A Novel Deep Learning Model as a Donor–Recipient Matching Tool to Predict Survival after Liver Transplantation

Background: The “digital era” in the field of medicine is the new “here and now”. Artificial intelligence has entered many fields of medicine and is recently emerging in the field of organ transplantation. Solid organs remain a scarce resource. Being able to predict the outcome after liver transplantation promises to solve one of the long-standing problems within organ transplantation. What is the perfect donor recipient match? Within this work we developed and validated a novel deep-learning-based donor–recipient allocation system for liver transplantation. Method: In this study we used data collected from all liver transplant patients between 2004 and 2019 at the university transplantation centre in Munich. We aimed to design a transparent and interpretable deep learning framework to predict the outcome after liver transplantation. An individually designed neural network was developed to meet the unique requirements of transplantation data. The metrics used to determine the model quality and its level of performance are accuracy, cross-entropy loss, and F1 score as well as AUC score. Results: A total of 529 transplantations with a total of 1058 matching donor and recipient observations were added into the database. The combined prediction of all outcome parameters was 95.8% accurate (cross-entropy loss of 0.042). The prediction of death within the hospital was 94.3% accurate (cross-entropy loss of 0.057). The overall F1 score was 0.899 on average, whereas the overall AUC score was 0.940. Conclusion: With the achieved results, the network serves as a reliable tool to predict survival. It adds new insight into the potential of deep learning to assist medical decisions. Especially in the field of transplantation, an AUC Score of 94% is very valuable. This neuronal network is unique as it utilizes transparent and easily interpretable data to predict the outcome after liver transplantation. Further validation must be performed prior to utilization in a clinical context.


Introduction
Liver-related death accounts for approximately 2 million deaths per year and is continuously increasing [1]. Combined, liver cirrhosis and liver cancer account for 3.5% of all death worldwide [2]. Liver transplantation (LT) can be a curative and life prolonging therapy for patients with end-stage liver disease. Furthermore, the post-transplant survival of the patient and the survival and functionality of the graft is continuously increasing [3]. However, as we live in a time of organ shortage there is a large gap between supply and demand [4]. Contrary to kidney transplantation, the guiding principle for the allocation of livers is urgency. This urgency is estimated by the model for end-stage liver disease (MELD) [5]. Aside from being an imperfect estimation of the severity of the underlying diseases, this urgency-based system can lead to transplantations in increasingly futile cases and is in danger of manipulation [6,7]. There are many ordinal and deontological allocation 2 of 8 concepts. However, none of these, at first sometimes logical concepts (lottery, first-comefirst-served, etc.), can fulfil all requirements (equal treatment, maximizing benefit for all, maximizing the benefit for the individual patient and respecting autonomy) for a truly fair allocation concept. In a recent publication we showed that a utility-based system that has the goal of maximizing gain-of-survival after transplantation could potentially eliminate futile transplantations [7]. For this, however, almost perfect prediction of survival is needed. In recent years, multiple simple models such as the donor risk index (DRI) [8], survival outcome following liver transplantation (SOFT) [9], balance of risk (BAR) [10] or the donor age x recipient MELD (D -MELD) [11] have been developed. However, none appear to be vastly superior in predicting outcome after liver transplantation [12]. It seems that artificial intelligence (AI) might be suited for that [13]. AI has shown promising results in multiple medical fields [14][15][16]. It is also a competent method of choice for reducing human innate subjectivity [17]. Recently, AI has also been increasingly applied to data from transplantations since predicting outcome based on donor and recipient data is especially hard [18,19]. Results from these studies are promising. A recent review underlined, that artificial neural networks are the most common algorithms to be used in transplantation data sets. The authors point out that neural networks are especially suited since they are more flexible than older score-based systems. Also, they conclude that more accurate neural networks could aid in in better allocation by taking more variables into account [20].
In this study, we report on the development and testing of a novel deep learning model for the prediction of overall survival at different timepoints after transplantation.

Data Selection and Study Population
In this single-centre study, we used data collected from all liver transplant patients between 2004 and 2019 at the university transplant centre in Munich/Erlangen. Ethical approval was obtained from the institutional review board (EK 19-395) at the Ludwig-Maximilian University in Munich. The need for an informed consent was waived by the institutional review board. Patients were selected by having both the donor and recipient values present. Data from a total of 1058 individuals were included. To present a transparent and interpretable model we included variables that are globally available and belong to the "standard" panel of recorded data. Recipient data consisted of demographic data (age, sex, gender, BMI, etc.), the underlying disease (Cirrhosis, etc.), disease features (Ascites, etc.), the MELD score and fourteen laboratory values. Donor data consisted of demographic data (age, sex, etc.), living or deceased donor, cause of death, reanimation of the deceased, donor risk index and fourteen laboratory values. We selected donor variables that were available prior to organ allocation since this reflects the clinical reality. Any donor variables that were only available after Tx were neglected. Importantly, we also included transplant-specific data that is not directly related to donor or recipient. Transplant-specific data consisted of ischemia time, full or split donation, distance organ travelled to transplantation location and graft quality. The detailed recipient, donor and transplant-specific datapoints are outlined in Tables 1-3.

Missing Data
Transplant data is generally heterogeneous and therefore challenges traditional statistical models. Since the organ is donated by a person who is largely independent from the recipient, the missing data needs to be calculated independently from one another. Further location data and laboratory markers may require different algorithms to estimate the most accurate value missing. Therefore, we developed and validated a novel multidimensional medical combined imputation (MMCI) algorithm to analyse this multifaceted and segmented dataset. The MMCI is a pipeline of interconnected imputation methods to impute segmented data with the highest accuracy. We have tested and validated the imputation mechanism on two different complete datasets. For both datasets, the most established imputation methods were tested, and accuracy (ACC) was compared with the novel MMCI. The model outperformed well-established imputation mechanisms such as missforrest, k-NN and MICE.

Model Development
An individually designed neural network was developed to meet the requirements of the data. A neural network, in general, is an arrangement of linear and non-linear modules that enable a network of nodes to intercommunicate and learn using the training data. The training data includes input data (variables) and the outcome which is to be predicted. To understand the correlation and causality within the data, the network needs several layers between input and outcome. At every node (decision point) data importance is weighted. An important hyperparameter to tune is defining the correct size of the layers and their depth. A lack of layers and nodes leads to the network underrepresenting individual data points and thus an underfitting occurs. Underfitting describes a situation in which the model is too rigid to be able to predict the outcome. However, when the model is oversized then it could lead to overfitting. Overfitting describes the situation when the model is trained so specifically that it is only able to predict the dataset it was trained on and is therefore not generalizable. To avoid this, four hyperparameters were introduced to monitor the development of the learning progress and to adjust the network in an iterative adaptation process. The goal was to scale the network to find the right balance between over-and underfitting. To measure this, the metrics accuracy, cross-entropy loss, F1 Score and AUC Score were utilized for monitoring. Accuracy reflects how often the network was correct in its prediction. Cross-entropy loss describes how far the prediction and the true value are divergent. F1 score is a weighted mean of the precision and recall metrics. It is especially helpful when a classifier "X" has a high precision and another classifier "Y" has a high recall. In this scenario, the F1 score helps to compare the average results of both models within one metric. Lastly, the AUC tells us how well the model can distinguish between different classes. (Detailed definitions are in Supplementary S1.b-Metric Definitions). After imputation and before training and cross-validation of the algorithm, the study cohort was split 8:1:1. This split ratio describes the training data (80%), the cross-validation data (10%) and finally the separate test data (10%). We used this common split ratio to allow the algorithm to train on as much data as possible to combat the phenomena of over-and underfitting. With a total of 529 transplantations in the study group, the training dataset (including 80% training data and 10% cross-validation data) included n = 477 and the test dataset n = 52 transplantations. After separation, the test dataset remained untouched throughout the analysis and was only used for testing the final model. With the given amount of 62 variables and 478 transplantations (956 observations from donors and recipients) inside the training dataset, the depth of the network was defined in six layers. A simple schematic display of the model is outlined in Figure 1. over-and underfitting. With a total of 529 transplantations in the study group, the training dataset (including 80% training data and 10% cross-validation data) included n = 477 and the test dataset n = 52 transplantations. After separation, the test dataset remained untouched throughout the analysis and was only used for testing the final model. With the given amount of 62 variables and 478 transplantations (956 observations from donors and recipients) inside the training dataset, the depth of the network was defined in six layers. A simple schematic display of the model is outlined in Figure 1.

Code Description
It is necessary to understand that the whole framework of this code has been built to serve as a dynamic usable structure for all types of transplant data. Therefore, it is built in a modular way where the user can train this specialized neural network on any transplant. The basic description is outlined in Supplementary S1.a. "Basic-Code".

Code Description
It is necessary to understand that the whole framework of this code has been built to serve as a dynamic usable structure for all types of transplant data. Therefore, it is built in a modular way where the user can train this specialized neural network on any transplant. The basic description is outlined in Supplementary S1.a. "Basic-Code".

Outcome Parameter
Accuracy was defined as the primary metric for training and testing of the prediction for the survival rate. The last layer of the network is the outcome. The network learns and evaluates itself during the training phase. Due to the fact that different timepoints of survival were evaluated, a subdivision of the data within the algorithm was made to increase the precision of the prediction. This subdivision related to death within 48 h, in-hospital mortality, 3-month, 6-month, 9-month and 12-month survival rates and death within the follow-up period.
To gain value from this network, the allocation needs to be conducted with one current donor for all possible recipients that are available. Based on having a list of possible recipients who are waiting for an organ and the new availability of one donor in the process of organ-extraction, a real-time allocation needs to take place to determine what the best mapping is. Therefore, out of the evaluation dataset, one patient is extracted and mapped onto the whole recipient dataset. Thus, a prediction is made based on the previous attributes that were necessary for determining the rate of survival.

Demographic Data Characteristics
A total of 529 transplantations were included from 2004 to 2019. For these, 529 matching donor and recipient observations were added into the database. The demographic and clinical data for the transplanted patients are listed in Table 1. Accepted organs were 321.56 ± 210.99 km distant from the university transplantation centre in Munich. Consequently, the cold ischemia time is relatively high at 630.69 ± 156.61 min ( Table 2). Donors were 54.79 ± 16.27 years old. Overall, they had a calculated donor risk index of 1.98 ± 0.43. Albumin levels were 27.86 ± 6.46 g/L. Notably, when comparing the recipient data, inflammation parameters were increased with leukocytes at 13.85 ± 5.95 G/L and CRP 14.78 ± 10.72 mg/dL. All data are listed in Table 3.
Variables were compared between the training and the test datasets. Regarding recipients, all demographic disease-specific variables showed no significant difference.
In the comparison of the laboratory values, only potassium levels were shown to be significantly different between the datasets.
In the comparison of the donor data. The DRI was higher in the training dataset (p = 0.0095).

Algorithm Performance
The metrics used to determine the model quality and its level of performance are accuracy, cross-entropy loss, F1 score and AUC score. By splitting the initial data set with a ratio of 8:1:1, where the first nine parts were used to train the network, the evaluation can take place on completely unseen data that were initially randomly sorted out. The distribution of outcome data between the training set and evaluation set is represented in Supplementary S1.c Figures S1 and S2.
The overall results given by the metrics calculated through an average of all outcome parameters show 0.958 (95.8%) accuracy with a cross-entropy loss of 0.042. The combined F1 score was 0.899, whereas the AUC score was 0.940. For better visibility, the summarized results are shown in Table 4.

Discussion
This study represents a novel deep-learning-based prediction model for survival after liver transplantation. The model was trained on 529 transplantations including 1048 donors and recipients. Further, it aims to be interpretable and transparent, especially in its process of data utilization. It managed to perform with an AUC of 0.940 that, in a clinical context, represents a very strong prediction. We chose this method because of the nature of the dynamic interaction between the donor and recipient. It is built in a modular way, where the user can train this specialized neural network on any transplant data with little effort. Previous studies have applied similar methods to accomplish predictions of outcome. Since we did not utilize the same data and used different outcome parameters a direct comparison between different prediction models is difficult. However, Ayllon et al. achieved an AUC of 0.82 for prediction of 12-month survival after liver transplantation [20]. Ershoff et al. achieved an AUC of 0.703 [21] for the prediction of 90-day post-transplant mortality. Our model achieved comparable and higher accuracies to those shown above. Even though these models also seem to be performing reasonably well, they can only predict one timepoint, whereas the model presented here can predict multiple timepoints.
The European General Data Protection Regulation of 2018 stated reasonable concerns with black-box predictions. The concerns not only include the opaqueness of the model itself but also the necessity to have control over the data, the processing and the interpretation of the results obtained [22]. Our data selection and processing were specifically set out to meet these concerns. As missing data is omnipresent within medical archives, we developed our own imputation method, which proved to be more accurate than readily known imputation algorithms (Boerner et al. under review).
Regarding the interpretation of the resulting predictions, we propose to create an AIassisted utility-based allocation concept. AI offers the chance of utilizing the vast amount of data in the field of transplantation to optimize organ utility. The state-of-the-art allocation target metrics such as the DRI, MELD score, SOFT score, and BAR score offer some success in predicting the most favourable outcome. However, these scores are criticized for being inaccurate, untransparent and static [23][24][25]. An AI-assisted utility-based allocation concept using gain-of-survival as the target metric would be more flexible and could represent the best approximation towards a perfect allocation practice [7,26].
This study has some limitations that are inherent to its design. First of all, this is a retrospective study with data from two German transplant centres. We have mitigated the possible biases by utilizing cross-validation to make our results more generalizable. Multiple models have been developed over the past years and have aided the process of discussion on how to incorporate machine learning and neuronal networks into our clinical decision process [27,28]. As mentioned above, we have strived for maximum transparency, however, the very nature of a black-box model could not be fundamentally changed.
This study is intended as a proof of concept. It represents a novel deep learning model that was trained and tested. Such a model could potentially be used as a part of a utility-based allocation concept. Before this model or any of its kind can be used as a bedside tool, the results need to be externally confirmed in a randomized clinical trial, ideally in a multicentre setting.