Blood Transfusion, All-Cause Mortality and Hospitalization Period in COVID-19 Patients: Machine Learning Analysis of National Health Insurance Claims Data

This study presents the most comprehensive machine-learning analysis for the predictors of blood transfusion, all-cause mortality, and hospitalization period in COVID-19 patients. Data came from Korea National Health Insurance claims data with 7943 COVID-19 patients diagnosed during November 2019–May 2020. The dependent variables were all-cause mortality and the hospitalization period, and their 28 independent variables were considered. Random forest variable importance (GINI) was introduced for identifying the main factors of the dependent variables and evaluating their associations with these predictors, including blood transfusion. Based on the results of this study, blood transfusion had a positive association with all-cause mortality. The proportions of red blood cell, platelet, fresh frozen plasma, and cryoprecipitate transfusions were significantly higher in those with death than in those without death (p-values < 0.01). Likewise, the top ten factors of all-cause mortality based on random forest variable importance were the Charlson Comorbidity Index (53.54), age (45.68), socioeconomic status (45.65), red blood cell transfusion (27.08), dementia (19.27), antiplatelet (16.81), gender (14.60), diabetes mellitus (13.00), liver disease (11.19) and platelet transfusion (10.11). The top ten predictors of the hospitalization period were the Charlson Comorbidity Index, socioeconomic status, dementia, age, gender, hemiplegia, antiplatelet, diabetes mellitus, liver disease, and cardiovascular disease. In conclusion, comorbidity, red blood cell transfusion, and platelet transfusion were the major factors of all-cause mortality based on machine learning analysis. The effective management of these predictors is needed in COVID-19 patients.


Introduction
Since its outbreak in December 2019, coronavirus disease 2019 (COVID-19) caused by the severe acute respiratory syndrome coronavirus-2 spread rapidly, leading to a public health crisis [1]. Moreover, the COVID-19 outbreak dramatically increased global mortality [2,3]. Various clinical factors, including gender (male), age, and comorbidities (obesity, chronic kidney disease, cancer, diabetes mellitus, lung disease, etc.), are related to mortality and prognosis in patients with COVID-19 [4,5]. In addition, anemia is prevalent and associated with long hospital stays and poor clinical conditions in patients with COVID-19 [6]. 2 of 9 Several studies evaluating the association between anemia and clinical outcomes of COVID-19 have shown that patients with anemia have higher mortality and worse prognosis [7][8][9]. However, other studies have not shown a negative association between anemia and the prognosis of patients with COVID-19 [10,11]. Thus, the effects of anemia on COVID-19 are debatable.
The COVID-19 pandemic had a significant impact on blood transfusions [12]. Concurrently, the pandemic had a substantial impact on supply of blood owing to a reduction in blood donation [12,13]. Studies have shown that many patients with COVID-19 do not require blood transfusion [12,14]. In a previous study, only one third of the patients required transfusion due to anemia (nonbleeding), and very few patients required platelet (PLT) or fresh frozen plasma (FFP) transfusion [12]. However, the literature investigating the prevalence and clinical effects of blood transfusion in patients with COVID-19 is limited. A previous study has reported that red blood cell (RBC) transfusion predicts overall mortality, and the mortality rate is directly proportional to the number of RBC units transfused [15]. Thus, RBC transfusion may be a prognostic predictor for adverse outcomes in patients with COVID-19; however, the clinical implications of blood transfusion in patients with COVID-19 remain unclear.
This study presents the most comprehensive machine learning analysis for the predictors of blood transfusion, all-cause mortality, and hospitalization period in COVID-19 patients, using a population-based cohort of 7943 participants from the Korean National Health Insurance Database (NHID) and the richest collection of 28 predictors, such as demographic/socioeconomic determinants, comorbid conditions, and disease information. All medical institutions mandatorily enter into a contract with the national government, and all prescriptions, orders, and diagnostic codes are computerized and recorded in the NHID. We analyzed the associations of blood transfusion in patients with COVID-19 with comorbidities. Further, we investigated the role of potential risk factors for all-cause mortality and hospitalization period in transfused and non-transfused patients with COVID-19. This study showed the overall implications of blood transfusion in patients with COVID-19.

Analysis
The t-test was used to evaluate the associations of blood transfusion with CCI and all-cause mortality in patients with COVID-19. Next, random forest analysis for variable importance was performed to investigate the main factors of all-cause mortality and hospitalization period in patients with COVID-19 and to test their associations with transfusion history and other variables. A random forest in this study created 500 training sets, trained 500 decision trees, and made predictions with a majority vote. Random forest analysis for variable importance (the node impurity (GINI) decrease from the creation of a branch on a certain predictor) measures the importance of an independent variable for predicting the dependent variable. Indeed, logistic regression and random forest were employed to analyze the effects of blood transfusion on all-cause mortality and the hospitalization period for various subgroups (i.e., RBC, PLT, FFP, and cryoprecipitate transfusions). The results of these subgroup analyses are reported in the supplementary tables. The data were split into training and validation sets at a 70:30 ratio. The criteria for the validation of the trained models were (1) precision for the prediction of all-cause mortality, that is, the ratio of correct predictions among all cases, and (2) the mean absolute percentage error for the prediction of the hospitalization period (here, the error was the difference between the actual and predicted values of the hospitalization period) [17]. The statistical analyses were performed using R-Studio 1.3.959 (R-Studio Inc., Boston, MA, USA).

Ethics Statement
This retrospective study complied with the tenets of the Helsinki Declaration and was approved by the Institutional Review Board (IRB) of Korea University Anam Hospital on 10 August 2020 (2020AN0367). Informed consent was waived by the IRB, given that data were deidentified.

Results
Descriptive statistics for the 7943 patients with COVID-19 are summarized in Tables 1 and 2. The median hospitalization period was 33 days and the median CCI, age, and socioeconomic status scores were 1, 4, and 10, respectively. Age was given with a range of 0-8 in the original data. Likewise, socioeconomic status ranged from 0 (highest health insurance payment) to 20 (lowest health insurance payment) in the original data. The proportion of those who died was 3.08% (n = 245). The proportions of patients with RBC, PLT, FFP, and cryoprecipitate transfusions were 2.29% (182), 0.65% (52), 0.43% (34), and 0.05% (4), respectively. The proportions of patients with gender (male), cardiovascular disease, diabetes mellitus, dementia, hemiplegia, liver disease, and antiplatelet medications were 60.00% (4766), 5.09% (404), 15.35% (1219), 6.48% (515), 1.13% (90), 17.55% (1394), and 12.07% (959), respectively. Blood transfusion was positively associated with the CCI score. The proportions of patients with RBC, PLT, and FFP transfusions were significantly higher in Group 1 (high CCI score) than in Group 2 (low CCI score) ( Table 3). Blood transfusion was also positively associated with all-cause mortality. The proportions of patients with RBC, PLT, FFP, and cryoprecipitate transfusions were significantly higher in patients who died than in survivors (Table 4). Indeed, the proportion of patients who received RBC transfusion was significantly higher among those who died than among survivors (Table S1-1 (Group 1) and Table S1-2 (Group 2), supplementary tables). p-values for the test of their equality were <0.05, as reported in the notes of Tables S1-1 and S1-2. Likewise, blood transfusion was the major determinant of all-cause mortality and hospitalization period in patients with COVID-19 based on random forest variable importance. The top ten factors of all-cause mortality were CCI score (53.54), age (45.68), socioeconomic status (45.65), RBC transfusion (27.08), dementia (19.27), antiplatelet agents (16.81), gender (14.60), diabetes mellitus (13.00), liver disease (11.19), and PLT transfusion (10.11) (Figure 1). For example, the random forest variable importance of RBC transfusion for predicting all-cause mortality was 27. This means that, on average, the accuracy of the decision tree in the random forest will decrease by 0.05 if the RBC transfusion variable is removed from the model. Here, the number 0.05 was obtained by dividing 27 by 500 (the number of decision trees in the random forest). The top ten factors of the hospitalization period were CCI score, socioeconomic status, dementia, age, gender, hemiplegia, antiplatelet agents, diabetes mellitus, liver disease, and cardiovascular disease (Figure 2). The following supplementary tables are provided for additional reference: (1) the descriptive statistics of subgroups (Tables S2-1 to S2-16); and (2) the performance measures of machine learning for the prediction of all-cause mortality and hospitalization period (Tables S3-1 and S3-2).
The proportions of patients with RBC, PLT, and FFP transfusions were significantly higher in Group 1 (high CCI score) than in Group 2 (low CCI score) ( Table 3). Blood transfusion was also positively associated with all-cause mortality. The proportions of patients with RBC, PLT, FFP, and cryoprecipitate transfusions were significantly higher in patients who died than in survivors (Table 4). Indeed, the proportion of patients who received RBC transfusion was significantly higher among those who died than among survivors (Table S1-1 (Group 1) and Table S1-2 (Group 2), supplementary tables). p-values for the test of their equality were <0.05, as reported in the notes of Tables S1-1 and S1-2. Likewise, blood transfusion was the major determinant of all-cause mortality and hospitalization period in patients with COVID-19 based on random forest variable importance. The top ten factors of all-cause mortality were CCI score (53.54), age (45.68), socioeconomic status (45.65), RBC transfusion (27.08), dementia (19.27), antiplatelet agents (16.81), gender (14.60), diabetes mellitus (13.00), liver disease (11.19), and PLT transfusion (10.11) (Figure 1). For example, the random forest variable importance of RBC transfusion for predicting all-cause mortality was 27. This means that, on average, the accuracy of the decision tree in the random forest will decrease by 0.05 if the RBC transfusion variable is removed from the model. Here, the number 0.05 was obtained by dividing 27 by 500 (the number of decision trees in the random forest). The top ten factors of the hospitalization period were CCI score, socioeconomic status, dementia, age, gender, hemiplegia, antiplatelet agents, diabetes mellitus, liver disease, and cardiovascular disease (Figure 2). The following supplementary tables are provided for additional reference: (1) the descriptive statistics of subgroups (Tables S2-1

Discussion
Transfusion requirements have been reported to be associated with poor clinical conditions [15]. A previous study reported that most patients with mild COVID-19 did not require blood transfusion, and RBC transfusion was necessary in severely ill patients, especially those with gastrointestinal bleeding, and the requirement for FFP and PLT transfusions was lower [18]. In contrast, another study reported that transfusion requirement was low, even in critically ill patients with COVID-19 [12]. In this study, blood transfusions were significantly associated with patients with COVID-19 with comorbidities. The percentage of patients with RBC transfusion was significantly higher in the high CCI score group than in the low CCI score group (5.5% vs. 0.3%). The trends for proportion of patients receiving PLT and FFP transfusions were comparable between the two groups (PLT, 1.5% vs. 0.1%; FFP, 1.0% vs. 0.1%). Moreover, blood transfusions have been reported as independent predictors of mortality in patients with COVID-19 [15,19]. A previous multivariable analysis of patients with COVID-19 showed that blood transfusions were associated with a significantly higher mortality rate [19]. The present study showed that patients who died during the study period received significantly more transfusions than those who survived (Table 4). Furthermore, when we conducted random forest analysis for correcting potential confounders, such as gender, socioeconomic status, comorbidities, and transfusion/medication history, RBC and PLT transfusions were the major determinants of mortality in patients with COVID-19 (Figures 1 and 2). This finding was consistent with that of previous studies that analyzed blood transfusions and clinical outcomes in patients with COVID-19 [15,19].
Interestingly, in the random forest model, RBC and PLT transfusions were among the top ten factors of all-cause mortality. However, in a subgroup analysis based on CCI scores, only RBC transfusion maintained statistical significance for mortality in both the high (Group 1, Table S1-1) and low CCI score (Group 2, Table S1-2) groups. CCI score was the first-ranking factor for all-cause mortality in this study, and the validity of the CCI score for predicting mortality in patients with COVID-19 has been reported in previous studies [20][21][22]. Older age, low socioeconomic status, RBC transfusion, dementia, use of antiplatelet medications, gender (male), diabetes mellitus, liver disease, and PLT transfusion were also identified as important factors for all-cause mortality. This discrepancy between the two analyses might be because PLT transfusion has a statistically stronger interaction with CCI scores than with RBC transfusion, resulting in a bias. Thus, RBC transfusion can be a very strong factor for mortality in patients with COVID-19, because it maintained statistical significance even after correcting for bias using stratified subgroup analysis. This result is also concordant with that of a previous study that showed that the number of RBC units transfused is an independent factor for mortality in patients with COVID-19 [15]. Thus, restrictive RBC transfusion strategies might help to reduce the mortality in patients with COVID-19 independent of comorbidities. On the other hand, blood transfusion, even presumably appropriately administrated, depends on low values of hemoglobin. It is well known that anemia is a common feature of severe COVID-19 for many possible and combining reasons, including inflammation, marrow hypoproliferation, drugs, multiple blood sampling, red cell membrane damage, autoantibodies, hospitalization, and kidney damage [6][7][8][9]. RBC transfusion has also been used to mitigate low pulmonary perfusion [6][7][8][9]. Considering these aspects, it might be conceivable that more severe COVID-19 patients have transfused more than less severe patients. Therefore, further studies are needed to assess the associations of blood transfusions with the mortality and disease severity in patients with COVID-19.
We used machine learning and nationwide population data to analyze the associations of blood transfusion on all-cause mortality and hospitalization period in patients with COVID-19. Machine learning (or data mining) methods are statistical methods for "extracting knowledge from large amounts of data" [23]. Specifically, the random forest analysis does not require unrealistic assumptions of linear regression, such as ceteris paribus, "all the other variables remaining constant." In addition, as demonstrated in this study, the random forest analysis can identify factors that are more important for the prediction of all-cause mortality and hospitalization period in patients with COVID-19. Further studies are needed, but this study will be a good starting point in this direction.
This study had some limitations. First, this study had the following methodological biases. As described above, this study used population data. Data of 7943 patients with COVID-19 diagnosed between 1 November 2019 and 31 May 2020 were obtained from the NHID. The NHID covers 52 million residents (nearly all residents) in Korea. All patients with any medical visit encoded as COVID-19 during the study period were included. The data size (7943) exceeds the minimum size to have desired properties with the 95% confidence interval and the 5% margin of error (156). However, the NHID provided limited information regarding the diagnosis, drugs, and service codes. Detailed information of individual patients was unavailable, including the dose of medication and laboratory data, such as hemoglobin level and PLT counts. Second, we used the ICD-10 codes to define the comorbidities included in this study. Thus, we would have underestimated the comorbidities of patients because the ICD-10 codes were broad and ambiguous. Finally, all patients included in this study were Korean. Therefore, our results should be applied cautiously to other populations.

Conclusions
In conclusion, this study based on a real-world population-based database showed that blood transfusions could be effective in patients with COVID-19. RBC and PLT transfusions were the major determinant factors for all-cause mortality based on machine learning analysis. In particular, a strong association between RBC transfusion and mortality was observed in patients with COVID-19. Thus, effective management of blood transfusion based on the comorbidities and disease severity in patients with COVID-19 may be beneficial. Further studies should evaluate the effect of blood transfusion in other populations with COVID-19.   Informed Consent Statement: Informed consent was waived by the IRB, given that data were deidentified.

Data Availability Statement:
The data presented in this study are not publicly available. However, the data are available from the corresponding author upon reasonable request and under the permission of Korea National Health Insurance Service.