A Dynamic Bayesian Model for Identifying High-Mortality Risk in Hospitalized COVID-19 Patients

As Coronavirus Disease 2019 (COVID-19) hospitalization rates remain high, there is an urgent need to identify prognostic factors to improve patient outcomes. Existing prognostic models mostly consider the impact of biomarkers at presentation on the risk of a single patient outcome at a single follow up time. We collected data for 553 Polymerase Chain Reaction (PCR)-positive COVID-19 patients admitted to hospital whose eventual outcomes were known. The data collected for the patients included demographics, comorbidities and laboratory values taken at admission and throughout the course of hospitalization. We trained multivariate Markov prognostic models to identify high-risk patients at admission along with a dynamic measure of risk incorporating time-dependent changes in patients’ laboratory values. From the set of factors available upon admission, the Markov model determined that age >80 years, history of coronary artery disease and chronic obstructive pulmonary disease increased mortality risk. The lab values upon admission most associated with mortality included neutrophil percentage, red blood cells (RBC), red cell distribution width (RDW), protein levels, platelets count, albumin levels and mean corpuscular hemoglobin concentration (MCHC). Incorporating dynamic changes in lab values throughout hospitalization lead to dramatic gains in the predictive accuracy of the model and indicated a catalogue of variables for determining high-risk patients including eosinophil percentage, white blood cells (WBC), platelets, pCO2, RDW, large unstained cells (LUC) count, alkaline phosphatase and albumin. Our prognostic model highlights the nuance of determining risk for COVID-19 patients and indicates that, rather than a single variable, a range of factors (at different points in hospitalization) are needed for effective risk stratification.

Characteristics of raw data: Table S1. Summary of test values upon admission. Here, "Nobs" indicates the number of patients who had measurements taken for each test on their first day of admission. Note, that the data here reflects summaries calculated after the steps outlined in the "Data cleaning" section and included n = 540 individuals who had measurements taken on the first day. It includes data for those tests with 150 or more observations.     . Figure S4. Markov model: estimated odds ratios indicating mortality risk. Each marker represents a different selection of variables included as regressors (see "Determining patient risk" section). The upper and lower whiskers indicate the 75% and 25% posterior quantiles; the middle points of each range indicate the posterior median. Those points shown in orange indicate that the 5%-95% posterior quantiles did not cross zero.       Table S6. Summary statistics of lab values. These show the means and standard deviations (SDs) of the initial lab values and the percentage changes from these initial values for the data we analyze using the Markov model. Note, these summaries correspond to a different dataset (that obtained as described in the "Data processing for logistic and Markov models" section) than those provided in Table S1.    Table S7. Priors for Markov model parameters.

Parameter(s) Prior(s) Reasoning
Discharge individual patient intercepts and related population-level summaries

Data cleaning
We now detail the steps taken to clean and process the data, to convert it into a form amenable to estimation by both the logistic and Markov models.
There were n = 926 clinical measurements where the test was recorded to take place before the date when the patient was admitted: in these instances, if the difference between date-time of admission and date-time of measurements was less than 48 hours (n = 380 obs), we changed the date-time of admission for those individuals to be date-time of their first recorded test satisfying the 48 hour constraint; if the difference between date-time of admission and date-time of measurements exceeded 48 hours (n = 546), the observation was dropped. Similarly, there were instances when measurements were reported after their recorded datetime of change of status (when either discharge or death occurred): if the gap between the date-time of their change of status and the observation exceeded 48 hours, the observation was dropped (n = 274 obs); if this gap was less than 48 hours (n = 68 obs), the date of change of status was changed to the latest observation within the 48 hour constraint. For many patients, there was a gap between their last observation and the date-time of their change of status. For those patients where this gap was less than a day, we updated the change of status date-time to the date-time of their last test (n = 302 patients); for those patients who were eventually discharged but where the gap exceeded a day, we changed their change of status date-time to the date-time of their last test (n = 134 patients). For the analyses involving time-dependent measurements, we dropped those patients who eventually died, where the gap between the date of the last observation and their date of death exceeded 48 hours (n = 20 patients) as this typically signaled that the patients or their relatives had opted for palliative care. We dropped those patients where either the date-time of admission or the date-time of the change of status were unknown (n = 6 patients).

Data processing for logistic and Markov models
For the analyses involving time-dependent measurements, we carried out an additional series of transformations on the data. We then converted each patient's data into a regular day-block form: where, for each day the patient was in hospital, the patient had a single observation for each clinical test. To do this, we made a number of assumptions: if multiple tests of the same type were conducted on a patient on a given day, we took their mean as the daily observation; where a patient had a day when a specific test was not conducted, we assumed that day's test measurement was the same as on the last day it was conducted; when a patient was missing observations for a test until some days after their admission, we assumed the test measurements on these intervening days were the same as that from the first measurement. To reduce the impact of imputed observations, we included only those tests in the analysis where at least 70% of the individuals had a test taken (at least once); we then kept only those individuals who had been tested on at least 80% of the tests remaining after the previous step. Overall, approximately 27% of dynamic test observations were imputed, meaning that 27% of patient-test-days had observations imputed from previous or subsequent days' observations for that patient a described above. Finally, any individuals where age or sex were missing were dropped from the analysis. Collectively, these steps meant that n = 475 patients with data from n = 28 distinct clinical tests were included in the dynamic analyses.
For the analyses, we partitioned the influence of each clinical test into two separate variables: the initial measurement for each patient ("initial" meaning that the test was done within the first day of admission); and the percentage change in test value from this initial value. For those instances when an initial observation was zero, we set the percentage change as zero if all subsequent test values were also zero; alternatively, we calculated a percentage change versus half the first recorded non-zero value of the test. Finally, we standardized the percentage change data so that each test had a standard deviation of 1 and a mean of zero. Because of this, the effect sizes we estimate correspond to the typical scale of variation of each test (with those scales provided in Table S6).

Inference models Trend analysis
To compare trends in the clinical test values across the two patient groups (i.e. those that survived and those who died), we fit linear regression models incorporating time trends to the data. To do this, we used data for n = 541 patients representing those remaining after removing those observations where either the date-time of admission or the date-time of the change of status were unknown and after having removed observations when the date-time of change of status occurred more than 48 hours after the date-time of the last observation.
For each combination of patient and test value, we calculated the percentage change in test value from the first test value taken on the patient (note that this calculation was slightly different to what was done for the logistic and Markov models). We then scaled this value by subtracting the mean and dividing through by the standard deviation in percentage test changes. Before conducting regressions, we also removed any infinite (i.e. when the first test value was zero) or missing observations for the percentage change. We also removed extreme observations: specifically, those where the absolute value of the percentage change exceeded the 98% quantile.
We modelled the percentage change in test value as a function of a quadratic time trend, allowing for fixed effect trends but including individual patient slopes of both the linear and quadratic terms of the trend. We fit separate models to each of the two patient groups and extracted the fixed effect estimates of the trends. The models were estimated in a frequentist framework using the lme4 R package [41].

Determining patient risk
We conducted two analyses of the data. The first was a logistic regression analysis, which aimed to determine those factors most predictive of patient mortality. The second was a Markov model, which analysed the dynamic sequence of observations for each patient throughout their stay and aimed to examine how changes in these variables affected how quickly a patient was discharged or died.
In the logistic regression, we examined how an individual patient's characteristics affected their risk of dying. This model is described by: ∼ Bernoulli( ), = logistic( 0 + 1 ′ ), where ∈ {0,1} is the patient-specific outcome: = 0 represents discharge and = 1 represents death during the study period. The probability parameter of the Bernoulli model, ∈ [0,1], is given by a logistic transformation of a linear combination of the independent variables and parameters 1 .
We performed five separate regression types, each with different groups of independent variables included. In the first of these, we included only a single variable in an analysis to examine the influence of each variable in isolation: so, this first analysis really consisted of a series of univariate logistic regressions: one for each of the variables included in the analysis. The second regression included patient characteristics not specifically relating to health ("patient" variables): their age, sex, ethnicity, and the day they were admitted to hospital. The third regression ("pat. + comorbidities") supplemented the background variables with the recorded comorbidities for each patient: whether they had hypertension, diabetes etc. (13 conditions in total). The fourth regression ("admission") supplemented the third with the initial measurements for each patient for each of the 28 included clinical tests. The final regression ("post-admission") then included the percentage changes in each clinical test measurement from the initial values for each patient: this was calculated using the last recorded measurement for each patient.
We next sought to model not only the outcome of each patient but also the time taken for this outcome to be reached. An approach that is often used to analyse these sorts of data in the literature is Cox regression incorporating competing risks. We reviewed these models and decided not to use these due to issues with the assumptions underpinning them: specifically, that those individuals who are discharged would effectively be handled as if they were uninformatively censored observations to determine a mortality risk. Instead, we chose to develop a Markov model that simultaneously modelled individuals who were discharged and those who died. To do so, we considered the sequence of outcomes for each day the patient remained in hospital. On their first day, they were admitted and began in the "hospital" state (see Fig. 1); at the end of the first day, they either remained in hospital or transitioned to the "discharge" or "death" states. On subsequent days, the patients that remained in hospital faced the same possible transitions. The probabilities that these transitions occurred were modelled as a function of each patient's characteristics: some of which could potentially vary over time. Specifically, the un-normalised probabilities of each possible transition were modelled using a log link: