An Expectation-Maximization Algorithm for Including Oncological COVID-19 Deaths in Survival Analysis

We address the problem of how COVID-19 deaths observed in an oncology clinical trial can be consistently taken into account in typical survival estimates. We refer to oncological patients since there is empirical evidence of strong correlation between COVID-19 and cancer deaths, which implies that COVID-19 deaths cannot be treated simply as non-informative censoring, a property usually required by the classical survival estimators. We consider the problem in the framework of the widely used Kaplan–Meier (KM) estimator. Through a counterfactual approach, an algorithmic method is developed allowing to include COVID-19 deaths in the observed data by mean-imputation. The procedure can be seen in the class of the Expectation-Maximization (EM) algorithms and will be referred to as Covid-Death Mean-Imputation (CoDMI) algorithm. We discuss the CoDMI underlying assumptions and the convergence issue. The algorithm provides a completed lifetime data set, where each Covid-death time is replaced by a point estimate of the corresponding virtual lifetime. This complete data set is naturally equipped with the corresponding KM survival function estimate and all available statistical tools can be applied to these data. However, mean-imputation requires an increased variance of the estimates. We then propose a natural extension of the classical Greenwood’s formula, thus obtaining expanded confidence intervals for the survival function estimate. To illustrate how the algorithm works, CoDMI is applied to real medical data extended by the addition of artificial Covid-death observations. The results are compared with the estimates provided by the two naïve approaches which count COVID-19 deaths as censoring or as deaths by the disease under study. In order to evaluate the predictive performances of CoDMI an extensive simulation study is carried out. The results indicate that in the simulated scenarios CoDMI is roughly unbiased and outperforms the estimates obtained by the naïve approaches. A user-friendly version of CoDMI programmed in R is freely available.


Introduction
The problem of defining a common and appropriate method in survival analysis for handling the dropouts due to coronavirus disease 2019 (COVID-19) deaths of patients participating to oncology clinical trials has been recently stressed [1,2]. In oncology trials, all-causality deaths are often counted as events for death-related endpoints, e.g., overall survival. However, as it has been pointed out [2], counting a COVID-19 fatality as a death-related endpoint requires a complex redefinition of the estimand, considering a composite strategy for using the so-called intercurrent events [3], as, e.g., "discontinuation from treatment due to COVID-19" or "delay of scheduled intervention". The problem is also exacerbated by the difficulty of homogeneously determining whether a death is entirely attributable to COVID-19. In this paper, we address a simplified version of this problem, assuming that COVID-19-related deaths are homogeneously identified and are the only intercurrent events to be considered. In this framework, we tackle the problem of how data in an oncology trial having the overall survival as the endpoint can be dealt with when deaths due to COVID-19 are present in the sample.
COVID-19 deaths should not be treated as standard censored data, because usual censoring should be considered-at least in principle-non-informative. Informative censoring, instead, occurs when participants are lost to follow-up also due to reasons related to the study, as it seems to be the case with COVID-19 deaths of oncological patients. Direct data on how COVID-19 affects survival outcomes in patients with active or a history of malignancy are immature. However, early evidence identified increased risks of COVID-19 mortality in patients with cancer, especially in those patients who have progressive disease [4]. Patients with cancer and COVID-19 were associated with increased death rate compared to unselected COVID-19 patient population (13% versus 1.4%) [4,5]. Based on these findings, in survival analysis dropouts due to COVID-19 deaths should be considered as cases of informative censoring. Another way used in survival analysis literature to represent this dependence is to view cancer deaths and COVID-19 deaths as competing events, see, e.g., [6] Ch. 8. In this paper, we propose an algorithmic method to include COVID-19 deaths of oncological patients in typical survival data, focusing on the classical Kaplan-Meier (KM) product-limit survival estimator. Our method is in the spirit of the Expectation-Maximization (EM) algorithms [7] used for handling missing or fake data in statistical analysis. In this sense, the method could also be used in applications different from clinical trials, e.g., reliability analysis. Correction of actuarial life tables can be also a possible application.
An overview of methods for dealing with missing data in clinical trials is provided by DeSouza, Legedza and Sankoh [8]. See also Shih [9]. In Shen and Chen [10] the problem of doubly censored data is considered and a maximum likelihood estimator is obtained via EM algorithms that treat the survival times of left censored observations as missing. As concerning situations with informative censoring, where there is stochastic dependence between the time to event and the time to censoring (which is our case if "censoring" is a COVID-19 death), a distinction is proposed by Willems, Schat and van Noorden [11] among cases where the stochastic dependence is direct, or through covariates. In that paper [11], the latter case is considered and an "inverse probability censoring weighting" approach is proposed for handling this kind of censoring. Since at this stage it is difficult to model cancer deaths and COVID-19 deaths through covariates in common, in this paper, we consider the case of direct dependence. We do not consider a survival regression model based on specified covariates, and limit the analysis, as has been said, to the basic Kaplan-Meier survival model, which is assumed to be applied, as usual, to a sufficiently homogeneous cohort of oncological patients. In this framework, we propose a so-called mean-imputation method for COVID-19 deaths using a purpose-built algorithm, referred to as Covid-Death Mean-Imputation (CoDMI) algorithm. A user-friendly version of this algorithm programmed in R is freely available. The corresponding source code can be downloaded from the website: https://github.com/alef-innovation/codmi (accessed on 6 July 2021).
An alternative approach to survival analysis when COVID-19 deaths are present in an oncology clinical trial in addition to cancer deaths could be based on the cumulative incidence functions, which estimate the marginal probability for each competing risks.This would lead to dealing with subdistributions and would require appropriate statistical tests to be used, see, e.g., [12]. Our algorithmic approach, instead, acts directly on the data, producing an adjustment that virtually eliminates the presence of the competing risk, thus allowing the use of standard statistical tools. This comes at the price of accepting some simplifications and specific assumptions.
The basic idea of CoDMI algorithm is of a counterfactual nature. Since the KM model provides an estimation of the probability distribution to survive until a chosen point on the time axis for any patients in the sample, for each of the patients which is observed to die of COVID-19 at time θ, we derive from this distributionê θ , the expected lifetime beyond time θ, thus obtaining the "no-Covid" expected lifetimeτ = θ +ê θ for each of these patients. Each θ value is then replaced by the virtual lifetimeτ (this is the mean-imputation) and the KM estimation is repeated on the original data completed in this way, providing a new estimate ofτ. This procedure is iterated until the change between two successiveτ estimates is considered immaterial (according to a specified criterion).
It is pointed out by Shih [9] that "The attraction of imputation is that once the missing data are filled-in (imputed), all the statistical tools available for the complete data may be applied". Although in our case we are not dealing with missing data but with partially observed data, this attractive property of mean-imputation still holds true. It should be noticed, however, that in general, treating an estimated value-even an unbiased one-as an observed value should require some increase in variance. In particular, the confidence limits of KM estimates on data including imputations should be appropriately enlarged. We propose an extension of the classical Greenwood's formula providing this correction.
The paper is organized as follows. In Section 2, the notations and the basic structure of the KM survival estimator are provided and the related problem of computing expected lifetimes is illustrated. The representation of Covid-death cases in the sample is described. In Section 3, CoDMI algorithm is introduced and the details of the iteration procedure are provided. The convergence issue is discussed and the underlying assumptions of the algorithm are considered, taking into account some subtleties required by the nonparametric nature of the KM estimator. A possible adjustment for censoring of the algorithm is presented and a correction of Greenwood's formula is derived for taking into account the estimation error in the imputed point estimates. Application of CoDMI to real medical data are provided in Section 4. Two oncological survival data sets which are well referenced in the literature are completed by artificial Covid-death observations and the survival curves estimated by CoDMI are compared with the no-Covid KM estimates and with the two naïve KM estimates obtained by considering COVID-19 deaths as censorings or as death of disease. The effect of the final adjustment for censoring is also illustrated. In Section 5, an extensive simulation study is presented to evaluate the CoDMI predictive performances. We discuss the details of the simulation procedure and provide tables illustrating the results. Some conclusions and comments are given in Section 6. In Appendix A a derivation of the extended Greenwood's formula is provided.

Typical Clinical Trial Data and the Kaplan-Meier Estimator
We consider a study group of n oncological patients which received a specified treatment and are followed-up for a fixed calendar time interval. The response for each patient is the survival time T 0 which is computed starting from the date of enrollment in the study, date 0. Remark 1. This is in line with the standard actuarial notation, where T x is used to denote the survival time of a subject of age x. Our patients actually have "age 0" (in the study) at the time they are enrolled.
Typically, the observations include censored data, that is, survival times known only to exceed reported value. Formally, for a given patient there is a censoring at time point t if we only know that for this patient T 0 > t. If t max denotes the last observed time point in the study, i.e., t max corresponds to the current date or the end of the study, the case of a censored time t < t max corresponds to a patient lost to follow-up. To take into account censoring, the observations can be represented in the form: where t i is the observed survival time of patient i and d i is a "status" indicator at t i which is equal to 1 if death of disease under study (DoD) is observed and is equal to 0 if there is a censoring (Cen) on that time. We assume that the group of patients provides a homogeneous sample, that is, all the observations t i come from the same probability distribution for T 0 , and our aim is to estimate the cumulated probability function F(t) = P(T 0 < t), or the related survival function: The estimation of S(t) can be realized non-parametrically by the well-known Kaplan-Meier product-limit estimator [13]. If we denote by: the observations z i ordered by increasing value of t (with t (0) = d (0) = 0), the KM estimator is written as:Ŝ where R(t (i) ) is the number of subjects at risk at (immediately before) time t (i) , and the ratio h(t (i) ) = d (i) /R(t (i) ) is the hazard rate at time t (i) . ThereforeŜ(t) is a (left continuous) step function with steps at each time a DoD event occurs.

Remark 2.
(i) If there are ties in the sample, the ordering can always be unambiguously defined by adopting the appropriate conventions. We refrain here from describing these conventions, already considered in the original paper [13] and extensively discussed in the subsequent literature. (ii) In general, the event of interest (in our case DoD) acts on the ratio d (i) /R(t (i) ) in the estimator (1) by modifying both the numerator and the denominator. The not-of-interest event (Cen) only acts on the denominator. This follows from the assumption that a Cen corresponds to a non-informative censoring.
It is assumed that the censored observations do not contribute additional information to the estimation, which is the case if censoring is independent of the survival process. If the time points t i are given, it was already shown in the original paper [13] that (1) is a maximum likelihood estimator. Obviously t (n) = t max , the last time point in the observed sequence. For our purposes, it is important to distinguish two cases, depending on whether at t max there is a DoD or a Cen.

The Case of Complete Death-Observations
If d (n) = 1, i.e., t max relates to a DoD event, and if R(t max ) = 1, then one has S(t max ) = 0, which means that the data allows us to estimate the entire probability distribution of T 0 . Let us refer to this case as the complete death-observations case or, briefly, the complete case. In this situation, we can compute the estimated expected future lifetime for a patient which is alive at time θ ≥ 0. Let us denote the conditional lifetime, given θ, as: Then the expected future lifetime (the life expectancy) beyond θ is: SinceŜ(t) is a step function and the jump at time t (i) with d (i) = 1 equals the probability q (i) to die of disease at this time point, (2) is equivalent to the average taken on the truncated distribution of T 0 − θ:ê where q (i) = 0 if d (i) = 0.

The Incomplete Case
If the condition d (n) = 1 is not fulfilled, we are in an incomplete (death-observations) case: one has S(t max ) > 0, meaning that the data are not sufficient to estimate the entire survival distribution, then the expected future lifetime e θ cannot be derived without some ad hoc choices or suitable additional assumptions.
Let us denote by t (D) max the last observed time point of a DoD event (i.e., t (D) max = max{t i : max , the KM estimate only provides the final survival probability Q fin =Ŝ(t (D) max ) > 0. We then choose to complete the distribution by settingŜ(t max ) = 0, which is equivalent to posing the entire probability mass Q fin on the last time point t max . In terms of the data, this is also equivalent to change to 1 the status indicator d (n) . The effect of this choice depends on the actual meaning we attribute to the random variable T 0 . If T 0 represents the entire future lifetime of the patients since they entered the study, then posinĝ S(t max ) = 0 provides an underestimation of the life expectancy, since we have θ + e θ ≤ t max while we know that at least one patient was alive at the end of the study. In many cases, however, it is convenient to assume that the variable of interest is the patient's lifetime in the study. Formally, we would consider the random variable T 0 = min{T 0 , t max }, where t max is the duration of the study. The completed survival function refers to this random variable and no underestimation would be produced in this case. This issue is strictly related to the special nature of the final time point t max in this kind of survival problems. For example, self-consistency, an important property of the KM estimator, only holds ifŜ(t max ) = 0 Remark 3. This was pointed out by Efron [14] p. 843, where it is observed that the iterative construction underlying the KM estimator "sheds some light on the special nature of the largest observation, which the self-consistent estimator always treats as uncensored, irrespective of" d (n) .

Including Covid-Death Events in the Data
Assume that, in addition to the n patients who left the study by a DoD or a Cen event, also m patients were present in the oncological trial for whom death of COVID-19 (DoC) was observed on the time points θ j , j = 1, . . . , m. The corresponding observed data set can be represented as follows: where the status indicator of each DoC event is missing. It is clearly inappropriate to pose these indicators equal to 1, but it is also not appropriate to set them equal to 0, since the DoC event provides an informative censoring, given that we know this event does carry prognostic information about the survival experience of the oncological patients. More precisely, we know that there is a positive correlation between DoD and DoC events. However, ignoring DoC data would cause an unpleasant loss of information and we would like to adjust these data in some ways, so that it can be included in the study. Formally, we are interested in replacing each of the observed θ j by a different appropriate time point τ j > θ j , a virtual lifetime conditional on θ j , possibly with an appropriate value of the corresponding status indicator, which we will denote by δ j . We are confident that this replacement of the DoC time points can be properly completed just because we assume that, due to the dependence between DoC and DoD events, the "standard" data z contain information on the COVID-19 data (and vice versa). The determination of the status indicators δ j is more challenging. However, with the appropriate adjustment we can consider the whole data set: and we can safely apply the KM estimator to these data, thus also using the information contribution carried by COVID-19 deaths. In the following section, we will propose an iterative procedure to suitably realize this adjustment.

The CoDMI Algorithm
Obviously, the input data to the algorithm are given by the observation set x in (4). We will assume, however, that all patients who died of COVID-19 would have died of disease if COVID-19 had not intervened, thus setting δ j ≡ 1, i.e., assuming that all the virtual lifetimes τ j would have been terminated by a DoD event). We will see in Section 3.4 how one can try to get around this limitation in this counterfactual problem. Under the assumption δ j ≡ 1, the basic idea of our COVID-19 adjustment is to estimate the virtual lifetimes τ j as the expectation E(T θ j ), provided by the KM estimator itself. This is realized by a procedure consisting of the following steps.

•
Initialization step. One starts by setting (τ j , are arbitrarily chosen initial values. Then one obtains an artificial complete data set w (0) , as defined in (5). Examples of initialization areτ is the life expectancy computed by applying the KM estimator to the standard data z. • Estimation step. The KM estimator is applied toŵ (0) to produce the survival function estimateŜ (0) (t). In case of incomplete death-observations, the distribution is completed by are computed as in (3).
The corresponding time pointsτ One then obtains the new artificial complete data set: • The estimation and the expectation steps are repeated, producing at the k-th stage a new complete data setŵ (k) , provided by the expectations {ê (k) θ j , j = 1, . . . , m}. The iterations stop when a specified convergence criterion is fulfilled. A natural criterion is: for a suitable specified tolerance level ε > 0 (this choice will be left as an option for the user). If condition (6) is not satisfied after a fixed maximum number of iterations (which will also be chosen as a user option), the convergence is considered failed.
If the convergence criterion is met, the final values of the m life expectancy provide estimates which we will denote byê θ j . The corresponding estimated lifetimes arê τ θ j = θ j +ê θ j and the estimated whole data set is: This iterative procedure can be seen in the class of the well-known Expectation-Maximization (EM) algorithms, since the estimation step can be interpreted as a maximization, given that the KM approach provides a maximum likelihood estimator. In this class of algorithms the expectation step is often referred to as mean-imputation, hence we will call our iterative procedure Covid-Death Mean-Imputation (CoDMI) algorithm.

Remark 4.
(i) Usually EM algorithms, and the concept of imputation, refer to procedures aimed to filling-in missing data. What we are dealing with here is data observed to a limited extent, rather than completely missing. Therefore, in this application the imputation corresponds rather to a replacement (of the observed time points θ j by the estimated time points τ j ). Our method is, however, in the spirit of the fake-data principle, as illustrated by Efron and Hastie [15], pp. 148-149. (ii) It should be noted that the idea of estimating the virtual lifetimes τ j as the expectation E(T θ j ) implies a further more subtle assumption. Let DoC j be the event: "Patient j died of COVID-19 at time θ j " and RoC j : "Patient j became ill with COVID-19 but recovered at time θ j ". Using notation introduced by Pearl in causal analysis (e.g., [16]), we are assuming for this patient that: where do(A) is the intervention operator on event A. This means that we are assuming that the event RoC j , which is not excluded by DoC j = 0, does not change the probability distribution of T θ j . This is clearly a simplifying assumption that makes our counterfactual problem easy to solve. In a more rigorous analysis, the effect of events as RoC j should be also taken into account [1]. We refrain to do this here, since such an analysis would take us out of the KM survival framework.

The Convergence Issue
In general, CoDMI is not guaranteed to converge. If we make the classical binomial assumptions, we can derive the KM likelihood as a function of the hazard rates h i . Running the algorithm, we find it is possible that different parameter sets, then different sets of e θ j estimates, correspond to the same likelihood value. This should indicate an issue in parameter identifiability. However the classical KM likelihood is defined for fixed time points, while the estimatesê θ j change at each step in our algorithm. Thus, the identifiability problem should be more properly studied referring to a likelihood function which includes the event times in the parameters as well.

Remark 5.
A similar problem of iterated estimates for the KM product-limit estimator, but with fixed time points hence without parameter identifiability issues, was studied by Efron [14]. He proved in this case that, provided that the probability distribution is complete, the solution of the convergence problem exists and is unique. The previously mentioned self-consistency refers precisely to this property.
However, in order to manage the convergence problem, even based on the results of the simulation exercise presented in Section 5, it is worth considering the following three types of situation.
(1) Finite time convergence. The difference between two successive estimates becomes zero after a finite number of iterations. (2) Asymptotic convergence. The difference between two successive estimates tends to zero asymptotically. (3) Cyclicity. After a certain number of iterations, cycles of the estimated values are established which tend to repeat themselves indefinitely, so that the minimum difference between two successive estimates remains greater than zero. In this case, if this minimal difference is less than the tolerance ε, the corresponding estimate can be accepted (this is actually referred to by the term "tolerance"). It often happens that small changes in some of the θ j values are sufficient to get out of cyclicity cases. Therefore, some fudging of these data could be used to obtain acceptable solutions when the minimum improvement is out of tolerance.
As shown in the simulation study in Section 5, cases of non-convergence are not very frequent, and many of these can be circumvented by milding the convergence criterion (6) and fudging the COVID-19 data a little, if necessary. In general, the results are found to be sensitive to the initial valuesτ (0) j . In cases of convergence this is not a problem since different solutions, but within the chosen tolerance criterion, are equivalent from a practical point of view. In some cases of non-convergence, on the other hand, it is possible to skip to convergence cases by changing the initial values.

Assumptions Underlying CoDMI
The iterative procedure described in Section 3.1 can probably be easily justified by intuitive reasoning. However, also to give internal consistency to the simulation procedure presented in Section 5, it is convenient to better specify the assumptions underlying the CoDMI algorithm. A preliminary remark is important to be made. In our framework, the "true" probability distribution of the random variable T 0 is the best-fitting distribution in the KM sense, i.e., the distribution identified by applying the maximum likelihood product-limit estimator to existing data. Without appropriate additional assumptions (e.g., specifying an analytic form of the hazard function) this distribution is completely non-parametric and there is no other way to identify it than by specifying the data as well as the estimator used (the product-limit estimator, in fact). One could say, data provide information to the estimator, and the estimator provides probabilistic structure to data. Having remarked upon this, the basic assumption underlying CoDMI algorithm outlined in the following section. When COVID-19 deaths are present in the study sample, there is an extended underlying data structure composed of the n observed lifetimes t i (ending with a DoD or a Cen) and by the m partially observed lifetimes τ j (virtually ending, if we assume δ j = 1, with a DoD). The corresponding probability distribution is the best-fitting distribution specified by this extended data, i.e., by applying the KM estimator to the data set w = z ∪ z . We will keep have this property in mind when we generate the simulated scenarios on which to measure the algorithm's predictive performance.

Adjusting for the Assumption δ j ≡ 1
Relaxing the assumption that patients eliminated by a DoC event would have died of disease without this event is not an easy task. The prediction regarding the status operators δ j increases the forecasting problem by one dimension and requires a reliable predictive model, which is currently not available to us. We are therefore content to propose an adjustment for censoring of the response of CoDMI algorithm which should mitigate the possible bias produced by the assumption δ j ≡ 1. If the algorithm met the convergence criterion, the final data set is given by (7). We then consider the modified data set: where both the observed and the estimated virtual lifetimes are kept the same, while all the status indicators are reversed. Running the KM estimator on the setŵ (R) , one obtains the so-called reverse Kaplan-Meier survival curveŜ (R) (t), which refers to Cen instead of DoD endpoints, and provides the new conditional expectationsτ (R) j , given θ, of the virtual lifetimes. We then choose to derive the adjusted estimatesτ * j , for j = 1, . . . , m, as: where α(t) is the probability that an event observed at time t is a DoD (as opposed to a Cen). In order to estimate these non-censoring probabilities, the standard observations The above procedure is fairly ad hoc and the indications provided do not necessarily have to be accepted. It may be the case that the user of the procedure has a personal opinion, based on external information, on the value of (some of) the virtual status operator δ j . In this situation the coefficients α(θ j ) in (9) could be assigned or modified by the user on the basis of this expert judgment.

An Extended Greenwood's Formula
The virtual lifetime expectationsτ j provided by CoDMI and included in the meanimputed dataŵ are point estimates which allow these data to be applied to any statistical tool available for survival analysis. However, replacing an observed value with a point estimate, even an unbiased one, increases the variance of the survival estimates, since the mean-imputed data convey their own estimation error. Usually the standard deviation of the KM survival function estimate is computed using Greenwood's formula. On the standard data, using the same notation in (1), this can be written as: where the summand is set to 0 if h (i) = 1. We provide an extension of this formula in order to include the variance component due to the estimated time pointsτ j . We start by the CoDMI output, eventually with the adjustment for censoring: where theτ * j are derived by (9) and the indicators δ j can be equal to 0 or 1. We represent theŵ data set in the alternative form: where: • t i = t i orτ * j are the observed or estimated survival times ordered by increasing value (the usual conventions on tied values apply); • d i = 0 if t i corresponds to a Cen and 1 otherwise; • δ i = 1 if t i corresponds to a DoC and 0 otherwise. Since the time points t i are assumed to be ordered, we simplify the exposition in this section by using the subscript i instead of (i) (and R i instead of R(t (i) )). We then consider both the "direct" probability distribution {q i , i = 1, . . . , n + m} and the reverse one {q (R) i , i = 1, . . . , n + m}, both taken from the CoDMI output, and from these we derive the m direct and the m reverse truncated distributions: These distributions are defined, with null values, also for t i ≤ θ j . Finally, we compute the total probabilities: and define Q With these definitions, we propose the following correction of Greenwood's formula: where the hazard ratesh i are specified as: and the number of subjects at risk is computed as: The basic idea underlying this formula is that the m COVID-19 deaths are distributed as "fractional deaths" Q i = ∑ j q * i,j over all the uncensored time points (both DoD and DoC), and the hazard rate at time t i has a random component with mean Q i /R i and variance The details of the derivation of Formula (12) are provided in Appendix A. Using (12), the approximate 95% confidence intervals can be computed by:

Application to COVID-19 Extended NCOG Data
To illustrate the effects of our mean-imputation adjustments, we start by considering some real survival data well referenced in the literature and apply CoDMI algorithm to these data after the addition of some artificial COVID-19 deaths. This is carried out because, currently, sufficiently rich real datasets containing both cancer-death and Covid-death events are hardly available. To this aim, we chose, as the real reference data, the head/neck cancer data of the NCOG (North Carolina Oncology Group) study, which was used to illustrate the KM approach in the book by Efron and Hastie [15], Section 9.2. We considered data from the two arms, A and B, separately.

Arm A of NCOG Data
Survival times (in days) from Arm A in the first panel of Table 9.2 [15] are reported in Table 1. To save space, data is presented, as in the book, in compact form, with the + sign representing censoring. The conversion of these data into the form of a two-component vector z = {(z i , d i ), i = 1, 2, . . . , n} is immediate. There are n = 51 patients, with 43 DoD events and 8 Cen events. The final time point is 1417 days after the beginning of the study, and a DoD is observed on that date. Therefore we are in a complete death-observations case, with t max = t (D) max = 1417. The corresponding KM estimate of the survival functionŜ(t) is illustrated by the black line in Figure 1. To illustrate the application of CoDMI algorithm, we add to these data an artificial group of m Covid death observations, i.e., m DoC events assumed being observed at the time points θ = {θ j , j = 1, 2, . . . , m}. We chose m = 5 (roughly 10% of n) DoC events, on 5 time points roughly equally spaced in (0, t max ): θ = {250, 500, 750, 1000, 1250} .
Since the observation set x = z ∪ θ has been specified, we have to choose the virtual lifetimesτ (0) j in the data setŵ (0) which is used to initialize CoDMI algorithm. If, for example, we choose the option to setτ are then used as mean-imputed data to obtain the final complete data setŵ in (7). As one can observe, the expectation Formula (3) provides non-integer values, which is not a problem since the survival function provided by the KM estimator is defined on the real axis.

Remark 6.
A tolerance of 0.1 already provides overabundant precision for our applications. However, in order to stress the algorithm, we also tried with ε = 10 −8 and ε = 10 −18 , obtaining convergence after 33 and 51 iterations, respectively. This seems to be a case of asymptotic convergence.
The survival curve provided by the KM estimator applied to the completed dataŵ ("DoC Imputed") is illustrated in blue color in Figure 1, where it can be compared with the original survival estimate based on the z data ("Without DoC", black color). For further comparisons, we also present the survival KM curves estimated by the two naïve strategies, comprising a classification of all DoC events as Cen, i.e.,τ j ≡ θ j and δ j ≡ 0 ("DoC as Cen", green color), or all DoC events as DoD, i.e.,τ j ≡ θ j and δ j ≡ 1 ("DoC as DoD", red). In the figure, the "critical" time points are reported by indicating the 14 Cen points by tiks and the 5 θ j points by red triangles on the black curve, while the 5τ j points are indicated by circles on the blue line (where, obviously, each circle corresponds to a jump).
We finally illustrate the application of the adjustment for censoring presented in Section 3.4. After deriving fromŵ the modified data setŵ (R) in (8) In Figure 2, on the left it is illustrated the probability curve α(t) estimated as specified in Section 3.4. By this function, one obtains: Therefore, the procedure suggests to consider the last two time points as (potentially) censored, then estimated as in (17). The data setẑ in (16)  In Figure 3, the survival function estimated after the suggested adjustment for censoring is reported, together with the 95% confidence limits computed with the traditional Greenwood's formula (red dotted lines) and with the extended Formula (12) (blue dashed lines).

Arm B of NCOG Data
In Table 2, we report censored survival times (in days) from Arm B in the second panel of Table 9.2 [15]. Furthermore, in this case, we refrain, for reasons of space, to present data converted into z form. Data are heavily censored in this arm, having n = 45 patients, with 14 Cen events, which are mainly distributed among the largest time points. Moreover, we are in a case of incomplete death-observations, since the final time point t max = 2297 is a Cen point. The last time point with a DoD event observed is t (D) max = 1776 and 4 Cen events are observed thereafter. The final level of the survival curve provided by the KM estimator isŜ(t (D) max ) = 22.99% and we choose to allocate this probability mass entirely on the final Cen point 2297. For the artificial data on COVID-19 deaths, also in this case we choose m roughly 10% n and assume equally spaced DoC events in the interval (0, 2297). That is we assume n = 5 with θ j values in the set: The last time point in θ is after the last observed DoD time point (1776). As in the previous case, the initial data setẑ is derived by setting δ j ≡ 1, and the complete data set w (0) = z ∪ẑ is used to initialize CoDMI algorithm. The algorithm, run again with ε = 0.1, converged after 12 iterations (convergence was met after 49 iterations for ε = 10 In Figure 4, we replicate the illustrations of Figure 1 on these data. As concerning the adjustment for censoring, from the estimated probability curve reported in Figure 2 on the right we obtain: Therefore, in this case, the procedure suggests to consider the last four time points as censored. Using the criterion in (17), the final data set is obtained: (1654.63, 1), (1922.76, 0), (1978.15, 0), (2084.32, 0), (2201.93, 0)} .

A Simulation Study
In order to test the ability of CoDMI to correctly estimate the expected life-shortening (or the corresponding virtual lifetime) due to DoC events in a study population, we generate many scenarios each containing simulated data. These pseudo-data include az data set of standard observations and aτ (0) data set of (preliminary) virtual lifetimes. By randomly censoring the time variables inτ (0) a corresponding setθ of DoC time points is derived. In order to equip these pseudo-data with a probabilistic structure consistent with CoDMI assumptions, a KM best-fitting distribution is derived by applying the productlimit estimator toz ∪τ (0) . The "true" virtual lifetimesτ are then derived by conditional sampling, givenθ, from this distribution. Running CoDMI algorithm on the pseudoobservationsx =z ∪θ, the estimated virtual lifetimesτ are obtained and the quality of the estimator is measured by computing the average, over all scenarios, of the prediction errorsτ −τ.

Details of the Simulation Process
The details of each scenario simulation are as follows: Remark. It should be noted that many tied values can be generated in this step, especially if n sim n. Moreover,t (D) max could result to be censored (a case of incomplete death observations) even if the death observations are complete in the original data. It is easy to guess that generating many scenarios in this way can produce a number of "extreme" pseudo-dataz. This is useful, however, for testing the algorithm even in unrealistic situations. Most cases of failed convergence correspond to extreme situations.  Remark. The use of a uniform distribution is obviously questionable, and more "informative" distribution could be suggested. For example, a beta distribution with first parameter greater than 1 and second parameter lower than 1 may be preferable, as it makes more probable values ofθ j closer toτ (0) j . However, the form of this distribution is irrelevant to our purposes: we are interested in observing how CoDMI is able to capture the simulated virtual lifetimes, independently of how they are generated.
3. Simulation of virtual lifetimesτ j . The temporary lifetimesτ (0) j (and the data setz) cannot be directly used to test CoDMI algorithm, since their probabilistic structure is indeterminate and, in any case, we have too few (pseudo-)observations. In order to introduce a probabilistic structure consistent with CoDMI assumptions, we first run the KM estimator on the data setw (0) =z ∪ {(τ i , i = 1, 2, . . . n sim + m sim }. The virtual lifetimes τ (1) j , j = 1, 2, . . . , m sim , are then obtained by computing the conditional expectations E(Tθ j ) by this distribution. However, this is not yet fully consistent with CoDMI assumptions, since, as discussed in Section 3.3, the appropriate distribution is the KM best-fitting distribution specified on the extended data, i.e., data including the virtual lifetimes themselves. To obtain this result we should repeat the previous step, i.e., running the product-limit estimator on the new data setw (1)   by taking the conditional expectation on this distribution. In principle, this step should be iterated similarly to what is completed in the CoDMI algorithm. To avoid convergence problems, however, we prefer to limit the number of iterations to a fixed (low) value n iter , thereby implicitly accepting a certain level of bias in the estimations. After these n iter iterations has been made, the final data set w (n iter ) =z ∪ {(τ (n iter ) j , 1)} is obtained. Running the KM estimator on these data again, the final distribution {q (n iter ) i , i = 1, 2, . . . n sim + m sim } is obtained and the definitive time pointsτ j , with the correspondingẽ j =τ j − θ j , are computed by conditional sampling, givenθ j , i.e., simulating from the truncated distribution {q (n iter ) i , i : t i > θ j } (after normalization). These sampled values are taken as the true values of virtual lifetimes and life expectancy, respectively, which should be estimated by CoDMI using only the informationz ∪θ. 4. Application of CoDMI and naïve estimators. CoDMI algorithm is applied to the simulated data: withz i obtained in step 1 andθ j in step 2. Provided that the algorithm converges, we obtain the m sim estimated virtual lifetimesτ j and the estimated life expectancyê j .
To allow comparison, we also derive in this step the predictions of the two naïve "estimators" which are obtained by applying the KM estimator to the simulated datã w, modified by posing, for all j,τ j =θ j and δ j = 1 ("DoC as DoD") or δ j = 0 ("DoC as Cen").

Valuation of the Predictive Performances
In the simulation exercise, a large number N of scenarios are generated. This provides, for j = 1, 2, . . . , m sim and k = 1, 2, . . . , N, the N · m sim CoDMI estimatesê (k) j (from step 4) and the N · m sim true realizationsẽ (k) j (from step 3). Then we can compute the prediction errors: j , j = 1, 2, . . . , m sim , k = 1, 2, . . . , N , and the average errors: j correspond to under(over)-estimates provided by CoDMI. As usual, we can associate to these average errors the corresponding standard error, i.e., the standard error of the mean (s.e.m.). Given the independence assumption, the central limit theorem guarantees, as usual, that the sample means are asymptotically normal. Therefore, the corresponding s.e.m. is inversely proportional to √ N. The same summary statistics are computed for the prediction errors relative to the two naïve estimators.

Results from Simulation Exercises
Two separate simulation exercises were performed, one using Arm A, the other using Arm B as real-life data. In both the exercises, N = 10, 000 scenarios were generated, with n sim = 100 standard observations (roughly double the real ones) and m sim = 10 COVID-19 deaths. A tolerance ε = 1 was chosen for the CoDMI algorithm, with a maximum number of allowed iterations iter max = 100. The number of iterations for generating the true values was n iter = 10 and for all the initializations the optionτ was chosen. Since in some scenarios CoDMI failed to converge (with the chosen values for ε and iter max ), the sample means and the corresponding s.e.m. where computed only on the N c convergence cases.
In Table 3, which is referred to Arm A data, the simulation results are reported for each of the 10 DoC cases. We obtained N c = 9, 802 convergence cases out of the 10,000 simulated. In each row, the sample mean of the DoC time pointsθ j , the true life expectancyẽ j and the CoDMI estimated life expectancyê j are reported in columns 2-4. In columns 5-9, we provide summary statistics of the corresponding prediction errors: the mean error∆ j =ẽ j −ê j , the related s.e.m., the relative mean error∆ j /ẽ j and the minimum and maximum value of∆ j . The same results for 10,000 scenarios generated by Arm B data are reported in Table 4.  Table 5 provides the results in Tables 3 and 4 aggregated over all the DOC events. These overall results are summarized in blok "DoC imputed". In the bloks, "DoC as DoD" and "DoC as Cen" the average prediction errors are reported for the two corresponding naïve estimators. The main finding from the simulations is that the CoDMI estimates seem to be essentially unbiased, with a relative prediction error of around 0.5% for both the original data considered. Some more extensive (and time consuming) tests, with N = 10 5 or N = 10 6 , have shown a further reduction of the error (as well as, obviously, of the corresponding s.e.m.). As a final exercise, we used a modified version of the simulation procedure to obtain an assessment of goodness of the adjustment for censoring described in Section 3.4. In the modified simulation, all the m sim true virtual lifetimesτ j were generated assuming a censoring, instead of a DoC, as the endpoint. Then we set δ j ≡ 0 and in step 3 of Section 5 we generated in all iterations the virtual lifetimesτ j using the truncated reverse probability distribution, i.e., the distribution obtained by applying the Kaplan-Meier estimator to the reversed dataw (R) (see (8)). Correspondingly with this change in assumption, the estimated valuesτ j in each simulation were obtained by applying the CoDMI algorithm with the final adjustment for censoring, setting at 0 all the probabilities α(θ j ) in (9). The overall results from these simulations are summarized in Table 6, which have the same structure as Table 5 and where the results without adjustment are also provided for comparison. As we can see, the changed assumption on the status of the DoC endpoints provides a large increase of the true life expectancyẽ, but the adjustment for censoring seems to capture quite well this change. Of course, in real life we do not know what the true value of the δ j is, and we will have to try to choose the suitableτ j in (9) based on the α j probabilities and/or using expert judgment.

Conclusions and Directions for Future Research
In the simulated scenarios, where all the virtual endpoints of COVID-19 cases are assumed to be DoD, the results indicate that CoDMI estimator is roughly unbiased and outperforms alternative estimates obtained by the naïve approaches. In the opposite extreme situation, where all the virtual endpoints of COVID-19 cases are assumed to be censored, the final adjustment for censoring of CoDMI also guarantees unbiasedness, provided that the information on the status of DoC events is assumed to be known. The non-convergence cases can often be circumvented by milding the convergence criterion and/or fudging COVID-19 data a little. Furthermore, changing the initialization of the algorithm can be useful in some cases.
By a natural extension of the binomial assumptions underlying the KM estimator, a version of the classical Greenwood formula can be derived for computing the variance of CoDMI estimates. Equipped with this formula, the CoDMI algorithm is proposed as a complete statistical estimation tool.
As we pointed out in the Introduction, CoDMI algorithm, compared with the cumulative incidence functions method often used to study competing risks, is a pragmatic approach that allows to directly apply all standard statistical tools to "augmented" data. However, it remains important to compare the predictive performance of the two approaches. In our applications, where the competing events are DoD and DoC, we do not yet have sufficiently rich data to test the effectiveness-and possibly the necessity-of an approach based on the cumulative incidence functions, or even to test the possibility of using the two methods in conjunction. Therefore, this topic is left for future research.
Another interesting issue is the convergence of CoDMI algorithm, which is discussed in Section 3.2. A natural way to approach this problem is to study the behavior of the log-likelihood function. However, as we have pointed out, we are not in a fixed time points situation. So it is not a trivial task to explicitly write the updated log-likelihood at each iteration step, because the replacements in each step imply a re-ordering of the time points and consequently a change in the number of items at risk in each death probability estimate. This problem is also left as a future work.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Derivation of the Extended Greenwood's Formula
We organize data in a life table with K time intervals k = 1, 2, . . . , K, spanning the interval [0, t n+m ], each with length ∆t = t n+m /K. Let us denote by h k and ν k the hazard rate and the number of DoD events, respectively, in the interval k and by n k the number of subjects at risk at the end of the interval k − 1. In this setting the survival function is defined as S l = ∏ l k=1 (1 − h k ), l = 1, 2, . . . , K, and an estimateŜ l for S l is obtained by plugging in an estimateĥ k for h k , k = 1, 2, . . . , l. We make the binomial assumption: n kĥ k ∼ Bin(n k , h k ) , with h k = h k + H k , k = 1, 2, . . . , K , where the parameters h k and H k are the DoD and the DoC hazard rate, respectively. In addition to this usual assumption, we express H k as the random variable: where, as usual, the random variable T θ j is the conditional lifetime T 0 |(T 0 ≥ θ j ) and θ j is the time of the j-th observed DoC event. The probability distribution of T θ j , however, is not necessarily specified for the moment. Let µ k = E(N k ) and σ 2 k = Var(N k ). In order to derive an approximation of the variance ofŜ l , l = 1, 2, . . . , K, in the same spirit of Greenwood's formula we consider the variance of the logarithm: Var log(1 −ĥ k ) . (A3) As for the second equality, it should be noted that theĥ k values are not independent, since n k depends on the events in the previous periods. However successive conditional independence, given n k (essentially, a martingale argument), is a sufficient con-dition for the equality to hold. Now we use the so-called delta-method approximation Var(log X) Var(X)/[E(X)] 2 , to obtain: By the binomial assumption (A1) we have, for all k: and: Therefore, for the expectation ofĥ k we obtain: and for the variance ofĥ k we have: or, with a little algebra: By inserting (A5) and (A6) into (A4) we have: Plugging in h k = ν k /n k and posingh k = (ν k + µ k )/n k , we obtain: Using the inverse approximation Var(X) [E(X)] 2 Var(log X), we finally have: Now, in the life table we take ∆t small enough to make each time interval contain at most one time point t i . In this limit, if k i denotes the interval containing t i , we assume that: consistently with the fact that, in this setting, T θ j has discrete distribution with probability masses in the points t i . These probabilities are the q * i,j provided by CoDMI. Then, by (A2) and (11), we obtain: For the variance, assuming the independence of the T θ j we have: i .
Thus, we estimate µ k by Q i and σ 2 k with Q i − − − Q (2) i . Putting it all together, for the survival function S l we arrive at the product-limit estimator: where:h and R i is computed recursively as: Correspondingly, (A7) reduces to: (A9) In summary, the addition of the random component in the binomial assumption (A1) has the effect of distributing each of the m COVID-19 deaths, which has been imputed by CoDMI on the time pointsτ * j , on all the uncensored time points according to its truncated distribution q * i,j . Summing over j we obtain the total probabilities Q i , for which the property holds ∑ i Q i = m. In the variance expression (A9) the estimation error of theτ * j is taken into account by the additional term containing the variance estimates Q i − Q (2) i . It should be noted, however, that the survival function estimateŜ (t) given by (A8) is slightly different by the estimateŜ(t) given by (1). Since the COVID-19 deaths are spread out on all the time points, one usually hasŜ (t) ≥Ŝ(t) for small t andŜ (t) ≤Ŝ(t) for large t. One can accept the approximation: which gives Formula (12). The more conservative approximation: could be also considered.