Next Article in Journal
Lysine-Specific Demethylase 1 (LSD1)-Mediated Epigenetic Modification of Immunogenicity and Immunomodulatory Effects in Breast Cancers
Next Article in Special Issue
Evaluation of a Virtual Dance Class for Cancer Patients and Their Partners during the Corona Pandemic—A Real-World Observational Study
Previous Article in Journal
Encouraging Patients to Ask Questions: Development and Pilot Testing of a Question Prompt List for Patients Undergoing a Biopsy for Suspected Prostate Cancer
Previous Article in Special Issue
The Effects of Lack of Awareness in Age-Related Quality of Life, Coping with Stress, and Depression among Patients with Malignant Melanoma
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Expectation-Maximization Algorithm for Including Oncological COVID-19 Deaths in Survival Analysis

1
Department of Radiological Science, Oncology and Human Pathology, “Sapienza” University of Rome, Policlinico Umberto I, 00161 Rome, Italy
2
Alef—Advanced Laboratory Economics and Finance, 00198 Rome, Italy
3
Department of Economics, University of Perugia, 06123 Perugia, Italy
*
Author to whom correspondence should be addressed.
Curr. Oncol. 2023, 30(2), 2105-2126; https://doi.org/10.3390/curroncol30020163
Submission received: 26 December 2022 / Revised: 31 January 2023 / Accepted: 3 February 2023 / Published: 8 February 2023

Abstract

:
We address the problem of how COVID-19 deaths observed in an oncology clinical trial can be consistently taken into account in typical survival estimates. We refer to oncological patients since there is empirical evidence of strong correlation between COVID-19 and cancer deaths, which implies that COVID-19 deaths cannot be treated simply as non-informative censoring, a property usually required by the classical survival estimators. We consider the problem in the framework of the widely used Kaplan–Meier (KM) estimator. Through a counterfactual approach, an algorithmic method is developed allowing to include COVID-19 deaths in the observed data by mean-imputation. The procedure can be seen in the class of the Expectation-Maximization (EM) algorithms and will be referred to as Covid-Death Mean-Imputation (CoDMI) algorithm. We discuss the CoDMI underlying assumptions and the convergence issue. The algorithm provides a completed lifetime data set, where each Covid-death time is replaced by a point estimate of the corresponding virtual lifetime. This complete data set is naturally equipped with the corresponding KM survival function estimate and all available statistical tools can be applied to these data. However, mean-imputation requires an increased variance of the estimates. We then propose a natural extension of the classical Greenwood’s formula, thus obtaining expanded confidence intervals for the survival function estimate. To illustrate how the algorithm works, CoDMI is applied to real medical data extended by the addition of artificial Covid-death observations. The results are compared with the estimates provided by the two naïve approaches which count COVID-19 deaths as censoring or as deaths by the disease under study. In order to evaluate the predictive performances of CoDMI an extensive simulation study is carried out. The results indicate that in the simulated scenarios CoDMI is roughly unbiased and outperforms the estimates obtained by the naïve approaches. A user-friendly version of CoDMI programmed in R is freely available.

1. Introduction

The problem of defining a common and appropriate method in survival analysis for handling the dropouts due to coronavirus disease 2019 (COVID-19) deaths of patients participating to oncology clinical trials has been recently stressed [1,2]. In oncology trials, all-causality deaths are often counted as events for death-related endpoints, e.g., overall survival. However, as it has been pointed out [2], counting a COVID-19 fatality as a death-related endpoint requires a complex redefinition of the estimand, considering a composite strategy for using the so-called intercurrent events [3], as, e.g., “discontinuation from treatment due to COVID-19” or “delay of scheduled intervention”. The problem is also exacerbated by the difficulty of homogeneously determining whether a death is entirely attributable to COVID-19. In this paper, we address a simplified version of this problem, assuming that COVID-19-related deaths are homogeneously identified and are the only intercurrent events to be considered. In this framework, we tackle the problem of how data in an oncology trial having the overall survival as the endpoint can be dealt with when deaths due to COVID-19 are present in the sample.
COVID-19 deaths should not be treated as standard censored data, because usual censoring should be considered—at least in principle—non-informative. Informative censoring, instead, occurs when participants are lost to follow-up also due to reasons related to the study, as it seems to be the case with COVID-19 deaths of oncological patients. Direct data on how COVID-19 affects survival outcomes in patients with active or a history of malignancy are immature. However, early evidence identified increased risks of COVID-19 mortality in patients with cancer, especially in those patients who have progressive disease [4]. Patients with cancer and COVID-19 were associated with increased death rate compared to unselected COVID-19 patient population (13% versus 1.4%) [4,5]. Based on these findings, in survival analysis dropouts due to COVID-19 deaths should be considered as cases of informative censoring. Another way used in survival analysis literature to represent this dependence is to view cancer deaths and COVID-19 deaths as competing events, see, e.g., [6] Ch. 8. In this paper, we propose an algorithmic method to include COVID-19 deaths of oncological patients in typical survival data, focusing on the classical Kaplan–Meier (KM) product-limit survival estimator. Our method is in the spirit of the Expectation-Maximization (EM) algorithms [7] used for handling missing or fake data in statistical analysis. In this sense, the method could also be used in applications different from clinical trials, e.g., reliability analysis. Correction of actuarial life tables can be also a possible application.
An overview of methods for dealing with missing data in clinical trials is provided by DeSouza, Legedza and Sankoh [8]. See also Shih [9]. In Shen and Chen [10] the problem of doubly censored data is considered and a maximum likelihood estimator is obtained via EM algorithms that treat the survival times of left censored observations as missing. As concerning situations with informative censoring, where there is stochastic dependence between the time to event and the time to censoring (which is our case if “censoring” is a COVID-19 death), a distinction is proposed by Willems, Schat and van Noorden [11] among cases where the stochastic dependence is direct, or through covariates. In that paper [11], the latter case is considered and an “inverse probability censoring weighting” approach is proposed for handling this kind of censoring. Since at this stage it is difficult to model cancer deaths and COVID-19 deaths through covariates in common, in this paper, we consider the case of direct dependence. We do not consider a survival regression model based on specified covariates, and limit the analysis, as has been said, to the basic Kaplan–Meier survival model, which is assumed to be applied, as usual, to a sufficiently homogeneous cohort of oncological patients. In this framework, we propose a so-called mean-imputation method for COVID-19 deaths using a purpose-built algorithm, referred to as Covid-Death Mean-Imputation (CoDMI) algorithm. A user-friendly version of this algorithm programmed in R is freely available. The corresponding source code can be downloaded from the website: https://github.com/alef-innovation/codmi (accessed on 6 July 2021).
An alternative approach to survival analysis when COVID-19 deaths are present in an oncology clinical trial in addition to cancer deaths could be based on the cumulative incidence functions, which estimate the marginal probability for each competing risks.This would lead to dealing with subdistributions and would require appropriate statistical tests to be used, see, e.g., [12]. Our algorithmic approach, instead, acts directly on the data, producing an adjustment that virtually eliminates the presence of the competing risk, thus allowing the use of standard statistical tools. This comes at the price of accepting some simplifications and specific assumptions.
The basic idea of CoDMI algorithm is of a counterfactual nature. Since the KM model provides an estimation of the probability distribution to survive until a chosen point on the time axis for any patients in the sample, for each of the patients which is observed to die of COVID-19 at time θ , we derive from this distribution e ^ θ , the expected lifetime beyond time θ , thus obtaining the “no-Covid” expected lifetime τ ^ = θ + e ^ θ for each of these patients. Each θ value is then replaced by the virtual lifetime τ ^ (this is the mean-imputation) and the KM estimation is repeated on the original data completed in this way, providing a new estimate of τ ^ . This procedure is iterated until the change between two successive τ ^ estimates is considered immaterial (according to a specified criterion).
It is pointed out by Shih [9] that “The attraction of imputation is that once the missing data are filled-in (imputed), all the statistical tools available for the complete data may be applied”. Although in our case we are not dealing with missing data but with partially observed data, this attractive property of mean-imputation still holds true. It should be noticed, however, that in general, treating an estimated value—even an unbiased one—as an observed value should require some increase in variance. In particular, the confidence limits of KM estimates on data including imputations should be appropriately enlarged. We propose an extension of the classical Greenwood’s formula providing this correction.
The paper is organized as follows. In Section 2, the notations and the basic structure of the KM survival estimator are provided and the related problem of computing expected lifetimes is illustrated. The representation of Covid-death cases in the sample is described. In Section 3, CoDMI algorithm is introduced and the details of the iteration procedure are provided. The convergence issue is discussed and the underlying assumptions of the algorithm are considered, taking into account some subtleties required by the non-parametric nature of the KM estimator. A possible adjustment for censoring of the algorithm is presented and a correction of Greenwood’s formula is derived for taking into account the estimation error in the imputed point estimates. Application of CoDMI to real medical data are provided in Section 4. Two oncological survival data sets which are well referenced in the literature are completed by artificial Covid-death observations and the survival curves estimated by CoDMI are compared with the no-Covid KM estimates and with the two naïve KM estimates obtained by considering COVID-19 deaths as censorings or as death of disease. The effect of the final adjustment for censoring is also illustrated. In Section 5, an extensive simulation study is presented to evaluate the CoDMI predictive performances. We discuss the details of the simulation procedure and provide tables illustrating the results. Some conclusions and comments are given in Section 6. In Appendix A a derivation of the extended Greenwood’s formula is provided.

2. Notation and Assumptions on Covid Deaths in the Sample

2.1. Typical Clinical Trial Data and the Kaplan–Meier Estimator

We consider a study group of n oncological patients which received a specified treatment and are followed-up for a fixed calendar time interval. The response for each patient is the survival time T 0 which is computed starting from the date of enrollment in the study, date 0.
Remark 1.
This is in line with the standard actuarial notation, where T x is used to denote the survival time of a subject of age x. Our patients actually have “age 0” (in the study) at the time they are enrolled.
Typically, the observations include censored data, that is, survival times known only to exceed reported value. Formally, for a given patient there is a censoring at time point t if we only know that for this patient T 0 > t . If t max denotes the last observed time point in the study, i.e., t max corresponds to the current date or the end of the study, the case of a censored time t < t max corresponds to a patient lost to follow-up. To take into account censoring, the observations can be represented in the form:
z i = ( t i , d i ) , i = 1 , , n ,
where t i is the observed survival time of patient i and d i is a “status” indicator at t i which is equal to 1 if death of disease under study (DoD) is observed and is equal to 0 if there is a censoring (Cen) on that time. We assume that the group of patients provides a homogeneous sample, that is, all the observations t i come from the same probability distribution for T 0 , and our aim is to estimate the cumulated probability function F ( t ) = P ( T 0 < t ) , or the related survival function:
S ( t ) = 1 F ( t ) = P ( T 0 t ) .
The estimation of S ( t ) can be realized non-parametrically by the well-known Kaplan–Meier product-limit estimator [13]. If we denote by:
z ( i ) = ( t ( i ) , d ( i ) ) , i = 0 , 1 , , n ,
the observations z i ordered by increasing value of t (with t ( 0 ) = d ( 0 ) = 0 ), the KM estimator is written as:
S ^ ( t ) = i : t ( i ) t 1 d ( i ) R ( t ( i ) ) ,
where R ( t ( i ) ) is the number of subjects at risk at (immediately before) time t ( i ) , and the ratio h ( t ( i ) ) = d ( i ) / R ( t ( i ) ) is the hazard rate at time t ( i ) . Therefore S ^ ( t ) is a (left continuous) step function with steps at each time a DoD event occurs.
Remark 2.
(i) If there are ties in the sample, the ordering can always be unambiguously defined by adopting the appropriate conventions. We refrain here from describing these conventions, already considered in the original paper [13] and extensively discussed in the subsequent literature.
(ii) In general, the event of interest (in our case DoD) acts on the ratio d ( i ) / R ( t ( i ) ) in the estimator (1) by modifying both the numerator and the denominator. The not-of-interest event (Cen) only acts on the denominator. This follows from the assumption that a Cen corresponds to a non-informative censoring.
It is assumed that the censored observations do not contribute additional information to the estimation, which is the case if censoring is independent of the survival process. If the time points t i are given, it was already shown in the original paper [13] that (1) is a maximum likelihood estimator. Obviously t ( n ) = t max , the last time point in the observed sequence. For our purposes, it is important to distinguish two cases, depending on whether at t max there is a DoD or a Cen.

2.2. The Case of Complete Death-Observations

If d ( n ) = 1 , i.e., t max relates to a DoD event, and if R ( t max ) = 1 , then one has S ( t max ) = 0 , which means that the data allows us to estimate the entire probability distribution of T 0 . Let us refer to this case as the complete death-observations case or, briefly, the complete case. In this situation, we can compute the estimated expected future lifetime for a patient which is alive at time θ 0 . Let us denote the conditional lifetime, given θ , as:
T θ = T 0 | ( T 0 θ ) .
Then the expected future lifetime (the life expectancy) beyond θ is:
e ^ θ : = E ^ ( T θ ) θ = 1 S ^ ( θ ) θ t ( n ) S ^ ( t ) d t .
Since S ^ ( t ) is a step function and the jump at time t ( i ) with d ( i ) = 1 equals the probability q ( i ) to die of disease at this time point, (2) is equivalent to the average taken on the truncated distribution of T 0 θ :
e ^ θ = i : t ( i ) > θ ( t ( i ) θ ) q ( i ) i : t ( i ) > θ q ( i ) ,
where q ( i ) = 0 if d ( i ) = 0 .

2.3. The Incomplete Case

If the condition d ( n ) = 1 is not fulfilled, we are in an incomplete (death-observations) case: one has S ( t max ) > 0 , meaning that the data are not sufficient to estimate the entire survival distribution, then the expected future lifetime e θ cannot be derived without some ad hoc choices or suitable additional assumptions.
Let us denote by t max ( D ) the last observed time point of a DoD event (i.e., t max ( D ) = max { t i : d i = 1 } ). If t max > t max ( D ) , the KM estimate only provides the final survival probability Q fin = S ^ ( t max ( D ) ) > 0 . We then choose to complete the distribution by setting S ^ ( t max ) = 0 , which is equivalent to posing the entire probability mass Q fin on the last time point t max . In terms of the data, this is also equivalent to change to 1 the status indicator d ( n ) . The effect of this choice depends on the actual meaning we attribute to the random variable T 0 . If T 0 represents the entire future lifetime of the patients since they entered the study, then posing S ^ ( t max ) = 0 provides an underestimation of the life expectancy, since we have θ + e θ t max while we know that at least one patient was alive at the end of the study. In many cases, however, it is convenient to assume that the variable of interest is the patient’s lifetime in the study. Formally, we would consider the random variable T 0 = min { T 0 , t max } , where t max is the duration of the study. The completed survival function refers to this random variable and no underestimation would be produced in this case. This issue is strictly related to the special nature of the final time point t max in this kind of survival problems. For example, self-consistency, an important property of the KM estimator, only holds if S ^ ( t max ) = 0
Remark 3.
This was pointed out by Efron [14] p. 843, where it is observed that the iterative construction underlying the KM estimator “sheds some light on the special nature of the largest observation, which the self-consistent estimator always treats as uncensored, irrespective of” d ( n ) .

2.4. Including Covid-Death Events in the Data

Assume that, in addition to the n patients who left the study by a DoD or a Cen event, also m patients were present in the oncological trial for whom death of COVID-19 (DoC) was observed on the time points θ j , j = 1 , , m . The corresponding observed data set can be represented as follows:
x = z θ = z i = ( t i , d i ) , i = 1 , , n ( θ j , · ) , j = 1 , , m ,
where the status indicator of each DoC event is missing. It is clearly inappropriate to pose these indicators equal to 1, but it is also not appropriate to set them equal to 0, since the DoC event provides an informative censoring, given that we know this event does carry prognostic information about the survival experience of the oncological patients. More precisely, we know that there is a positive correlation between DoD and DoC events. However, ignoring DoC data would cause an unpleasant loss of information and we would like to adjust these data in some ways, so that it can be included in the study. Formally, we are interested in replacing each of the observed θ j by a different appropriate time point τ j > θ j , a virtual lifetime conditional on θ j , possibly with an appropriate value of the corresponding status indicator, which we will denote by δ j . We are confident that this replacement of the DoC time points can be properly completed just because we assume that, due to the dependence between DoC and DoD events, the “standard” data z contain information on the COVID-19 data (and vice versa). The determination of the status indicators δ j is more challenging. However, with the appropriate adjustment we can consider the whole data set:
w = z i = ( t i , d i ) , i = 1 , , n z j = ( τ j , δ j ) , j = 1 , , m ,
and we can safely apply the KM estimator to these data, thus also using the information contribution carried by COVID-19 deaths. In the following section, we will propose an iterative procedure to suitably realize this adjustment.

3. The EM Mean-Imputation Procedure

3.1. The CoDMI Algorithm

Obviously, the input data to the algorithm are given by the observation set x in (4). We will assume, however, that all patients who died of COVID-19 would have died of disease if COVID-19 had not intervened, thus setting δ j 1 , i.e., assuming that all the virtual lifetimes τ j would have been terminated by a DoD event). We will see in Section 3.4 how one can try to get around this limitation in this counterfactual problem. Under the assumption δ j 1 , the basic idea of our COVID-19 adjustment is to estimate the virtual lifetimes τ j as the expectation E ( T θ j ) , provided by the KM estimator itself. This is realized by a procedure consisting of the following steps.
  • Initialization step. One starts by setting ( τ j , δ j ) = ( τ ^ j ( 0 ) , 1 ) for j = 1 , 2 , , m , where τ ^ j ( 0 ) are arbitrarily chosen initial values. Then one obtains an artificial complete data set w ^ ( 0 ) , as defined in (5). Examples of initialization are τ ^ j ( 0 ) θ j or τ ^ j ( 0 ) θ j + e ^ θ j ( z ) , where e ^ θ j ( z ) is the life expectancy computed by applying the KM estimator to the standard data z.
  • Estimation step. The KM estimator is applied to w ^ ( 0 ) to produce the survival function estimate S ^ ( 0 ) ( t ) . In case of incomplete death-observations, the distribution is completed by posing S ^ ( 0 ) ( t max ) = 0 .
  • Expectation step. Using w ^ ( 0 ) , the m future life expectancy e ^ θ j ( 0 ) are computed as in (3). The corresponding time points τ ^ j ( 0 ) are then replaced by τ ^ j ( 1 ) = θ j + e ^ θ j ( 0 ) . One then obtains the new artificial complete data set:
    w ^ ( 1 ) = ( t i , d i ) , i = 1 , , n ( τ ^ j ( 1 ) , 1 ) , j = 1 , , m .
  • The estimation and the expectation steps are repeated, producing at the k-th stage a new complete data set w ^ ( k ) , provided by the expectations { e ^ θ j ( k ) , j = 1 , , m } . The iterations stop when a specified convergence criterion is fulfilled. A natural criterion is:
    max 1 j m e ^ θ j ( k + 1 ) e ^ θ j ( k ) < ε ,
    for a suitable specified tolerance level  ε > 0 (this choice will be left as an option for the user). If condition (6) is not satisfied after a fixed maximum number of iterations (which will also be chosen as a user option), the convergence is considered failed.
If the convergence criterion is met, the final values of the m life expectancy provide estimates which we will denote by e ^ θ j . The corresponding estimated lifetimes are τ ^ θ j = θ j + e ^ θ j and the estimated whole data set is:
w ^ = z i = ( t i , d i ) , i = 1 , , n z ^ j = ( τ ^ j , 1 ) , j = 1 , , m .
This iterative procedure can be seen in the class of the well-known Expectation-Maximization (EM) algorithms, since the estimation step can be interpreted as a maximization, given that the KM approach provides a maximum likelihood estimator. In this class of algorithms the expectation step is often referred to as mean-imputation, hence we will call our iterative procedure Covid-Death Mean-Imputation (CoDMI) algorithm.
Remark 4.
(i) Usually EM algorithms, and the concept of imputation, refer to procedures aimed to filling-in missing data. What we are dealing with here is data observed to a limited extent, rather than completely missing. Therefore, in this application the imputation corresponds rather to a replacement(of the observed time points θ j by the estimated time points τ j ). Our method is, however, in the spirit of the fake-data principle, as illustrated by Efron and Hastie [15], pp. 148–149.
(ii) It should be noted that the idea of estimating the virtual lifetimes τ j as the expectation E ( T θ j ) implies a further more subtle assumption. Let DoCj be the event: “Patient j died of COVID-19 at time θ j ” and RoCj: “Patient j became ill with COVID-19 but recovered at time θ j ”. Using notation introduced by Pearl in causal analysis (e.g., [16]), we are assuming for this patient that:
E T θ j | d o ( D o C j = 0 ) = E T θ j | D o C j = 0 ,
where d o ( A ) is the intervention operator on event A. This means that we are assuming that the event RoCj, which is not excluded by DoCj = 0, does not change the probability distribution of T θ j . This is clearly a simplifying assumption that makes our counterfactual problem easy to solve. In a more rigorous analysis, the effect of events as RoCj should be also taken into account [1]. We refrain to do this here, since such an analysis would take us out of the KM survival framework.

3.2. The Convergence Issue

In general, CoDMI is not guaranteed to converge. If we make the classical binomial assumptions, we can derive the KM likelihood as a function of the hazard rates h i . Running the algorithm, we find it is possible that different parameter sets, then different sets of e ^ θ j estimates, correspond to the same likelihood value. This should indicate an issue in parameter identifiability. However the classical KM likelihood is defined for fixed time points, while the estimates e ^ θ j change at each step in our algorithm. Thus, the identifiability problem should be more properly studied referring to a likelihood function which includes the event times in the parameters as well.
Remark 5.
A similar problem of iterated estimates for the KM product-limit estimator, but with fixed time points hence without parameter identifiability issues, was studied by Efron [14]. He proved in this case that, provided that the probability distribution is complete, the solution of the convergence problem exists and is unique. The previously mentioned self-consistency refers precisely to this property.
However, in order to manage the convergence problem, even based on the results of the simulation exercise presented in Section 5, it is worth considering the following three types of situation.
(1)
Finite time convergence. The difference between two successive estimates becomes zero after a finite number of iterations.
(2)
Asymptotic convergence. The difference between two successive estimates tends to zero asymptotically.
(3)
Cyclicity. After a certain number of iterations, cycles of the estimated values are established which tend to repeat themselves indefinitely, so that the minimum difference between two successive estimates remains greater than zero. In this case, if this minimal difference is less than the tolerance ε , the corresponding estimate can be accepted (this is actually referred to by the term “tolerance”). It often happens that small changes in some of the θ j values are sufficient to get out of cyclicity cases. Therefore, some fudging of these data could be used to obtain acceptable solutions when the minimum improvement is out of tolerance.
As shown in the simulation study in Section 5, cases of non-convergence are not very frequent, and many of these can be circumvented by milding the convergence criterion (6) and fudging the COVID-19 data a little, if necessary. In general, the results are found to be sensitive to the initial values τ ^ j ( 0 ) . In cases of convergence this is not a problem since different solutions, but within the chosen tolerance criterion, are equivalent from a practical point of view. In some cases of non-convergence, on the other hand, it is possible to skip to convergence cases by changing the initial values.

3.3. Assumptions Underlying CoDMI

The iterative procedure described in Section 3.1 can probably be easily justified by intuitive reasoning. However, also to give internal consistency to the simulation procedure presented in Section 5, it is convenient to better specify the assumptions underlying the CoDMI algorithm. A preliminary remark is important to be made. In our framework, the “true” probability distribution of the random variable T 0 is the best-fitting distribution in the KM sense, i.e., the distribution identified by applying the maximum likelihood product-limit estimator to existing data. Without appropriate additional assumptions (e.g., specifying an analytic form of the hazard function) this distribution is completely non-parametric and there is no other way to identify it than by specifying the data as well as the estimator used (the product-limit estimator, in fact). One could say, data provide information to the estimator, and the estimator provides probabilistic structure to data. Having remarked upon this, the basic assumption underlying CoDMI algorithm outlined in the following section. When COVID-19 deaths are present in the study sample, there is an extended underlying data structure composed of the n observed lifetimes t i (ending with a DoD or a Cen) and by the m partially observed lifetimes τ j (virtually ending, if we assume δ j = 1 , with a DoD). The corresponding probability distribution is the best-fitting distribution specified by this extended data, i.e., by applying the KM estimator to the data set w = z z . We will keep have this property in mind when we generate the simulated scenarios on which to measure the algorithm’s predictive performance.

3.4. Adjusting for the Assumption δ j 1

Relaxing the assumption that patients eliminated by a DoC event would have died of disease without this event is not an easy task. The prediction regarding the status operators δ j increases the forecasting problem by one dimension and requires a reliable predictive model, which is currently not available to us. We are therefore content to propose an adjustment for censoring of the response of CoDMI algorithm which should mitigate the possible bias produced by the assumption δ j 1 . If the algorithm met the convergence criterion, the final data set is given by (7). We then consider the modified data set:
w ^ ( R ) = ( t i , 1 d i ) , i = 1 , , n ( τ ^ j , 0 ) , j = 1 , , m ,
where both the observed and the estimated virtual lifetimes are kept the same, while all the status indicators are reversed. Running the KM estimator on the set w ^ ( R ) , one obtains the so-called reverse Kaplan–Meier survival curve  S ^ ( R ) ( t ) , which refers to Cen instead of DoD endpoints, and provides the new conditional expectations τ ^ j ( R ) , given θ , of the virtual lifetimes. We then choose to derive the adjusted estimates τ ^ j , for j = 1 , , m , as:
τ ^ j = τ ^ j if α ( θ j ) 0.5 , τ ^ j ( R ) if α ( θ j ) < 0.5 ,
where α ( t ) is the probability that an event observed at time t is a DoD (as opposed to a Cen). In order to estimate these non-censoring probabilities, the standard observations { z i = ( t i , d i ) } are represented on a time grid spanning the time interval [ 0 , t max ] with cells l = 1 , 2 , , G , and a parametric hazard rate function h ^ l is fitted on this grid. The same procedure is then applied to the “reverse observations” { z i ( R ) = ( t i , 1 d i ) } and the corresponding hazard rate function h ^ l ( R ) is then derived. The probability estimates are then computed as α ^ ( t ) = h ^ l ( t ) / [ h ^ l ( t ) + h ^ l ( t ) ( R ) ] , where l ( t ) is the cell containing the time point t. Examples of estimated α ^ ( t ) functions are provided in the next section.
The above procedure is fairly ad hoc and the indications provided do not necessarily have to be accepted. It may be the case that the user of the procedure has a personal opinion, based on external information, on the value of (some of) the virtual status operator δ j . In this situation the coefficients α ( θ j ) in (9) could be assigned or modified by the user on the basis of this expert judgment.

3.5. An Extended Greenwood’s Formula

The virtual lifetime expectations τ ^ j provided by CoDMI and included in the mean-imputed data w ^ are point estimates which allow these data to be applied to any statistical tool available for survival analysis. However, replacing an observed value with a point estimate, even an unbiased one, increases the variance of the survival estimates, since the mean-imputed data convey their own estimation error. Usually the standard deviation of the KM survival function estimate is computed using Greenwood’s formula. On the standard data, using the same notation in (1), this can be written as:
s . d . S ^ ( t ) = S ^ ( t ) i : t ( i ) t h ( i ) 1 h ( i ) 1 R ( t ( i ) ) 1 / 2 , with h ( i ) = d ( i ) R ( t ( i ) ) ,
where the summand is set to 0 if h ( i ) = 1 . We provide an extension of this formula in order to include the variance component due to the estimated time points τ ^ j .
We start by the CoDMI output, eventually with the adjustment for censoring:
w ^ = ( t i , d i ) , i = 1 , , n ( τ ^ j * , δ j ) , j = 1 , , m
where the τ ^ j are derived by (9) and the indicators δ j can be equal to 0 or 1. We represent the w ^ data set in the alternative form:
y ^ = ( t i , d i , δ i ) , i = 1 , , n + m ,
where:
  • t i = t i or τ ^ j are the observed or estimated survival times ordered by increasing value (the usual conventions on tied values apply);
  • d i = 0 if t i corresponds to a Cen and 1 otherwise;
  • δ i = 1 if t i corresponds to a DoC and 0 otherwise.
Since the time points t i are assumed to be ordered, we simplify the exposition in this section by using the subscript i instead of ( i ) (and R i instead of R ( t ( i ) ) ). We then consider both the “direct” probability distribution { q i , i = 1 , , n + m } and the reverse one { q i ( R ) , i = 1 , , n + m } , both taken from the CoDMI output, and from these we derive the m direct and the m reverse truncated distributions:
q i , j = q i 1 { t i > θ j } k : t k > θ j q k , q i , j ( R ) = q i ( R ) 1 { t i > θ j } k : t k > θ j q k ( R ) , i = 1 , , n + m , j = 1 , , m .
These distributions are defined, with null values, also for t i θ j . Finally, we compute the total probabilities:
Q i = j = 1 m q i , j , w i t h q i , j = δ j q i , j + ( 1 δ j ) q i , j ( R ) , i = 1 , , n + m ,
and define Q i ( 2 ) = j = 1 m ( q i , j ) 2 , i = 1 , , n + m . Observe that j = 1 m Q i = m .
With these definitions, we propose the following correction of Greenwood’s formula:
s . d . S ^ ( t ) = S ^ ( t ) i : t i t h ¯ i 1 h ¯ i 1 R ¯ i + 1 ( 1 h ¯ i ) 2 R ¯ i 1 R ¯ i Q i Q i ( 2 ) R ¯ i 2 1 / 2 ,
where the hazard rates h ¯ i are specified as:
h ¯ i = d i ν i R ¯ i , w i t h ν i = ( 1 δ i ) + Q i , i = 1 , 2 , , n + m ,
and the number of subjects at risk is computed as:
R ¯ i = n + m for i = 1 , R ¯ i 1 1 + ( ν i 1 ) d i for i = 2 , 3 , , n + m .
The basic idea underlying this formula is that the m COVID-19 deaths are distributed as “fractional deaths” Q i = j q i , j over all the uncensored time points (both DoD and DoC), and the hazard rate at time t i has a random component with mean Q i / R ¯ i and variance ( Q i Q i ( 2 ) ) / R ¯ i 2 . The details of the derivation of Formula (12) are provided in Appendix A. Using (12), the approximate 95% confidence intervals can be computed by:
log S ^ ( t ) ± 1.96 s . d . S ^ ( t ) S ^ ( t ) .

4. Examples of Application to Real Survival Data

4.1. Application to COVID-19 Extended NCOG Data

To illustrate the effects of our mean-imputation adjustments, we start by considering some real survival data well referenced in the literature and apply CoDMI algorithm to these data after the addition of some artificial COVID-19 deaths. This is carried out because, currently, sufficiently rich real datasets containing both cancer-death and Covid-death events are hardly available. To this aim, we chose, as the real reference data, the head/neck cancer data of the NCOG (North Carolina Oncology Group) study, which was used to illustrate the KM approach in the book by Efron and Hastie, Section 9.2 [15]. We considered data from the two arms, A and B, separately.

4.1.1. Arm A of NCOG Data

Survival times (in days) from Arm A in the first panel of Table 9.2 [15] are reported in Table 1.
To save space, data is presented, as in the book, in compact form, with the + sign representing censoring. The conversion of these data into the form of a two-component vector z = { ( z i , d i ) , i = 1 , 2 , , n } is immediate. There are n = 51 patients, with 43 DoD events and 8 Cen events. The final time point is 1417 days after the beginning of the study, and a DoD is observed on that date. Therefore we are in a complete death-observations case, with t m a x = t max ( D ) = 1417 . The corresponding KM estimate of the survival function S ^ ( t ) is illustrated by the black line in Figure 1.
To illustrate the application of CoDMI algorithm, we add to these data an artificial group of m Covid death observations, i.e., m DoC events assumed being observed at the time points θ = { θ j , j = 1 , 2 , , m } . We chose m = 5 (roughly 10 % of n) DoC events, on 5 time points roughly equally spaced in ( 0 , t max ) :
θ = { 250 , 500 , 750 , 1000 , 1250 } .
Since the observation set x = z θ has been specified, we have to choose the virtual lifetimes τ ^ j ( 0 ) in the data set w ^ ( 0 ) which is used to initialize CoDMI algorithm. If, for example, we choose the option to set τ ^ j ( 0 ) θ j , then we have w ^ ( 0 ) = z z ^ ( 0 ) , with:
z ^ ( 0 ) = { ( 250 , 1 ) , ( 500 , 1 ) , ( 750 , 1 ) , ( 1000 , 1 ) , ( 1250 , 1 ) } .
We run CoDMI algorithm with this initialization and ε = 0.1 . The procedure converged after 10 iterations, providing the following estimates for the lifetimes { τ ^ j , j = 1 , , 5 } :
{ 894.32 , 1118.85 , 1253.58 , 1286.24 , 1354.00 } .
The corresponding COVID-19 data:
z ^ = { ( 894.32 , 1 ) , ( 1118.85 , 1 ) ( 1253.58 , 1 ) , ( 1286.24 , 1 ) , ( 1354.00 , 1 ) } ,
are then used as mean-imputed data to obtain the final complete data set w ^ in (7). As one can observe, the expectation Formula (3) provides non-integer values, which is not a problem since the survival function provided by the KM estimator is defined on the real axis.
Remark 6.
A tolerance of 0.1 already provides overabundant precision for our applications. However, in order to stress the algorithm, we also tried with ε = 10 8 and ε = 10 18 , obtaining convergence after 33 and 51 iterations, respectively. This seems to be a case of asymptotic convergence.
The survival curve provided by the KM estimator applied to the completed data w ^ (“DoC Imputed”) is illustrated in blue color in Figure 1, where it can be compared with the original survival estimate based on the z data (“Without DoC”, black color). For further comparisons, we also present the survival KM curves estimated by the two naïve strategies, comprising a classification of all DoC events as Cen, i.e., τ ^ j θ j and δ j 0 (“DoC as Cen”, green color), or all DoC events as DoD, i.e., τ ^ j θ j and δ j 1 (“DoC as DoD”, red). In the figure, the “critical” time points are reported by indicating the 14 Cen points by tiks and the 5 θ j points by red triangles on the black curve, while the 5 τ ^ j points are indicated by circles on the blue line (where, obviously, each circle corresponds to a jump).
We finally illustrate the application of the adjustment for censoring presented in Section 3.4. After deriving from w ^ the modified data set w ^ ( R ) in (8), we apply the KM estimator to these data, obtaining the following alternative lifetimes { τ ^ j ( R ) , j = 1 , , 5 } :
{ 1207.49 , 1296.23 , 1347.78 , 1347.78 , 1398.13 } .
In Figure 2, on the left it is illustrated the probability curve α ( t ) estimated as specified in Section 3.4. By this function, one obtains:
α ( θ 1 ) = 0.623 , α ( θ 2 ) = 0.781 , α ( θ 3 ) = 0.699 , α ( θ 4 ) = 0.402 , α ( θ 5 ) = 0.193 .
Therefore, the procedure suggests to consider the last two time points as (potentially) censored, then estimated as in (17). The data set z ^ in (16) is then modified as:
z ^ = { ( 894.32 , 1 ) , ( 1118.85 , 1 ) ( 1253.58 , 1 ) , ( 1347.78 , 0 ) , ( 1398.13 , 0 ) } .
These suggestions, however, are purely indicative and can be rejected or changed based on expert opinion.
In Figure 3, the survival function estimated after the suggested adjustment for censoring is reported, together with the 95 % confidence limits computed with the traditional Greenwood’s formula (red dotted lines) and with the extended Formula (12) (blue dashed lines).

4.1.2. Arm B of NCOG Data

In Table 2, we report censored survival times (in days) from Arm B in the second panel of Table 9.2 [15].
Furthermore, in this case, we refrain, for reasons of space, to present data converted into z form. Data are heavily censored in this arm, having n = 45 patients, with 14 Cen events, which are mainly distributed among the largest time points. Moreover, we are in a case of incomplete death-observations, since the final time point t max = 2297 is a Cen point. The last time point with a DoD event observed is t max ( D ) = 1776 and 4 Cen events are observed thereafter. The final level of the survival curve provided by the KM estimator is S ^ ( t max ( D ) ) = 22.99 % and we choose to allocate this probability mass entirely on the final Cen point 2297. For the artificial data on COVID-19 deaths, also in this case we choose m roughly 10 % n and assume equally spaced DoC events in the interval ( 0 , 2297 ) . That is we assume n = 5 with θ j values in the set:
θ = { 400 , 800 , 1200 , 1600 , 2000 } .
The last time point in θ is after the last observed DoD time point (1776). As in the previous case, the initial data set z ^ is derived by setting δ j 1 , and the complete data set w ^ ( 0 ) = z z ^ is used to initialize CoDMI algorithm. The algorithm, run again with ε = 0.1 , converged after 12 iterations (convergence was met after 49 iterations for ε = 10 8 and 78 iterations for ε = 10 18 ), providing the following estimates for the adjusted lifetimes { τ ^ j , j = 1 , , 5 } :
{ 1654.63 , 1934.24 , 2004.07 , 2041.32 , 2148.59 } .
In Figure 4, we replicate the illustrations of Figure 1 on these data. As concerning the adjustment for censoring, from the estimated probability curve reported in Figure 2 on the right we obtain:
α ( θ 1 ) = 0.667 , α ( θ 2 ) = 0.371 , α ( θ 3 ) = 0.192 , α ( θ 4 ) = 0.074 , α ( θ 5 ) = 0.0002 .
Therefore, in this case, the procedure suggests to consider the last four time points as censored. Using the criterion in (17), the final data set is obtained:
z ^ = { ( 1654.63 , 1 ) , ( 1922.76 , 0 ) , ( 1978.15 , 0 ) , ( 2084.32 , 0 ) , ( 2201.93 , 0 ) } .
Figure 5 is the analogous for Arm B of Figure 3.

5. A Simulation Study

In order to test the ability of CoDMI to correctly estimate the expected life-shortening (or the corresponding virtual lifetime) due to DoC events in a study population, we generate many scenarios each containing simulated data. These pseudo-data include a z ˜ data set of standard observations and a τ ˜ ( 0 ) data set of (preliminary) virtual lifetimes. By randomly censoring the time variables in τ ˜ ( 0 ) a corresponding set θ ˜ of DoC time points is derived. In order to equip these pseudo-data with a probabilistic structure consistent with CoDMI assumptions, a KM best-fitting distribution is derived by applying the product-limit estimator to z ˜ τ ˜ ( 0 ) . The “true” virtual lifetimes τ ˜ are then derived by conditional sampling, given θ ˜ , from this distribution. Running CoDMI algorithm on the pseudo-observations x ˜ = z ˜ θ ˜ , the estimated virtual lifetimes τ ^ are obtained and the quality of the estimator is measured by computing the average, over all scenarios, of the prediction errors τ ˜ τ ^ .

5.1. Details of the Simulation Process

The details of each scenario simulation are as follows:
1.
Simulation of standard survival data z ˜ . The simulated standard (i.e., non-Covid) survival data z ˜ is generated in each scenario starting from the same set of real data z = { ( t i , d i ) , i = 1 , 2 , n } , spanning the time interval [ 0 , t max ] . The set z ˜ is generated by drawing with replacement n s i m pairs ( t ˜ i , d ˜ i ) from the n real-life pairs ( t i , d i ) , maintaining the proportion between DoD and Cen events in z. Let us denote by t ˜ max ( D ) the largest uncensored time point in z ˜ .
Remark. It should be noted that many tied values can be generated in this step, especially if n sim n . Moreover, t ˜ max ( D ) could result to be censored (a case of incomplete death observations) even if the death observations are complete in the original data. It is easy to guess that generating many scenarios in this way can produce a number of “extreme” pseudo-data z ˜ . This is useful, however, for testing the algorithm even in unrealistic situations. Most cases of failed convergence correspond to extreme situations.
2.
Simulation of DoC time points θ ˜ . In order to simulate a number m sim of COVID-19 deaths, the time points τ ˜ j ( 0 ) , j = 1 , 2 , , m sim , are generated by drawings with replacement from the t i points in real data z, satisfying the conditions d i = 1 and t i t ˜ max ( D ) . These time points are interpreted as temporary virtual lifetimes and are first used to generate the DoC time points θ ˜ j . A number m sim of independent drawings u ˜ j from a uniform (0, 1) distribution are performed, and the corresponding DoC time points are obtained as θ ˜ j = u ˜ j · τ ˜ j ( 0 ) . Therefore, for all j one has 0 < θ ˜ j < τ ˜ j ( 0 ) t ˜ max ( D ) , with θ ˜ j taking equally probable values in ( 0 , τ ˜ j ( 0 ) ) .
Remark. The use of a uniform distribution is obviously questionable, and more “informative” distribution could be suggested. For example, a beta distribution with first parameter greater than 1 and second parameter lower than 1 may be preferable, as it makes more probable values of θ ˜ j closer to τ ˜ j ( 0 ) . However, the form of this distribution is irrelevant to our purposes: we are interested in observing how CoDMI is able to capture the simulated virtual lifetimes, independently of how they are generated.
3.
Simulation of virtual lifetimes τ ˜ j . The temporary lifetimes τ ˜ j ( 0 ) (and the data set z ˜ ) cannot be directly used to test CoDMI algorithm, since their probabilistic structure is indeterminate and, in any case, we have too few (pseudo-)observations. In order to introduce a probabilistic structure consistent with CoDMI assumptions, we first run the KM estimator on the data set w ˜ ( 0 ) = z ˜ { ( τ ˜ j ( 0 ) , 1 ) } , thus obtaining the corresponding death probability distribution { q ˜ i ( 0 ) , i = 1 , 2 , n sim + m sim } . The virtual lifetimes τ ˜ j ( 1 ) , j = 1 , 2 , , m sim , are then obtained by computing the conditional expectations E ( T θ ˜ j ) by this distribution. However, this is not yet fully consistent with CoDMI assumptions, since, as discussed in Section 3.3, the appropriate distribution is the KM best-fitting distribution specified on the extended data, i.e., data including the virtual lifetimes themselves. To obtain this result we should repeat the previous step, i.e., running the product-limit estimator on the new data set w ˜ ( 1 ) = z ˜ { ( τ ˜ j ( 1 ) , 1 ) } , thus producing the new distribution { q ˜ i ( 1 ) , i = 1 , 2 , n sim + m sim } and then simulating m s i m new time points τ ˜ j ( 2 ) by taking the conditional expectation on this distribution. In principle, this step should be iterated similarly to what is completed in the CoDMI algorithm. To avoid convergence problems, however, we prefer to limit the number of iterations to a fixed (low) value n i t e r , thereby implicitly accepting a certain level of bias in the estimations. After these n iter iterations has been made, the final data set w ˜ ( n iter ) = z ˜ { ( τ ˜ j ( n iter ) , 1 ) } is obtained. Running the KM estimator on these data again, the final distribution { q ˜ i ( n iter ) , i = 1 , 2 , n sim + m sim } is obtained and the definitive time points τ ˜ j , with the corresponding e ˜ j = τ ˜ j θ j , are computed by conditional sampling, given θ ˜ j , i.e., simulating from the truncated distribution { q ˜ i ( n i t e r ) , i : t i > θ j } (after normalization). These sampled values are taken as the true values of virtual lifetimes and life expectancy, respectively, which should be estimated by CoDMI using only the information z ˜ θ ˜ .
4.
Application of CoDMI and naïve estimators. CoDMI algorithm is applied to the simulated data:
w ˜ = z ˜ i = ( t ˜ i , d ˜ i ) , i = 1 , , n sim z ˜ j = ( θ ˜ j , 1 ) , j = 1 , , m sim ,
with z ˜ i obtained in step 1 and θ ˜ j in step 2. Provided that the algorithm converges, we obtain the m sim estimated virtual lifetimes τ ^ j and the estimated life expectancy e ^ j .
To allow comparison, we also derive in this step the predictions of the two naïve “estimators” which are obtained by applying the KM estimator to the simulated data w ˜ , modified by posing, for all j, τ ˜ j = θ ˜ j and δ j = 1 (“DoC as DoD”) or δ j = 0 (“DoC as Cen”).

5.2. Valuation of the Predictive Performances

In the simulation exercise, a large number N of scenarios are generated. This provides, for j = 1 , 2 , , m sim and k = 1 , 2 , , N , the N · m sim CoDMI estimates e ^ j ( k ) (from step 4) and the N · m sim true realizations e ˜ j ( k ) (from step 3). Then we can compute the prediction errors:
Δ j ( k ) = e ˜ j ( k ) e ^ j ( k ) , j = 1 , 2 , , m sim , k = 1 , 2 , , N ,
and the average errors:
Δ ¯ j = 1 N k = 1 N Δ j ( k ) , j = 1 , 2 , , m sim , Δ ¯ = 1 m sim j = 1 m sim Δ ¯ j .
Positive (negative) values of Δ j ( k ) correspond to under(over)-estimates provided by CoDMI. As usual, we can associate to these average errors the corresponding standard error, i.e., the standard error of the mean (s.e.m.). Given the independence assumption, the central limit theorem guarantees, as usual, that the sample means are asymptotically normal. Therefore, the corresponding s.e.m. is inversely proportional to N .
The same summary statistics are computed for the prediction errors relative to the two naïve estimators.

5.3. Results from Simulation Exercises

Two separate simulation exercises were performed, one using Arm A, the other using Arm B as real-life data. In both the exercises, N = 10 , 000 scenarios were generated, with n sim = 100 standard observations (roughly double the real ones) and m sim = 10 COVID-19 deaths. A tolerance ε = 1 was chosen for the CoDMI algorithm, with a maximum number of allowed iterations iter max = 100 . The number of iterations for generating the true values was n iter = 10 and for all the initializations the option τ ^ j ( 0 ) = θ j + e ^ θ j ( z ) was chosen. Since in some scenarios CoDMI failed to converge (with the chosen values for ε and iter max ), the sample means and the corresponding s.e.m. where computed only on the N c convergence cases.
In Table 3, which is referred to Arm A data, the simulation results are reported for each of the 10 DoC cases. We obtained N c = 9802 convergence cases out of the 10,000 simulated. In each row, the sample mean of the DoC time points θ ˜ j , the true life expectancy e ˜ j and the CoDMI estimated life expectancy e ^ j are reported in columns 2–4. In columns 5–9, we provide summary statistics of the corresponding prediction errors: the mean error Δ ¯ j = e ˜ ¯ j e ^ ¯ j , the related s.e.m., the relative mean error Δ ¯ j / e ˜ ¯ j and the minimum and maximum value of Δ ¯ j .
The same results for 10,000 scenarios generated by Arm B data are reported in Table 4.
Table 5 provides the results in Table 3 and Table 4 aggregated over all the DOC events. These overall results are summarized in blok “DoC imputed”. In the bloks, “DoC as DoD” and ”DoC as Cen” the average prediction errors are reported for the two corresponding naïve estimators. The main finding from the simulations is that the CoDMI estimates seem to be essentially unbiased, with a relative prediction error of around 0.5 % for both the original data considered. Some more extensive (and time consuming) tests, with N = 10 5 or N = 10 6 , have shown a further reduction of the error (as well as, obviously, of the corresponding s.e.m.).
As a final exercise, we used a modified version of the simulation procedure to obtain an assessment of goodness of the adjustment for censoring described in Section 3.4. In the modified simulation, all the m sim true virtual lifetimes τ ˜ j were generated assuming a censoring, instead of a DoC, as the endpoint. Then we set δ j 0 and in step 3 of Section 5 we generated in all iterations the virtual lifetimes τ ˜ j using the truncated reverse probability distribution, i.e., the distribution obtained by applying the Kaplan–Meier estimator to the reversed data w ˜ ( R ) (see (8)). Correspondingly with this change in assumption, the estimated values τ ^ j in each simulation were obtained by applying the CoDMI algorithm with the final adjustment for censoring, setting at 0 all the probabilities α ( θ j ) in (9). The overall results from these simulations are summarized in Table 6, which have the same structure as Table 5 and where the results without adjustment are also provided for comparison.
As we can see, the changed assumption on the status of the DoC endpoints provides a large increase of the true life expectancy e ˜ , but the adjustment for censoring seems to capture quite well this change. Of course, in real life we do not know what the true value of the δ j is, and we will have to try to choose the suitable τ ^ j in (9) based on the α j probabilities and/or using expert judgment.

6. Conclusions and Directions for Future Research

In the simulated scenarios, where all the virtual endpoints of COVID-19 cases are assumed to be DoD, the results indicate that CoDMI estimator is roughly unbiased and outperforms alternative estimates obtained by the naïve approaches. In the opposite extreme situation, where all the virtual endpoints of COVID-19 cases are assumed to be censored, the final adjustment for censoring of CoDMI also guarantees unbiasedness, provided that the information on the status of DoC events is assumed to be known. The non-convergence cases can often be circumvented by milding the convergence criterion and/or fudging COVID-19 data a little. Furthermore, changing the initialization of the algorithm can be useful in some cases.
By a natural extension of the binomial assumptions underlying the KM estimator, a version of the classical Greenwood formula can be derived for computing the variance of CoDMI estimates. Equipped with this formula, the CoDMI algorithm is proposed as a complete statistical estimation tool.
As we pointed out in the Introduction, CoDMI algorithm, compared with the cumulative incidence functions method often used to study competing risks, is a pragmatic approach that allows to directly apply all standard statistical tools to “augmented” data. However, it remains important to compare the predictive performance of the two approaches. In our applications, where the competing events are DoD and DoC, we do not yet have sufficiently rich data to test the effectiveness—and possibly the necessity—of an approach based on the cumulative incidence functions, or even to test the possibility of using the two methods in conjunction. Therefore, this topic is left for future research.
Another interesting issue is the convergence of CoDMI algorithm, which is discussed in Section 3.2. A natural way to approach this problem is to study the behavior of the log-likelihood function. However, as we have pointed out, we are not in a fixed time points situation. So it is not a trivial task to explicitly write the updated log-likelihood at each iteration step, because the replacements in each step imply a re-ordering of the time points and consequently a change in the number of items at risk in each death probability estimate. This problem is also left as a future work.

Author Contributions

F.D.F. and F.M. conceived the basic structure of the paper. F.M. designed the CoDMI algorithm and derived the extended Greenwood’s formula. L.M. realized the simulation study and implemented the CoDMI algorithm in R. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Derivation of the Extended Greenwood’s Formula

We organize data in a life table with K time intervals k = 1 , 2 , , K , spanning the interval [ 0 , t n + m ] , each with length Δ t = t n + m / K . Let us denote by h k and ν k the hazard rate and the number of DoD events, respectively, in the interval k and by n k the number of subjects at risk at the end of the interval k 1 . In this setting the survival function is defined as S l = k = 1 l ( 1 h k ) , l = 1 , 2 , , K , and an estimate S ^ l for S l is obtained by plugging in an estimate h ^ k for h k , k = 1 , 2 , , l . We make the binomial assumption:
n k h ^ k Bin ( n k , h k ) , with h k = h k + H k , k = 1 , 2 , , K ,
where the parameters h k and H k are the DoD and the DoC hazard rate, respectively. In addition to this usual assumption, we express H k as the random variable:
H k = N k n k , with N k = j = 1 m 1 { ( k 1 ) Δ t < T θ j k Δ t } ,
where, as usual, the random variable T θ j is the conditional lifetime T 0 | ( T 0 θ j ) and θ j is the time of the j-th observed DoC event. The probability distribution of T θ j , however, is not necessarily specified for the moment. Let μ k = E ( N k ) and σ k 2 = Var ( N k ) .
In order to derive an approximation of the variance of S ^ l , l = 1 , 2 , , K , in the same spirit of Greenwood’s formula we consider the variance of the logarithm:
Var log S ^ l = Var k = 1 l log ( 1 h ^ k ) = k = 1 l Var log ( 1 h ^ k ) .
As for the second equality, it should be noted that the h ^ k values are not independent, since n k depends on the events in the previous periods. However successive conditional independence, given n k (essentially, a martingale argument), is a sufficient condition for the equality to hold. Now we use the so-called delta-method approximation Var ( log X ) Var ( X ) / [ E ( X ) ] 2 , to obtain:
k = 1 l Var log ( 1 h ^ k ) k = 1 l Var ( 1 h ^ k ) [ E ( 1 h ^ k ) ] 2 .
By the binomial assumption (A1) we have, for all k:
E ( h ^ k | H k ) = h k + H k ,
and:
Var ( h ^ k | H k ) = 1 n k ( h k + H k ) [ 1 ( h k + H k ) ] .
Therefore, for the expectation of h ^ k we obtain:
E ( h ^ k ) = h k + μ k / n k ,
and for the variance of h ^ k we have:
Var ( h ^ ) = Var E ( h ^ k | H k ) + E Var ( h ^ k | H k ) = Var ( h k + H k ) + E ( h k + H k ) [ 1 ( h k + H k ) ] / n k ,
or, with a little algebra:
Var ( h ^ k ) = 1 n k ( h k + μ k / n k ) 1 ( h k + μ k / n k ) + n k 1 n k σ k 2 n k 2 .
By inserting (A5) and (A6) into (A4) we have:
Var ( log S ^ l ) k = 1 l Var ( 1 h ^ k ) [ E ( 1 h ^ k ) ] 2 = k = 1 l h k + μ k / n k 1 ( h k + μ k / n k ) 1 n k + 1 [ 1 ( h k + μ k / n k ) ] 2 n k 1 n k σ k 2 n k 2 .
Plugging in h k = ν k / n k and posing h ¯ k = ( ν k + μ k ) / n k , we obtain:
Var ( log S ^ l ) k = 1 l h ¯ k 1 h ¯ k 1 n k + 1 ( 1 h ¯ k ) 2 n k 1 n k σ k 2 n k 2 .
Using the inverse approximation Var ( X ) [ E ( X ) ] 2 Var ( log X ) , we finally have:
Var ( S ^ l ) ( S ^ l ) 2 k = 1 l h ¯ k 1 h ¯ k 1 n k + 1 ( 1 h ¯ k ) 2 n k 1 n k σ k 2 n k 2 .
Now, in the life table we take Δ t small enough to make each time interval contain at most one time point t i . In this limit, if k i denotes the interval containing t i , we assume that:
1 { ( k i 1 ) Δ t < T θ j k i Δ t } = 1 { T θ j = t i } ,
consistently with the fact that, in this setting, T θ j has discrete distribution with probability masses in the points t i . These probabilities are the q i , j provided by CoDMI. Then, by (A2) and (11), we obtain:
E ( N k ) = j = 1 m E ( 1 { T θ j = t i } ) = j = 1 m P ( T θ j = t i ) = j = 1 m q i , j = Q i .
For the variance, assuming the independence of the T θ j we have:
Var ( N k ) = j = 1 m Var ( 1 { T θ j ) = t i } ) = j = 1 m q i , j ( 1 q i , j ) = j = 1 m q i , j j = 1 m ( ¯ q i , j ) 2 = Q i Q i ( 2 ) .
Thus, we estimate μ k by Q i and σ k 2 with Q i Q i ( 2 ) . Putting it all together, for the survival function S l we arrive at the product-limit estimator:
S ^ ( t ) = i : t i t 1 h ¯ i , t 0 ,
where:
h ¯ i = d i ν i R i , with ν i = ( 1 δ i ) + Q i , i = 1 , 2 , , n + m ,
and R i is computed recursively as:
R ¯ 1 = n + m , R ¯ i = R ¯ i 1 [ 1 + ( ν i 1 ) d i ] , i = 2 , 3 , , n + m .
Correspondingly, (A7) reduces to:
Var S ^ ( t ) S ^ ( t ) 2 i : t i t h ¯ i 1 h ¯ i 1 R ¯ i + 1 ( 1 h ¯ i ) 2 R ¯ i 1 R ¯ i Q i Q i ( 2 ) R ¯ i 2 .
In summary, the addition of the random component in the binomial assumption (A1) has the effect of distributing each of the m COVID-19 deaths, which has been imputed by CoDMI on the time points τ ^ j , on all the uncensored time points according to its truncated distribution q i , j . Summing over j we obtain the total probabilities Q i , for which the property holds i Q i = m . In the variance expression (A9) the estimation error of the τ ^ j is taken into account by the additional term containing the variance estimates Q i Q i ( 2 ) .
It should be noted, however, that the survival function estimate S ^ ( t ) given by (A8) is slightly different by the estimate S ^ ( t ) given by (1). Since the COVID-19 deaths are spread out on all the time points, one usually has S ^ ( t ) S ^ ( t ) for small t and S ^ ( t ) S ^ ( t ) for large t. One can accept the approximation:
Var S ^ ( t ) S ^ ( t ) 2 i : t i t h ¯ i 1 h ¯ i 1 R ¯ i + 1 ( 1 h ¯ i ) 2 R ¯ i 1 R ¯ i Q i Q i ( 2 ) R ¯ i 2 ,
which gives Formula (12). The more conservative approximation:
Var S ^ ( t ) max { S ^ ( t ) , S ^ ( t ) } 2 i : t i t h ¯ i 1 h ¯ i 1 R ¯ i + 1 ( 1 h ¯ i ) 2 R ¯ i 1 R ¯ i Q i Q i ( 2 ) R ¯ i 2 ,
could be also considered.

References

  1. De Felice, F.; Moriconi, F. COVID-19 and Cancer: Implications for Survival Analysis. Ann. Surg. Oncol. 2021, 28, 5446–5447. [Google Scholar] [CrossRef] [PubMed]
  2. Degtyarev, E.; Rufibach, K.; Shentu, Y.; Yung, G.; Casey, M.; Englert, S.; Liu, F.; Liu, Y.; Sailer, O.; Siegel, J.; et al. Assessing the Impact of COVID-19 on the Clinical Trial Objective and Analysis of Oncology Clinical Trials—Application of the Estimand Framework. Stat. Biopharm. Res. 2020, 12, 427–437. [Google Scholar] [CrossRef] [PubMed]
  3. European Medicines Agency. ICH E9 (R1) Addendum on Estimands and Sensitivity Analysis in Clinical Trials to the Guideline on Statistical Principles for Clinical Trials. Scientific Guideline. Available online: https://www.ema.europa.eu/en/documents/scientific-guideline (accessed on 17 February 2020).
  4. Kuderer, N.M.; Choueiri, T.K.; Shah, D.P.; Shyr, Y.; Rubinstein, S.M.; Rivera, D.R.; Shete, S.; Hsu, C.Y.; Desai, A.; de Lima Lopes, G., Jr.; et al. Clinical impact of COVID-19 on patients with cancer (CCC19): A cohort study. Lancet 2020, 395, 1907–1918. [Google Scholar] [CrossRef] [PubMed]
  5. Guan, W.J.; Ni, Z.Y.; Hu, Y.; Liang, W.H.; Ou, C.Q.; He, J.X.; Liu, L.; Shan, H.; Lei, C.L.; Hui, D.S.; et al. Clinical Characteristics of Coronavirus Disease 2019 in China. N. Engl. J. Med. 2020, 382, 1708–1720. [Google Scholar] [CrossRef] [PubMed]
  6. Kalbfleisch, J.D.; Prentice, R.L. The Statistical Analysis of Failure Time Data; Wiley: Hoboken, NJ, USA, 2002. [Google Scholar]
  7. Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 1977, 39, 1–38. [Google Scholar]
  8. DeSouza, C.M.; Legedza, A.T.R.; Sankoh, A.J. An Overview of Practical Approaches for Handling Missing Data in Clinical Trials. J. Biopharm. Stat. 2009, 19, 1055–1073. [Google Scholar] [CrossRef] [PubMed]
  9. Shih, W.J. Problems in dealing with missing data and informative censoring in clinical trials. Curr. Control. Trials Cardiovasc. Med. 2002, 3, 4. [Google Scholar] [PubMed]
  10. Shen, P.S.; Chen, C.M. Aalen’s linear model for doubly censored data. Statistics 2018, 52, 1328–1343. [Google Scholar] [CrossRef]
  11. Willems, S.J.V.; Schat, A.; van Noorden, M.S.; Fiocco, M. Correcting for dependent censoring in routine outcome monitoring data by applying the inverse probability censoring weighted estimator. Stat. Methods Med. Res. 2018, 27, 323–335. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  12. Gray, R.J. A class of K-sample tests for comparing the cumulative incidence of a competing risk. Ann. Stat. 1988, 4, 1141–1154. [Google Scholar]
  13. Kaplan, E.L.; Meier, P. Nonparametric estimation from incomplete observations. J. Am. Stat. Assoc. 1958, 53, 457–481. [Google Scholar]
  14. Efron, B. The two sample problem with censored data. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; University of California Press: Berkeley, CA, USA, 1967; Volume 4, pp. 831–853. [Google Scholar]
  15. Efron, B.; Hastie, T. Computer Age Statistical Inference. Algorithms, Evidence, and Data Science; Cambridge University Press: Cambridge, UK, 2016. [Google Scholar]
  16. Pearl, J.; Glymour, M.; Jewell, N.P. Causal Inference in Statistics. A Primer; Wiley: Hoboken, NJ, USA, 2016. [Google Scholar]
Figure 1. Kaplan–Meier curves for alternative treatments of COVID-19 deaths—Arm A.
Figure 1. Kaplan–Meier curves for alternative treatments of COVID-19 deaths—Arm A.
Curroncol 30 00163 g001
Figure 2. Non-censoring probability curves for Arm A (left) and Arm B (right).
Figure 2. Non-censoring probability curves for Arm A (left) and Arm B (right).
Curroncol 30 00163 g002
Figure 3. Kaplan–Meier curves estimated by CoDMI with adjustment for censoring and related confidence intervals—Arm A.
Figure 3. Kaplan–Meier curves estimated by CoDMI with adjustment for censoring and related confidence intervals—Arm A.
Curroncol 30 00163 g003
Figure 4. Kaplan–Meier curves for alternative treatments of COVID-19 deaths—Arm B.
Figure 4. Kaplan–Meier curves for alternative treatments of COVID-19 deaths—Arm B.
Curroncol 30 00163 g004
Figure 5. Kaplan–Meier curves estimated by CoDMI with adjustment for censoring and related confidence intervals—Arm B.
Figure 5. Kaplan–Meier curves estimated by CoDMI with adjustment for censoring and related confidence intervals—Arm B.
Curroncol 30 00163 g005
Table 1. Censored survival times from Arm A (Chemotherapy) of the NCOG study.
Table 1. Censored survival times from Arm A (Chemotherapy) of the NCOG study.
73442636474+838491
108112129133133139140140146
149154157160160165173176185+
218225241248273277279+297319+
405417420440523523+5835941101
1116+11461226+1349+1412+1417
Table 2. Censored survival times from Arm B (Chemotherapy+Radiation) of the NCOG study.
Table 2. Censored survival times from Arm B (Chemotherapy+Radiation) of the NCOG study.
37849294110112119127130
133140146155159169+173179194
195209249281319339432469519
528+547+613+633725759+8171092+1245+
1331+15571642+1771+17761897+2023+2146+2297+
Table 3. Results by DoC event from N = 10 , 000 simulations ( N c = 9802 ) generated by Arm A data.
Table 3. Results by DoC event from N = 10 , 000 simulations ( N c = 9802 ) generated by Arm A data.
Summary Statics of Δ ¯ j = e ˜ ¯ j e ^ ¯ j
j θ ˜ ¯ j e ˜ ¯ j e ^ ¯ j avg.avg.%s.e.m.minmax
1134.14421.94426.14 4.20 1.00 % 4.21 677.54 1083.35
2137.64434.06425.96 8.09 1.86 % 4.31 675.28 1077.25
3140.10427.67425.51 2.16 0.51 % 4.26 692.93 1084.41
4134.01421.72424.59 2.87 0.68 % 4.31 658.69 1070.25
5138.20432.20425.59 6.61 1.53 % 4.33 649.86 1067.94
6134.54421.22425.62 4.40 1.04 % 4.23 638.90 1067.69
7138.66434.07426.54 7.53 1.74 % 4.32 671.86 1067.94
8137.66433.16426.60 6.56 1.52 % 4.31 676.75 1071.44
9141.41430.15425.71 4.44 1.03 % 4.29 631.85 1067.11
10140.08427.10427.14 0.04 0.01 % 4.29 703.10 1072.31
Table 4. Results by DoC event from N = 10 , 000 simulations ( N c = 9472 ) generated by Arm B data.
Table 4. Results by DoC event from N = 10 , 000 simulations ( N c = 9472 ) generated by Arm B data.
Summary Statics of Δ ¯ j = e ˜ ¯ j e ^ ¯ j
j θ ˜ ¯ j e ˜ ¯ j e ^ ¯ j avg.avg.%s.e.m.minmax
1170.39901.20893.29 7.91 0.88 % 8.20 1245.06 1546.86
2165.77903.10894.93 8.16 0.90 % 8.17 1221.83 1545.27
3168.02892.31891.88 0.43 0.05 % 8.16 1247.44 1527.10
4168.50881.61894.61 13.00 1.47 % 8.17 1235.65 1551.53
5168.56887.58893.39 5.81 0.65 % 8.13 1248.04 1557.19
6172.76889.36895.64 6.28 0.71 % 8.11 1281.15 1545.27
7167.56885.83895.42 9.59 1.08 % 8.13 1190.08 1547.59
8166.83881.27895.01 13.74 1.56 % 8.13 1271.57 1539.00
9169.95886.48894.43 7.94 0.90 % 8.18 1283.04 1547.59
10167.30888.51892.08 3.57 0.40 % 8.20 1247.83 1550.47
Table 5. Overall results from 10 , 000 simulations.
Table 5. Overall results from 10 , 000 simulations.
Global Averages of Prediction Errors
DoC ImputedDoC as DoDDoC as Cen
Data N c θ ˜ ¯ e ˜ ¯ e ^ ¯ Δ ¯ Δ ¯ % s.e.m. Δ ¯ Δ ¯ % Δ ¯ Δ ¯ %
Arm A9802137.64428.33425.94 2.39 0.56 % 1.338 4.97 1.16 % 20.28 4.74 %
Arm B9472168.56889.72894.07 4.34 0.49 % 2.557 12.65 1.42 % 63.64 7.16 %
Table 6. Effect of CoDMI adjustment for censoring when all COVID-19 endpoints are simulated as censored ( δ j 0 ). Overall results from 10 , 000 simulations.
Table 6. Effect of CoDMI adjustment for censoring when all COVID-19 endpoints are simulated as censored ( δ j 0 ). Overall results from 10 , 000 simulations.
Global Averages of Prediction Errors
DoC ImputedDoC as DoDDoC as Cen
DataAdjust. N c θ ˜ ¯ e ˜ ¯ e ^ ¯ Δ ¯ Δ ¯ % s.e.m. Δ ¯ Δ ¯ % Δ ¯ Δ ¯ %
Arm ANO9804137.551004.22426.09 578.13 57.57 % 1.377579.94 57.80 % 554.64 55.28 %
YES9804137.551004.221001.91 2.30 0.23 % 1.212579.94 57.80 % 554.64 55.28 %
Arm BNO9459168.621394.29894.01 500.28 35.88 % 2.119488.74 35.14 % 437.73 31.47 %
YES9459168.621394.291396.58 2.29 0.16 % 1.899488.74 35.14 % 437.73 31.47 %
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

De Felice, F.; Mazzoni, L.; Moriconi, F. An Expectation-Maximization Algorithm for Including Oncological COVID-19 Deaths in Survival Analysis. Curr. Oncol. 2023, 30, 2105-2126. https://doi.org/10.3390/curroncol30020163

AMA Style

De Felice F, Mazzoni L, Moriconi F. An Expectation-Maximization Algorithm for Including Oncological COVID-19 Deaths in Survival Analysis. Current Oncology. 2023; 30(2):2105-2126. https://doi.org/10.3390/curroncol30020163

Chicago/Turabian Style

De Felice, Francesca, Luca Mazzoni, and Franco Moriconi. 2023. "An Expectation-Maximization Algorithm for Including Oncological COVID-19 Deaths in Survival Analysis" Current Oncology 30, no. 2: 2105-2126. https://doi.org/10.3390/curroncol30020163

Article Metrics

Back to TopTop