1. Introduction
The problem of defining a common and appropriate method in survival analysis for handling the dropouts due to coronavirus disease 2019 (COVID-19) deaths of patients participating to oncology clinical trials has been recently stressed [
1,
2]. In oncology trials, all-causality deaths are often counted as events for death-related endpoints, e.g., overall survival. However, as it has been pointed out [
2], counting a COVID-19 fatality as a death-related endpoint requires a complex redefinition of the estimand, considering a composite strategy for using the so-called
intercurrent events [
3], as, e.g., “discontinuation from treatment due to COVID-19” or “delay of scheduled intervention”. The problem is also exacerbated by the difficulty of homogeneously determining whether a death is entirely attributable to COVID-19. In this paper, we address a simplified version of this problem, assuming that COVID-19-related deaths are homogeneously identified and are the only intercurrent events to be considered. In this framework, we tackle the problem of how data in an oncology trial having the overall survival as the endpoint can be dealt with when deaths due to COVID-19 are present in the sample.
COVID-19 deaths should not be treated as standard censored data, because usual censoring should be considered—at least in principle—
non-informative. Informative censoring, instead, occurs when participants are lost to follow-up also due to reasons related to the study, as it seems to be the case with COVID-19 deaths of oncological patients. Direct data on how COVID-19 affects survival outcomes in patients with active or a history of malignancy are immature. However, early evidence identified increased risks of COVID-19 mortality in patients with cancer, especially in those patients who have progressive disease [
4]. Patients with cancer and COVID-19 were associated with increased death rate compared to unselected COVID-19 patient population (13% versus 1.4%) [
4,
5]. Based on these findings, in survival analysis dropouts due to COVID-19 deaths should be considered as cases of
informative censoring. Another way used in survival analysis literature to represent this dependence is to view cancer deaths and COVID-19 deaths as
competing events, see, e.g., [
6] Ch. 8. In this paper, we propose an algorithmic method to include COVID-19 deaths of oncological patients in typical survival data, focusing on the classical Kaplan–Meier (KM) product-limit survival estimator. Our method is in the spirit of the
Expectation-Maximization (EM) algorithms [
7] used for handling missing or fake data in statistical analysis. In this sense, the method could also be used in applications different from clinical trials, e.g., reliability analysis. Correction of actuarial life tables can be also a possible application.
An overview of methods for dealing with missing data in clinical trials is provided by DeSouza, Legedza and Sankoh [
8]. See also Shih [
9]. In Shen and Chen [
10] the problem of doubly censored data is considered and a maximum likelihood estimator is obtained via EM algorithms that treat the survival times of left censored observations as missing. As concerning situations with informative censoring, where there is stochastic dependence between the time to event and the time to censoring (which is our case if “censoring” is a COVID-19 death), a distinction is proposed by Willems, Schat and van Noorden [
11] among cases where the stochastic dependence is
direct, or
through covariates. In that paper [
11], the latter case is considered and an “inverse probability censoring weighting” approach is proposed for handling this kind of censoring. Since at this stage it is difficult to model cancer deaths and COVID-19 deaths through covariates in common, in this paper, we consider the case of direct dependence. We do not consider a survival regression model based on specified covariates, and limit the analysis, as has been said, to the basic Kaplan–Meier survival model, which is assumed to be applied, as usual, to a sufficiently homogeneous cohort of oncological patients. In this framework, we propose a so-called
mean-imputation method for COVID-19 deaths using a purpose-built algorithm, referred to as
Covid-Death Mean-Imputation (CoDMI) algorithm. A user-friendly version of this algorithm programmed in R is freely available. The corresponding source code can be downloaded from the website:
https://github.com/alef-innovation/codmi (accessed on 6 July 2021).
An alternative approach to survival analysis when COVID-19 deaths are present in an oncology clinical trial in addition to cancer deaths could be based on the
cumulative incidence functions, which estimate the marginal probability for each competing risks.This would lead to dealing with subdistributions and would require appropriate statistical tests to be used, see, e.g., [
12]. Our algorithmic approach, instead, acts directly on the data, producing an adjustment that virtually eliminates the presence of the competing risk, thus allowing the use of standard statistical tools. This comes at the price of accepting some simplifications and specific assumptions.
The basic idea of CoDMI algorithm is of a counterfactual nature. Since the KM model provides an estimation of the probability distribution to survive until a chosen point on the time axis for any patients in the sample, for each of the patients which is observed to die of COVID-19 at time , we derive from this distribution , the expected lifetime beyond time , thus obtaining the “no-Covid” expected lifetime for each of these patients. Each value is then replaced by the virtual lifetime (this is the mean-imputation) and the KM estimation is repeated on the original data completed in this way, providing a new estimate of . This procedure is iterated until the change between two successive estimates is considered immaterial (according to a specified criterion).
It is pointed out by Shih [
9] that “The attraction of imputation is that once the missing data are filled-in (imputed), all the statistical tools available for the complete data may be applied”. Although in our case we are not dealing with missing data but with partially observed data, this attractive property of mean-imputation still holds true. It should be noticed, however, that in general, treating an
estimated value—even an unbiased one—as an
observed value should require some increase in variance. In particular, the confidence limits of KM estimates on data including imputations should be appropriately enlarged. We propose an extension of the classical Greenwood’s formula providing this correction.
The paper is organized as follows. In
Section 2, the notations and the basic structure of the KM survival estimator are provided and the related problem of computing expected lifetimes is illustrated. The representation of Covid-death cases in the sample is described. In
Section 3, CoDMI algorithm is introduced and the details of the iteration procedure are provided. The convergence issue is discussed and the underlying assumptions of the algorithm are considered, taking into account some subtleties required by the non-parametric nature of the KM estimator. A possible
adjustment for censoring of the algorithm is presented and a correction of Greenwood’s formula is derived for taking into account the estimation error in the imputed point estimates. Application of CoDMI to real medical data are provided in
Section 4. Two oncological survival data sets which are well referenced in the literature are completed by artificial Covid-death observations and the survival curves estimated by CoDMI are compared with the no-Covid KM estimates and with the two naïve KM estimates obtained by considering COVID-19 deaths as censorings or as death of disease. The effect of the final adjustment for censoring is also illustrated. In
Section 5, an extensive simulation study is presented to evaluate the CoDMI predictive performances. We discuss the details of the simulation procedure and provide tables illustrating the results. Some conclusions and comments are given in
Section 6. In
Appendix A a derivation of the extended Greenwood’s formula is provided.
5. A Simulation Study
In order to test the ability of CoDMI to correctly estimate the expected life-shortening (or the corresponding virtual lifetime) due to DoC events in a study population, we generate many scenarios each containing simulated data. These pseudo-data include a data set of standard observations and a data set of (preliminary) virtual lifetimes. By randomly censoring the time variables in a corresponding set of DoC time points is derived. In order to equip these pseudo-data with a probabilistic structure consistent with CoDMI assumptions, a KM best-fitting distribution is derived by applying the product-limit estimator to . The “true” virtual lifetimes are then derived by conditional sampling, given , from this distribution. Running CoDMI algorithm on the pseudo-observations , the estimated virtual lifetimes are obtained and the quality of the estimator is measured by computing the average, over all scenarios, of the prediction errors .
5.1. Details of the Simulation Process
The details of each scenario simulation are as follows:
- 1.
Simulation of standard survival data . The simulated standard (i.e., non-Covid) survival data is generated in each scenario starting from the same set of real data , spanning the time interval . The set is generated by drawing with replacement pairs from the n real-life pairs , maintaining the proportion between DoD and Cen events in z. Let us denote by the largest uncensored time point in .
Remark. It should be noted that many tied values can be generated in this step, especially if . Moreover, could result to be censored (a case of incomplete death observations) even if the death observations are complete in the original data. It is easy to guess that generating many scenarios in this way can produce a number of “extreme” pseudo-data . This is useful, however, for testing the algorithm even in unrealistic situations. Most cases of failed convergence correspond to extreme situations.
- 2.
Simulation of DoC time points . In order to simulate a number of COVID-19 deaths, the time points are generated by drawings with replacement from the points in real data z, satisfying the conditions and . These time points are interpreted as temporary virtual lifetimes and are first used to generate the DoC time points . A number of independent drawings from a uniform (0, 1) distribution are performed, and the corresponding DoC time points are obtained as . Therefore, for all j one has , with taking equally probable values in .
Remark. The use of a uniform distribution is obviously questionable, and more “informative” distribution could be suggested. For example, a beta distribution with first parameter greater than 1 and second parameter lower than 1 may be preferable, as it makes more probable values of closer to . However, the form of this distribution is irrelevant to our purposes: we are interested in observing how CoDMI is able to capture the simulated virtual lifetimes, independently of how they are generated.
- 3.
Simulation of virtual lifetimes . The temporary lifetimes
(and the data set
) cannot be directly used to test CoDMI algorithm, since their probabilistic structure is indeterminate and, in any case, we have too few (pseudo-)observations. In order to introduce a probabilistic structure consistent with CoDMI assumptions, we first run the KM estimator on the data set
, thus obtaining the corresponding death probability distribution
. The virtual lifetimes
are then obtained by computing the conditional expectations
by this distribution. However, this is not yet fully consistent with CoDMI assumptions, since, as discussed in
Section 3.3, the appropriate distribution is the KM best-fitting distribution specified on the extended data, i.e., data including the virtual lifetimes themselves. To obtain this result we should repeat the previous step, i.e., running the product-limit estimator on the new data set
, thus producing the new distribution
and then simulating
new time points
by taking the conditional expectation on this distribution. In principle, this step should be iterated similarly to what is completed in the CoDMI algorithm. To avoid convergence problems, however, we prefer to limit the number of iterations to a fixed (low) value
, thereby implicitly accepting a certain level of bias in the estimations. After these
iterations has been made, the final data set
is obtained. Running the KM estimator on these data again, the final distribution
is obtained and the definitive time points
, with the corresponding
, are computed
by conditional sampling, given
, i.e., simulating from the truncated distribution
(after normalization). These sampled values are taken as the
true values of virtual lifetimes and life expectancy, respectively, which should be estimated by CoDMI using only the information
.
- 4.
Application of CoDMI and naïve estimators. CoDMI algorithm is applied to the simulated data:
with
obtained in step 1 and
in step 2. Provided that the algorithm converges, we obtain the
estimated virtual lifetimes
and the estimated life expectancy
.
To allow comparison, we also derive in this step the predictions of the two naïve “estimators” which are obtained by applying the KM estimator to the simulated data , modified by posing, for all j, and (“DoC as DoD”) or (“DoC as Cen”).
5.2. Valuation of the Predictive Performances
In the simulation exercise, a large number
N of scenarios are generated. This provides, for
and
, the
CoDMI estimates
(from step 4) and the
true realizations
(from step 3). Then we can compute the prediction errors:
and the average errors:
Positive (negative) values of correspond to under(over)-estimates provided by CoDMI. As usual, we can associate to these average errors the corresponding standard error, i.e., the standard error of the mean (s.e.m.). Given the independence assumption, the central limit theorem guarantees, as usual, that the sample means are asymptotically normal. Therefore, the corresponding s.e.m. is inversely proportional to .
The same summary statistics are computed for the prediction errors relative to the two naïve estimators.
5.3. Results from Simulation Exercises
Two separate simulation exercises were performed, one using Arm A, the other using Arm B as real-life data. In both the exercises, scenarios were generated, with standard observations (roughly double the real ones) and COVID-19 deaths. A tolerance was chosen for the CoDMI algorithm, with a maximum number of allowed iterations . The number of iterations for generating the true values was and for all the initializations the option was chosen. Since in some scenarios CoDMI failed to converge (with the chosen values for and ), the sample means and the corresponding s.e.m. where computed only on the convergence cases.
In
Table 3, which is referred to Arm A data, the simulation results are reported for each of the 10 DoC cases. We obtained
convergence cases out of the 10,000 simulated. In each row, the sample mean of the DoC time points
, the true life expectancy
and the CoDMI estimated life expectancy
are reported in columns 2–4. In columns 5–9, we provide summary statistics of the corresponding prediction errors: the mean error
, the related s.e.m., the relative mean error
and the minimum and maximum value of
.
The same results for 10,000 scenarios generated by Arm B data are reported in
Table 4.
Table 5 provides the results in
Table 3 and
Table 4 aggregated over all the DOC events. These overall results are summarized in blok “DoC imputed”. In the bloks, “DoC as DoD” and ”DoC as Cen” the average prediction errors are reported for the two corresponding naïve estimators. The main finding from the simulations is that the CoDMI estimates seem to be essentially unbiased, with a relative prediction error of around
for both the original data considered. Some more extensive (and time consuming) tests, with
or
, have shown a further reduction of the error (as well as, obviously, of the corresponding s.e.m.).
As a final exercise, we used a modified version of the simulation procedure to obtain an assessment of goodness of the adjustment for censoring described in
Section 3.4. In the modified simulation, all the
true virtual lifetimes
were generated assuming a censoring, instead of a DoC, as the endpoint. Then we set
and in step 3 of
Section 5 we generated in all iterations the virtual lifetimes
using the truncated
reverse probability distribution, i.e., the distribution obtained by applying the Kaplan–Meier estimator to the reversed data
(see (
8)). Correspondingly with this change in assumption, the estimated values
in each simulation were obtained by applying the CoDMI algorithm with the final adjustment for censoring, setting at 0 all the probabilities
in (
9). The overall results from these simulations are summarized in
Table 6, which have the same structure as
Table 5 and where the results without adjustment are also provided for comparison.
As we can see, the changed assumption on the status of the DoC endpoints provides a large increase of the true life expectancy
, but the adjustment for censoring seems to capture quite well this change. Of course, in real life we do not know what the true value of the
is, and we will have to try to choose the suitable
in (
9) based on the
probabilities and/or using expert judgment.
6. Conclusions and Directions for Future Research
In the simulated scenarios, where all the virtual endpoints of COVID-19 cases are assumed to be DoD, the results indicate that CoDMI estimator is roughly unbiased and outperforms alternative estimates obtained by the naïve approaches. In the opposite extreme situation, where all the virtual endpoints of COVID-19 cases are assumed to be censored, the final adjustment for censoring of CoDMI also guarantees unbiasedness, provided that the information on the status of DoC events is assumed to be known. The non-convergence cases can often be circumvented by milding the convergence criterion and/or fudging COVID-19 data a little. Furthermore, changing the initialization of the algorithm can be useful in some cases.
By a natural extension of the binomial assumptions underlying the KM estimator, a version of the classical Greenwood formula can be derived for computing the variance of CoDMI estimates. Equipped with this formula, the CoDMI algorithm is proposed as a complete statistical estimation tool.
As we pointed out in the Introduction, CoDMI algorithm, compared with the cumulative incidence functions method often used to study competing risks, is a pragmatic approach that allows to directly apply all standard statistical tools to “augmented” data. However, it remains important to compare the predictive performance of the two approaches. In our applications, where the competing events are DoD and DoC, we do not yet have sufficiently rich data to test the effectiveness—and possibly the necessity—of an approach based on the cumulative incidence functions, or even to test the possibility of using the two methods in conjunction. Therefore, this topic is left for future research.
Another interesting issue is the convergence of CoDMI algorithm, which is discussed in
Section 3.2. A natural way to approach this problem is to study the behavior of the log-likelihood function. However, as we have pointed out, we are not in a fixed time points situation. So it is not a trivial task to explicitly write the updated log-likelihood at each iteration step, because the replacements in each step imply a re-ordering of the time points and consequently a change in the number of items at risk in each death probability estimate. This problem is also left as a future work.