Abstract
A novel cure rate model is introduced by considering, for the number of concurrent causes, the modified power series distribution and, for the time to event, the recently proposed power piecewise exponential distribution. This model includes a wide variety of cure rate models, such as binomial, Poisson, negative binomial, Haight, Borel, logarithmic, and restricted generalized Poisson. Some characteristics of the model are examined, and the estimation of parameters is performed using the Expectation–Maximization algorithm. A simulation study is presented to evaluate the performance of the estimators in finite samples. Finally, an application in a real medical dataset from a population-based study of incident cases of lobular carcinoma diagnosed in the state of São Paulo, Brazil, illustrates the advantages of the proposed model compared to other common cure rate models in the literature, particularly regarding the underestimation of the cure rate in other proposals and the improved precision in estimating the cure rate of our proposal.
Keywords:
power piecewise exponential distribution; cure rate model; Expectation–Maximization algorithm; survival analysis; cancer dataset MSC:
62N01; 62N02
1. Motivation
Cure models have had enormous growth in the medical area [1], especially associated with cancer, because it allows us to estimate two crucial components together: on the one hand, the probability of cure and, on the other, the survival time of susceptible patients (that is, those who are not cured and will die due to the cancer studied). The increase in the use of this type of model is due, in large part, to the increase in preventive medical techniques that allow many types of cancer to be detected in the initial stages and, therefore, allow a better prognosis for the patient.
In this context, we assume the existence of a latent random variable (r.v.), M, which represents the number of competing causes related to the occurrence of the event of interest. The pioneer model in this context was proposed by Berkson and Gage [2], which assumed the Bernoulli distribution for M. Almost fifty years later, in a cancer context, the causes were represented by carcinogenic cells and modeled according to the Poisson distribution by Chen et al. [3]. Other important models in this context consider this approach, modifying the discrete distribution, including the negative binomial (NB) [4,5,6,7,8,9], zero-modified geometric [10], power series family [11], Conway–Maxwell– Poisson [12], weighted Poisson [13], modified power series (MPS) [14].
On the other hand, the piecewise exponential (PE) model proposed by Feigl and Zelen [15], and extended by Friedman [16], is widely used for modeling clinical data, because it has a constant risk function in each of its predefined L intervals, which can be very useful in certain situations, as it allows the survival function to fall more (or less) quickly at specific times that have a clear explanation from a practical point of view. Gomez et al. [17], based on the exponential method, extended the PE to obtain the power piecewise exponential distribution (PPE). The PPE generalizes the PE model by adding flexibility to the hazard function, allowing for both monotonic and non-monotonic patterns within each of the L intervals, in addition to the already known constant hazard function pattern. De Castro and Gómez [9] employed the PPE model in a cure rate model context, assuming the negative binomial distribution for the number of competing causes, whereas Gómez et al. [8] discussed the classical counterpart.
In this article, we propose the use of the PPE model, extending the NB cure fraction model discussed in [8] through the modified power series family of distributions. Thus, the proposed model offers multiple options for modeling the time to event because the exponential, PE, and exponentiated exponential (EE) [18] models are particular cases of the PPE model. On the other hand, particular cases of the MPS include traditional models such as Poisson (Po), binomial (Bin), NB, logarithmic (Lo), as well as less used discrete models such as Borel (Bo), Haight (Ha), generalized binomial (GB), and restricted generalized Poisson (RGP), to name a few. The manuscript is organized as follows. Section 2 is devoted to introducing details of the PPE and MPS distributions. In Section 3, we introduce the MPS cure rate model with baseline PPE. Section 4 discusses the estimation procedure for the proposed model, including an Expectation–Maximization (EM)-type algorithm [19] to obtain the maximum likelihood (ML) estimators. Section 5 presents a simulation to assess the performance of the ML estimators in finite samples. In Section 6, we present a real data illustration of the model for patients with lobular carcinoma. Finally, Section 7 presents the main conclusions of the work and possible future research based on this article.
2. Background
In this Section, we provide some details of the PPE and MPS distributions, which are relevant to introduce our proposal.
2.1. PPE Distribution
The PPE model was introduced by Gómez et al. [17]. For a fixed L (representing the breakpoints of the distribution), let T be an r.v. with PPE distribution with parameters and and known partition , such that . Note that each , is related to each partition. We denote . The probability density function (PDF) and cumulative distribution function (CDF) are determined by
where and , . In addition, is defined as
Remark 1.
The PPE distributions include the following particular cases.
- For , .
- For , .
- For and , (the standard exponential model).
The survival function of the PPE model is given by
and its respective hazard function is given by
for and . Figure 1 shows the different forms adopted by the distribution of the PPE with partition at .
Figure 1.
PDF (left) and hazard function (right) of the PPE distribution for different values of and , with ( partitions).
2.2. Modified Power Series Family of Distributions
The MPS distribution was introduced by Noack [20]. We say that , if its probability mass function (PMF) is given by
where , , is a positive function, and . In a cure rate models context, the probability generating function (PGF) is very important, and for the MPS models, such a function is given by
Table 1 details some particular cases of the MPS distribution.
Table 1.
, , , , and (the inverse function of ) for some particular cases of the MPS model.
Note that very well-known models in the literature are particular cases of this class of distributions.
3. The Proposed Model
In this Section, we introduce the MPS cure rate (MPScr) with baseline PPE distribution. Henceforth, this model will be named the MPScr-PPE model.
Suppose that a patient diagnosed with some type of cancer has M carcinogenic cells. Evidently, M is not observable; thus, for the formulation of the model, we will assume that its PMF corresponds to the PMF of the MPS distribution. Further, we assume that , represents, for each M, the associated time to produce carcinogenesis. The time of death of the patient is given by the minimum of the s, as long as the patient has at least one cancer cell. Otherwise, the patient will be considered cured. Under this scheme, the failure time of the patient is given by , where is a degenerate r.v. at zero. We assume that, conditional on , are independent and identically distributed such as . With those assumptions, the population survival function is given by
The population PDF of the model is given by
and substituting the PDF and survival function of the PPE model provided in Equations (1) and (2), the population PDF is given by
where . is reduced in a simple function for some particular cases of the MPS model (Bin, Po, NB, and Lo, which coincide with as the identity function). For the other cases, cannot be reduced. On the other hand, the cure rate of the model is given by
which only depends on . Therefore, the model can be reparametrized directly in the cure rate term in order to perform a regression on this term. Let ; then, .
Considering that the population is not homogeneous, we assume the existence of a set of r covariates measured for each observation, say , , where the first term is related to the intercept, and n represents the sample size. The vector can be introduced in the cure rate term using, for instance, the logit link function such that , where corresponds to the vector of unknown coefficients. Note that for all the distributions in Table 1, we obtain . Therefore,
4. Estimation
In this Section, we discuss the estimation procedure for the model under a classical approach. As the studies related to the cure rate are prospective, it is natural to assume a right censoring scheme. The failure indicator of the i-th observation will be denoted by , which will take the value of 1 when the event of interest is observed and 0 when the time is censored, with . The observations are considered independent. Under this configuration, the log-likelihood function for the MPScr-PPE model is given by
where denotes the vector of the parameters, and , and are given in Equations (6) and (7), respectively. However, the maximization of Equation (9) can be difficult, especially because, for some particular cases of the MPScr-PPE model, cannot be reduced to a simpler form. For this reason, in the next subsection, we will discuss a more efficient estimation procedure with less complexity based on the EM algorithm.
EM Algorithm
The EM algorithm is a very useful tool to deal with models in the presence of latent variables, facilitating the estimation process. Let be the vector containing the number of concurrent causes for all the individuals (the unobserved data) and the observed data, where , and . Thus, represents the complete data. Considering proposition 1 of Gallardo et al. [14], it follows directly that, for the MPScr-PPE model, we obtain
with as the survival function of the PPE model evaluated at . Therefore, the PMF of the number of concurrent causes, , conditional on and , is given by
with and . In this way, the conditional expectation of given and becomes
The complete log-likelihood function is given by
where , with denoting the PDF of the PPE distribution evaluated at . Let be the estimate of at the k-th iteration of the EM algorithm, and let be the conditional expectation of the complete log-likelihood function given the observed data and . With those notations, can be rewritten as
where
and , which can be computed using the result in Equation (10). In summary, the k-th step of the EM algorithm is given by
- E-step: For , compute
The maximization of the Equations (11) and (12) can be performed using numerical procedures. For instance, we use the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm implemented in the of [21]. The E- and M- steps are iterated until some convergence criterion is met. We consider that the distance between the estimates in two consecutive steps is less than a preset . In particular, we consider the distance as the maximum of the absolute difference between and . The asymptotic covariance matrix of the ML estimators of , say , can be estimated as
This matrix can be estimated numerically. For instance, we consider the hessian function included in the pracma [22] package of R [21] version 4.2.2.
5. Simulation Study
In this Section, we present a simulation study to assess the performance of the ML estimators for the MPScr-PPE model obtained via the EM algorithm in finite samples.
Recovery Parameters
This study was devoted to assessing some properties of the ML estimators in finite samples. In particular, we performed the study for the GBcr-PPE model. We considered a scenario similar to a real data application with two covariates: simulated from the multinomial distribution with the parameter vectors (representing the four stages of the cancer) and simulated from the standard normal distribution with the mean and standard deviation of 59 and 12.6, respectively (representing the age of the patient). Therefore, the vector of the covariates for each individual is given by , (note that is not included to avoid identifiability problems). To draw samples, we consider the stochastic representation of the model. For a fixed vector , we compute as in Equation (8); then, is drawn from the corresponding GB distribution with and . If , then the failure time is defined as . For , we draw from the PPE distribution (with a predefined ), and the failure time is defined as . For simplicity, we also consider that all the censoring times are identical to C. Therefore, the observed times are given by , with the corresponding failure indicators . We consider partitions, with and . We also consider three sample sizes: 500, 750, and 1000; three values for C: 10, 14, and 18; and three values for : 0.8, 1.0, and 1.2, totaling 27 cases. Table 2 and Table 3 summarize the average bias (bias), the estimated root mean square error (RMSE), and the mean of the standard errors (SE). The results suggest that the estimators are consistent, because as the sample size increases, the bias generally decreases, while the precision of the estimate increases. In addition, the model works well to capture when the survival time has a PE distribution instead of PPE, because the estimate for has a higher accuracy with respect to the other scenarios, where takes larger values.
Table 2.
Estimated bias, RMSE, and SE for the PPE-GB model with and under different scenarios ( and ).
Table 3.
Estimated bias, RMSE, and SE for the PPE-GB model with and under different scenarios ().
6. Application
The data set includes information from 2562 patients diagnosed with lobular carcinoma (a breast cancer), treated in the mastology area, with a diagnosis date between 2009 and 2016, with follow-up conducted until 2018. The data set was obtained from the Oncocenter Foundation of São Paulo, Brazil (Fundação Oncocentro de São Paulo (FOSP) in Portuguese), which is responsible for coordinating the Hospital Cancer Registry of the State of São Paulo (http://fosp.saude.sp.gov.br, accessed on 31 December 2023). This pathology is a type of breast cancer that occurs in the lobes, the glands that produce milk.
Death due to cancer was defined as the event of interest, and the time was measured from the date of diagnosis until the patient’s death (in years, mean: 5.01, standard deviation (SD): 3.05, median: 4.50, range: 0.0027–13.62). A total of 461 (18%) events occurred during the follow-up period. The median follow-up time was 12.7 years. The observed independent variables were as follows: age at diagnosis (mean: 58.98, SD: 12.64, median: 59, range: 20–94) and the clinical stage (I: 721 (28.14%), II: 976 (38.1%), III: 651 (25.41%), and IV: 214 (8.35%)), with clinical stage IV representing the most advanced stage. Figure 2 shows the estimated survival curves obtained by the Kaplan–Meier (KM) estimator for the breast cancer dataset. According to the estimated overall survival (Figure 2a), the survival function appears to trend towards a plateau close to 0.5, suggesting the presence of long-term survivors in the population. Additionally, younger patients (≤55 years old) and those in early clinical stages exhibit higher survival rates (Figure 2b,c).
Figure 2.
Estimated survival curve obtained by the Kaplan–Meier estimator.
We fitted 59 particular cases of the MPScr-PPE model, considering homogeneous partitions based on the quantiles ranging L from 1 to 30. The models considered were Po, Lo, Bo, Ha, NB (with to ), Bin ( to ), RGP ( to ), and GB ( to and to ), including the most popular cure rate models in the literature mentioned in the Introduction. Figure 3 shows the Akaike information criteria (AIC) [23] for all combinations of the available covariates, considering the five models with a better performance (Po, Lo, NB with , NB with , and GB with and ). The best scenario was , with a similar trend in all the proposed adjusted models. For comparative purposes, we also considered the same models with the baseline Weibull (WEI) distribution for the concurrent causes; here, the lowest AIC was 23 points higher than our proposal.
Figure 3.
AIC for to partitions. The right panel is a zoomed in image of the outlined part of the graph.
Table 4 presents the log-likelihood function, AIC, and Bayesian information criterion (BIC) [24] for the members of the GBcr-PPE and GBcr-WEI models. Note that in both cases, the GBcr model provides a better result.
Table 4.
AIC and BIC criteria for PPE-MPScr and Wei-MPScr.
The parameter estimation under the GBcr-PPE model with and shows that with a standard deviation of . This suggests that the PE model should be preferred instead of the PPE model for this particular problem. To verify this, we also performed the likelihood ratio test for the hypothesis versus , providing the observed statistic . Under , this statistic follows a chi-square distribution with one degree of freedom, providing a p value of 0.0009. Therefore, with a level of significance of 5%, it is concluded that there is not enough information to establish a difference between the PE and PPE distributions. Table 5 shows the parameter estimation with the respective standard error of the PE-GBcr model with and . Note that the estimates for the regression coefficients are concordant for both models in the sense that both have the same sign.
Table 5.
Estimates and standard errors (in parenthesis) for the PE-GBcr and WEI-GBcr models with and for the lobular carcinoma data.
Figure 4 shows the QQ-plot for the quantile residuals (left panel) and KM estimator for the Cox–Snell residuals [25] (right panel). On the other hand, we also applied some common normality tests to check the validity of the quantile residuals, such as the Kolmogorov–Smirnov (KS, [26]), Shapiro–Wilk (SW, [27]), Anderson–Darling (AD, [28]), and Cramer–Von-Mises (CVM, [29]). The p values for such tests suggest that the quantile residuals have a standard normal distribution. Finally, the KM estimator for the Cox–Snell residuals suggest that it is reasonable that such residuals have a standard exponential distribution. For this reason, both residuals suggest that the GBcr model provides satisfactory results for this data set.
Figure 4.
Quantile–quantile (QQ) plot with envelope for quantile residuals (and the corresponding p value for different normality tests) and the KM estimator for the Cox–Snell residuals for the GBcr-PE model for the lobular carcinoma data.
Finally, in order to illustrate the advantage of using the GBcr-PE instead of the GBcr-WEI model, we computed the estimated cure rate and the corresponding 95% confidence interval (CI) for both models, which are presented in Figure 5. Note that the GBcr-WEI model underestimates the cure rate in relation to the GBcr-PE model. Furthermore, in some cases (such as Figure 5a), the GBcr-PE provides a more accurate 95% CI.
Figure 5.
Estimated cure rate and the corresponding 95% confidence intervals for the MPScr-PE and MPScr-WEI models.
7. Conclusions and Future Work
A new cure rate model was introduced based on the power piecewise exponential distribution. The parameter estimation was performed using the EM algorithm, which produces a very simplified estimation procedure. Properties of the ML estimators were validated through a simulation study, which revealed that, as the sample size increases, the bias and standard error (SE) decrease. The components of the vector (related to the PPE distribution) highlighted slower convergence of the estimator compared to other parameters, indicating the need for a larger sample size to reach acceptable properties. The model proficiently identifies when survival times align with a PE distribution rather than a PPE distribution. Finally, in a real data application related to breast cancer, the GBcr-PPE model performed better than the common models in this context. Specifically, we determined that, for this kind of cancer, the punctual estimation for the cure rate based on our proposal varies between 99% for the most favorable case (younger patients in stage I) and 35% (older patients in stage IV), which was always underestimated by a concurrent model. Future research along these lines could consider a Bayesian approach to perform the parameter estimation and the inclusion of random effects in the cure rate terms of the model.
Author Contributions
Conceptualization, D.I.G. and Y.M.G.; methodology, D.I.G., Y.M.G. and H.W.G.; software, D.I.G., Y.M.G. and J.L.S.; validation, H.W.G.; formal analysis, D.I.G., Y.M.G., H.W.G. and J.L.S.; investigation, D.I.G., Y.M.G. and J.L.S.; resources, Y.M.G.; data curation, V.F.C. and J.L.S.; writing—original draft preparation, D.I.G., Y.M.G. and J.L.S.; writing—review and editing, D.I.G., Y.M.G. and J.L.S. All authors have read and agreed to the published version of the manuscript.
Funding
The work of the first author was partially supported by a grant from the Fondo Nacional de Desarrollo Científico y Tecnológico (FONDECYT) 11230397. This work was also supported in part by the NIH National Center for Advancing Translational Sciences UCLA CTSI UL1 TR001881.
Data Availability Statement
Data and computational codes are available upon request from the authors.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Yin, G.; Ibrahim, J.G. Cure rate models: A unified approach. Can. J. Stat. 2005, 33, 559–570. [Google Scholar] [CrossRef]
- Berkson, J.; Gage, R.P. Survival curve for cancer patients following treatment. J. Am. Stat. Assoc. 1952, 47, 501–515. [Google Scholar] [CrossRef]
- Chen, M.H.; Ibrahim, J.G.; Sinha, D. A new Bayesian model for survival data with a surviving fraction. J. Am. Stat. Assoc. 1999, 94, 909–919. [Google Scholar] [CrossRef]
- Cancho, V.G.; Rodrigues, J.; de Castro, M. A flexible model for survival data with a cure rate: A Bayesian approach. J. Appl. Stat. 2011, 38, 57–70. [Google Scholar] [CrossRef]
- Yiqi, B.; Maria Russo, C.; Cancho, V.G.; Louzada, F. Influence diagnostics for the Weibull-Negative-Binomial regression model with cure rate under latent failure causes. J. Appl. Stat. 2016, 43, 1027–1060. [Google Scholar] [CrossRef]
- Ortega, E.M.; Cordeiro, G.M.; Kattan, M.W. The negative binomial–beta Weibull regression model to predict the cure of prostate cancer. J. Appl. Stat. 2012, 39, 1191–1210. [Google Scholar] [CrossRef]
- D’Andrea, A.; Rocha, R.; Tomazella, V.; Louzada, F. Negative binomial Kumaraswamy-G cure rate regression model. J. Risk Financ. Manag. 2018, 11, 6. [Google Scholar] [CrossRef]
- Gómez, Y.M.; Gallardo, D.I.; Leão, J.; Calsavara, V.F. On a new piecewise regression model with cure rate: Diagnostics and application to medical data. Stat. Med. 2021, 40, 6723–6742. [Google Scholar] [CrossRef] [PubMed]
- de Castro, M.; Gómez, Y.M. A Bayesian cure rate model based on the power piecewise exponential distribution. Methodol. Comput. Appl. Probab. 2020, 22, 677–692. [Google Scholar] [CrossRef]
- Leão, J.; Bourguignon, M.; Gallardo, D.I.; Rocha, R.; Tomazella, V. A new cure rate model with flexible competing causes with applications to melanoma and transplantation data. Stat. Med. 2020, 39, 3272–3284. [Google Scholar] [CrossRef]
- Cancho, V.G.; Louzada, F.; Ortega, E.M. The power series cure rate model: An application to a cutaneous melanoma data. Commun. Stat.-Simul. Comput. 2013, 42, 586–602. [Google Scholar] [CrossRef]
- Balakrishnan, N.; Pal, S. Expectation maximization-based likelihood inference for flexible cure rate models with Weibull lifetimes. Stat. Methods Med. Res. 2016, 25, 1535–1563. [Google Scholar] [CrossRef]
- Balakrishnan, N.; Koutras, M.V.; Milienos, F.S. A weighted Poisson distribution and its application to cure rate models. Commun. Stat.-Theory Methods 2018, 47, 4297–4310. [Google Scholar] [CrossRef]
- Gallardo, D.I.; Gomez, Y.M.; Gómez, H.W.; de Castro, M. On the use of the modified power series family of distributions in a cure rate model context. Stat. Methods Med. Res. 2020, 29, 1831–1845. [Google Scholar] [CrossRef]
- Feigl, P.; Zelen, M. Estimation of exponential survival probabilities with concomitant information. Biometrics 1965, 21, 826–838. [Google Scholar] [CrossRef]
- Friedman, M. Piecewise exponential models for survival data with covariates. Ann. Stat. 1982, 10, 101–113. [Google Scholar] [CrossRef]
- Gómez, Y.M.; Gallardo, D.I.; Arnold, B.C. The power piecewise exponential model. J. Stat. Comput. Simul. 2018, 88, 825–840. [Google Scholar] [CrossRef]
- Gupta, R.D.; Kundu, D. Generalized exponential distributions. Aust. N. Z. J. Stat. 1999, 41, 173–188. [Google Scholar] [CrossRef]
- Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum Likelihood from Incomplete Data Via the EM Algorithm. J. R. Stat. Soc. Ser. B (Methodol.) 1977, 39, 1–22. [Google Scholar] [CrossRef]
- Noack, A. A Class of Random Variables with Discrete Distributions. Ann. Math. Stat. 1950, 21, 127–132. [Google Scholar] [CrossRef]
- R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2023. [Google Scholar]
- Borchers, H.W. pracma: Practical Numerical Math Functions; R Package Version 2.4.2; R Package Vignette: Madison, WI, USA, 2022. [Google Scholar]
- Akaike, H. A new look at the statistical model identification. IEEE Trans. Autom. Control. 1974, 19, 716–723. [Google Scholar] [CrossRef]
- Schwarz, G. Estimating the dimension of a model. Ann. Stat. 1978, 6, 461–464. [Google Scholar] [CrossRef]
- Cox, D.R.; Snell, E.J. A general definition of residuals. J. R. Stat. Soc. Ser. B (Methodol.) 1968, 30, 248–275. [Google Scholar] [CrossRef]
- Kolmogorov, A.N. Sulla determinazione empirica di una legge di distribuzione (On the empirical determination of a distribution law). G. Dell’Inst. Ital. Degli Attuari 1933, 4, 83–91. [Google Scholar]
- Shapiro, S.S.; Wilk, M.B. An analysis of variance test for normality (complete samples). Biometrika 1965, 52, 591–611. [Google Scholar] [CrossRef]
- Anderson, T.W.; Darling, D.A. Asymptotic theory of certain “goodness of fit” criteria based on stochastic processes. Ann. Math. Stat. 1952, 23, 193–212. [Google Scholar] [CrossRef]
- Cramér, H. On the composition of elementary errors. Skand. Aktuarietidskr. 1928, 11, 13–74. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).