Next Article in Journal
Confidence Intervals of Risk Ratios for the Augmented Logistic Regression with Pseudo-Observations
Previous Article in Journal
The Unit-Modified Weibull Distribution: Theory, Estimation, and Real-World Applications
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Mixture Model for Survival Data with Both Latent and Non-Latent Cure Fractions

by
Eduardo Yoshio Nakano
1,*,
Frederico Machado Almeida
1 and
Marcílio Ramos Pereira Cardial
2
1
Department of Statistics, University of Brasilia, Campus Darcy Ribeiro, Asa Norte, Brasilia 70910-900, Brazil
2
Institute of Mathematics and Statistics, Federal University of Goias, Goiania 74001-970, Brazil
*
Author to whom correspondence should be addressed.
Stats 2025, 8(3), 82; https://doi.org/10.3390/stats8030082
Submission received: 26 July 2025 / Revised: 10 September 2025 / Accepted: 11 September 2025 / Published: 13 September 2025
(This article belongs to the Section Survival Analysis)

Abstract

One of the most popular cure rate models in the literature is the Berkson and Gage mixture model. A characteristic of this model is that it considers the cure to be a latent event. However, there are situations in which the cure is well known, and this information must be considered in the analysis. In this context, this paper proposes a mixture model that accommodates both latent and non-latent cure fractions. More specifically, the proposal is to extend the Berkson and Gage mixture model to include the knowledge of the cure. A simulation study was conducted to investigate the asymptotic properties of maximum likelihood estimators. Finally, the proposed model is illustrated through an application to credit risk modeling.

1. Introduction

Statistical techniques for censored data have been extensively studied in the literature. A common assumption underlying these models is that each individual in the study will eventually experience the event of interest if followed long enough. However, this assumption does not hold in many real-world scenarios, including biomedical, financial, demographic, criminological, and engineering research. Such individuals are typically referred to as cured, non-susceptible, immune, or long-term survivors, and their survival times are considered infinite. The remaining individuals are classified as susceptible.
Models for analyzing data with a proportion of cured individuals are often called cure models or cure rate models. According to [1], such a model, in practical terms can be used, for example, to model data related to various types of cancer for which a significant proportion of patients are cured.
Since the seminal mixture model proposed by [2], which assumes that the population under study is a combination of cured and susceptible individuals, several approaches have been developed to accommodate the presence of a cured fraction. These include promotion time and frailty models. Promotion time models treat the time-to-event as a result of the first occurrence among a set of latent failures. These models are useful in contexts where multiple failure mechanisms are present [3]. Extensions of this modeling approach have been studied in [4,5,6]. Frailty models incorporate unobserved heterogeneity between individuals by modeling vulnerability to the event of interest as a latent variable (frailty). These models can be extended to accommodate cure fractions, as presented in [7,8]. Models that incorporate long-term survivors offer an advantage over standard survival techniques, as they allow for the simultaneous estimation of parameters associated with both the susceptible and cured subpopulations [9,10].
Among the aforementioned models, the most popular cure fraction model is the Berkson and Gage mixture model [2]. This model assumes the existence of heterogeneity in the population under study. Consequently, the modeling is based on the mixture of two distributions: one representing the distribution of failure or survival times of susceptible individuals and the other corresponding to a degenerate distribution (which allows for, in principle, infinite survival times) for cured individuals.
Let T be a non-negative random variable denoting the survival time of the entire population. Under the mixture assumption, the Berkson and Gage [2] model has the form
S T ( t ) = ( 1 ϕ ) S Y ( t ) + ϕ ,
where 0 ϕ 1 is the proportion of cured individuals and S Y ( · ) denotes the survival function of the susceptible group.
However, the model (1) may not be suitable for some problems since it treats the cure as a latent event. In fact, there are situations in which the cure is known, i.e., situations in which it is known that the censoring occurred due to the individual’s cure. One example arises in credit risk modeling, where the variable of interest is the time to default (i.e., delay in loan repayment): a customer who fully repays the loan is a known cured case, and this information should be incorporated into the analysis.
In this context, the main contribution of this work is to develop a model that accommodates both latent and non-latent cure fractions. More specifically, the proposal is to extend the Berkson and Gage mixture model (1) to incorporate known cure information. The model proposed in this paper is illustrated using artificial data of customers who have taken out loans from a financial institution. A credit risk score derived from the proposed model is then used to classify customers according to their risk of default.
This manuscript is organized as follows: Section 2 introduces the model formulation and the procedures for estimating the model parameters using the maximum likelihood method. Section 3 presents a simulation study conducted to investigate whether the usual asymptotic properties of maximum likelihood estimators hold and illustrates the proposed model using artificial data. Finally, concluding remarks are provided in Section 4.

2. Materials and Methods

2.1. Model Formulation

The model proposed in this paper is formulated considering that an individual observed in the sample can be part of one of three distinct subpopulations, consisting of susceptible (non-cured) individuals, non-susceptible individuals who are known to be cured (non-latent cure), and non-susceptible individuals whose status as cued is unknown (latent cure).
The knowledge of the cure of an individual can be represented by a random variable K following a Bernoulli distribution with a success probability ρ . In addition, given K = 0 , the latent cure variable, C, follows a Bernoulli distribution with success probability ϕ . Thus,
K = 0 , if it is unknown that the individual is cured 1 , if it is known that the individual is cured ,
and
C = 0 , if the individual is not cured 1 , if the individual is cured .
Since P ( C = 1 | K = 1 ) = 1 , P ( C = 1 | K = 0 ) = ϕ , P ( C = 0 | K = 1 ) = 0 and P ( C = 0 | K = 0 ) = 1 ϕ , the probability of an individual being cured is given by
μ = P ( C = 1 ) = P ( C = 1 , K = 1 ) + P ( C = 1 , K = 0 ) = P ( K = 1 ) P ( C = 1 | K = 1 ) + P ( K = 0 ) P ( C = 1 | K = 0 ) = ρ + ( 1 ρ ) ϕ .
This model proposes a mixture of distributions for individuals who are not known to be cured, i.e., given K = 0 , we have f T | K = 0 ( t ) = ( 1 ϕ ) f Y ( t ) + ϕ f Z ( t ) . Here, T is a random variable that represents the time to failure, Y is a continuous random variable that represents the time to failure of non-cured individuals, and Z is a degenerate variable at infinity (i.e., P ( Z > z ) = 1 , t > 0 ) that represents the time to failure of cured individuals.
Given the value of K, the survival function in mixture form is given by
S T | K = 0 ( t ) = P ( T > t | K = 0 ) = P ( T > t , C = 0 | K = 0 ) + P ( T > t , C = 1 | K = 0 ) = P ( C = 0 | K = 0 ) P ( T > t | C = 0 , K = 0 ) + P ( C = 1 | K = 0 ) P ( T > t | C = 1 , K = 0 ) = ( 1 ϕ ) S Y ( t ) + ϕ ,
and
S T | K = 1 ( t ) = P ( T > t | K = 1 ) = 1 .
Thus, from (5) and (6), we have that the survival function and the probability density function of the entire population are given, respectively, by
S T ( t ) = P ( T > t , K = 1 ) + P ( T > t , K = 1 ) = P ( K = 0 ) P ( T > t | K = 0 ) + P ( K = 1 ) P ( T > t | K = 1 ) = ( 1 ρ ) [ ( 1 ϕ ) S Y ( t ) + ϕ ] + ρ ,
and
f T ( t ) = d d t S T ( t ) = ( 1 ρ ) ( 1 ϕ ) f Y ( t ) ,
where ρ is the proportion of individuals who are known to be cured, ϕ is the proportion of cured individuals, and S Y ( · ) and f Y ( · ) are, respectively, the survival and probability density function of susceptible individuals.
Note that if ρ = 0 , the model (7) reduces to the Berkson and Gage [2] mixture model (1). In addition, this model can also be characterized as a cure rate mixture model with competing risks [11], where the non-latent cure is considered a competing cause, and its time is assumed to be a degenerate variable at infinity.

2.2. Likelihood Function

Assuming a non-informative right-censoring mechanism, the response of the individual i observed in the sample can be represented by the term ( t i , δ 1 i , δ 2 i ), where t i is the observed time of the i-th individual in the sample and δ 1 i and δ 2 i are their respective indicators of censorship and knowledge of the cure, i = 1 , 2 , , n . Here,
δ 1 i = 0 , if the observation t i is censored 1 , if the observation t i is not censored ,
and
δ 2 i = 0 , if it is unknown that individual i is cured in t i 1 , if it is known that individual i is cured in t i .
The contribution of the i-th individual to the likelihood function is given by P ( K = 1 ) = ρ , if it is known that individual i is cured in t i ; f T ( t i ) = ( 1 ρ ) ( 1 ϕ ) f Y ( t i ; θ ) , if it is unknown that individual i is cured and the observation t i is not censored; and P ( T > t i , K = 0 ) = ( 1 ϕ ) S Y ( t i ; θ ) + ϕ ] , if it is unknown that individual i is cured and the observation t i is censored. That is, for an observed value of ( t i , δ 1 i , δ 2 i ) , the contribution to the likelihood function is
ρ δ 2 i ( 1 ρ ) ( 1 ϕ ) f Y ( t i ; θ ) ( 1 δ 2 i ) δ 1 i ( 1 ρ ) [ ( 1 ϕ ) S Y ( t i ; θ ) + ϕ ] ( 1 δ 2 i ) ( 1 δ 1 i ) .
Thus, the likelihood function is given by
L ( ρ , ϕ , θ ; t , δ 1 , δ 2 ) i = 1 n ρ δ 2 i ( 1 ρ ) ( 1 ϕ ) f Y ( t i ; θ ) ( 1 δ 2 i ) δ 1 i × ( 1 ρ ) [ ( 1 ϕ ) S Y ( t i ; θ ) + ϕ ] ( 1 δ 2 i ) ( 1 δ 1 i ) ,
where 0 ρ 1 , 0 ϕ 1 and θ are the parameters to be estimated. Here, S Y ( · ; θ ) and f Y ( · ; θ ) are, respectively, the survival and probability density functions of susceptible (non-cured) individuals, and t = ( t 1 , t 2 , , t n ) is the vector of times observed in the sample with their respective indicators of censorship δ 1 = ( δ 11 , δ 12 , , δ 1 n ) and knowledge of cure δ 2 = ( δ 21 , δ 22 , , δ 2 n ) .
Note that when the distribution of susceptible individuals, f Y ( · ; θ ) , is identifiable, the likelihood function (11) is identifiable as well, except in cases where no observations with known cure status are present in the sample. In fact, when i = 1 n δ 2 i = 0 , the likelihood function (11) results in ( 1 ρ ) ( 1 ϕ ) i = 1 n f Y ( t i ; θ ) , which is not identifiable (the non-identifiability arises due to the permutation of the values of ϕ and ρ ). In such scenarios, where no known cures are observed, the proposed model is not recommended, and it is more appropriate to consider the Berkson and Gage mixture model (1).

2.3. Regression Model

In this work, the proposal is to incorporate the covariates into the model in the probability of cure μ , as presented in (4). Thus, considering the logit link function, the regression model will be expressed by log μ 1 μ = X β , which results in
μ = exp { X β } 1 + exp { X β } .
In (12), X = ( 1 , X 1 , , X k ) is a vector of k explanatory variables and β = ( β 0 , β 1 , , β k ) is its respective vector of regression coefficients.
The distribution of the random variable T was reparametrized in order to ensure that one of its new parameters represents the probability of cure. The proposed reparameterization is given by
μ = ρ + ( 1 ρ ) ϕ σ = ρ ( 1 ρ ) ϕ ,
which results in
ϕ = μ σ ( 1 μ ) + 1 ρ = μ σ σ + 1 .
In the new parameterization proposed in (13), μ is a parameter that corresponds to the probability of cure and σ represents the ratio between the odds of the known cure and the proportion of the latent cure. Large values of σ indicate a greater known cure in relation to latent cure.
From (11), (12) and (14), the likelihood function can be rewritten as
L ( β , σ , θ ; t , x , δ 1 , δ 2 ) i = 1 n μ i σ σ + 1 δ 2 i ( 1 μ i ) f Y ( t i ; θ ) ( 1 δ 2 i ) δ 1 i ( 1 μ i ) S Y ( t i ; θ ) + μ i σ + 1 ( 1 δ 2 i ) ( 1 δ 1 i ) .
In (15), σ > 0 , μ i = exp { x i β } 1 + exp { x i β } and S Y ( · ; θ ) and f Y ( · ; θ ) are the survival and probability density functions of susceptible individuals, respectively. Here, t = ( t 1 , t 2 , , t n ) is the vector of times observed in the sample with their respective indicators of censorship δ 1 = ( δ 11 , δ 12 , , δ 1 n ) and knowledge of the cure δ 2 = ( δ 21 , δ 22 , , δ 2 n ) , θ is the vector of parameters of the distribution of non-susceptible individuals, and β = ( β 0 , β 1 , , β k ) are the regression coefficients associated with the observed explanatory variables x i = ( 1 , x i 1 , , x i k ) , i = 1 , 2 , , n .
Applying the logarithm to the likelihood function (15), we get
( β , σ , θ ; t , x , δ 1 , δ 2 ; ) = i = 1 n δ 2 i log μ i σ σ + 1 + ( 1 δ 2 i ) δ 1 i log ( 1 μ i ) f Y ( t i ; θ ) + ( 1 δ 2 i ) ( 1 δ 1 i ) log ( 1 μ i ) S Y ( t i ; θ ) + μ i σ + 1 + c ,
where c is a constant that does not depend on σ , θ and β .
The likelihood equation is given by
U ( ϑ ) = ( ϑ ) ϑ = 0 .
Thus, the value ϑ ^ = ( σ ^ , θ ^ , β ^ ) , which satisfies Equation (17), is the maximum likelihood estimator of the model parameters, which under appropriate regularity conditions has asymptotically a multivariate normal distribution with mean ϑ and variance and covariance matrix given by
Σ ( ϑ ^ ) = 2 ( ϑ ) ϑ ϑ ϑ = ϑ ^ 1 = J ( ϑ ) ϑ = ϑ ^ 1 .
The value of ϑ ^ = ( σ ^ , θ ^ , β ^ ) and the observed information matrix J ( ϑ ) can be obtained numerically using computational optimization methods using the Newton–Raphson-type algorithm, which provides an accurate numerical approximation for this matrix. From these results, it is possible to construct confidence intervals for the parameters and carry out significance tests on the model covariates.

3. Results and Discussions

3.1. Simulation Study

This section describes a simulation study conducted to investigate whether the usual asymptotic properties of maximum likelihood estimators are present. The performance of the proposed model was also evaluated in the presence of censored data.
The study of the proposed model was conducted considering simulated data in R software, version 4.3.3 [12]. The simulations performed in this work considered in the model a dichotomous covariate, x 1 , generated from a Bernoulli distribution with a probability of success p = 0.5 and a numerical covariate, x 2 , with a standard normal distribution. These covariates were included in the the probability of cure (4) taking into account a logit link function, i.e., μ = exp { x β } 1 + exp { x β } , where x = ( 1 , x 1 , x 2 ) and β = ( β 0 , β 1 , β 2 ) . For a fixed value of σ , the values of ρ and ϕ are given by (14).
This simulation study considered that the time to failure (U) of susceptible individuals follows a Weibull distribution with shape parameter α and scale parameters λ . The Weibull distribution was chosen due to its popularity in modeling survival data. In addition, the time to censoring (V) follows a Weibull distribution with the same shape parameter α and scale parameters λ 2 = λ 1 q q 1 / α , where q denotes the proportion of censorship. This result is based on the fact that q = P ( V < U ) = λ α λ α + λ 2 α when U W e i b u l l ( α , λ ) and V W e i b u l l ( α , λ 2 ) . The survival times and their respective censoring indicators can be obtained following the steps of Algorithm 1.
Algorithm 1: Obtaining the survival time of the proposed model
  •  Define the values of β 0 , β 1 , β 2 , σ , α , λ and q (censoring proportion) and set i = 1 ;
  •  Generate the covariates:
    x 1 i B e r n o u l l i ( 0.5 ) and
    x 2 i N o r m a l ( 0 , 1 ) ;
  •  Calculate ρ and ϕ i from Equation (14);
  •  Generate K i B e r n o u l l i ( ρ ) and C i B e r n o u l l i ( ϕ i ) ;
  •  Generate the time to failure u i W e i b u l l ( α , λ ) ;
  •  Generate the time to censoring v i W e i b u l l ( α , λ 2 ) , where λ 2 = λ 1 q q 1 / α ;
  •  Define the indicator of the failure of susceptible individuals, R i :
      If u i v i , set R i = 1
      Else, set R i = 0 ;
  •  Define the censoring indicators:
    δ 1 i = ( 1 K i ) ( 1 C i ) R i and
    δ 2 i = K i ;
  •  Calculate the observed time t i = δ 1 i u i + ( 1 δ 1 i ) v i ;
  •  If i < n , set i = i + 1 and return to Step 2. Else end algorithm.
Note 1: In the case of no censoring (q = 0), the value of λ 2 in Step 6 can be specified in a way that λ 2 > > λ . For example, λ 2 = 10 5 λ .
Note 2: The Step 9 implies that the censored times of non-susceptible individuals in which the cure is not known will be equal to v i .
The survival time samples were simulated considering the values β 1 = 1 , β 2 = 0.5 , α = 2 , λ = 3 and several values of σ and β 0 . The values of α and β 0 were defined in order to vary the proportions of known and latent cures ( ρ and ϕ , respectively). A total of 5 simulation scenarios were considered, as shown in Table 1.
The mean estimates, the mean square error (MSE), and the coverage probability (CP) of the estimators were obtained from 1000 Monte Carlo replicates, considering sample sizes n = 50, 100, 200, and 500 and censoring percentages equal to 0%, 10%, and 30% (i.e., censoring susceptible individuals).
For the construction of confidence intervals (CI) for the calculation of the CP, a confidence level of 95% was considered. In addition, since σ and the parameters α and λ of the Weibull distribution are positive, a logarithmic transformation was applied for constructing the CIs of these parameters.
Analysis of the results shown in Figure 1, Figure 2, Figure 3, Figure 4, Figure 5 and Figure 6, referring to Scenario 2—in which data histograms were plotted with normal distribution curves superimposed, along with the values of the average, mean square error (MSE) and coverage probability (CP)—provides evidence of the asymptotic normality of the estimators, regardless of the censoring percentage.
When the censoring percentage is equal to zero, the estimators have excellent statistical properties. The parameter estimates exhibit small bias and low variance, indicating consistency. In addition, the CPs remain close to the nominal confidence level, regardless of the parameter and sample size. The largest difference observed was 0.0440 (0.9940–0.9500), for the σ parameter with a sample size of n = 50.
In the presence of censored observations, there is an increase in the deviations of the estimates from the true parameter values, and this effect is more pronounced as the percentage of censoring increases. Consequently, the probability of coverage tends to move away from the nominal confidence level. However, this behavior is to be expected in censored contexts. It is important to note that as the sample size increases, even under high censoring percentages, the estimators once again show desirable properties. For example, for n = 500 and 30% censoring, the biggest difference observed was 0.0660 (1.066–1.000) in the average, 0.1035 in the MSE, and 0.0150 (0.9650–0.9500) in the CP, all three associated with the parameter σ .
Finally, the results showed evidence of the asymptotic normality of the estimators for all parameters and censorship percentages, as evidenced by the fit of the normal curves superimposed on the histograms, which improves with increasing sample size. This behavior justifies the use of normal approximations for constructing confidence intervals and testing hypotheses about the model parameters, even in scenarios with censoring. The results of the other scenarios were similar to those observed in Scenario 2. The numerical results for these scenarios are shown in Table 2.

3.2. An Illustrative Example

This section presents an illustration of the application of the proposed model to an artificial dataset of clients who take a loan at a financial institution. The decision to use artificial data for this illustration is due to the high commercial value associated with credit risk data and also due to legal factors that prevent the disclosure of sensitive customer information. Given these limitations, it has been decided to consult the existing literature in order to identify the variables most frequently used to classify low- and high-risk applicants in the context of credit risk analysis. Based on this investigation, the following explanatory variables were selected: (1) credit limit: is the maximum amount of credit a lender authorizes to each customer (log scale); (2) gender (female and male); (3) social class, coded into 5 levels (class A to class E); (4) marital status, coded into 3 levels (single, married, and widowed/separated) and; (5) age in years.
Based on these variables, a simulated database was created with 1000 observations, also including the time (in months) until the occurrence of default. In these studies, it has been observed that censorship rates are generally high, resulting in a low number of registered defaults. In this context, this study adopted a censorship rate of 50% in order to approximate the real data reported in the literature in studies such as [13]. This strategy allowed the simulated base to more faithfully represent the characteristics observed in real credit risk analysis scenarios.
In this study, the explanatory variables were generated as follows: credit limit by a lognormal distribution, gender by a Bernoulli distribution, social classification by a multinomial distribution with five categories, marital status by a multinomial distribution with three categories, and age by a Poisson distribution. Survival times were generated using Algorithm 1 described in Section 3. In addition, the time values were truncated at t = 60. In these cases, the indicators of censorship and knowledge cure were defined as δ 1 = 0 and δ 2 = 1 , respectively. This procedure was adopted to represent loans with terms up to 60 months. Here, δ 2 = 1 indicates that default will not occur (an individual who is known to be cured) either because they have paid off the loan early or because they have completed the entire loan period without missing a payment.
The simulated dataset is provided in Table S1 of the Supplementary Materials. The sample exhibited a censoring rate of 54.4%, of which 44.0% correspond to individuals known to be cured ( δ 2 = 1 ).
The application data was adjusted using the proposed model considering that the time to default of susceptible individuals follows a Weibull distribution with a shape parameter α and a scale parameter λ , and the covariates were included in the probability of cure (4), taking into account a logit link function. The maximum likelihood point estimates of the parameters, with their respective confidence intervals, are shown in Table 3.
The results in Table 3 indicate that, except for age, all covariates significantly influence the cure at the 5% significance level. The probability of cure for the clients was calculated using (12), based on the obtained estimates and considering all covariates. This probability of cure was used to classify clients as having a low or high risk of default. The cutoff point considered was 0.456, that is, a customer is classified as having a high risk of default if their probability of cure, μ < 0.456 (or classified as low risk if μ 0.456 ).
The definition of the cut-off point took into account the proportion of defaulting customers in the sample. In the observed sample, a customer is considered to be in default when δ 1 = 0 . This cutoff point resulted in correct predictions, sensitivity (probability of the model classifying a good customer as low risk) and specificity (probability of the model classifying a defaulting customer as having a high risk) of 70.3%, 56.4%, and 82.0%, respectively, indicating a good classification of customers, particularly the defaulting ones. In fact, the cut-off point can be adjusted to increase or decrease sensitivity or specificity in order to control the error that the financial institution considers to be the most critical.
For benchmarking purposes, a logistic regression model and the Weibull Berkson and Gage mixture model were used to classify customers in this data set. The results of the logistic regression (correct predictions of 71.7%, sensitivity of 56.4%, and specificity of 82.4%) and the Berkson and Gage model (correct predictions of 70.2%, sensitivity of 54.4% and specificity of 83.5%) were similar to those of the proposed model. Figure 7 shows the probabilities of cure estimated by the proposed model, logistic regression, and Berkson and Gage mixture model, indicating agreement among these three methodologies.
The results show that the model proposed in this study achieved performance similar to that of the benchmark models. Under a naive comparison, the overall accuracy rate of the proposed model was 70.3%, while the logistic regression and the Berkson and Gage mixture models achieved rates of 71.7% and 70.2%, respectively.
However, it is important to note that, despite the similarity in results to the logistic regression model, the proposed model also accounts for the time until default for susceptible clients. Moreover, the logistic model is not fully appropriate in this context, as there are latent cured (non-defaulting) clients for whom it is not known with certainty whether they are truly cured. In addition, as expected, the Weibull mixture model underestimates the cure probabilities. This occurs because the Berkson and Gage model ignores the presence of known cures, potentially considering some clients who are clearly cured as non-cured individuals.

4. Conclusions

This work presents an extension of the Berkson and Gage mixture model [2] that accommodates both latent and non-latent cure fractions. This model can be viewed as a cure rate mixture model with competing risks, considering the non-latent cure as a competing cause.
The proposed model was reparameterized in terms of the cure probability, with a regression structure attached to this parameter. Furthermore, a simulation study was conducted considering a Weibull distribution for the time to the event of susceptible (non-cured) individuals. The results of these simulated data provide evidence of the asymptotic properties of the estimators.
The proposed model is illustrated using a synthetic dataset of customers who have taken out loans from a financial institution, and its performance is compared with that of logistic regression and the standard Berkson and Gage mixture model. The results show that the proposed model not only achieves competitive accuracy but also offers important conceptual and practical advantages, since the logistic model ignores the time-to-event distribution and the Berkson and Gage model neglects known cure information when such data are available.
It is important to note that any other probability distribution could be used to model the time to event of susceptible individuals. Moreover, a regression structure, such as a proportional hazards or accelerated failure time model, can also be employed to incorporate covariates into the modeling of this time. In addition, future works could consider the implementation of informative censoring mechanisms, which are commonly encountered in credit risk modeling.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/stats8030082/s1, Table S1. Artificial dataset of 1000 clients who take a loan in a financial institution.

Author Contributions

Conceptualization, E.Y.N., F.M.A. and M.R.P.C.; methodology, E.Y.N., F.M.A. and M.R.P.C.; software, E.Y.N., F.M.A. and M.R.P.C.; validation, E.Y.N., F.M.A. and M.R.P.C.; formal analysis, E.Y.N., F.M.A. and M.R.P.C.; investigation, E.Y.N., F.M.A. and M.R.P.C.; resources, E.Y.N., F.M.A. and M.R.P.C.; data curation, E.Y.N. and M.R.P.C.; writing—original draft preparation, E.Y.N., F.M.A. and M.R.P.C.; writing—review and editing, E.Y.N., F.M.A. and M.R.P.C.; visualization, E.Y.N., F.M.A. and M.R.P.C.; supervision, E.Y.N.; project administration, E.Y.N.; funding acquisition, E.Y.N. and F.M.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Coordenação de Aperfeiçoamento de Pessoal de Nível Superior—Brazil (CAPES)—Finance Code 001, National Council for Scientific and Technological Development (CNPq), Editais de Auxílio Financeiro DPI/DPG/UnB, DPI/DPG/BCE/UnB, and PPGEST/UnB.

Data Availability Statement

The data presented in this study are available in the Supplementary Materials. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Chen, M.H.; Ibrahim, J.G.; Sinha, D. A new Bayesian model for survival data with a surviving fraction. J. Am. Stat. Assoc. 1999, 94, 909–919. [Google Scholar] [CrossRef] [PubMed]
  2. Berkson, J.; Gage, R.P. Survival curve for cancer patients following treatment. J. Am. Stat. Assoc. 1952, 47, 501–515. [Google Scholar] [CrossRef]
  3. Yakovlev, A.Y.; Tsodikov, A.D. Stochastic Models of Tumor Latency and Their Biostatistical Applications; World Scientific: Singapore, 1996. [Google Scholar]
  4. Oliveira, M.R.; Moreira, F.; Louzada, F. The zero-inflated promotion cure rate model applied to financial data on time-to-default. Cogent Econ. Financ. 2017, 5, 1395950. [Google Scholar] [CrossRef]
  5. Chen, T.; Du, P. Promotion time cure rate model with nonparametric form of covariate effects. Stat. Med. 2018, 37, 1625–1635. [Google Scholar] [CrossRef] [PubMed]
  6. Gómez, Y.M.; Gallardo, D.I.; Bourguignon, M.; Bertolli, E.; Calsavara, V.F. A general class of promotion time cure rate models with a new biological interpretation. Lifetime Data Anal. 2023, 29, 66–86. [Google Scholar] [CrossRef] [PubMed]
  7. Leão, J.; Leiva, V.; Saulo, H.; Tomazella, V. Incorporation of frailties into a cure rate regression model and its diagnostics and application to melanoma data. Stat. Med. 2018, 37, 4421–4440. [Google Scholar] [CrossRef] [PubMed]
  8. Cancho, V.G.; Barriga, G.; Leão, J.; Saulo, H. Survival model induced by discrete frailty for modeling of lifetime data with long-term survivors and change-point. Commun. Stat.-Theory Methods 2021, 50, 1161–1172. [Google Scholar] [CrossRef]
  9. Maller, R.A.; Zhou, X. Survival Analysis with Long-Term Survivors; Wiley: Hoboken, NJ, USA, 1996. [Google Scholar]
  10. Rodrigues, J.; Cancho, V.G.; de Castro, M.; Louzada-Neto, F. On the unification of long-term survival models. Stat. Probab. Lett. 2009, 79, 753–759. [Google Scholar] [CrossRef]
  11. Larson, M.G.; Dinse, G.E. A Mixture Model for the Regression Analysis of Competing Risks Data. J. R. Stat. Soc. Ser. C (Appl. Stat.) 1985, 34, 201–2011. [Google Scholar] [CrossRef]
  12. R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2024; Available online: https://www.R-project.org/ (accessed on 1 September 2025).
  13. Dirick, G.C.L.; Baesens, B. Time to default in credit scoring using survival analysis: A benchmark study. J. Oper. Res. Soc. 2017, 68, 652–665. [Google Scholar] [CrossRef]
Figure 1. Empirical distribution of the estimates of σ from 1000 Monte Carlo replications with the fitted normal curve (Scenario 2).
Figure 1. Empirical distribution of the estimates of σ from 1000 Monte Carlo replications with the fitted normal curve (Scenario 2).
Stats 08 00082 g001
Figure 2. Empirical distribution of the estimates of β 0 from 1000 Monte Carlo replications with the fitted normal curve (Scenario 2).
Figure 2. Empirical distribution of the estimates of β 0 from 1000 Monte Carlo replications with the fitted normal curve (Scenario 2).
Stats 08 00082 g002
Figure 3. Empirical distribution of the estimates of β 1 from 1000 Monte Carlo replications with the fitted normal curve (Scenario 2).
Figure 3. Empirical distribution of the estimates of β 1 from 1000 Monte Carlo replications with the fitted normal curve (Scenario 2).
Stats 08 00082 g003
Figure 4. Empirical distribution of the estimates of β 2 from 1000 Monte Carlo replications with the fitted normal curve (Scenario 2).
Figure 4. Empirical distribution of the estimates of β 2 from 1000 Monte Carlo replications with the fitted normal curve (Scenario 2).
Stats 08 00082 g004
Figure 5. Empirical distribution of the estimates of α from 1000 Monte Carlo replications with the fitted normal curve (Scenario 2).
Figure 5. Empirical distribution of the estimates of α from 1000 Monte Carlo replications with the fitted normal curve (Scenario 2).
Stats 08 00082 g005
Figure 6. Empirical distribution of the estimates of λ from 1000 Monte Carlo replications with the fitted normal curve (Scenario 2).
Figure 6. Empirical distribution of the estimates of λ from 1000 Monte Carlo replications with the fitted normal curve (Scenario 2).
Stats 08 00082 g006
Figure 7. Probabilities of cure estimated by the proposed model, logistic regression model, and Berkson and Gage mixture model.
Figure 7. Probabilities of cure estimated by the proposed model, logistic regression model, and Berkson and Gage mixture model.
Stats 08 00082 g007
Table 1. Scenarios used in the simulations.
Table 1. Scenarios used in the simulations.
Scenario σ β 0 β 1 β 2 α λ ϕ ˜ ρ ˜
S 1 1.0−3.01.00.52.05.0 5 % 5 %
S 2 1.0−2.01.00.52.05.0 12 % 10 %
S 3 1.5−0.51.00.52.05.0 31 % 30 %
S 4 3.0−1.01.00.52.05.0 15 % 30 %
S 5 0.5−1.01.00.52.05.0 30 % 13 %
Note: ϕ ˜ and ρ ˜ are, respectively, the expected values for the percentage of latent and non-latent cures based on the scenarios presented.
Table 2. Average of estimates, MSE, and probability of coverage (CP) of parameters considering the simulation scenarios and different sample sizes and censoring percentages.
Table 2. Average of estimates, MSE, and probability of coverage (CP) of parameters considering the simulation scenarios and different sample sizes and censoring percentages.
Cens.
Per.
Par.Scen.n
50100200500
AverageMSECPAverageMSECPAverageMSECPAverageMSECP
0 % σ S 1 1.37101.84690.98801.24911.21250.99601.13540.35180.97101.05400.10830.9580
S 2 1.23350.92160.99401.11000.35260.96701.06930.12210.97801.02570.03980.9620
S 3 1.67320.59730.95801.59330.26290.94701.56200.10940.96901.52600.03940.9580
S 4 3.42015.45670.97903.24041.73020.96803.14780.71160.97603.06620.22580.9740
S 5 0.57950.10210.96200.53050.03830.95200.51020.01580.96600.50450.00620.9440
β 0 S 1 −3.36816.56500.9340−3.14030.53610.9750−3.09520.25220.9730−3.04240.08660.9550
S 2 −2.17720.75810.9660−2.07660.18840.9750−2.04720.11210.9480−2.01790.04080.9490
S 3 −0.51820.16240.9680−0.50220.07970.9580−0.50990.04030.9600−0.50640.01590.9650
S 4 −1.07420.21310.9720−1.0250.09600.9610−1.02960.05150.9560−1.01450.02120.9530
S 5 −1.05550.19290.9820−1.01090.09060.9660−1.02090.04800.9560−1.00130.01980.9590
β 1 S 1 0.95454.25040.99201.02080.63880.98601.02190.32780.97001.01520.11030.9700
S 2 1.13781.00180.97701.04010.26960.97501.02890.15620.95401.01010.05990.9530
S 3 1.04260.38910.96301.02020.19120.95801.02730.09540.94901.01640.03680.9420
S 4 1.09470.43740.95901.03250.21710.95001.03540.09970.94701.01910.04070.9400
S 5 1.04300.39640.96601.00930.19810.95201.03900.09240.95401.00560.03790.9470
β 2 S 1 0.55960.58370.98100.56090.21740.96000.51440.07370.95800.51860.03040.9480
S 2 0.55720.31230.97600.53180.09730.95400.51340.04330.95400.51330.01580.9540
S 3 0.54730.19350.96500.53800.06220.95300.52610.02950.94900.51070.01030.9510
S 4 0.56660.20740.97100.52750.06720.94500.52270.02990.95500.51070.01100.9530
S 5 0.55560.19800.97100.54000.06070.96600.52050.03130.94800.51070.01050.9530
α S 1 2.07250.06530.94702.03630.02830.95402.01720.01560.93702.00610.00530.9450
S 2 2.07630.07220.94602.04690.03310.94902.02350.01720.95102.00750.00610.9500
S 3 2.12170.12840.93202.06640.05680.94102.03490.02670.93702.01720.00930.9500
S 4 2.10470.09750.93502.05430.04590.94202.03120.0210.95002.01210.00800.9500
S 5 2.11860.10760.93402.06510.04470.94502.03080.02030.94702.01260.00780.9610
λ S 1 5.01840.15640.94005.01570.07280.94905.00840.03910.94505.00480.01460.9580
S 2 5.01750.15470.94805.02180.07930.95105.01350.04210.94905.00870.01620.9560
S 3 5.01370.24810.94105.03180.13070.95305.01330.06070.95105.00880.02740.9500
S 4 5.02720.20710.94805.01500.11250.94605.01730.05130.95505.00400.02250.9470
S 5 5.02450.19030.95005.02970.09890.95205.01470.05220.94605.00840.02180.9540
10 % σ S 1 1.21541.58440.99701.17880.94400.99401.18430.61200.98201.06660.14040.9630
S 2 1.20990.89590.99401.14140.40580.97501.07440.15000.97201.03510.04880.9690
S 3 1.69260.68670.98101.61200.28890.95401.55970.11930.96601.53530.04590.9550
S 4 3.30844.35200.97403.25562.15470.97403.17290.87670.97503.09690.29280.9710
S 5 0.59110.13440.96400.53450.04140.95800.51210.01620.95800.50620.00680.9450
β 0 S 1 −3.22506.53930.9450−3.18511.30140.9700−3.11350.37460.9690−3.04930.10880.9590
S 2 −2.11760.65130.9690−2.10420.27590.9720−2.05580.13120.9540−2.01920.04820.9530
S 3 −0.53780.19590.9640−0.51330.09250.9580−0.51380.04550.9620−0.50960.01810.9640
S 4 −1.04010.24750.9650−1.01980.11150.9490−1.02310.05680.9540−1.00890.02150.9570
S 5 −1.07070.27420.9810−1.03000.10910.9630−1.0170.05630.9630−1.00910.02350.9570
β 1 S 1 0.95814.76120.98701.09001.41940.98601.04350.45290.97301.00660.13910.9720
S 2 1.06611.03290.98101.05440.36050.96901.03360.17560.96601.01030.06730.9510
S 3 1.08180.45120.95101.03780.21980.95201.03090.10420.94401.01990.04030.9540
S 4 1.07390.48030.95601.03960.23890.95201.02310.11000.94801.01460.04250.9560
S 5 1.07510.52710.96601.03010.23340.95901.02900.10790.95501.01010.04320.9560
β 2 S 1 0.57970.90740.97600.56150.25970.96300.52170.09180.95400.51210.03710.9560
S 2 0.57670.37370.97400.55360.12300.95100.51540.04660.96200.51760.01820.9490
S 3 0.58500.22650.96700.54390.07170.96000.52570.03280.95100.51150.01170.9590
S 4 0.59570.25840.96200.53920.07460.95600.52860.03300.95900.50870.01190.9570
S 5 0.61220.25450.97300.53850.07510.96200.52730.03470.95600.51430.01240.9590
α S 1 2.08700.07160.95102.04100.03300.95602.01850.01680.93802.00520.00620.9490
S 2 2.08860.08840.93902.04580.03830.94902.02240.01960.94602.00680.00740.9470
S 3 2.13210.16420.92002.07280.06710.94002.04140.03090.94202.01700.01120.9550
S 4 2.13000.14070.91902.05770.05260.94302.02800.02480.94202.01410.01000.9520
S 5 2.13330.13570.92402.06820.05290.93902.03770.02520.95002.01450.00930.9490
λ S 1 4.98060.16240.94005.00180.08360.94905.00680.04220.95805.00290.01650.9550
S 2 4.99760.19360.94005.02330.10210.94305.01530.04970.95405.00970.01920.9530
S 3 5.02350.29180.94305.01680.15040.95805.01960.07820.94005.01780.03370.9470
S 4 5.01320.24720.93405.01250.11970.95505.01220.06480.95005.00430.02540.9530
S 5 5.04420.24660.94505.02950.13130.94005.01550.06280.95105.01000.02750.9480
30 % σ S 1 1.10622.21050.97101.07921.13450.96901.13020.77520.96801.13370.37020.9680
S 2 1.16560.95230.98001.16670.70380.97101.11640.35070.97201.06600.10350.9650
S 3 1.72160.97210.97501.67450.60940.96301.59670.19420.96701.55720.07230.9600
S 4 3.02424.12120.94503.25593.81820.96703.25531.96400.97303.16570.70490.9730
S 5 0.63280.22010.97700.52710.08770.96600.52320.02400.97000.51000.00930.9410
β 0 S 1 −3.598516.91350.9000−3.21103.65330.9430−3.08450.59220.9590−3.05180.17010.9600
S 2 −2.14561.56660.9560−2.11760.73130.9510−2.07410.22180.9590−2.02600.07170.9510
S 3 −0.49190.30140.9500−0.52700.13500.9540−0.51310.06170.9610−0.51390.02430.9650
S 4 −0.99590.33370.9550−1.00860.12980.9620−1.02430.07090.9600−1.01200.02730.9600
S 5 −1.07560.41710.9670−1.07740.24400.9520−1.03090.10260.9650−1.01730.04140.9490
β 1 S 1 0.897411.23280.97601.07232.80330.98901.06370.65240.97201.02270.18800.9630
S 2 1.17423.32080.98401.07660.77900.96901.05750.24560.96301.01420.08960.9590
S 3 1.10630.62610.95801.08840.30410.95901.02450.13250.95801.01960.04860.9600
S 4 1.15030.65700.95601.06430.26560.95601.04470.13010.95501.02160.04850.9560
S 5 1.09210.69760.98101.06900.38690.94601.03900.15880.95601.00730.05820.9600
β 2 S 1 0.59181.57970.92400.63650.43780.96900.53530.11970.96400.52010.04810.9540
S 2 0.64770.61370.98200.58630.18240.95800.52100.06630.95400.51670.02210.9570
S 3 0.62900.34680.96900.56130.09810.95300.52600.04340.94600.51320.01460.9460
S 4 0.61520.33750.96800.54380.09500.94200.52530.03550.96400.51200.01390.9510
S 5 0.62650.43510.98300.57480.12860.96100.53350.05220.94900.52130.01760.9590
α S 1 2.17240.13980.92402.08320.05690.93202.03360.02160.95302.00700.00860.9600
S 2 2.16520.13560.93002.06510.06160.93902.03260.02720.94502.00770.01040.9490
S 3 2.20080.26100.90902.09910.10300.93102.05180.04490.94702.01850.01790.9350
S 4 2.20640.21600.92202.08920.08230.93602.03910.03330.95802.00990.01420.9430
S 5 2.16100.18400.92402.08010.08130.93202.04270.03670.94802.01460.01340.9530
λ S 1 4.80330.25120.91004.91190.13670.93004.96480.06120.95104.99560.02640.9580
S 2 4.86680.26990.92604.97500.15910.94004.99980.07810.95605.00820.03340.9590
S 3 4.93330.45640.92905.02270.28160.94405.01630.13880.94505.01630.06500.9410
S 4 4.87610.35530.92804.98110.20260.94005.00690.10460.94805.00620.04360.9530
S 5 4.97880.46400.92905.02350.26770.94205.01250.12820.95105.01700.05750.9450
Table 3. Point and interval estimates of the parameters of the proposed model.
Table 3. Point and interval estimates of the parameters of the proposed model.
CovariatesParameterEstimateStandard Error95% CI
Intercept β 0 −4.0401.002(−6.004; −2.076)
Credit limit β 1 0.2090.104(0.005; 0.413)
Gender 1
Male β 2 1.0270.142(0.749; 1.305)
Social class 2
B β 31 0.4330.221(−0.001; 0.867)
C β 32 0.8630.226(0.419; 1.306)
D β 33 0.7650.224(0.326; 1.204)
E β 34 1.4230.224(0.985; 1.861)
Marital status 3
Married β 41 1.3250.174(0.984; 1.666)
widowed/separated β 42 0.9590.174(0.618; 1.301)
Age β 5 0.0130.011(−0.009; 0.035)
σ 3.3860.312(2.827; 4.056) 4
α 1.2830.043(1.202; 1.367) 4
λ 9.9560.378(9.243; 10.725) 4
Reference level of the covariates: 1 Female; 2 Class A; 3 Single; 4 A logarithmic transformation was applied for constructing the CIs.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Nakano, E.Y.; Almeida, F.M.; Cardial, M.R.P. A Mixture Model for Survival Data with Both Latent and Non-Latent Cure Fractions. Stats 2025, 8, 82. https://doi.org/10.3390/stats8030082

AMA Style

Nakano EY, Almeida FM, Cardial MRP. A Mixture Model for Survival Data with Both Latent and Non-Latent Cure Fractions. Stats. 2025; 8(3):82. https://doi.org/10.3390/stats8030082

Chicago/Turabian Style

Nakano, Eduardo Yoshio, Frederico Machado Almeida, and Marcílio Ramos Pereira Cardial. 2025. "A Mixture Model for Survival Data with Both Latent and Non-Latent Cure Fractions" Stats 8, no. 3: 82. https://doi.org/10.3390/stats8030082

APA Style

Nakano, E. Y., Almeida, F. M., & Cardial, M. R. P. (2025). A Mixture Model for Survival Data with Both Latent and Non-Latent Cure Fractions. Stats, 8(3), 82. https://doi.org/10.3390/stats8030082

Article Metrics

Back to TopTop