1. Introduction
Advances in scientific research, particularly in the health sciences, have been pivotal in driving substantial improvements in the average life expectancy of populations worldwide. Consequently, patients with longer survival times may experience certain clinical events more than once. For this reason, since the 1980s, there has been a growing effort to develop survival models to analyse multivariate time data [
1,
2], arising from the observation of several episodes of a particular event of interest. These episodes are referred to as recurrent events [
3] and are commonly observed in medical studies on cancer relapses, myocardial infarctions, asthma attacks and hospital readmissions. Furthermore, in scenarios where patients may recover after each occurrence, researchers (e.g., [
4]) often focus on modelling the time elapsed between two consecutive events, known as the gap time.
As mentioned by Cook and Lawless [
3], there are two fundamental approaches to modelling recurrent events, which involve developing models based on event counts or gap times. Poisson processes are the canonical models for event counts, typically using the calendar time as the time scale. In contrast, the renewal process is the canonical framework when analysing gap times between recurrent events. Such a stochastic process relies on the assumption that gap times are independent and identically distributed (iid) random variables. Since this is a strong condition that holds only in a few cases, more general gap time models that account for within-individual dependence have been developed over the years using conditional distributions. These include various regression models for the assessment of covariates’ effects on a given function of interest (e.g., intensity, rate, hazard, mean or quantile functions) [
5,
6,
7,
8,
9,
10,
11,
12,
13], multistate models to capture transitions between different event states [
14,
15], gap time models with random effects to account for unobserved heterogeneity [
4,
16,
17] and copula-based models for the joint modelling of dependent gap times [
18,
19]. For a more in-depth overview of the topic, readers are referred to [
3].
Another approach in gap time modelling involves obtaining the conditional distribution of a gap time, given a previous recurrence time. This issue has been investigated by several authors [
20,
21,
22], who have proposed different non-parametric estimators of the conditional survival (or distribution) function that, in some way, are based on the Kaplan–Meier estimator of the survival function. From a different perspective, the conditional distribution of the gap times can be deduced under the classic assumption that the number of recurrent events up to a given time follows a non-homogeneous Poisson process (NHPP) [
3]. In this setting, Zhao and Zhou [
23] developed an additive semiparametric model with a rate function derived from an NHPP. One advantage of this model is its ability to estimate covariate effects without assuming a specific parametric form for the baseline rate function, similarly to the well-known Cox-based models, which feature an unspecified baseline hazard function. However, this may present a limitation when estimating these functions is of primary interest, particularly in medicine, as it enables the evaluation of how the risk of disease occurrence evolves over time. As stated by Royston and Parmar [
24] and Jullum and Hjort [
25], using a parametric version of the Cox model, when appropriate, can provide more accurate survival probability estimates, enhancing the understanding of the phenomenon under study. Therefore, adopting a fully parametric model may be more suitable, a stance supported by Cox in [
26].
Following Zhao and Zhou [
23], a class of parametric rate models has emerged considering alternative specifications for the baseline rate function. Macera et al. [
27] and Louzada et al. [
28] studied a model based on the extended exponential–Poisson (EEP) distribution. Similarly, Louzada et al. [
29] and Sousa-Ferreira et al. [
30] studied the use of the Weibull form to specify the baseline rate, but the models were extended in different directions to account for the presence of zero-recurrence (cured) individuals. Both the EEP and Weibull rate models have been shown to include the classical homogeneous Poisson process (HPP) as a special case, a property that is useful for model-checking purposes. Nonetheless, these models only allow for monotonic rate functions and may fall short in capturing complex disease patterns over a patient’s lifetime. To overcome this limitation, Sousa-Ferreira et al. [
31] proposed a model based on the extended Chen–Poisson (ECP) distribution, whose rate function can accommodate non-monotonic shapes, including bathtub or unimodal.
Despite the variety of available parametric approaches, a comprehensive comparison of existing parametric rate models is still lacking. Thus, a more detailed study is justified to assess the strengths and limitations of different parametric rate models, providing a valuable understanding of their relative performance and applicability.
This paper is organised as follows.
Section 2 provides an overview of the mathematical properties underlying the general rate model for gap times between recurrent events derived from an NHPP, followed by the formulation of existing parametric rate models that differ in terms of the distributional assumptions on gap times. A new model, based on restricted cubic splines (RCSs), is also introduced in this section.
Section 3 outlines the inferential procedure based on the usual maximum likelihood (ML) method for right-censored data and addresses the likelihood ratio (LR) test for model selection.
Section 4 illustrates the application of these models to two real data sets from the literature, namely bowel motility data and hospital readmission data. Finally,
Section 5 presents some concluding remarks and directions for future work.
2. Parametric Rate Models for Gap Times Between Recurrent Events
Survival models for recurrent events can be formulated by defining the distribution of the number of events in an infinitesimal interval
, given the process history up to time
t. In general, fully specifying this distribution is unfeasible due to its complexity. As the interest often lies in marginal features of the recurrence process, Poisson process-based models focusing on the rate and mean functions have been developed [
7,
9].
Consider an individual whose recurrence process begins at time 0 for simplicity. Let be the continuous and non-negative random variables denoting the ordered times corresponding to episodes of a given event of interest, where () is the time from the beginning of the study until the occurrence of the kth episode. These times are realisations of a counting process, , which records the cumulative number of events in . An alternative representation of the same process is through the gap times, defined as , with .
Assuming that
is a Poisson process, the intensity function is defined as
where
is the increment over
. A key assumption is that
has independent increments. Therefore, Equation (
1) is known as the rate function and represents the marginal (unconditional on the history) instantaneous probability of an event at time
t [
3]. Considering that the probability of more than one event over
is negligible, it follows that
and
. Thus, an equivalent way to define this process is
. The mean function (also called the cumulative rate function), defined as
, describes the expected number of events at
t. The process is called homogeneous if
is a positive constant; otherwise, it is non-homogeneous.
The Poisson process is useful in deriving the conditional distribution of gap times, given the previous recurrence time [
3]. Under the assumptions of an NHPP, the probability that no event occurs during a gap time of length
y, given that the individual survived beyond time
, is defined as
where
is the baseline survival function. The expected value of
, conditional on
, takes the expression
, which, by definition, corresponds to the mean residual life function of
at time
.
The conditional survival function (
2) expresses the dependence structure among gap times within an individual but reduces to
for the first gap time (
). In general, gap times are not independent, except in the special case of an HPP, where
and the gap times are iid exponential random variables with rate parameter
.
The corresponding expected number of recurrences over the interval
is equal to the conditional cumulative rate function
where
is the baseline cumulative rate function. Then, the rate function of the recurrence process
can be straightforwardly deduced from
as
where
is the baseline rate function.
During the past two decades, gap time models characterised by the rate function (
4) have been developed. These models differ in the nature (non-parametric or parametric) of the baseline rate. The first model was proposed by Zhao and Zhou [
23], who considered a kernel estimation method to estimate the baseline rate non-parametrically. Subsequent models [
27,
28,
29,
30,
31] assumed a specific parametric form for
based on a particular distribution. Moreover, a Poisson process can be generalised to incorporate covariates or random effect terms [
3], which has also been done by some authors (e.g., [
23,
29,
30]) under this approach. Here, however, the focus is exclusively on gap time modelling.
2.1. Extended Exponential–Poisson (EEP) Rate Model
Macera et al. [
27] and Louzada et al. [
28] proposed similar models, assuming that the baseline rate function has the same analytical form as the hazard function of the exponential–Poisson and Poisson–exponential distributions, respectively. These distributions, intended for single-event analysis, arise in competitive and complementary risk (CCR) problems, where the lifetime of each cause (which, in this case, follows an exponential distribution) is unobservable, and only the minimum or maximum lifetime across all possible causes can be observed. Ramos et al. [
32] showed that, if the number of causes follows a zero-truncated Poisson distribution with parameter
, the distributions of the minimum and maximum can be unified into a single model by extending
to
, giving rise to the unified Poisson family of distributions. The expected number of latent causes is
, approaching 1 as
. The exponential–Poisson and Poisson–exponential distributions are particular cases of the EEP distribution for
(distribution of the minimum) and
(distribution of the maximum), respectively. Thus, the two parametric rate models of Macera et al. [
27] and Louzada et al. [
28] can likewise be merged, as discussed in [
32].
For the EEP rate model, the rate function of the recurrence process
is
where
and
. Its shape is monotonically decreasing for
and increasing for
, stabilising at
as the gap time tends to infinity. Since
, the exponential rate model (HPP) with constant rate
is a limiting case. While broadly applicable to recurrent gap time data, this model is especially useful in CCR problems as it yields a practical interpretation: for
(or
),
represents the minimum (or maximum) gap time among all competitive (or complementary) causes.
2.2. Extended Chen–Poisson (ECP) Rate Model
Sousa-Ferreira et al. [
31] deduced the conditional distribution of a gap time, given the previous recurrence time, assuming a baseline rate function with an ECP form [
33]. Their work addresses a limitation of the EEP rate model, whose monotonic rate function may not be suitable for scenarios where the risk peaks and then declines, such as in disease progression. Based on the ECP distribution, the resulting parametric rate model accommodates non-monotonic rate shapes, offering a more accurate representation of real-world data. Similarly to the EEP distribution, the ECP distribution is a member of the unified Poisson family [
32] but, in this case, the lifetime of each unobservable cause follows a Chen distribution.
For the ECP rate model, the rate function of the
kth gap time,
, conditional on
, takes the expression
where
and
. This function can adopt the same shapes as the hazard function of the ECP distribution [
33], including monotonic increasing, monotonic decreasing, unimodal, bathtub, increasing–decreasing–increasing and decreasing–increasing–decreasing–increasing. Since
has the same analytical form as the hazard function of the Chen distribution, it follows that the ECP rate model reduces to the Chen rate model as
. In CCR settings,
retains its interpretation in terms of competing (
) or complementary (
) causes, enabling the estimation of the average number of latent causes.
2.3. Flexible Rate Model Based on Restricted Cubic Splines (RCSs)
Traditional distributions—such as the exponential, Weibull, gamma or other generalised exponential distributions—often lack the flexibility to capture rate functions that increase and decrease multiple times. For this reason, we propose using RCSs in this class of parametric rate models. A cubic spline is a smooth function defined by a set of third-degree polynomial functions joined at a predefined number of points, with continuous first and second derivatives. The first and last of these points are named boundary knots, and the remaining are internal knots. This function may also be constrained to be linear beyond the boundary knots, ensuring a sensible functional form in the tails [
34,
35], where data are typically sparse. These types of splines are known as RCSs and have been used in survival models [
24,
36].
For a predefined number
m of internal knots, denoted by
, with boundary knots
and
, the RCS function of an observation
x can be written as
where
is the vector of parameters and
is the
lth basis function (
), defined as
with
. The complexity of the curve is regulated by the number of degrees of freedom (df), given by
. As
m increases, the curve gains flexibility; however, it may become unstable if
m is too large. By convention, df
indicates that no internal knots are specified, so
. Some authors [
24,
37] recommend modelling on the log time scale, i.e.,
with
, as this strategy reduces the variation between curves with different df.
Following the approach of Royston and Parmar [
24], we propose to model the log-cumulative baseline rate function as an RCS function of log time, which yields analytically tractable functions. From (
3), the flexible rate model has a cumulative rate function characterised by
where
is the log-cumulative baseline rate function and
is the RCS function (
7) of
. Therefore, the corresponding rate function of the recurrence process
is
with the expression for
available in
Appendix A. This flexible rate function can capture rollercoaster shapes with multiple inflection points. Each choice of
m internal knots defines a different parametric rate model. With no internal knots, the rate simplifies to
, which means that
follows a Weibull rate model, with scale parameter
and shape parameter
. This case, studied in [
29,
30] using a more common parametrisation, includes the HPP as a nested model when
. All other cases correspond to an NHPP.
Considerations on the Use of Restricted Cubic Splines (RCSs)
The placement and number of internal knots are widely debated topics in the literature (e.g., [
34,
38]). Royston and Parmar [
24] discourage the use of data-driven optimisation methods to automatically select the location of each knot, arguing that such approaches may lead to overfitting by capturing minor features of the data. Instead, they recommend placing boundary knots at the minimum and maximum of uncensored log survival times, with internal knots set at equally spaced empirical quantiles (e.g., quartiles for three knots; see
Table 1). From a practical perspective, these authors found that the location of the internal knots has a minimal impact on the overall shape of the estimated hazard function.
The automated selection of the number
m is also discouraged for similar reasons. Note that RCS-based models are not necessarily nested, so the application of the LR test is generally inappropriate to compare models with different values of
m. Royston and Parmar [
24] and Rutherford et al. [
38] found that little is gained by considering
. Consequently, these authors suggested informally examining the observed values of the Akaike information criterion (AIC) and Bayesian information criterion (BIC) for models fitted with zero to three internal knots.
Rutherford et al. [
38] conducted a comprehensive simulation study on the use of RCSs to approximate complex hazard functions. Their findings revealed that, with a sufficient number of knots, the hazard function estimated via an RCS closely approximates the true simulated hazard function across a wide range of complex shapes. Moreover, they concluded that the hazard ratio values are largely insensitive to baseline hazard misspecification.
Although the primary objective of using RCSs is to accurately approximate the baseline hazard function, Royston and Parmar [
24] chose to model the log-cumulative baseline hazard function as an RCS function of log time, rather than modelling
or
directly, to obtain analytically tractable functions. By contrast, Crowther and Lambert [
35] modelled
using RCSs, acknowledging the need for numerical integration to derive the survival and cumulative hazard functions, but highlighting its advantages when incorporating time-dependent covariates.
3. Statistical Inference
The inferential procedures are based on the usual ML approach and large sample properties, under a general right-censoring mechanism and within the framework of recurrent event analysis [
3], assuming that gap times are conditionally independent given the previous observed recurrence time.
Suppose that data are available from
n independent individuals. Let
be the pair associated with the
kth recurrence of the
ith individual,
and
, where
is the observed gap time between two consecutive events, and
is the censoring indicator, taking the value 1 if
is completely observed and 0 if it is right-censored. Thus,
represents the observed event times corresponding to the
recurrences. In general, consider that the parametric rate model, characterised by the rate function (
4), is known up to a vector of parameters
. Assuming that the censoring mechanism is non-informative, the ML estimate of
can be obtained by maximising the log-likelihood function,
, given by
where
is specified according to the parametric distributional assumption on the gap times, conditional on the previous recurrence time. If
follows an EEP rate model, the rate function (
5) is used; if it follows an ECP rate model, the rate function (
6) is used; and if it follows a flexible rate model, the rate function (
8) is used.
Large sample inference for the vector of parameters
can be based on the corresponding ML estimates and their estimated standard errors, evaluated in the usual manner from the inverse of the observed information matrix. The confidence intervals (CIs) for the parameters can be constructed using the normal approximation. For computational implementation, the
optim function available in the
R [
39] statistical software (version 4.5.0) is applied to directly maximise the log-likelihood function (
9) using standard numerical optimisation methods, such as the Broyden–Fletcher–Goldfarb–Shanno algorithm.
The goodness of fit comparison between the EEP and exponential rate models (and between the ECP and Chen rate models) involves testing the hypotheses
versus
. The LR test is commonly used for model selection between two nested models. The LR statistic is given by
, where
and
are the maximised log-likelihoods under the null and alternative hypotheses, respectively. Note that the test is performed at the boundary of the parametric space of
. Under this non-standard condition, we conjecture that the asymptotic distribution of the LR statistic is a 50:50 mixture of a degenerate distribution at zero and a chi-squared distribution with one degree of freedom, as suggested by the theoretical results of Chernoff [
40] and Self and Liang [
41]. Simulation studies provide empirical support for the use of this asymptotic distribution of the LR statistic when testing the EEP and exponential rate models [
27,
28], as well as the ECP and Chen rate models [
31].
In addition, the exponential rate model is nested within the Weibull rate model. The LR test can again be carried out to evaluate the hypotheses versus . In this situation, the classical asymptotic distribution theory for the LR statistic remains valid, as the test is conducted within the interior of the parameter space of . Incorporating the exponential rate model as a sub-model is particularly valuable, as it corresponds to the special case of the HPP, where gap times are iid. Consequently, hypothesis testing allows for the assessment of the independence assumption between a gap time, , and the previous recurrence time, .
When comparing two or more models, not necessarily nested, an information criterion based on the maximised likelihood, such as the AIC or BIC, can be used for model selection. Furthermore, to verify that the assumptions underlying the fitted model are reasonable given the available data, a residual analysis can be performed to informally assess whether the observed times follow the specified parametric model. In this context, a generalised version of Cox–Snell residuals is useful in evaluating the overall goodness of fit of models for recurrent events [
3]. These residuals are defined as
,
and
, where
is the estimated cumulative rate function of the fitted model. For the correct model, the graphical representation of the pairs
yields a straight line through the origin with a unit slope, where
is the Nelson–Aalen estimate of the cumulative rate function based on the residuals. Alternatively, plotting the pairs
makes it easier to identify deviations from linearity.
5. Concluding Remarks and Future Work
The main objective of our study was to provide a comprehensive comparison of parametric rate models for the analysis of gap times between recurrent events from theoretical and practical perspectives. The models differ in their distributional assumptions on the gap times but share the feature of having a rate function derived from an NHPP, enabling the conditional distribution of a gap time given the previous recurrence time to be obtained. An additional contribution of this work is the proposal of the flexible rate model, accompanied by a thorough discussion of its formulation and modelling strategy using RCS functions.
In the application to two well-known clinical data sets—the bowel data and the readmission data—our findings suggest that a model with a monotonic rate function (exponential, Weibull or EEP form) may fit data poorly when dependence is present, as shown by the result of the LR test. Both applications showed that the ECP and flexible rate models markedly improved the fit by capturing non-monotonic rate shapes, with the latter yielding the best overall goodness of fit. Curiously, the baseline cumulative rate function of the flexible rate model did not require great algebraic complexity, as only a single internal knot (df ) was sufficient in the spline part to achieve the lowest AIC and BIC values, while also exhibiting the best alignment with the reference line in the Cox–Snell residual plot. This highlights the advantage of modelling the log-cumulative baseline rate function as an RCS function of log time, particularly when no qualitative information is available to guide the selection of the most appropriate baseline distribution for the gap times.
At this stage of our research, no simulation study was conducted to assess the performance of the flexible rate model. This decision was motivated by the extensive literature supporting the use of RCS functions to approximate complex hazard shapes in survival analysis, as discussed throughout this work. In particular, when an adequate number of internal knots is specified, RCS-based models provide accurate approximations to the baseline hazard function. Nonetheless, we recognise that simulation studies could be pursued under the flexible rate model to evaluate the risk of overfitting in small samples and the impact of different degrees of dependence between a gap time and the previous recurrence time within an individual.
A promising avenue for future work would be to extend the existing parametric rate models to include covariates through the scale parameter, assuming a multiplicative effect on the rate function. Moreover, these models can also be generalised to incorporate a frailty term (random effect), aiming to represent unobserved or unmeasurable risk factors. Finally, we plan to develop an R package dedicated to the methodology discussed here.