Subgroup Identification and Regression Analysis of Clustered and Heterogeneous Interval-Censored Data

: Clustered and heterogeneous interval-censored data occur in many ﬁelds such as medical studies. For example, in a migraine study with the Netherlands Twin Registry, the information including time to diagnosis of migraine and gender was collected for 3975 monozygotic and dizygotic twins. Since each study subject is observed only at discrete and periodic follow-up time points, the failure times of interest (i.e., the time when the individual ﬁrst had a migraine) are known only to belong to certain intervals and hence are interval-censored. Furthermore, these twins come from different genetic backgrounds and may be associated with differential risks for developing migraines. For simultaneous subgroup identiﬁcation and regression analysis of such data, we propose a latent Cox model where the number of subgroups is not assumed a priori but rather data-driven estimated. The nonparametric maximum likelihood method and an EM algorithm with monotone ascent property are also developed for estimating the model parameters. Simulation studies are conducted to assess the ﬁnite sample performance of the proposed estimation procedure. We further illustrate the proposed methodologies by an empirical analysis of migraine data.


Introduction
Subgroup identification for heterogeneous data has become a ubiquitous problem in a broad range of applications including social science, marketing, and clinical trials.For instance, in clinical trials, heterogeneity may arise due to underlying differences among groups of patients.For patients with similar attributes, disease progression or treatment effects often exhibit close patterns.Therefore, it is valuable to classify the patients into a few homogeneous groups and tailor a disease treatment specifically for each subgroup to optimize the treatment effect.Conceptually, analyzing data from a heterogeneous population consisting of a few homogeneous subgroups is to view data as generated from a mixture of subgroups and leads to a finite mixture model.In unsupervised learning, parametric mixture models have been widely used in many fields.The books [1][2][3][4] and the review paper [5] provide a thorough introduction and applications on finite mixture models.In addition, a mixture model can be applied in reliability analysis ( [6][7][8]).
Interval-censored data is a common type of data in real applications.In many clinical applications, the observations are recorded periodically, and the failure times of interest are known between each period, which causes the difficulties on analyzing on this type of data.Ref. [9][10][11] reviewed existing methods that applied the parametric models and nonparametric estimations for survival curves based on interval-censored data.Particularly, Ref. [12] proposed the nonparametric way for survival distribution estimation and [13] provided the score statistics for parameter estimation for interval-censored data.However, as for heterogeneous interval-censored data, mixture models should be considered for subgroup classification.Only limited research studies are targeting this area.Ref. [14] proposed the estimation methods for Gaussian mixtures using MCMC methodology and [15] proposed a semi-parametric mixture model in the field of antimicrobial resistance with interval-censored observations.However, the methods mentioned above are density estimation using a mixture model without conducting regression analysis on other observed covariates.There exist computational difficulties on conducting group identification and regression analysis based on mixture models in survival analysis for interval-censored data.
In this paper, motivated by the Netherlands twin study on migraines, we propose a new latent Cox model for analyzing clustered and heterogeneous interval-censored data.The population is separated into a few subgroups according to the covariate effects.The baseline hazard functions for the subgroups as well as the number of subgroups are left unspecified to avoid restrictive distributional assumptions and allow for flexibility.
Compared with existing mixture survival models ( [16][17][18]) in the literature for rightcensored data, the proposed model aims to accomplish simultaneous subgroup identification and regression analysis.It is important to note that compared with right-censored data, for interval-censored survival data, the incomplete data information and computational complexity bring greater challenges for the aforementioned tasks.Moreover, we investigate the heterogeneity driven by unknown covariate effects without specifying the baseline hazard functions and the number of subgroups, which make the model estimation more challenging and the computation more intensive.Our new proposed nonparametric maximum likelihood estimation approach separates the parameters during estimation, which greatly reduces the computational complexity.In addition, the proposed EM algorithm has the monotone ascent property for estimating the model parameters.Numerical studies demonstrate its good performance.A modified Bayesian information criterion is also proposed to select the number of mixing components [19].
The rest of the paper is organized as follows.In Section 2, we present the latent Cox model for clustered interval-censored data.In Section 3, we develop an estimation procedure for the proposed model using the EM algorithm.Selecting the number of subgroups and assessing the finite-sample performance of the proposed methods are presented in Section 4. We further provide an application to migraine data to illustrate the practical utilities of the proposed methods in Section 5.

Data and Model
Let T ij denote the response of interest (i.e., the failure time) for the jth subject in the ith cluster, where j = 1, . . ., n i , i = 1, . . ., n, n i is the number of subjects in the ith cluster and n is the number of clusters in the dataset.Furthermore, T ij is interval-censored and only known to belong to the interval (L ij , R ij ].The q-dimensional vector of covariates is denoted by X ij = (X ij1 , . . ., X ijq ) .The observations are summarized as Y obs = {(L ij , R ij ], X ij ; i = 1, . . ., n, j = 1, . . ., n i }.For accommodating heterogeneous covariate effects that may exist among subgroups, we propose a latent Cox model for simultaneous subgroup identification and regression analysis.Specifically, the instantaneous hazard function for the jth subject in the ith cluster As in [10,20], we make the following two assumptions.(A1) L ij and R ij are random and (A2) T ij are independent of (L ij , R ij ].It is important to note that the baseline hazard functions and the covariate effects are allowed to vary across the clusters and accordingly accommodate the heterogeneity.In the same spirit of mixture modeling and for extrapolation and interpretation purposes, further assume that n clusters are from M subgroups with M 1 and the clusters in the same subgroup have the same baseline hazard and covariate effects.In other words, let G = (G 1 , . . ., G M ) be a partition of {1, . . ., n}.Let the mixing probabilities be π m , m = 1, . . ., M and π 1 + . . .+ π M = 1.For each cluster i = 1, . . ., n, with probability π m , we have i ∈ G m and λ 0i (•) = λ 0m (•) and β i = β m .In practice, the number of subgroups M is unknown and will be estimated in a data-driven way.However, in practice, it is usually reasonable to assume that M is much smaller than n.Our goal is to estimate M and the model parameters Λ 0 = (Λ 01 , . . ., Λ 0M ), β = (β 1 , . . ., β M ) and Π = (π 1 , . . ., π M ).The observed likelihood function of the i-th cluster {L ij , R ij , X ij } n i j=1 can be written as where f i(m) (Λ 0m , β m ) denotes the likelihood function of the i-th cluster when it belongs to the m-th subgroup.When the i-th cluster comes from the m-th subgroup, its hazard function λ ij(m) (t) = λ 0m (t) exp(X ij β m ) for j = 1, . . ., n i where λ 0m (•) is the unspecified baseline hazard function and Λ 0m (•) is the corresponding cumulative baseline risk of the m-th subgroup.β m is the corresponding effect of X ij in the m-th subgroup.Furthermore, we suppose that T ij is monitored at a sequence of positive time points and the log-likelihood (Λ 0 , β, Π|Y obs ) based on the observed data To estimate Λ 0 , β and Π, we adopt the nonparametric maximum likelihood estimation approach.Let 0 The estimator for Λ 0m is a step function that jumps only at those time points with respective jump sizes of 0, λ 0m (t 1 ), . . ., λ 0m (t K ).It follows that (3) can be rewritten as

Estimation and Algorithm
The observed data with unknown subgroup memberships can be formulated as an incomplete-data problem in the EM framework.
We view the observed data {(L ij , R ij ], X ij ; i = 1, . . ., n, j = 1, . . ., n i } as being incomplete and introduce the unobserved Bernoulli random variables and Poisson random variables . ., M} is the same as (4).Therefore, we develop an EM algorithm to maximize (4) by treating Then, the complete-data log-likelihood is proportional to In the M-step, we maximize (6) for any given β m , then we have where m = 1, . . ., M; k = 1, . . ., K.After incorporating ( 8) into ( 6), we obtain To update β m , we employ the following Newton-Raphson algorithm where In the E-step, we evaluate the conditional expectations of Z im and W mijk involved in the M-step.The posterior mean of Z im is where In addition, the conditional expectation of Now, we summarize iteration processes between the E-step and M-step for the proposed algorithm as follows.
Step 2. Calculate the conditional expectations of Z im and W mijk via (10) and (11).
Step 3. Replace Z im in (7) by Ê(Z im ) and update the estimate of Π via (7).
We iterate between the E-step and M-step until the sum of the absolute differences of the estimates at two iterations is less than , i.e., the stopping criterion is set to be where ||α|| 1 indicates the L 1 norm for α, i.e., ||α|| 1 = ∑ q i=1 |α i | with α = (α 1 , . . ., α q ).In the following section, we let = 10 −3 , and simulation studies are conducted to assess the finite sample performance of the proposed method and in particular, we propose a modified BIC criterion to select the number of subgroups M.

Simulation Study
As in the mixture model [21], the number of subgroups M in the proposed model is unknown and will be estimated in a data-driven manner.Here, we use the modified Bayesian information criterion (BIC [19]) to choose the number of components M by minimizing the criterion function where β = ( β1 , . . ., βM ), N = ∑ n i=1 n i is the sample size, and q is the dimension of β i .In the following, we conduct a set of simulation studies to assess the finite sample performance of our proposed method.
Example 1.We generate clustered interval-censored data from a latent Cox model with two covariates and three subgroups where the covariates X ij1 and X ij2 are independent and both follow the standard normal distribution.
The n clusters are randomly assigned into three subgroups with equal probabilities, i.e., we let The cluster size is set to be m for each cluster.We consider different combinations of the number of clusters (n) and the cluster size (m) to assess the performance of the proposed estimation procedure.We identify the number of subgroups M by minimizing the modified BIC given in (12).Table 1 presents the mean, median, and standard error (s.d.) of the estimated number of subgroups, denoted by M, and the empirical percentage of M equal to the true number of subgroups based on 100 replications.It can be seen from Table 1 that for (n, m) = (400, 4) and (n, m) = (800, 2), the BIC identifies the true number of subgroups among all 100 replications, indicating its favorable performance.Table 2 reports the estimation results for the regression coefficients β1 , β2 , and β3 , and the mixing probabilities π1 and π2 , based on 100 replications.The proposed method preforms well and yields the estimators with small biases for the case of (n, m) = (400, 4).However, as we increase the number of clusters to 800 and decrease the number of subjects within each cluster to 2, the biases of some estimators become larger even though the total sample sizes are the same for the two cases.
where the covariates X ij = (X ij1 , X ij2 , X ij3 ) are generated from a multivariate normal distribution with mean zero and a first-order autoregressive covariance structure Σ = (σ st ) with σ st = 0.5 |s−t| for s, t = 1, 2, 3.The clusters are randomly assigned into two subgroups with equal probabilities, i.e., we let P(i ∈ G 1 ) = P(i ∈ G 2 ) = 1/2, and The cluster size is set to be m for all n clusters.We consider the cases of (n, m) = (400, 3) and (600, 2).As in Example 1, we estimate the number of subgroups M by minimizing the modified BIC given in (12).Table 3 reports the mean, median, and standard error (s.d.) of the estimator M and the empirical percentage of M equal to the true number of subgroups based on 100 replications.We observe that the median of M is equal to the true number of subgroups 2, and the mean also gets closer to 2 as the number of clusters increases.Moreover, the empirical percentage of correctly identifying the true number of subgroups is close to 1 as the cluster number becomes moderately large.The estimation results for the regression coefficients β and mixing probabilities Π are summarized in Table 4.It can be seen that in terms of the estimation accuracy, the proposed estimation procedure performs quite well and yields the estimators with small biases for (n, m) = (400, 3).Similar with Example 1, as we increase the number of clusters to 600 and decrease the number of subjects within each cluster to 2, the biases of some estimators become larger even though the total sample sizes are the same for the two cases.
Based on the BIC criterion given in (12), we estimate the number of subgroups M and report the sample mean, median, and standard error (s.d.) of the estimated number of subgroups M and the empirical percentage of M equal to the true number of subgroups M based on 100 replications.We consider (n, m) = (200, 4) and (400, 2).The results are given in Table 5.We observe that for each replication, the number of subgroups is correctly identified to be 1.The estimation results are summarized in Table 6.We find that the regression coefficients are estimated accurately with small biases for (n, m) = (200, 4).Similar with Examples 1 and 2, as we increase the number of clusters to 400 and decrease the number of subjects within each cluster to 2, the biases of some estimators become larger even though the total sample sizes are the same for the two cases.To assess the estimation accuracy of the cumulative baseline hazard rate function, by plotting them in Figure 1, we show the difference between the true cumulative hazard rate function Λ 0 (t) and the estimated baseline cumulative hazard curves Λ0 (t).From Figure 1, it can be seen that two curves are quite close to each other during the time periods of (0, 2) and (6,12).However, because there are no sample points falling in the time period of (2,6), the two curves exhibit a significant difference during this time period.

An Application to the Netherlands Twin Study on Migraine
We now apply the proposed model to analyze the Netherlands twin migraine data.The participants were volunteer members of the Netherlands Twin Registry, which is maintained by the Department of Biological Psychology at the Vrije Universiteit in Amsterdam [22].The data were collected between 1991 and 2002 as part of an ongoing study of health, lifestyle, and genetics involving a large cohort of Dutch twins and their relatives.The primary response of interest in the migraine study is the time when the individual first had a migraine.Since the individuals were followed up on a periodic basis, the time to event may be known only to belong to intervals and hence be interval-censored.The twins form into the clusters with the cluster size 2 and come from different genetic backgrounds, which naturally can be classified into heterogeneous subgroups based on the genetic pro-files of the twin families, which are not directly observed.Our analysis is based on 3975 monozygotic and dizygotic twin pairs.The left and right endpoints of the interval in which the individual had migraine (in years) are denoted by L ij and R ij , respectively, for the jth individual in the ith cluster.In this dataset, L ij and R ij are random and are independent of the event time.Furthermore, two covariates included in the model are gender (1 = male, 0 = female) and the type of twins (1 = monozygotic, 0 = dizygotic).
To explore the heterogeneity across the twins as indicated by their initial health status, household lifestyle, disease progression, and genetic profiles, we assume that the twins can be classified into a few homogeneous subgroups for each of which the conditional hazard function is postulated by a Cox model.We fit the migraine data by the proposed model with varying M, and the number of subgroups M is estimated by minimizing the BIC criterion function in (12).We found that by the BIC criterion, the optimal M is 3.In Table 7, for the number of subgroups M = 1, 2, 3, 4, we report the maximum log-likelihood values (LL), the BIC values (BIC), and the estimated parameters.We found that the model with three subgroups yields the best fit.The twins can be classified into three homogeneous subgroups with mixing probabilities of 77%, 19%, and 4%, respectively.The estimated regression coefficients for three subgroups are also detailed in Table 7.In addition, the baseline cumulative hazard functions for three subgroups are plotted in Figure 2.For the optimal model selected by BIC criterion, we calculate the empirical standard error and 95%CI of the parameters by the bootstrap method.We repeatedly generated bootstrap samples for G times and obtained bootstrap estimates ( Πg , βg ), g = 1, . . ., G with G = 500.Then, the normal-based 100(1 − α)% bootstrap interval for π 1 is [ π1 − z α/2 ŝe(π 1 ), π1 + z α/2 ŝe(π 1 )] where π1 = (∑ G g=1 π1g )/G, ŝe(π 1 ) = [∑ G g=1 ( π1g − π1 ) 2 ]/(G − 1).The bootstrap 100(1 − α)% percentile interval for π 1 is [ π1L , π1U ]; here, π1L and π1U are the (α/2)G-th and (1 − α/2)G-th order statistics of { π1g } G g=1 .The confidence intervals and the empirical standard errors for other parameters can be calculated in a similar way, and the results are reported in Table 8.Author Contributions: Data curation, X.H. and J.X.; Formal analysis, X.H.; Funding acquisition, J.X.; Investigation, J.X.; Methodology, J.X.All authors have read and agreed to the published version of the manuscript.

Figure 1 .
Figure 1.The dotted and solid lines plot the true and estimated baseline cumulative hazard functions, respectively.The estimated baseline cumulative hazard function is the empirical average of the estimated baseline cumulative hazard functions based on 50 replications with (n, m) = (200, 4) or (400, 2) in Example 3.

Figure 2 .
Figure 2. The estimated baseline cumulative hazard functions for migraine data in the optimal model with three subgroups.

Table 1 .
The sample mean, median, and standard error (s.d.) of M and the empirical percentage (per) of M equal to the true number of subgroups based on 100 replications in Example 1.
Example 2. We simulate data from a latent Cox model with three covariates and two subgroups

Table 3 .
The sample mean, median, and standard error (s.d.) of M and the empirical percentage (per) of M equal to the true number of subgroups based on 100 replications in Example 2.

Table 4 .
The empirical bias and standard error (s.d.) of the estimators π1 , β1 , and β2 based on 100 replications in Example 2. We next generate data from the Cox model with two covariates

Table 5 .
The sample mean, median, and standard error (s.d.) of M and the empirical percentage (per) of M equal to the true number of subgroups M based on 100 replications in Example 3.

Table 8 .
The estimated parameters for the optimal model.Notes: SE, the empirical standard error based on the bootstrap samples; CI † , normal-based bootstrap CI; CI ‡ , percentile bootstrap CI.