Abstract
Current status data are encountered in a wide range of applications, including tumorigenic experiments and demographic studies. In this case, each subject has one observation, and the only information obtained is whether the event of interest happened at the moment of observation. In addition to censoring, truncating is also very common in practice. This paper examines the regression analysis of current status data with informative censoring times, considering the presence of left truncation. In addition, we propose an inference approach based on sieve maximum likelihood estimation (SMLE). A copula-based approach is used to describe the relationship between the failure time of interest and the censoring time. The spline function is employed to approximate the unknown nonparametric function. We have established the asymptotic properties of the proposed estimator. Simulation studies suggest that the developed procedure works well in practice. We also applied the developed method to a real dataset derived from an AIDS cohort research.
MSC:
62N01; 62N02; 62G99
1. Introduction
Current status data generally arise in demographic, tumorigenic, and epidemiological fields [1,2,3]. One significant feature of the current status data is that the failure time cannot be accurately observed. On the contrary, it is known that the failure time is less than or greater than the observation or examination time. One common feature of studies that produce such data is that participants are only observed once, perhaps due to the limitation of resources. In this manuscript, we consider the semiparametric regression analysis of current status data with left truncation and informative censoring. In addition, an SMLE method is proposed.
Several methods have been developed to study the current status data. For example, under the proportional hazards (PH) model, ref. [1] considered the efficiency problem and established asymptotic properties of maximum likelihood estimators of regression parameters and baseline cumulative hazard functions. Ref. [4] studied this problem under the additive hazard model and proposed an estimation equation approach to estimate the regression coefficient. Ref. [5] discussed the regression analysis problem under the proportional odds model.
Note that all the literature above assume that the failure time is independent of the examination or observation time. When the two are not independent, the data obtained are generally referred to as current status data with informative observation times or dependent current status data. Currently, some literature studies have discussed the regression analysis of current status data under the assumption of informative censoring. For example, ref. [2] discusses the regression analysis of the current status with informative examination times under the additive hazards regression model. Ref. [6] developed a class of semiparametric transformation models for dependent current status data. The above literature studies introduced frailty to depict the correlation between failure time and examination time. It is well-known that this method needs to assume the specific distribution of the latent variable, which makes the application of this method limited. An alternative way to describe the correlation between the failure time and the examination time is by introducing the copula function. For example, ref. [7] employed this method and discussed the regression analysis of current status data under the PH model. Note that the copula method has been applied in many types of dependent data analyses [8,9,10].
In addition to censoring, truncation is another statistical phenomenon that arises in various fields, including survival analysis, astronomy, epidemiology, and economics [11,12,13,14,15]. Subjects whose failure times were truncated were unable to provide any information to researchers. When only the data of individuals whose event times exceeded a certain random time (i.e., left truncation time) are recorded, left truncation will occur. Under left truncation, individuals with smaller event times are less likely to be observed, leading to bias in the research sample toward larger event times. Currently, some literature studies have developed regression analyses of current status data with left truncation [16,17]. In the following, we will discuss the regression analyses of current status data with left truncation and informative censoring.
The remainder of the article is structured as follows: We introduce the models and assumptions in Section 2. In Section 3, we introduce the SMLE method, including the estimation procedure and asymptotic properties. In Section 4, we conduct simulation studies to evaluate the practical performance of the developed approach. In Section 5, we apply the established approaches to a real dataset. Our discussions and concluding remarks are presented in Section 6.
2. Notation, Assumptions, and Models
Suppose that a failure time study consists of n independent subjects. For subject i, let represent the failure time and be the p-dimensional vector of the covariate associated with the subject. As mentioned above, truncation is also very common in practice. For this, assume that, for every subject, there exists a left truncation time , such that . The examination time is denoted by . It is possible that Y is dependent on the failure time . Define . The observed data can be represented as follows: . In other words, for the failure time , we only have current status data with the left truncation available.
We assume that follows a Cox model given by
where is the hazard function of given , denotes an unspecified baseline hazard function, and denotes a vector of regression coefficients.
In practice, the covariate may also affect the observation time . So, we suppose that the hazard function of has the following form
where represents an unknown baseline hazard function, and represents the regression parameter.
To show the correlation between and , we next introduce the Copula function. Let represent the joint distribution function of and given . Thus, according to Theorem 2.3.3 in [9], there exists a copula function , satisfying
where and denote the marginal distribution function of and , respectively, is the association parameter representing the correlation between and . satisfies , , and . We then have the conditional distribution function of given and , as follows:
Let , and . We define to be the marginal density function of , given the covariate, so we can obtain
and
When , we have
For , we have
Therefore, the likelihood function based on is
Thus, the full likelihood function based on the i.i.d. sample has the following form
In the next part, we will consider the maximization of the above likelihood function. It should be noted that, as mentioned by [3], given a specified parametric copula family, the associated parameter cannot be identified without prior or additional information. Hence, in the next section, we assume that both the copula functions and the associated parameters are necessary.
3. Maximum Likelihood Estimation
Now, we discuss how to maximize the full likelihood function . In fact, it is difficult to directly maximize the likelihood function because this likelihood function contains not only finite-dimensional parameters but also infinite-dimensional parameter functions, and . In order to maximize the likelihood function, we intend to approximate and by linear combinations of some basic functions. Specifically, we intend to use the I-spline function to accomplish this task [18]. Let represent the I-spline base function, where k and are the order of the spline and the number of interior knots, respectively. Additionally, and . Then, we define
as the sieve space, , and denote the range of . Thus, the functions in are all non-negative and non-decreasing on the interval [18]. Therefore, we can employ to approximate or replace in the likelihood function, and estimate the regression parameters , and coefficients simultaneously by maximizing the subject to , .
One issue when using splines is how to choose k and . An easy way to do this for a given problem is to try several different values for k and and compare the results. Furthermore, we can employ the Bayesian information criteria (BIC) to choose k and , which give the smallest BIC values.
Let represent the estimator of described above and be the true value of . To establish the asymptotic properties, we need to describe the regularity conditions first.
- (C1)
- The copula functions are differentiable and the partial derivatives satisfy the Lipschitz condition.
- (C2)
- The covariate Z has bounded support in .
- (C3)
- (i) If a constant and the constant vector satisfy almost surely, then and . (ii) Assume that for any open set B in , , where represents the probability measure generated by the copula function .
- (C4)
- For , suppose that the Fisher matrix is positive-definite, where is defined in Appendix A.
- (C5)
- Let denote the kth derivative of , . Assume they are Holder-continuous with exponent . In other words, there exists a positive constant K and some , such that for all . In the following, let
According to [19,20], the aforementioned conditions are typically moderate and meet in practical situations. The following theorems provide the large sample properties of the estimators. Here, for function g, let , where is the probability measure generated by X.
Theorem 1.
Assume that the regularity conditions C1–C4 are satisfied. Then,
almost surely.
Theorem 2.
Assume that the regularity conditions C1–C4 are satisfied. Then
Theorem 3.
Assume that conditions C1–C5 are satisfied and . Then, as , we have ; the definition of Σ is in Appendix A.
We provide the proof of the above theorems in the appendix. In order to estimate the covariance matrix, we recommend a common and direct method based on the sieve likelihood function, i.e., using the inverse of the observed information matrix. The observed information may be poorly conditioned or high-dimensional, so this method might be computationally demanding. Nevertheless, the simulation results shown in the following section suggest that it typically performs effectively, particularly when k and are not very big.
4. Simulation Study
In this section, simulation studies are conducted to evaluate the performance of the developed procedure. We suppose that the covariates are Bernoulli (0.5). We first generate the left truncation time from the exponential distribution with parameters a, where constant a is selected to provide a suitable percentage of the left truncation. We set , and is generated from model (1). To generate the examination time , we consider the following three copula models:
They are the FGM, Frank, and Gumbel models, respectively. As mentioned above, the association parameter here indicates the correlation between T and Y. Since the range of in the above three copula models are not the same, one needs a uniform measure of association between T and Y, such as Kendall’s . For the FGM model, , and for the Frank model, and . Under the Gumbel model, the relationship is .
Based on a fixed copula function, we set , then the examination time is generated from the conditional distribution, given . Specifically, we first generate a random number ; given and b, we can solve the following equation for ,
For the spline functions, we employ the quadratic splines with the quantiles of the pooled set of all ’s and ’s as three interior knots. The results shown below are based on 1000 replications.
Table 1 and Table 2 report the simulation results under the FGM model with sample sizes and 400. The results show the estimated bias (bias, empirical average of the parameter estimator minus the true value), the standard error of the parameter estimator (SSE), the empirical average of the standard error estimator (SEE), and the empirical coverage percentage of the 95% confidence interval (CP). Figure 1 presents the boxplots of estimators of and with and under the FGM copula. It can be seen that the estimators have a slight bias, and the bias becomes smaller as n increases. The true variabilities are accurately reflected by the variance estimators, and the confidence intervals have proper coverage probabilities, i.e., the normal approximation to the distribution of the estimated regression parameters seems reasonable. The estimation results based on the Frank and Gumbel copulas are presented in Table 3, Table 4, Table 5 and Table 6; they yield comparable conclusions to those given in Table 1.
Table 1.
Simulation results under the FGM model with .
Table 2.
Simulation results under the FGM model with .
Figure 1.
Boxplots for the estimators of and with and under the FGM copula.
Table 3.
Simulation results under the Frank model with .
Table 4.
Simulation results under the Frank model with .
Table 5.
Simulation results under the Gumbel model with .
Table 6.
Simulation results under the Gumbel model with .
Regarding model misspecification, Table 7 shows the estimation results of the simulated data generated under the Frank model but obtained from the FGM model. In the table, presents Kendall’s for the estimation. This table suggests that the estimator of may be biased when copula is specified correctly but is specified wrong. When is specified correctly but the copula model is misspecified, the estimator seems relatively reasonable. In addition, the estimator of seems to be less sensitive to the choice of the association parameter or the copula model. We attempted other set-ups and acquired comparable conclusions.
Table 7.
Simulation results for model misspecification.
5. An Application
Now, we apply our method to an AIDS cohort study of hemophiliacs, as analyzed by [21], among others. This study included 257 hemophilia patients receiving treatment at medical centers in France since 1978. Due to the potential contamination of blood factors used for treatment, these patients were in danger of contracting HIV-1. In this study, the failure time represents the duration from HIV-1 infection to the AIDS diagnosis. Patients were divided into two groups—heavily and lightly treated groups—based on the blood volume they received. Here, the primary objective was to evaluate the effect of the treatment on the total time to HIV diagnosis from the beginning of the treatment.
HIV-1 contraction and the time of AIDS diagnosis cannot be accurately observed since the patients were only examined regularly; only the time intervals that include HIV-1 contraction and AIDS diagnosis can be observed. Here, the left truncation time is taken as the midpoint of the examination interval for HIV-1 infection, and the observation time is taken as the right endpoint of the examination interval for AIDS diagnosis [21]. For our analysis, we focus on the 188 patients identified as HIV-1-infected during the study. Of these, 41 had been diagnosed with AIDS.
Let the covariate be 1 if the ith subject is in the heavily treated groups, and 0, otherwise. Following the simulation part, we still use the quadratic spline functions with , or 8 for approximation, and use FGM, Frank, and Gumbel copulas for dependent censoring. In the following, the BIC values were calculated to find the smallest one, which is given by the FGM copula model with and . We present the results obtained under the FGM model in Table 8 with and 7, and several values. The table shows the estimated treatment effect , the estimated effect on the examination time , the estimated standard error (SE), and the p-value for testing the absence of treatment effects. The results indicate that the patients in the heavily treated group had a higher hazard of being diagnosed with AIDS, which is similar to the conclusions presented in [16].
Table 8.
Estimation results for the AIDS data.
6. Discussion and Concluding Remarks
In the previous sections, we discussed the regression analysis of dependent current status data with left truncation and developed an SMLE method for inference. The developed procedure uses the copula function to depict the correlation between the failure time and the examination time, and the spline function is used to approximate the known nonparametric function in the model. Simulation studies suggest that the considered approach works well in practice.
For the presented approach, one may consider other statistical models, like the linear transformation model and accelerated failure model, and develop comparable estimation procedures. In our approach, we applied the copula model to construct the joint distribution. One future study direction will be to apply the frailty methods to describe the connection between the two and establish corresponding statistical methods. Furthermore, in our method, while we adopted the I-splines to approximate unknown functions in our approach, other basis functions can also be utilized, such as monotone B-splines, Bernstein polynomials, and even step functions.
Author Contributions
Methodology, T.H.; Software, M.Z. and D.X.; Writing—original draft, D.X.; Writing—review & editing, J.S.; Supervision, J.S.; Project administration, S.Z.; Funding acquisition, D.X. All authors have read and agreed to the published version of the manuscript.
Funding
Xu’s work was partially supported by the National Natural Science Foundation of China (NSFC) (12001093) and the National Key Research and Development Program of China (no. 2020YFA0714102). Zhao’s work was partially supported by the National Natural Science Foundation of China (NSFC) (12071176).
Data Availability Statement
Not applicable.
Conflicts of Interest
The authors declare no conflict of interest.
Appendix A. Proofs of Theorems 1–3
In the following proofs, let and represent the empirical process of . Let
be the sieve space for , where , and H be a positive number. In the following, let represent constants whose values may vary at different locations.
Proof of Theorem 1.
Let represent the log-likelihood of one observation . For consistency, let .
According to Lemma A.1 in [22], the covering number satisfies
where In addition, by inequality (31) in [23], we acquire
Let , , , and . Hence, we have
If then
Thus, by condition (C3), we have Combining (A2) and (A3), we can obtain with Thus, and . Therefore, by (A1), we have and . Finally, we conclude that . □
Proof of Theorem 2.
In order to prove Theorem 2, for any define with According to [24], it can be shown that , where After a little bit of algebra, we can obtain for any
Hence, based on Lemma 3.4.2 in [25], we have that
where . Therefore, we have , and decreases with . In addition, we also have here, with . Based on Theorem 3.2.5 in [25], we obtain . Then, according to Lemma A1 in [26], we can obtain . □
Proof of Theorem 3.
First, let represent the true value of , and let V represent the linear span of . Define For any denote
to be the directional derivative of at the direction . Define
the Fisher inner product on W, and the corresponding norm . Let denote the closure of W under the Fisher norm. Therefore, is a Hilbert space.
Let
where is a vector with . So, is a linear function on . For , let
Then we have . Based on the Riesz theorem, there exists , so that for all with
Since , by the Cramr–Wold device, in order to prove Theorem 3, it suffices to show
Recall that . For each component, let be the minimizer of
where , is a -dimensional vector whose qth element is 1 and all other elements are 0. and denote the directional derivatives with respect to and , respectively. Then, according to a similar method by [27], we can obtain where . This completes the proof of Theorem 3. □
References
- Huang, J. Efficient estimation for the proportional hazards model with interval censoring. Ann. Stat. 1996, 24, 540–568. [Google Scholar] [CrossRef]
- Zhang, Z.; Sun, J.; Sun, L. Statistical analysis of current status data with informative observation times. Stat. Med. 2005, 24, 1399–1407. [Google Scholar] [CrossRef]
- Titman, A.C. A pool-adjacent-violators type algorithm for non-parametric estimation of current status data with dependent censoring. Lifetime Data Anal. 2014, 20, 444–458. [Google Scholar] [CrossRef]
- Lin, D.Y.; Oakes, D.; Ying, Z. Additive hazards regression with current status data. Biometrika 1998, 85, 289–298. [Google Scholar] [CrossRef]
- Rossini, A.J.; Tsiatis, A.A. A semiparametric proportional odds regression model for the analysis of current status data. J. Am. Stat. Assoc. 1996, 91, 713–721. [Google Scholar] [CrossRef]
- Chen, C.M.; Lu, T.F.C.; Chen, M.H.; Hsu, C.M. Semiparametric transformation models for current status data with informative censoring. Biom. J. 2012, 54, 641–656. [Google Scholar] [CrossRef]
- Ma, L.; Hu, T.; Sun, J. Sieve maximum likelihood regression analysis of dependent current status data. Biometrika 2015, 102, 731–738. [Google Scholar] [CrossRef]
- Zheng, M.; Klein, J.P. Estimates of marginal survival for dependent competing risks based on an assumed copula. Biometrika 1995, 82, 127–138. [Google Scholar] [CrossRef]
- Nelsen, R.B. An Introduction to Copulas; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
- Chen, M.H.; Tong, X.; Sun, J. A frailty model approach for regression analysis of multivariate current status data. Stat. Med. 2009, 28, 3424–3436. [Google Scholar] [CrossRef]
- Bilker, W.B.; Wang, M.C. A semiparametric extension of the Mann-Whitney test for randomly truncated data. Biometrics 1996, 52, 10–20. [Google Scholar] [CrossRef] [PubMed]
- Rennert, L.; Xie, S.X. Bias induced by ignoring double truncation inherent in autopsy-confirmed survival studies of neurodegenerative diseases. Stat. Med. 2019, 38, 3599–3613. [Google Scholar] [CrossRef] [PubMed]
- Dorre, A. Bayesian estimation of a lifetime distribution under double truncation caused by time-restricted data collection. Stat. Pap. 2020, 61, 945–965. [Google Scholar] [CrossRef]
- Saha, A.; Sundaram, R. Variable selection for discrete survival model with frailty in presence of left truncation and right censoring: Studying association of environmental toxicants on time-to-pregnancy. Stat. Med. 2023, 42, 193–208. [Google Scholar] [CrossRef]
- Withana Gamage, P.W.; McMahan, C.S.; Wang, L. A flexible parametric approach for analyzing arbitrarily censored data that are potentially subject to left truncation under the proportional hazards model. Lifetime Data Anal. 2023, 29, 188–212. [Google Scholar] [CrossRef]
- Kim, J.S. Efficient estimation for the proportional hazards model with left-truncated and case 1 interval-censored data. Stat. Sin. 2003, 13, 519–537. [Google Scholar]
- Sun, T.; Li, Y.; Xiao, Z.; Ding, Y.; Wang, X. Semiparametric copula method for semi-competing risks data subject to interval censoring and left truncation: Application to disability in elderly. Stat. Methods Med. Res. 2023, 32, 656–670. [Google Scholar] [CrossRef]
- Schumaker, L.L. Spline Functions: Basic Theory; Wiley: Hoboken, NJ, USA, 1981. [Google Scholar]
- Zhang, Y.; Hua, L.; Huang, J. A spline-based semiparametric maximum likelihood estimation method for the cox model with interval-censored data. Scand. J. Stat. 2010, 37, 338–354. [Google Scholar] [CrossRef]
- Huang, J.; Rossini, A.J. Sieve estimation for the proportional odds failure-time regression model with interval censoring. J. Am. Stat. Assoc. 1997, 92, 960–967. [Google Scholar] [CrossRef]
- Kim, M.Y.; De Gruttola, V.G.; Lagakos, S.W. Analyzing doubly censored data with covariates, with application to AIDS. Biometrics 1993, 49, 13–22. [Google Scholar] [CrossRef]
- Xu, D.; Zhao, S.; Hu, T.; Sun, J. Regression analysis of informatively interval-censored failure time data with semiparametric linear transformation model. J. Nonparametr. Stat. 2019, 31, 663–679. [Google Scholar] [CrossRef]
- Pollard, D. Convergence of Stochastic Processes; Springer Science & Business Media: Berlin/Heidelberg, Germany, 1984. [Google Scholar]
- Shen, X.; Wong, W.H. Convergence rate of sieve estimates. Ann. Stat. 1994, 22, 580–615. [Google Scholar] [CrossRef]
- van der Vaart, A.W.; Wellner, J.A. Weak Convergence and Empirical Processes: With Applications to Statistics; Springer Science & Business Media: Berlin/Heidelberg, Germany, 1996. [Google Scholar]
- Lu, M.; Zhang, Y.; Huang, J. Estimation of the mean function with panel count data using monotone polynomial splines. Biometrika 2007, 94, 705–718. [Google Scholar] [CrossRef]
- Chen, X.; Fan, Y.; Tsyrennikov, V. Efficient estimation of semiparametric multivariate copula models. J. Am. Stat. Assoc. 2006, 101, 1228–1240. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).