Model Averaging for Accelerated Failure Time Models with Missing Censoring Indicators

: Model averaging has become a crucial statistical methodology, especially in situations where numerous models vie to elucidate a phenomenon. Over the past two decades, there has been substantial advancement in the theory of model averaging. However, a gap remains in the field regarding model averaging in the presence of missing censoring indicators. Therefore, in this paper, we present a new model-averaging method for accelerated failure time models with right censored data when censoring indicators are missing. The model-averaging weights are determined by minimizing the Mallows criterion. Under mild conditions, the calculated weights exhibit asymptotic optimality, leading to the model-averaging estimator achieving the lowest squared error asymptotically. Monte Carlo simulations demonstrate that the method proposed in this paper has lower mean squared errors compared to other model-selection and model-averaging methods. Finally, we conducted an empirical analysis using the real-world Acute Myeloid Leukemia (AML) dataset. The results of the empirical analysis demonstrate that the method proposed in this paper outperforms existing approaches in terms of predictive performance.


Introduction
In some practical scenarios, we often need to select useful models from a candidate model set.A popular approach to address this issue is model selection.Methods such as the Akaike Information Criterion (AIC) [1], Mallows' Cp [2] and Bayesian Information Criterion (BIC) [3] are designed to identify the best model.However, in cases where a single model does not receive strong support from the data, these model-selection methods may overlook valuable information from other candidate models, leading to issues of model-selection uncertainty and bias [4].
To tackle these challenges and enhance prediction accuracy, several model-averaging techniques have been developed to leverage all information from the candidate models.Taking inspiration from AIC and BIC, Buckland et al. [5] proposed smoothed AIC (SAIC) and smoothed BIC (SBIC) methods based on AIC and BIC, respectively.Hansen [6] introduced the Mallows model-averaging (MMA) estimator, obtaining weights through the minimization of Mallow's Cp criterion.The MMA estimator asymptotically attains the minimum squared error among the model-averaging estimators in its class.Subsequently, Wan et al. [7] relaxed the constraints of Hansen [6], allowing for non-nested candidate models and continuous weights.In practical applications, many datasets exhibit heteroscedasticity.Therefore, it is essential to explore model-averaging methods tailored for heteroscedastic settings.Firstly, Hensen and Racine [8] proposed Jackknife model averaging (JMA), which determines weights by minimizing a cross-validation criterion.JMA significantly reduces Mean Squared Error (MSE) compared to MMA when errors are heteroscedastic.Secondly, Liu and Okui [9] modified the MMA method proposed by Hensen [6] to make it suitable for heteroscedastic scenarios.Furthermore, Zhao et al. [10] extended [6]'s work by estimating the covariance matrix based on the weighted average of squared residuals corresponding to all candidate models.This approach improves the model average estimator under heteroskedasticity settings.
In survival analysis, the accelerated failure time (AFT) model provides a straightforward description of how covariates directly impact survival time and has consequently garnered widespread attention.There are several parameter-estimation methods for the Accelerated Failure Time (AFT) model, including Miller's estimator [11], Buckley-James estimator [12], Koul-Susarla-Van Ryzin (KSV) estimator [13] and WLS estimator [14].However, all these methods assume that the censoring indicator is observable.Therefore, Wang and Dinse [15] improved the KSV estimator to make it adaptable to situations where the censoring indicator is missing.
Under practical conditions, it is common to encounter situations where only the observed time is available and it is uncertain whether the event of interest has occurred.In such cases, data suffer from missingness in the censoring indicator.For example, in a clinical trial for lung cancer, a patient may die for unknown reasons and while the survival time is observed, it is uncertain whether the patient died specifically due to lung cancer.This situation leads to missingness in the censoring indicator.Previous studies have mainly addressed the issue of missingness in the censoring indicator under a specific model.Research on model averaging for right-censored data typically assumes that the censoring indicator is observable.Therefore, this paper adopts the inverse probability weighting method proposed by [15] to construct the response variable.Through appropriate weight-selection criteria, weights are chosen to build the model-averaged estimator for the accelerated failure time model.It significantly enhances the predictive performance of the model and mitigates the bias introduced by the selection of a single model.Compared to previous research, this paper makes two main contributions: First, it introduces a novel model-averaging method for the case of missingness in the censoring indicator.Second, the paper allows for heteroscedasticity and employs model-averaging techniques to estimate variance.
The remaining sections of this paper are organized as follows.In Section 2, we commence by introducing the notation and progressively delineate the methodology and associated theoretical properties of the proposed model-averaging approach.In Section 3, we report the Monte Carlo simulation results.In Section 4, we assess the predictive performance of the proposed model-averaging method against other approaches using the real-world Acute Myeloid Leukemia (AML) dataset.In Section 5, we provide a comprehensive summary of the entire paper and suggest future research directions in this area.All theorem proofs will be presented in Appendix A.

Methodology and Theoretical Property
We denote where T represents the survival time and V denotes the censored time.X = (X ′ 1 , X ′ 2 , . . ., X ′ n ) ′ denotes the covariate matrix for n independent observations, where X i = (x i1 , x i2 , . . ., x ip ) .The accelerated failure time model can be expressed as follows: where e i is the random error with E(e i |X i ) = 0 and E(e 2 i |X i ) = σ 2 i .We assume that there are M candidate models in the candidate model set.Where the mth candidate model contains p m covariates.Following [7], these candidate model forms are non-nested.The mth candidate model is where X m is an n × p m dimensional full column-rank matrix, ′ , e m = (e m1 , . . . ,e mn ) ′ .
In the case of right censored data, the response variable Y i might be censored, making it unobservable.We only observe (Z i , X i , δ i ), where Z i = min(Y i , C i ) and the censoring indicator δ i = I(Y i ≤ C i ).Define a missingness indicator ξ i which is 1 if δ i is observed and is 0 otherwise.When the censoring indicators are missing, the observed data are {Z i , X i , ξ i , ξ i δ i } .For simplicity, we set U i = (Z i , X i ) ′ .In this paper, similar to [15], we assume the missing mechanism for δ to be: This assumption is more stringent than the missing at random (MAR) condition yet less restrictive than the assumption of missing completely at random (MCAR).
Koul et al. [13] introduced a method that involves synthetic data for constructing linear regression models.Wang and Dinse [15] extended [13]'s method to address the situation where censoring indicators are missing.In our work, we follow the approach proposed by [15] to construct a response in the form of inverse probability weighting, specifically: where It is easy to observe that under the missing data mechanism in this paper: Similar to Equation (2), we have: where E(e Wi |X i ) = 0 , σ 2 Wi = var(e Wi |X i ).This is expressed in matrix form as: where And then the weighted least squares estimator of β m : where where The model-averaging estimator of µ is defined as follows: for any w ∈ H M , where 2 , where ∥ • ∥ denotes the Euclidean norm.Then the risk function is defined as: where The derivation of ( 10) is as follows: Regarding the choice of weights, a natural approach is to minimize the risk function to obtain the optimal weights.However, as shown in Equation ( 11), we recognize that the risk function includes the unknowns µ, which makes it infeasible to directly minimize the risk function to obtain the optimal weights.Therefore, we replace µ with Y W and seek an unbiased estimator of the risk function as the criterion for weight selection.
Define the criterion for weight selection as Wi .By disregarding a term that is independent of w, C G n (w) serves as an unbiased estimator of the risk function .
In practice, m(•) , π(•) and G n (•) are usually unknown; therefore, we need to estimate them.Firstly regarding the estimation of m(u), it is usually estimated by the Logit model.Suppose m(u) is estimated by the parametric model m 0 (u; θ), where m 0 (u; θ) = e Uθ 1+e Uθ .By the maximum likelihood estimation method, we can obtain the parameter estimate θ n for the parameter θ. π(z) usually can be estimated nonparametrically by where W(•) is a kernel function and b n is a bandwidth sequence.Next, we define u(z) = E(δ|Z = z), u(z) estimated nonparametrically by , where K(•) is a kernel function and h n is a bandwidth sequence.We adopt the following estimator of G n (z): , where R i denotes the rank of Z i .
Next, replacing m(•), π(•) and G n (•) with m 0 (•, •), π n (•) and G n (•), we have: And the corresponding weight selection criterion is as follows: where The weights for minimizing C G n (w) are given by: Then, we enumerate the necessary regularity conditions for the asymptotic optimality.
m is an M × 1 unit vector in which the mth element is 1 and the others are 0.For some integer 1 ≤ J < ∞ and some positive constant k such that E(e ii denote the ith diagonal element of P m .There exists a constant c such that ρ Condition (C1) is utilized in [16] and it ensures that 1 − G n (t) is not equal to 0. Condition (C2) mandates that the conditional expectation of µ i remains within bounded limits, in line with assumptions seen in prior research, including [7,17].Condition (C3) is a requirement commonly found in model-averaging literature (e.g., [7,18]).Condition (C4) mandates the non-degeneracy of the covariance matrix Ω as n → ∞.Similar assumptions can also be found in [9,10].Similar to [15], Conditions (C5) and (C6) impose constraints on the bounds of m(•), π(•) and bandwidth, respectively.Condition (C7) is frequently employed in the analysis of the asymptotic optimality of cross-validation methods, as seen in prior works like [8].
Theorem 1 establishes the asymptotic optimality of the model-averaging procedure employing weights w, as its squared loss converges to that of the infeasible best possible model average estimator.
In most cases, Ω is unknown and needs to be estimated.We estimate Ω using residuals derived from the model-averaging process: where σ 2 Wi = var( e Wi ).In the existing literature on model averaging, most estimates of variance are predominantly derived from the largest candidate model, as exemplified by works such as [6,16].In contrast, our approach, following [10], leverages information from all candidate models for estimation rather than relying on a single model.Such an estimation method is more robust.Replacing Ω by Ω(w) in (13), C(w) becomes The weights that minimize C G n (w) are as follows: This weight selection criterion C G n (w) is a cubic function of w.

Simulation
In the simulation study, we generate data from the accelerated failure time (AFT) model, log(T i ) = Y i = ∑ 1000 j=1 β j x ij + e i , where β j = 1/j 2 ; the observations of ) are generated from a multivariate normal distribution with zero mean and covariance matrix Σ = (σ ij ) with σ ij = 0.5 |i−j| .The errors e i follow normal distribution N(0, γ 2 (x 4 i2 + 0.01)).By varying the value of γ, we allow R 2 to range from 0.1 to 0.9.This variance specification closely resembles that of [8].However, we introduce a small constant, 0.01, to ensure that the variances remain strictly positive.The censoring time C i is generated from N(C 0 , 7).By varying the value of C 0 , we achieve censoring rates (CRs) of approximately 20%, 40%.We set sample sizes n = 150, 300.Here, our model configuration is set in a nested form, meaning the first m models include the first m regressors.The number of candidate models M was set to be ⌈3n 1/3 ⌉, where ⌈x⌉ denote the smallest integer greater than x.
Based on the missing mechanism described in this paper, we assume that the probability of missing censoring indicators, denoted as 1 − π(z), is determined via a logistic model: log{ [15], we employed the uniform kernel function W(x) = 1 2 for |x| ≤ 1 and W(x) = 0 otherwise.Additionally, we used the biweight kernel function K(x) = 15  16 (1 − 2x 2 + x 4 ) for |x| ≤ 1 and K(x) = 0 otherwise.The bandwidths were b n = h n = n − 1 3 max(Z).We estimated m(u) under the logistic model: log{ m(u) 1−m(u) } = γ 1 + γ 2 z + γ 3 x.As highlighted by [19], when the data on δ are completely (or quasi-completely) separated, the maximum likelihood estimate of γ = (γ 1 , γ 2 , γ 3 ) does not exist.In our simulation setup, the number of covariates significantly exceeds the sample size.Therefore, we employ the lasso method to estimate the parameters.
We compare the proposed Model-Averaging method for the Missing Censoring Indicators in the Heteroscedastic setting (HCIMA) with other classical model-selection and modelaveraging methods in this article.Brief descriptions of these methods are provided below: • The model-selection methods rely on AIC and BIC, where the AIC and BIC criterion for the mth model are defined as follows: • Model methods based on SAIC and SBIC: The weights for the mth candidate model are given by: where AIC m = AIC(m) − min(AIC) , BIC m = BIC(m) − min(BIC) .• Additionally, we compare our approach with the method that estimates the variance using the maximum candidate model (MCIMA).And the specifics of variance estimation and weight selection in their approach are as follows: where In the simulation, we utilize the Mean Squared Error (MSE) to evaluate the performance of various methods, where the MSE is defined as We present the mean of MSEs from 500 replications.
Figures 1 and 2, respectively, show the Mean Squared Error (MSE) values for various methods across 500 repetitions under different censored rates and sample sizes, with missing rates of 20% and 40%.In terms of Mean Squared Error (MSE), our proposed HCIMA method outperforms other approaches.Additionally, the MCIMA method performs better than existing methods in all cases except for when compared to HCIMA.Furthermore, it is evident that SAIC and SBIC outperform their respective AIC and BIC counterparts, further highlighting the advantages of model-averaging methods.Comparing Figures 1 and 2, it is observed that the MSE at MR = 20% is slightly higher than at MR = 40%.The reason for this occurrence is that when ξ i = 1, δ i = 0, the signs of Y Wi and Z i are opposite.As MR increases, the occurrence of the ξ i = 1, δ i = 0 situation decreases.Although this result may seem counterintuitive, it does not affect the performance of the method proposed in this paper, which still keeps its advantages in this case.

Real Data Analysis
In this section, we assess the predictive performance of our proposed HCIMA method using the real Acute Myeloid Leukemia (AML) dataset.This dataset contains 672 samples, including 97 variables such as patient age, survival time, gender, race, mutation count, etc.For more specific information about this dataset, we refer the reader to https://www.cbioportal.org/study/clinicalData?id=aml_ohsu_2018 (accessed on 13 December 2023).
We selected ten variables for analysis: Cause Of Death, Age, Sex, Overall Survival Status, Overall Survival Months (Survival Time), Number of Cumulative Treatment Stages, Cumulative Treatment Regimen Count, Mutation Count, Platelet Count and WBC (White Bloodcell Count).After removing rows with missing values, we retained a total of 396 samples.We treat samples with unknown causes of death as missing censoring indicators.Among these 396 samples, 76 have unknown causes of death and 167 samples are still alive after the clinical trial ends.Therefore, the missing rate is approximately 19% and the censoring rate is 42%.We focus on the impact of seven variables, excluding "Cause Of Death" and "Overall Survival Status" on Survival Time.Therefore, we can construct 2 7 − 1 = 127 non-nested candidate models.
We randomly select data from n 0 samples as the training dataset, while the remaining n 1 = n − n 0 samples are used as the testing dataset.We set the training dataset size to 50%, 60%, 70% and 80% of the total dataset size, respectively.Following [16,20], we employed the normalized mean squared prediction error (NMSPE) as the performance metric: where µ i represents the predicted value and µ mi denotes the value of µ for the mth model.We calculate the mean, the standard deviation and the optimal rate of each method over these 1000 repetitions.Specifically, the optimal rate refers to the frequency at which the minimum value is achieved across these 1000 repetitions.
Table 1 displays the mean, optimal rates and standard deviations of NMSPE for each method over 1000 repetitions.Consistent with the simulation results, the HCIMA method exhibits the lowest average NMSPE and standard deviation and the highest optimal rate.The MCIMA method also performs well, ranking second after HCIMA.This indicates that the proposed model-averaging methods in this paper demonstrate superior predictive performance compared to other approaches.

Discussion
To address the uncertainty in model selection and enhance predictive accuracy, this paper proposes a novel model-averaging approach for the accelerated failure time model with missing indicators.Moreover, we establish asymptotic optimality under certain mild conditions.In Monte Carlo simulations, the method proposed in this paper exhibits lower mean squared errors compared to other model-selection and model-averaging methods.Empirical results demonstrate that the proposed method has a lower NMSPE compared to other approaches, indicating its superior predictive performance.This further underscores the applicability of the proposed method to real-life data scenarios with missing censoring indicators.
In this paper, we introduce the inverse probability weighted form of response variable proposed in [15].The primary advantage of this form of response variable lies in its double robustness, making it less susceptible to the impact of model misspecification (if π(•) or m(•) is misspecified).However, as mentioned in [15], its drawback, compared to synthetic response [13], regression calibration and imputation [15], is a larger variance.Yet, in practical scenarios, the harm caused by model misspecification often outweighs the harm of higher variance.Therefore, in our work, we follow the recommendation of [15] to use the inverse probability weighted form of the response variable.A future research direction is to further enhance this response variable for better applicability in the context of missing censoring indicators.
As far as we know, there is currently very limited research on model averaging for missing censoring indicators.Therefore, there are still many questions that deserve further investigation.There is potential for extending our approach to high-dimensional data in terms of data and in terms of models, exploration into partial linear models, generalized linear models and other extensions could be pursued. .
Under our specific conditions, we can prove this lemma using the same techniques as the proof of (A.3) in [7].Therefore, we omit the proof here.

Lemma A3. Under Conditions
where K is a constant.By Condition (C2), we have 1 n µ T µ = O p (1) and 1 n e T W e W = O p (1).According to [15] (1).Combined with Condition (C1), we have: Similar to the proof of Lemma 6.2 in [16], we have: .
Furthermore, we can obtain .
With the three lemmas mentioned above, we can now proceed to prove Theorem 1.
According to Lemma A3, we have: .
Combining this with Lemma A3, we can conclude that (A9) is of o p (1), which establishes the proof for (A3).

Figure 1 .
Figure 1.Mean Squared Errors (MSEs) of various methods under different sample sizes and censor rates at MR = 20%.

Figure 2 .
Figure 2. Mean Squared Errors (MSEs) of various methods under different sample sizes and censor rates at MR = 40%.

Table 1 .
The mean, optimal rate and standard deviation of NMSPE.