Optimal Model Averaging for Semiparametric Partially Linear Models with Censored Data

: In the past few decades, model averaging has received extensive attention, and has been regarded as a feasible alternative to model selection. However, this work is mainly based on parametric model framework and complete dataset. This paper develops a frequentist model-averaging estimation for semiparametric partially linear models with censored responses. The nonparametric function is approximated by B-spline, and the weights in model-averaging estimator are picked up via minimizing a leave-one-out cross-validation criterion. The resulting model-averaging estimator is proved to be asymptotically optimal in the sense of achieving the lowest possible squared error. A simulation study demonstrates that the method in this paper is superior to traditional model-selection and model-averaging methods. Finally, as an illustration, the proposed procedure is further applied to analyze two real datasets.


Introduction
The semiparametric partially linear model (PLM), which was proposed by [1], has attracted extensive attention in statistics because it combines the interpretability of the linear model with the flexibility of the nonparametric model. A large collection of literature has explored the estimation methods of this model, including the parametric part and the nonparametric function, such as [2][3][4][5], and so on. The premise of using methods in the aforementioned literature is that the model is correctly specified. However, in real data analysis, researchers are always able to collect a variety of variables and are not sure which variables are to be included in the true model. This kind of uncertainty is generally referred to as model uncertainty, which brings great trouble to statistical analysis.
Refs. [6][7][8] pointed out that model selection and model averaging are two mainstream methods to deal with model uncertainty. Model selection, which has a long history, selects a model from a series of candidate models through selection criteria, for example, Akaike information criterion (AIC [9]), Bayesian information criterion (BIC [10]), focused information criterion (FIC [11]), and so on. In addition, shrinkage-estimation-based variable selection was also applied to determine which variables are neededto build a PLM (see, e.g., [12][13][14], among others). These model-selection methods can be viewed as combining a series of candidate models, and assigning a weight of 1 to the selected model and 0 to other models.
As an important alternative to model selection, model averaging incorporates the model uncertainty in statistical analysis by assigning nonzero weight vector to a set of candidate models, which frequently leads to more effective results (see [15]). Bayesian model averaging, which is an important branch of model averaging, has been fully developed in the past decades; see [16] for details. In the current paper, we focus on model-averaging approach in PLM from a frequentist perspective. Since ref. [17] pioneered the use of Mallows criterion for weight choice in model averaging, there is rapidly growing research on asymptotically optimal model averaging. Different kinds of optimal model-averaging methods were proposed, including jackknife model averaging (JMA [7]), Kullback-Leibler model averaging [18], generalized least-squares model averaging [19], leave-subject-out cross-validation [20], K-fold cross-validation [21], and so on. In addition, the optimal model-averaging methods were extended to quantile regression [22], semiparametric models [23,24], missing data [25,26], functional data [27], measurement error data [28], and high-dimensional data [29,30].
Censored data are ubiquitous in the fields of biomedicine, industry, econometrics, etc. For example, in biomedicine, when some sampled individuals are lost to follow-up before the end of the study or drop out during the study, the survival time will be subject to censoring. Compared with the complete data, censoring makes some data unable to be observed completely, which increases the difficulty of statistical analysis. Although an extensive body of literature focusing on the estimation methods in the presence of censored data has been developed, such as [31,32], there is little work on the development of model-averaging approaches with censored data. Based on FIC advocated in [11,[33][34][35], theydeveloped model-averaging methods for different regression models with censored data under the local misspecification framework, where the weights of the model-averaging estimators were constructed by the information criterion values, rather than being selected in a data-driven fashion. Moreover, the framework mentioned above requires that the distance between the candidate model and the true model is O(1/ √ n). This means the candidate model is close to the true model when the sample size is large, which is unrealistic.
Without local misspecification framework, ref. [36] constructed the optimal modelaveraging estimator for a high-dimensional linear model with censored data by adapting a leave-one-out cross-validation criterion. Ref. [37] studied Mallows model-averaging method for linear models with censored responses, and the resulting model-averaging estimator was proved to be asymptotically optimal in terms of minimizing the squared error loss. The optimal model-averaging methods for censored data mentioned above are based on classical linear models. The primary object of the current paper is to construct an optimal model-averaging estimator for semiparametric PLM with censored responses, in which the weight vector is selected by minimizing a leave-one-out cross-validation criterion. Compared with [36,37], we confront two major challenges. Firstly, the nonparametric function in PLM significantly complicates the construction of the model-averaging estimator and the development of the weight choice criterion. Secondly, our proof of optimality cannot follow the approach of the linear model, since their proof techniques cannot be directly applied when a nonparametric part is present.
The plan of this article is as follows. In Section 2, we describe the model set and introduce the parametric estimation method in the candidate PLM. Section 3 constructs the model-averaging estimator and proposes a weight choice criterion. Section 4 establishes the asymptotic optimality of the model-averaging estimator. Section 5 explores the finite sample performance of our method by the simulation study. Section 6 applies the proposed method to a real dataset. Section 7 gives some conclusions. Proofs are listed in the Appendix A.

Model Setup and Parametric Estimation
To facilitate presentation, we first list the basic notations used in this paper in Table 1. Then, we consider the following PLM: where Y i is a response variable with a continuous distribution function F(·), a countably infinite vector X i = (x i1 , x i2 , . . .) is linearly related to Y i , U i is a covariate nonlinearly related to Y i , g(·) is an unknown smooth function, and i is the model error with

Notations Descriptions
The response variable, a transformation of T i X i The covariate vector of the ith subject δ i The censoring indicator of the ith subject C i The last follow up time of the ith subject Z i The observed time, equal to min The cumulative distribution function of C i Z G n × 1 synthetic response vector µ n × 1 conditional mean vector of the response B (s) n × k n B-spline basis matrix for the sth model β (s) p s × 1 linear regression coefficient vector for the sth model α (s) k n × 1 spline coefficient vector for the sth model G n (·) The Kaplan-Meier estimator of G(·) β G,(s) The estimator of β (s) with G(·) The estimator of µ for the sth model with G(·) In survival analysis, we assume Y i to be a known monotonic transformation of the survival time T i , for example, the commonly used logarithm Y i = log T i . Y i may be censored by a censoring time C i and hence cannot be observed completely. We consider a sample of independent observations (Z i , δ i , X i , U i ) for i = 1, . . . , n, where Z i = min(Y i , C i ) and Let G(·) be the cumulative distribution function of the censoring time, and Z G,i = Z i δ i /{1 − G(Z i )} be a synthetic response. Then, from [38], it is not difficult to verify that where (2) can be expressed in matrix form as where Z G = (Z G,1 , . . . , Z G,n ) is an n-dimensional synthetic response vector, conditional mean µ = (µ 1 , . . . , µ n ) is an n-dimensional vector, M = {g(U 1 ), . . . , g(U n )} , X = (X 1 , . . . , X n ) is a linear covariate matrix, U = (U 1 , . . . , U n ) , and e G = (e G,1 , . . . , e G,n ) is an n-dimensional error vector satisfying E(e G |X, U) = 0 and E(e G e G |X, U) = Ω G = diag(σ 2 G,1 , . . . , σ 2 G,n ). Assume that we have a total of S candidate PLMs to approximate the true data generating process, where S is allowed to go to infinity. Suppose the sth candidate model is where X (s) = (X (s),1 , . . . , X (s),n ) is an n × p s covariate matrix which includes p s columns of X with full column rank, β (s) = (β (s)1 , . . . , β (s)p s ) is the corresponding p s × 1 unknown linear regression coefficient vector, M (s) = {g (s) (U 1 ), . . . , g (s) (U n )} is an n × 1 unknown nonparametric function vector, and e G,(s) is the model error.
To obtain the estimator of µ under the sth candidate model, we should estimate the coefficient vector β (s) and nonparametric function vector M (s) firstly. There are many estimation methods for model (4), including kernel smoothing, polynomial spline smoothing, and so on. Recently, ref. [39] pointed out that using B-splines to approximate nonparametric functions in the area of model averaging has great advantages; therefore, in this paper, we adopt the spline technique to estimate the unknowns in model (4). Denote where J n is the number of interior knots. Let ζ n be the polynomial spline space on interval [0, 1] of degree r. From [40], the nonparametric function in PLM can be well estimated by a B-spline expansion. Then, for g s (u), one can write where B (s) (·) = (B (s)1 (·), . . . , B (s)k n (·)) is the normalized B-spline basis function vector in the sth candidate model, k n = J n + r + 1, and . Therefore, there exists a design matrix X * (s) = (X (s) , B (s) ) and the corresponding unknown parameter where X * (s) is supposed to be of full-column rank. By regressing Z G on X * (s) , we can obtain the least-squares estimators of β (s) and α (s) : andα where Q (s) = B (s) (B (s) B (s) ) −1 B (s) . Therefore, the estimator of µ under the sth candidate model is given bŷ whereX (s) = (I − Q (s) )X (s) , and P (s) = Q (s) +X (s) (X (s)X (s) ) −1X (s) . From Equation (9), we find thatμ G,(s) is linearly dependent on Z G .

Model-Averaging Estimator and Weight Choice Criterion
Let ω = (ω 1 , . . . , ω S ) be a weight vector belonging to the set W = {ω ∈ [0, 1] S : ∑ S s=1 ω s = 1}, then the model-averaging estimator of µ can be formulated aŝ where P(ω) = ∑ S s=1 ω s P (s) . Motivated by [7], we propose a leave-one-out cross-validation criterion to select the weight vector ω for Equation (10) in the PLM framework. Let X G be the matrices/vectors X (s) , B (s) and Z G with the ith row deleted. The leave-one-out estimator of µ i in the sth candidate model is given bỹ whereβ (s) . Denote the sth jackknife estimator and the corresponding jackknife version of the averaging estimator asμ G,(s) = (μ G,(s),1 , . . . ,μ G,(s),n ) andμ G (ω) = ∑ S s=1 ω sμG,(s) . The leave-one-out cross-validation weight choice criterion is then minimizing CV G (ω) over the space W yields the optimal weight vector. However, in practice, such a minimization process is computationally infeasible because the cumulative distribution function G(·) in Equation (12) is unknown and needs to be estimated. Similar to [41], we can estimate G(·) by the commonly used Kaplan-Meier estimator where In what follows, a letter subscripted byĜ n denotes that it is obtained by replacing G in its corresponding estimator withĜ n . For instance, ZĜ n is obtained by replacing G with its estimatorĜ n in Z G . Then a feasible counterpart of CV G (ω) is given by whereμĜ n (ω) = ∑ S s=1 ω sμĜ n ,(s) , andμĜ n ,(s) = (μĜ n ,(s),1 , . . . ,μĜ n ,(s),n ) . Minimizing CVĜ n (ω) with respect to ω over the set W leads to the jackknife choice of weight vector PluggingĜ n andω into Equation (10) yields the model-averaging estimator of µ, written asμĜ n (ω), which is named the censored partially linear model averaging (CPLMA) estimator hereafter.
However, minimizing the weight choice criterion (14) is not easy because the computation ofμĜ n ,(s) requires n separate regressions, which is especially cumbersome when the number of candidate models and the sample sizes are large. Motivated by the computationally efficient cross-validation criterion introduced by [7] for linear regression model, we expressμĜ n ,(s) in a simple form which yields an enormous reduction in calculation time. Let φ (s) ii be the ith diagonal entry of P (s) . From [20,42],μĜ n ,(s) can be conveniently written asμĜ where ii ), and A (s) = I − P (s) . The shortcut formula ofμĜ n ,(s) given in (16) indicates that all elements inμĜ n ,(s) can be simultaneously calculated based on all observations, which is much more convenient and time-saving than the standard method based on Equation (11).
. The corresponding computational shortcut formula for the feasible jackknife criterion (14) then follows as where (17), we observe that the minimization of CVĜ n (ω) is a standard quadratic programming problem, which can be performed by various existing software packages, for example, the quadprog package in R [43].

Asymptotic Optimality
In this section, we demonstrate that the resulting weight vector, which is obtained by minimizing the weight choice criterion CVĜ n (ω), is asymptotically optimal under some mild conditions.
Define the squared loss as L G (ω) = μ G (ω) − µ 2 , and the corresponding risk func- , and ω 0 s be a S × 1 vector with the sth entry taking on 1 and the others taking on 0. To prove the asymptotic optimality of the modelaveraging estimatorμĜ n (ω), we list the following regularity conditions, where all limiting processes correspond to n → ∞.
(Condition (C.2))λ(Ω G ) ≤ C G , whereλ(·) denotes the maximum singular value of a matrix, and C G is a constant.
ii ≤ C s n −1 tr(P (s) ), a.s., where C s is a constant. • (Condition (C.7)) The function g belongs to a class of functions A, whose rth derivative g (r) exsits and is Lipschitz of order α 0 . That is, for some positive constant M , where U is the support of U, r is a nonnegative integer and α 0 ∈ (0, 1] such that r + α 0 > 0.5. Condition (C.1), which is the same as condition (C5) in [35], is widely used to ensure the uniform convergence of the Kaplan-Meier estimatorĜ n (·) in studies of the censored data. Condition (C.2) imposes a mild restriction on the maximum singular value of the covariance matrix Ω G , which is also used by [44]. Condition (C.3), which is from condition (21) of [45], is less restrictive than the commonly used condition Sξ −2N s., for some constant N ≥ 1 in model-averaging references. Condition (C.4) places constraints on the growth rates ofp andq, which is similar to condition (22) in [45]. Condition (C.5) discusses the sum of µ 2 i , and is frequently used in the model-averaging literature, such as [23,24], and so on. Condition (C.6) is a common assumption utilized to guarantee the asymptotic optimality of cross-validation; see [7,24], for instance. Conditions (C.3)-(C.6) require almost sure convergence, which ensure that the result (18) holds whether the covariates X and U are random or not. Specifically, when X and U are nonstochastic, we only need to assume convergence in probability in conditions (C.3)-(C.6); see [46]. Otherwise, we should impose almost sure convergence to guarantee that the proof method, which is used in the case of nonstochastic, is still effective. Condition (C.7) is required for the B-spline approximation in PLM; see [39,47].
Theorem 1 indicates that the CPLMA estimator proposed in this paper is asymptotically optimal in the sense that its squared error loss is asymptotically equivalent to that of the infeasible best possible model-averaging estimator in PLM framework. The proof of Theorem 1 is shown in the Appendix A.
in probability as n → ∞.

A Simulation Study
In this section, a simulation experiment is conducted to investigate the finite sample performance of the CPLMA estimator, which arises from the proposed leave-one-out crossvalidation weight choice approach, in PLM with censored responses. We compare it with several popular information-criterion-based model-selection methods as well as other model-averaging procedures.

The Design of Simulation
The data-generating process in this part is similar to the infinite-order regression model proposed by [17], except that responses are subject to censoring and a nonparametric function is included in addition to the linear part. Specifically, the data are generated by the following regression model: where X i = (x i1 , . . . , x i200 ) , the covariates in linear component, follows a multivariate normal distribution with mean 0 and covariance 0.5 |j 1 −j 2 | between x ij 1 and x ij 2 . The coefficients of the linear part are set as β j = 1/j α , and the parameter α varies between 2 and 0.5. The larger α implies that the coefficient decays more quickly as j increases. The nonparametric function is g(U i ) = sin(2πU 2 i ), where U i is generated from the uniform distribution on [0, 1]. The model error i follows a normal distribution N(0, η 2 (x 2 i2 + 0.01)). We choose the value of η so that R 2 = var(µ i )/var(Y i ) varies from 0.1 to 0.9, where var(·) denotes the sample variance. In addition, the censoring variable C i is generated from a uniform distribution on interval [a 1 , a 2 ], where different values of a 1 and a 2 are selected to yield a censoring rate (CR) of either 20% or 40%. In order to evaluate the performance of the methods as comprehensively as possible, we consider two designs and set the sample size n = 50, 75, 100, 200, 300, and 400. (non-nested setting). The linear parts of all candidate models are a subset of {x i1 , . . . , x i5 } (with the rest of X i being ignored), so the number of candidate models is 2 5 = 32.

Estimation and Comparison
Suggested by [35], the cubic B-spline is used to approximate the nonparametric function, and the spline basis matrix is generated by bs(·) in splines package with R project [48]. To select the number of knots, we set η = 1, α = 2, and investigate the impact of the number of knots on the risk of CPLMA estimator under different scenarios. Figure 1 shows the variation for the mean of risks when the number of knots varies, over 500 replications for four combinations of designs and CRs considered. From Figure 1, we see that in almost all cases, 1 knot yields the smallest mean of risk, except that in the left lower panel, 1 knot leads to a mean of risk that is second to 2 knots but best among the remaining number of knots. In addition, in all cases, the mean of risk increases with the number of knots when the number of knots exceeds 2. This observation coincides with the findings in [39] that the larger number of knots results in a more serious overfitting effect. Therefore, the number of knots is set to be 1 in the simulation studies. We compare the performance of the CPLMA method with two traditional modelselection methods (AIC and BIC) and two model-averaging methods based on scores of information criteria (SAIC and SBIC). For the sth model, we calculate AIC and BIC scores by AIC s = log(σ 2Ĝ n ,(s) ) + 2n −1 tr(P (s) ), and BIC s = log(σ 2Ĝ n ,(s) ) + n −1 tr(P (s) ) log(n), respectively, whereσ 2Ĝ n ,(s) = n −1 ZĜ n −μĜ n ,(s) 2 , andμĜ n ,(s) is obtained by replacing G in Equation (9) withĜ n . Both methods pick the model with the smallest information criterion score.
For SAIC and SBIC, the weights of sth model are defined as and respectively. To evaluate these five methods, we draw 500 independent samples of size n, and compute the risk of the estimator of µ. For comparison convenience, the risks of all estimators are normalized by the risk produced by the AIC method.

Results
The simulation results for Design 1 are presented in Figures 2 and 3 for the censoring rate of 20% and in Figures 4 and 5 for the censoring rate of 40%. These four figures show that our method CPLMA leads to the smallest risk in most cases, except that both SAIC and SBIC sometimes have marginal advantages over ours when R 2 is large, and the advantage of SBIC is more obvious when n is small. In particular, comparing the simulation results between α = 0.5 and α = 2, we find that the performance of CPLMA is better when α is small. As expected, SAIC and SBIC invariably produce vastly more accurate outcomes than their respective model-selection counterparts.
The simulation results for Design 2 are depicted in Figures 6-9 with CR ≈ 20% and 40%, from which we see that in most cases our proposed CPLMA method still outperforms its rivals in terms of risk. The superiority of the CPLMA method over the other methods is more apparent than in Design 1. Additionally, we find that BIC-based model selection and averaging estimators have much worse risk performance than the other three estimators when R 2 is small, which is different from the simulation results in Design 1. We also note that SAIC and our method CPLMA almost perform equally well when R 2 is very large.
In a word, no matter whether the candidate models are nested or not, our proposal, CPLMA, is superior to the traditional model-selection and model-averaging methods for all the combinations of censoring rates and sample sizes considered.

Real Data Analysis
In this section, we apply the proposed CPLMA method to analyze two real datasets by R software. The first real dataset can be found in the R package "survival" [49], and the other is available at http://llmpp.nih.gov/MCL (accessed on 18 January 2023).

Primary Biliary Cirrhosis Dataset Study
The primary biliary cirrhosis (PBC) dataset includes information of 424 patients, which were collected at Mayo Clinic from January 1974 to May 1984, and has been extensively explored by [34,35,37,50,51]. Following the related literature, we restrict our attention to the n = 276 patients without missing observations, each of whom contains the data of 17 covariates. There are 111 deaths among 276 patients, which leads to 60% of censoring.
A total of 17 covariates lead to a huge number of candidate models, which brings heavy calculation burden. Ref. [50] pointed out that only eight covariates, that is, ge, edema, bili, albumin, copper, ast, protime, and stage, have significant impact on the response variable, and ref. [35] found that albumin has a functional impact on the response variable.
Thus, we only consider eight significant covariates. Specifically, we assign albumin in the nonparametric part while including others in the linear part of PLM, and we run model selection and averaging on the covariates in the linear part. Accordingly, there are 2 7 = 128 prepared models as candidates. Similar to [35], we also use the cubic B-spline with two knots to approximate the nonparametric component.
To evaluate the prediction effect of two model-selection methods (AIC and BIC) and three model-averaging methods (SAIC, SBIC, and CPLMA), we randomly separate the data into a training sample and a test sample. Let n 0 be the size of the training sample, and n 1 = n − n 0 be the size of the test sample. We set n 0 to 140, 160, 180, 200, 220, and 240. The mean-squared prediction error (MSPE) is used to describe the out-of-sample prediction performance of the proposed CPLMA and its competitors. We further calculate the mean and the median of the MSPE for each method based on 1000 replications. Specifically, and MSPE median = median d=1,2,...,1000 MSPE (d) , where andμ (d) i is the predicted value of µ i in the dth repetition.
To facilitate comparison, we calculate the ratio of MSPE for a given method to the MSPE produced by AIC, which is referred to relative MSPE (RMSPE). Table 2 reports the mean and median of RMSPE across 1000 repetitions. We see that our proposed CPLMA always yields the lowest mean and median of RMSPE for all considered training sample sizes. In all cases, the values of RMSPE for BIC are bigger than 1 and the values of RMSPE for three model-averaging methods are smaller than 1, which indicates that in terms of prediction performance, AIC is the overwhelming favorite of two model-selection methods, and model-averaging methods outperform model-selection methods.  Table 3 presents the Diebold and Mariano (DM) [52] test results for the differences in MSPE, where a positive DM statistic implies that the method in the numerator yields a larger MSPE than the method in the denominator. The results in columns 6, 9, 11, and 12 indicate that the differences between CPLMA and its competitors are statistically significant, and our method always produces smaller MSPE than other four methods, which again demonstrates the superiority of our proposal. The results in column 3 show that AIC is significantly better than BIC, which coincides with the finding in Table 2. Columns 4 and 8 indicate that SAIC and SBIC are significantly different from their respective model-selection counterparts.

Mantle Cell Lymphoma Data Analysis
The mantle cell lymphoma (MCL) dataset contains 92 patients who were classified as having MCL based on established morphologic and immunophenotype criteria. Since 2003, this dataset has been widely studied by [53,54]. The response variable of interest is time (time of follow-up in year). The variable status denotes patient status at follow-up (1 = death, 0 = censored). The six covariates include indicator of INK/ARF deletion (1 = yes, 0 = no), indicator of ATM deletion (1 = yes, 0 = no), indicator of P-53 deletion (1 = yes, 0 = no), cyclinD-1 taqman result, BMI expression, and proliferation signature averages. After removing seven records with missing covariates, we focus on the information of the remaining 85 patients, and the censoring rate is 29.41%.
Ref. [54] found that BMI expression has a functional impact on the response variable; therefore, we establish a full PLM model with BMI expression as nonparametric variable and other covariates as linear variables. Then, we can obtain 2 5 = 32 candidate models to conduct model selection and model averaging. Let the size of training sample be n 0 = 55 or 65, and the mean and median of the RMSPE across 1000 repetitions are shown in Table 4. It can be seen from Table 4 that, both in terms of mean and median, our method CPLMA is significantly superior to other competitive methods. Figure 10 shows that the variation of MSPE for CPLMA is minor relative to that of other methods, regardless of if n 0 is 55 or 65.

Conclusions
In the context of the semiparametric partially linear models with censored responses, we develop a jackknife model-averaging method which selects the weights by minimizing a leave-one-out cross-validation criterion, in which the B-splines are used to approximate the nonparametric function and the least-squares estimation is applied to estimate the unknown parameters in each candidate model. The resulting model-averaging estimator, CPLMA estimator, is shown to be asymptotically optimal. A simulation study and two real data examples indicate that our method posseses some advantages over other model-selection and model-averaging methods.
Based on the results in this paper, we can further explore the optimal model averaging for the semiparametric partially linear quantile regression models with censored data. In addition, it is worthwhile to apply other optimal model-averaging methods, such as the model-averaging method based on Kullback-Leibler distance, to generalized partially linear models with censored responses.
Lemma A1. Under Conditions (C.2)-(C.7), we have the following results: Proof. According to the proof of (A. 45 Proof. This result is from Lemma 6.2 in [37] directly; we omit the proof procedure.
By Equation (A7) and Lemma A3, we observe Thus, we can obtain (A15). This concludes the proof.