Abstract
In this paper, the statistical inference of the partially linear varying coefficient quantile regression model is studied under random missing responses. A two-stage estimation procedure is developed to estimate the parametric and nonparametric components involved in the model. Furthermore, the asymptotic properties of the estimators obtained are established under some mild regularity conditions. In addition, the empirical log-likelihood ratio statistic based on imputation is proposed, and it is proven that this statistic obeys the standard Chi-square distribution; thus, the empirical likelihood confidence interval of the parameter component of the model is constructed. Finally, simulation results show that the proposed estimation method is feasible and effective.
Keywords:
composite quantile regression; partially linear varying coefficient model; empirical likelihood; confidence interval; missing responses MSC:
62G05; 62G08; 62G20; 62G30
1. Introduction
The partially linear varying coefficient model, originally proposed by Zhang et al. [1], is a very flexible model that includes several sub-models, such as parametric, nonparametric, and semiparametric models. It not only has the advantages of the general linear model and robust nonparametric model but can also dynamically express the relationship between covariates and responses. At the same time, it maintains the explanatory power of parameters and the flexibility of nonparametric models (see [2,3]). Therefore, the model has been widely studied by many scholars. As we know, this model is usually estimated using the least-square (LS) estimation method. Although the LS method is the best method for normal distribution data sets, it may produce large deviations when there are abnormal values or errors in the data sets that obey heavy-tailed distributions. In order to remedy the defects of the LS method, Koenker and Bassett [4] proposed the quantile regression method, which can be used to explore the potential relationship between the response variables and covariates. Further, in order to avoid being influenced by the specific value of the quantile and improve estimation efficiency, the composite quantile regression (CQR) method was proposed by Zou and Yuan [5] based on quantile regression. So far, the CQR method has been applied to solve many problems. For example, new estimation and variable selection procedures for the semiparametric partially linear varying coefficient model were proposed by Kai et al. [6], showing that compared with the least-squares-based method, the CQR method is much more efficient for many non-normal errors. Jiang et al. [7] proposed a functional single-index composite quantile regression method and estimated the unknown slope function and link function using B-spline basis functions. Song et al. [8] proposed a penalized composite quantile regression estimator based on SCAD and the Laplacian error penalty (LEP), which can realize variable selection and estimation at the same time.
Despite significant advances in CQR theory and its applications, the CQR method has attracted more attention when missing values are contained in data samples, which arises in various fields, including economics, engineering, biology, and epidemiology. Due to various man-made or other unknown factors, it is easy to produce a large number of missing data. Considering this problem, several methods have been proposed to deal with missing data, such as the imputation method [9,10,11], complete-case (CC) analysis method [12], likelihood-based method [13], and inverse probability weighted method (IPW) [14,15,16]. Thus, some scholars have applied the CQR method in the case of missing data. Based on inverse probability weighted and B-spline approximations, Jin et al. [17] proposed a weighted B-spline composite quantile regression method to estimate nonparametric functions and regression coefficients for the partially linear varying coefficient model with missing covariates.
Further, the empirical likelihood method, introduced by Owen in [18,19], is a non-parametric statistical inference method for complete samples, which has similar sampling characteristics to the bootstrap method. It has many outstanding advantages, including domain preservation, the shape of the confidence region determined by the data, Bartlett correction, and without constructing axis statistics compared with classical or modern statistical methods. As a result, this method has attracted the attention of many statisticians and thus has been applied to analyze linear, nonparametric, and semiparametric regression models. For example, Zhao and Xue [20] used this method to give the adjusted empirical likelihood ratio function of the parameter part and proved that it asymptotically obeyed the standard Chi-square distribution. Wang and Zhu [21] developed two methods of longitudinal data inference based on this method under the framework of quantile regression. Yan et al. [22] proposed an empirical likelihood method based on imputation combined with quantile regression to construct confidence intervals of parametric and nonparametric components. In this paper, we combine the imputation method and the empirical likelihood inference to propose a composite quantile regression estimation for the partially linear varying coefficient model when the responses are missing at random.
The rest of this paper is organized as follows. In Section 2, a two-stage estimation procedure is developed to estimate the parametric and nonparametric components. Based on this, methods for constructing imputation empirical likelihood-based confidence intervals for the parametric components are presented, and some asymptotic properties of the proposed estimator are studied. In Section 3, some simulation studies and a real data application are used to evaluate the performance of the proposed method. In addition, the proofs of the main results are given in Section 4.
2. Two-Stage Estimation Method
In what follows, we use the two-stage estimation procedure [23] to estimate the parametric and nonparametric components involved in the following partially linear varying coefficient model:
where Y is the response, U is a one-dimensional variable, and and are the corresponding covariates. Suppose that is the unknown vector whose components are smooth coefficient functions, is a vector, is the model error, and and are independent of each other. If are random samples, we have
An indicator variable is introduced in this paper such that
Further, follows the following missing mechanism:
Assuming is known, the model in (2) can be regarded as the general varying coefficient model. Next, we use the local linear composite quantile estimation method to estimate the varying coefficient function . Specifically, for u in a neighborhood of , the unknown coefficient function can be locally linearly approximated as
Let be the quantile of , then . For a given non-negative integer K, let , and let be a quantile loss function for , so is uniquely deterministic for any . Then, the initial estimation of under complete data can be obtained by minimizing the following formula
where , , , and is the kernel function. Then, we have . Further, by substituting into Formula (2), we can obtain
where . Now, the model in (6) can be regarded as a general linear model. Then, the estimation for can be obtained by minimizing the following formula using the composite quantile regression estimation method
The estimated efficiency of can be further improved by substituting into (5). Then, we have and .
2.1. Imputation-Based Empirical Likelihood Inference for
Now, the confidence interval for is constructed for the imputation-based empirical likelihood inference method. When a large amount of data are missing, an empirical likelihood inference method based on imputation is proposed [24] to improve the accuracy of the confidence interval. Let and denote the estimation for and , respectively. In addition, the auxiliary random vectors based on imputation are defined as
where is an indicator function. The empirical log-likelihood ratio for is defined as
If zero is inside the convex hull of the point , there exists a unique optimal point for the optimization problem (9). It follows from the Lagrange multiplier method that can be rewritten as
where is a Lagrange multiplier that satisfies
2.2. Asymptotic Properties of Estimators
For convenience, the following notation is given. Note that and represent the density function and distribution function of the model error , respectively. represents the marginal density of the covariate u. Note that . To prove the asymptotic properties, we assume that the following regularity conditions hold:
C1: The random variable U has a bounded support , and its density function satisfies the Lipschitz condition and is continuous.
C2: has a continuous second derivative.
C3: The density function of the model error satisfies f(·) > 0, and its derivative is continuously bounded.
C4: The random variable X has bounded support.
C5: The kernel function is a symmetric probability density function with bounded support and satisfies the Lipschitz condition. Let
C6: For a given , and are positive definite matrices.
C7: The covariates are centred random vectors and satisfy
as .
C8. Suppose that and . Furthermore, we let for all , and z.
Theorem 1.
Assume that conditions C1-C8 hold. Then,
if β is the true value of the parameter, where denotes convergence in distribution, , , , and
where
Theorem 2.
Assume that conditions C1-C8 hold, as . Then,
where denotes convergence in distribution, and , , and .
Theorem 3.
Assume that conditions C1-C8 hold. Then,
if β is the true value of the parameter, where denotes convergence in distribution, and is the Chi-square distribution with q degrees of freedom.
Then, the confidence region for will be constructed based on the results. For a given with , let satisfy . Then, the approximate confidence region for can be defined as
3. Numerical Results
3.1. Simulation Studies
A numerical simulation experiment was carried out to study the performance of the proposed method under finite samples. We considered the following partially linear varying coefficient model and the data were also generated using this model
where , , , , , and the response is generated according to the model. Throughout the simulation study, we used the Epanechnikov kernel function, where . Further, we used the cross-validation method to select the optimal bandwidth
Thus, the missing probabilities corresponding to the scenarios were approximately 0.1, 0.25, and 0.4, respectively. In this simulation, the quantile vector was taken as with , and the sample sizes were set to , and , respectively. For each case, we conducted 1000 simulation runs. The estimation errors, standard deviations, and mean squared errors of the estimator with the three missing probabilities are summarized in Table 1. For the nonparametric component, we provide an estimation curve for the component of the coefficient with a missing probability of 0.25 and .
Table 1.
The bias, SD, and MSE for with three missing probabilities.
The following conclusions can be drawn from Table 1:
(1) For a given missing probability, with the increase in the sample size, the standard deviation and mean square error of the given estimator decrease.
(2) For a given sample size, with the increase in the missing probability, the deviation and mean square error of the given estimator increase.
In addition, by comparing the real curve with the estimated curve, as shown in Figure 1, we can see that the method proposed in this paper is effective.
Figure 1.
Estimation curve (dashed line) and true curve (solid line) of .
Next, we used the empirical likelihood method to construct the confidence interval for . In order to evaluate the performance of the proposed statistical inference method, two methods were compared in the following simulation study: the imputation-based empirical likelihood method (IEL) proposed in this paper and the complete data-based empirical likelihood method (CEL). Under three different missing probabilities, the upper and lower 95% confidence limits for and the corresponding coverage probabilities were computed. The sample sizes were set to , and , respectively. The upper and lower confidence limits, the average length of the confidence intervals, and the coverage probabilities are summarized in Table 2.
Table 2.
Confidence intervals and coverage probabilities for under two different methods.
As can be seen in Table 2, as the sample size increased under the same missing probability, the confidence interval lengths of the two methods decreased and the coverage probabilities increased. However, the IEL method yielded smaller confidence intervals and higher coverage probabilities compared to the CEL method, so the IEL method is better than the CEL method. Similarly, as the missing probability increased under the same sample size, the confidence interval lengths of the two methods increased and the coverage probability decreased. However, the IEL method yielded smaller confidence intervals and higher coverage probabilities compared to the CEL method, so the IEL method is superior to the CEL method.
3.2. Application to a Real Data Example
In this section, we apply our proposed method to NCCTG lung cancer data, which are available in the R software (latest v. 4.2.3). In this data set, there are 228 patients with lung cancer, and the survival days for 63 of these patients have been deleted for various reasons, with a missing rate of about 27.63%. There are 10 variables in this data set, but we paid more attention to the following variables: time (survival time), ph.karno (Karnofsky performance score, assessed by doctors), meal.cal (calories consumed from three meals), and age (age of lung cancer patients). Here, we used the model in (1) to fit the lung cancer data, where Y represents the survival time, U represents age, Z represents meal.cal, and X represents ph.karno. For comparison, we considered composite levels K of 5 and 9 in the CQR method, denoted as CQR5 and CQR9, respectively.
Next, we compare the results obtained using the CQR5 and CQR9 methods for estimating and give an estimation curve for the component of the coefficient using the CQR9 method. Then, we present the estimator and its corresponding 95% confidence interval for based on the proposed IEL method. The results for the CQR method and the confidence interval for are summarized in Table 3. The estimation curve for the component of the varying coefficient is shown in Figure 2. From Table 3, we can see that the confidence interval length of the CQR9 method is smaller than that of the CQR5 method, which indicates that the CQR9 method synthesizes more information about the quantile, which is consistent with the theory. In addition, it can be seen in Figure 2 that the varying coefficient function fluctuates with age.
Table 3.
The results and confidence intervals for under the CQR5 method and CQR9 method.
Figure 2.
Estimation curve of .
4. Proofs of Theorems
Lemma 1.
Let be the random vectors, where are one-dimensional random variables. At the same time, and hold, where represents the joint density function of (X, Y). Let be a bounded positive function with bounded support, satisfying the Lipschitz condition. Then, we have
where for .
For the proof of Lemma 1, the reader is referred to Fan and Huang [3].
Lemma 2.
Suppose that is convex and can be represented as , where V is a symmetric and positive definite matrix, is stochastically bounded, is arbitrary, and converges to 0 in probability for any s. Let the minimized solution of be , the minimized solution of be , and the difference between them be of order . Further, if , then , where denotes convergence in distribution.
Lemma 2 is taken from the basic propositions in Hjort and Pollard [25].
Lemma 3.
Suppose that conditions C1-C8 hold. Then, we have
where
Proof.
Note that . Then, the estimate of can be obtained by minimizing the following loss function:
(12) is equivalent to minimizing the following formula:
where , , . is the K-dimensional unit vector; its kth component is 1, and the remaining component is 0. According to the identity proposed by Knight [26],
and it is easy to obtain
where
According to Lemma 1, we can obtain
Moreover,
where . Thus, it is easy to obtain
Since is established, where . Then,
From Lemma 2, a minimum resolution of can be written as
Because is a pseudo-diagonal matrix, Lemma 3 can be proven. □
Lemma 4.
Assume that conditions C1-C8 hold. Then,
if β is the true value of the parameter, where denotes convergence in distribution, .
Proof.
It is easy to show that
Note that
then, we have the following calculation:
Then, we can obtain
Thus, we have
Moreover, using the central limit theorem, we can obtain
Lemma 5.
Assume that conditions C1–C8 hold. Then,
if β is the true value of the parameter, where denotes convergence in probability, .
Proof.
We also use the notations in the proof of Lemma 4. Then, we obtain
Using the law of large numbers, we can derive that . Let be the component of and be the rth component of . Further, using the Cauchy–Schwarz inequality, we can obtain
From the proof of Lemma 4, we can obtain and . Hence, . and are also easy to improve using a similar argument. Hence, Lemma 5 can be proven. □
Proof of Theorem 1.
Note that , , . Then, the estimate of can be obtained by minimizing the following loss function:
(19) is equivalent to minimizing the following formula,
According to the identity proposed by Knight,
(20) can be expressed equivalently as
where .
Next, we calculate the expectation of . The calculation is as follows:
Note that , then . Thus,
where . Further, we obtain
Then, in (22) can be represented as
where . Below, we give the calculation
where . According to Lemma 2, the minimum solution of can be expressed as
By using the Cramér–Wold theorem and the central limit theorem, it is easy to obtain
Further, by the Lindeberg–Feller central limit theorem, we can obtain
□
Proof of Theorem 2.
From Lemma 3, we can obtain
where , . It is easy to obtain
Then, we have
Further,
Thus, we can obtain
Consequently, we complete the proof of Theorem 2. □
5. Conclusions
In this paper, we discussed the statistical inference for the partially linear varying coefficient composite quantile regression model with missing data. By introducing a two-stage estimation process, we effectively solved the challenge introduced by random missing responses and proved the asymptotic property of the obtained estimator under mild and regular conditions. It is worth noting that this paper contributes a new empirical log-likelihood ratio statistic based on imputation and determines that it follows the standard Chi-square distribution. The feasibility and performance of the proposed method were verified with a comprehensive evaluation through a simulation study and applications using real data.
However, for more complex problems with missing data, there is still a lot of work to be done. For example, some new imputation methods should be proposed to solve these problems. On the other hand, in future work, we will use bootstrap methods to conduct simulation studies for varying coefficient functions in terms of coverage probability and average length [28].
Author Contributions
Methodology, S.L. and Y.Y.; software, Y.Y.; formal analysis, C.-y.Z.; investigation, S.L.; writing-original draft, Y.Y.; writing-review and editing, S.L. and C.-y.Z. All authors have read and agreed to the published version of the manuscript.
Funding
This work was supported by the National Natural Science Foundation of China (No. 12271420) and the Natural Science Foundation of Shaanxi Province of China (No. 2024JC-YBMS-007).
Data Availability Statement
Data are contained within the article.
Acknowledgments
The authors are grateful to the all reviewers for the constructive comments and suggestions that led to significant improvements to the original manuscript.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Zhang, W.; Lee, S.; Song, X. Local polynomial fitting in semi-varying coefficient models. J. Multivar. Anal. 2002, 82, 166–188. [Google Scholar] [CrossRef]
- Zhou, X.; You, J.H. Wavelet estimation in varying coefficient partially linear regression models. Stat. Probablity Lett. 2004, 68, 91–104. [Google Scholar] [CrossRef]
- Fan, J.Q.; Huang, T. Profile likelihood inferences on semiparametric varying coefficient partially linear models. Bernoulli 2005, 11, 1031–1057. [Google Scholar] [CrossRef]
- Koenker, R.; Gilbert, B., Jr. Regression quantiles. Econometrica 1978, 46, 33–50. [Google Scholar] [CrossRef]
- Zou, H.; Yuan, M. Composite quantile regression and the oracle model selection theory. Ann. Stat. 2008, 36, 1108–1126. [Google Scholar] [CrossRef]
- Kai, B.; Li, R.; Zou, H. New efficient estimation and variable selection methods for semiparametric varying-coefficient partially linear models. Ann. Stat. 2011, 39, 305–332. [Google Scholar] [CrossRef] [PubMed]
- Jiang, Z.; Huang, Z.; Zhang, J. Functional single-index composite quantile regression. Metrika 2022, 86, 595–603. [Google Scholar] [CrossRef]
- Song, Y.; Li, Z.; Fang, M. Robust variable selection based on penalized composite quantile regression for high-dimensional single-index models. Mathematics 2022, 10, 2000. [Google Scholar] [CrossRef]
- Rubin, D. Multiple Imputations for Nonresponse in Surveys; John Wiley & Sons Inc.: New York, NY, USA, 1987. [Google Scholar]
- Lipsitz, S.R.; Zhao, L.P.; Molenberghs, G. A semiparametric method of multiple imputation. J. R. Stat. Soc. Ser. B 1998, 60, 127–144. [Google Scholar] [CrossRef]
- Aerts, M.; Claeskens, G.; Hens, N.; Molenberghs, G. Local multiple imputation. Biometrika 2002, 89, 375–388. [Google Scholar] [CrossRef]
- Little, R.J.A.; Rubin, D.B. Statistical Analysis with Missing Data; John Wiley & Sons: Hoboken, NJ, USA, 2014. [Google Scholar]
- Ibrahim, J. Incomplete data in generalized linear models. J. Am. Stat. Assoc. 1990, 85, 765–769. [Google Scholar] [CrossRef]
- Horvitz, D.G.; Thompson, D.J. A generalization of sampling without replacement from a finite universe. J. Am. Stat. Assoc. 1952, 47, 663–685. [Google Scholar] [CrossRef]
- Robins, J.M.; Rotnitzky, A.; Zhao, L.P. Estimation of regression coefficients when some regressors are not always observed. J. Am. Stat. Assoc. 1994, 89, 846–866. [Google Scholar] [CrossRef]
- Robins, J.M.; Rotnitzky, A.; Zhao, L.P. Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. J. Am. Stat. Assoc. 1995, 90, 106–121. [Google Scholar] [CrossRef]
- Jin, J.; Ma, T.F.; Dai, J.J. Penalized weighted composite quantile regression for partially linear varying coefficient models with missing covariates. Comput. Stat. 2020, 1, 1–35. [Google Scholar] [CrossRef]
- Owen, A. Empirical likelihood ratio confidence intervals for a single function. Biometrika 1988, 75, 237–249. [Google Scholar] [CrossRef]
- Owen, A. Empirical likelihood ratio confidence regions. Ann. Stat. 1990, 18, 90–120. [Google Scholar] [CrossRef]
- Zhao, P.X.; Xue, L.G. Empirical likelihood inferences for semiparametric varying coefficient partially linear models with missing responses at random. Commun. Stat.-Theory Methods 2010, 27, 771–780. [Google Scholar]
- Wang, H.J.; Zhu, Z. Empirical likelihood for quantile regression models with longitudinal data. J. Stat. Plan. Inference 2011, 141, 1603–1615. [Google Scholar] [CrossRef]
- Yan, Y.X.; Luo, S.H.; Zhang, C.Y. Statistical inference for partially linear varying coefficient quantile models with missing responses. Symmetry 2022, 14, 2258. [Google Scholar] [CrossRef]
- Xue, L.G. Two-stage estimation and bias-corrected empirical likelihood in a partially linear single-index varying-coefficient model. Stat. Methodol. 2023, 85, 1299–1325. [Google Scholar] [CrossRef]
- Zhao, P.X.; Tang, X.R. Imputation based statistical inference for partially linear quantile regression models with missing responses. Metrika 2016, 79, 991–1009. [Google Scholar] [CrossRef]
- Hjort, N.; Pollard, D. Asymptotics for minimizers of convex processes. arXiv 2011, arXiv:1107.3806. [Google Scholar]
- Knight, K. Limiting distributions for l1 regression estimators under general conditions. Ann. Stat. 1998, 26, 755–770. [Google Scholar] [CrossRef]
- Xue, L.G.; Zhu, L.X. Empirical likelihood semiparametric regression analysis for longitudinal data. Biometrika 2007, 94, 921–937. [Google Scholar] [CrossRef]
- Xue, L.G.; Zhu, L.X. Empirical likelihood in a partially linear single-index model with censored response data. Comput. Stat. Data Anal. 2024, 193, 107912. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).