Abstract
The construction of confidence intervals is investigated for the partially linear varying coefficient quantile model with missing random responses. Combined with quantile regression, an imputation-based empirical likelihood method is proposed to construct confidence intervals for parametric and varying coefficient components. Then, it is proved that the proposed empirical log-likelihood ratios are asymptotically Chi-square in theory. Finally, the symmetry confidence intervals of the parametric components and the point-by-point confidence intervals of the varying coefficient components are constructed in the simulation studies to demonstrate further that the proposed method yields smaller confidence intervals and higher coverage probabilities.
1. Introduction
The partially linear varying coefficient model, originally proposed by [1], is a very important semi-parametric model, and it has not only the flexibility of the semi-parametric model, but also the easy interpretation of the parametric model. In recent years, the model has been studied by many scholars. Zhou and You [2] combined wavelet and least squares to estimate the parameter part and the varying coefficient part of the model. Zhang et al. [3] proposed a method for estimating the parameter part and the varying coefficient part of the model, and derived the asymptotic conditional bias and variance to give some idea about the mean square error of the estimator. Xia et al. [4] investigated variable selection for semi-parametric varying coefficient partially linear models with random missing responses. Then, they presented a bias-corrected variable selection procedure and established the oracle property of the regularized estimators.
On the other hand, the quantile regression (QR) model, proposed by [5], has been extensively applied in environmental monitoring, population census, biomedicine, etc. One significant reason is that compared to the mean model, the effects of the covariates at different quantiles is able to be estimated directly by quantile regression such that the quantile regression estimator has an important role in characterizing the entire conditional distribution of a dependent variable, given regressors and the robustness property to outlier observations. Cai and Xiao [6] studied quantile regression under the dynamics of this model, and pointed out that its coefficient values are functions of covariates, and the estimation of parametric and nonparametric function coefficients is recommended. Jin et al. [7] studied partially linear varying coefficient models with missing covariates. Then, they proposed a weighted B-spline composite quantile regression method based on inverse probability-weighting and B-spline approximations to estimate the non-parametric function and the regression coefficients.
Despite significant advances in QR theory and its applications, QR analysis has received little attention when data samples contain missing values. In many practical problems, such as clinical trials and medical tracking trials, it is easy to generate a large number of missing data due to various human or other unknown factors. Considering this problem, there are some methods, such as the complete-case (CC) analysis method [8], imputation method [9,10,11], inverse probability weighted method (IPW) [12,13,14] and likelihood-based method [15] to handle the missing data problem. Among these methods, the imputation method is the most popular and effective method for managing missing data under missing at random (MAR). In this paper, the partially linear varying coefficient quantile regression model with missing response variables was studied, propose an imputation-based empirical likelihood inference method, which can full utilize the information of the non-missing data in the observation data with missing data at a specific time point.
The empirical likelihood method has been widely researched in recent years. Owen [16] firstly proposed the empirical likelihood method to deal with nonparametric statistical problems. Under certain regular conditions, the estimators obtained by this method have good statistical properties. Thus, this method has attracted the interest of many statisticians and is applied in various statistical fields. You and Zhou [17] investigated the empirical likelihood inference of the parameter component in the partially linear varying coefficient model and obtained some results. Huang and Zhang [18] considered the statistical inference for nonparametric component in the partially linear varying coefficient model and showed that the proposed method can obtain more desirable coverage probabilities and average areas of confidence regions. Chen [19] investigated the empirical likelihood estimator based on the imputation method of missing value using quantile regression, and showed that the proposed method has competitive advantages over some of the most widely used parametric and non-parametric imputation estimators. Wang and Rao [20] constructed empirical likelihood confidence intervals of the mean of the response variable for linear models and nonparametric regression models under random design and missing data. Their results show that the adjusted empirical likelihood method performs competitively and the use of auxiliary information improves inferences.
The rest of this paper is organized as follows. In Section 2, the confidence interval construction method based on imputation empirical likelihood is presented for the parametric component, and the varying coefficient component and some asymptotic properties of the proposed empirical log-likelihood ratio are investigated. In Section 3, some simulation studies are conducted to assess the performances of the proposed method. In addition, the proofs of the main results are given in Section 4.
2. Quantile Regression Estimates for Partially Linear Varying Coefficient Model
We consider the partially linear varying coefficient regression model:
where Y is a response data, X is a p-dimensional covariate vector, is an unknown smooth coefficient function vector, and Z are q-dimensional vectors, and U is a covariate; in order to avoid the curse of dimensional, without loss of generality, it is assumed to be the unit interval . In addition, is the model error with , and and are independent of each other. Suppose are random samples, then we have
Throughout this paper, an indicator variable such that means that is observed and indicates that is missing. We assume that the data missing mechanism follows
Due to , we have
where is an indicator function. Then, we approximate by means of basis functions. Generally, let be B-spline basis functions with the order of M, where , and K denotes the number of interior knots. Then, can be approximated by
where is a vector of basis function coefficients. Further, the quantile regression estimation of and under complete data can be obtained by solving the following equation:
where is the quantile loss function. Then, the model (6) can be written as
where . Derive the model (7) with respect to the parameters to obtain
Let and be the solution of (8), then is the estimator of , and the estimator of can be given by . Use to estimate the missing , and give an imputation for the responses as
2.1. The Imputation Empirical Likelihood for
In order to construct a confidence interval for based on the empirical likelihood method, we define the following imputation-based auxiliary random vector
The empirical log-likelihood ratio for is as follows:
If zero is inside the convex hull of the point , a unique value for exists. By using the Lagrange multiplier method and some calculations, can be written as follows:
where is a Lagrange multiplier, which satisfies
Under some regularity conditions, we can show that is the asymptotically Chi-square distribution with q degrees of freedom when is the true parameter.
2.2. The Imputation Empirical Likelihood for
In this section, we construct the confidence interval for based on the empirical likelihood method. Due to , it is easy to prove that
where is the density function of . Then we construct the empirical log-likelihood ratio for according to a similar method, and we use the following imputation-based auxiliary random vector
where is given by (8), and is a kernel function. However, using the existing conclusions, we can prove that is not asymptotic Chi-squared. To solve this problem, we use some undersmoothing technologies.
We propose a bias correction for as follows:
where , and are obtained by (8). In what follows, we define a bias correction based empirical log-likelihood ratio function for as follows:
2.3. Asymptotic Properties of Estimators
To prove the asymptotic properties, we suppose that the following regularity conditions hold. For convenience and brevity, let c denote a positive constant, where this constant represents different values on different occasions.
C1. The function has continuous rth derivatives, where .
C2. Let be the conditional density function of given and Z. Then has continuous and uniformly bounded first-order and second-order derivatives.
C3. If are the interior knots of [0, 1], there exists a constant such that
where and .
C4. Suppose , then have bounded partial derivatives up to the order , where . We let for all and z.
C5. Assuming holds.
C6. We assume the matrix is a non-singular and finite matrix, where T represents the transpose of the matrix.
Theorem 1.
Suppose that conditions C1–C6 hold, and the number of knots satisfies . Then
(1) ,
(2) ,
where means convergence in distribution, .
Theorem 2.
Suppose that conditions C1–C6 hold. If β is the true value of the parameter, then
where means convergence in distribution, and is the Chi-square distribution with q degrees of freedom.
Based on the results, we can construct a confidence region for . For a given with , let satisfy . Then the approximate confidence region for can be defined as
Theorem 3.
Suppose that conditions C1–C6 hold. If is the true value of the parameter of u, then
where means convergence in distribution, and is the Chi-square distribution with p degrees of freedom.
Based on Theorem 3, we can obtain an approximate confidence interval for . For a given with , let satisfy . Then the approximate confidence region for can be defined as
The proof of the theorem relies on the lemmas of the Section 4.
3. Simulation Studies
To demonstrate the finite sample performance of the proposed method, we consider the following partially linear varying coefficient model:
where , , , . The response is generated according to the model. The model error is generated according to , where follows the Chi-square distribution with one degree of freedom, and is the th quantile of . It is easy to obtain . In the following simulation, we take . Consider the following two cases of selection probability : (1) ; (2) . Then, the missing rates corresponding to the two scenarios are approximately 0.1 and 0.3, respectively. In the following simulation, we take , where c is chosen as the standard deviation of . The number of interior knots K is used in (5) and the bandwidth h is used in (14). Further, K is estimated by minimizing the cross-validation score:
where is the quantile loss function, and and are the estimators of and , which are obtained by (8) after deleting the ith subject.
We compared the following three methods to evaluate the performance of the proposed statistical inference method: the imputation-based empirical likelihood method (IEL) proposed by this paper, the complete data-based empirical likelihood method (CEL), and the full data set (i.e., no missing data)–based empirical likelihood method (FEL). In this simulation, the sample size is set to be 100, 500 and 1000, and we take 500 simulation runs for each case. Based on the run results, the averages of the confidence intervals for the parametric component are summarized in Table 1, and the corresponding coverage probabilities are summarized in Table 2.
Table 1.
Confidence intervals for for different selection probability functions under three different methods.
Table 2.
Confidence interval length and coverage probabilities for under three different methods.
(1) The IEL method is better than the CEL method, because the IEL method yields a smaller confidence interval and higher coverage probabilities.
(2) With the same missing rate, as the sample size increases, the confidence intervals for both the IEL method and the CEL method become smaller, but the confidence interval for the IEL method is always smaller than the confidence interval for the CEL method.
(3) As n increases, the performance of the IEL method becomes closer and closer to that of the FEL estimation process. These results suggest that the proposed IEL process can weaken the effect of the missing rate compared to the CEL method.
For the varying coefficient part, we compare the IEL method with the CEL method. Here, we compare the case of with the case of . Because the situations of and have not much difference between them, we will not show them here. Figure 1 and Figure 2 summarize the finite sample performances of the IEL and CEL methods for the varying coefficient part under different levels of missing rates and different sample sizes. The figure of (a) shows the averages of pointwise confidence intervals with 500 simulation runs under the first missing rate , and the figure of (b) shows the averages of pointwise confidence intervals with 500 simulation runs under the second missing rate .
Figure 1.
Average of 95% pointwise confidence intervals of two choice probabilities for varying coefficient part with .
Figure 2.
Average of 95% pointwise confidence intervals of two choice probabilities for varying coefficient part with .
For the varying coefficient part, the following can be seen from the figures:
(1) With the same missing rate, the IEL method is better than the CEL method because the IEL method gives a smaller interval length.
(2) As the missing rate increases, the pointwise confidence intervals for both the IEL method and the CEL method become larger, but the confidence interval for the IEL method is always smaller than that for the CEL method.
(3) As the sample size increases, the pointwise confidence intervals for both the IEL method and the CEL method become smaller. The pointwise confidence interval for the IEL method is always smaller than the pointwise confidence interval for the CEL method.
4. Proofs of Theorems
Lemma 1.
Suppose that conditions C1–C6 hold, and the number of knots satisfies . We can obtain
Proof of Lemma 1.
Let , , , . For any given , there exists a constant C, such that
where . It is obvious that (17) implies with probability of at least that there exists a local such that .
Based on the definition of , below we give a calculation
where . Suppose . Then we have
From conditions C1, C3 and Corollary 6.21 in [21], we get that . Combining Conditions C2 and C5, we obtain , . Therefore, by choosing a sufficiently large C, dominates uniformly in . This means that for any given , if we choose C large enough, we obtain
According to (17), there exists a local minimum such that with probability at least . The proof of Lemma 1 is completed. □
Proof of Theorem 1.
Note that
where . It is easy to obtain that . We can obtain with the same result as in the proof of Lemma 1. Thus, we have
Next, we prove Theorem 1(2). Let , then we can obtain
Note that
Thus, we have
Moreover, we can obtain and . Then, using the central limit theorem, we can obtain
Next, we prove . Let be the jth component of , be the jth component of . Note that is the centered covariate. Through Lemma A.2 in [22], we obtain
In addition, by Theorem 1(1), we obtain
In the following we will prove . Below we give a simple calculation
Thus, we can obtain
Proof of Theorem 2.
Based on the definition of , and using the theories similar to [23], we obtain
and
Lemma 2.
Suppose that conditions C1–C6 hold, and the number of knots satisfies . We can obtain
where .
Proof of Lemma 2.
Let , . It can be obtained by calculation
Thus, we can have
Moreover, we have the calculation as follows:
Thus,
We can obtain and . Using the central limit theorem, we can obtain
Next we prove . According to the Theorem 1, and using the similar conclusion that used in the proof of Theorem 2 in [24], we obtain . In addition, using condition C5 and condition C6, we can obtain
Therefore,
Next, we prove . Using the Taylor expansion to and at u, we can obtain
By conditions C1–C6, we have
Next, we prove . We have the calculation as follows:
Then by using similar method to (26) based on Abel’s inequality, we can obtain
Proof of Theorem 3.
Similar to the proof of Theorem 2, for given u, we have
where . According to Lemma 2, we can get . Further, combined with Lemma 2, we obtain . □
Author Contributions
Methodology, S.L.; Writing—original draft, Y.Y.; Writing—review and editing, C.-y.Z. All the authors inferred the main conclusions and approved of the current version of this manuscript. All authors have read and agreed to the published version of the manuscript.
Funding
This work is supported by the National Natural Science Foundations of China (No. 11601409), the Natural Science Foundation of Shaanxi Province of China (Nos. 2020JM571, 2021JM-002).
Data Availability Statement
The data presented in this paper are obtained through computer simulation.
Acknowledgments
The authors would like to thank the anonymous referees for their valuable comments and suggestions, which actually stimulated this work.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Fan, J.Q.; Huang, T. Profile likelihood inferences on semiparametric varying coefficient partially linear models. Bernoulli 2005, 11, 1031–1057. [Google Scholar] [CrossRef]
- Zhou, X.; You, J.H. Wavelet estimation in varying coefficient partially linear regression models. Stat. Probablity Lett. 2004, 68, 91–104. [Google Scholar] [CrossRef]
- Zhang, W.; Lee, S.; Song, X. Local polynomial fitting in semivarying coefficient models. J. Multivar. Anal. 2002, 82, 166–188. [Google Scholar] [CrossRef]
- Xia, Y.F.; Qu, Y.R.; Sun, N.L. Variable selection for semiparametric varying coefficient partially linear model based on modal regression with missing data. Commun. Stat.-Theory Methods 2019, 48, 5121–5137. [Google Scholar] [CrossRef]
- Koenker, R.; Gilbert, B., Jr. Regression Quantiles. Econometrica 1978, 46, 33–50. [Google Scholar] [CrossRef]
- Cai, Z.W.; Xiao, Z.J. Semiparametric quantile regression estimation in dynamic models with partially varying coefficients. J. Econom. 2012, 167, 413–425. [Google Scholar] [CrossRef]
- Jin, J.; Ma, T.F.; Dai, J.J.; Liu, S.Z. Penalized weighted composite quantile regression for partially linear varying coefficient models with missing covariates. Comput. Stat. 2021, 36, 541–575. [Google Scholar] [CrossRef]
- Little, R.J.A.; Rubin, D.B. Statistical Analysis with Missing Data; John Wiley & Sons: Hoboken, NJ, USA, 2014. [Google Scholar]
- Rubin, D. Multiple Imputations for Nonresponse in Surveys; John Wiley & Sons Inc.: New York, NY, USA, 1987. [Google Scholar]
- Lipsitz, S.R.; Zhao, L.P.; Molenberghs, G. A semiparametric method of multiple imputation. J. R. Stat. Soc. Ser. B 1998, 60, 127–144. [Google Scholar] [CrossRef]
- Aerts, M.; Claeskens, G.; Hens, N.; Molenberghs, G. Local multiple imputation. Biometrika 2002, 89, 375–388. [Google Scholar] [CrossRef]
- Horvitz, D.G.; Thompson, D.J. A generalization of sampling without replacement from a finite universe. J. Am. Stat. Assoc. 1952, 47, 663–685. [Google Scholar] [CrossRef]
- Robins, J.M.; Rotnitzky, A.; Zhao, L.P. Estimation of regression coefficients when some regressors are not always observed. J. Am. Stat. Assoc. 1994, 89, 846–866. [Google Scholar] [CrossRef]
- Robins, J.M.; Rotnitzky, A.; Zhao, L.P. Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. J. Am. Stat. Assoc. 1995, 90, 106–121. [Google Scholar] [CrossRef]
- Ibrahim, J. Incomplete data in generalized linear models. J. Am. Stat. Assoc. 1990, 85, 765–769. [Google Scholar] [CrossRef]
- Owen, A.B. Empirical likelihood ratio confidence intervals for a single function. Biometrika 1991, 75, 237–249. [Google Scholar] [CrossRef]
- You, J.H.; Zhou, Y. Empirical likelihood for semi-parametric varying coefficient patially linear model. Stat. Probablity Lett. 2006, 76, 412–422. [Google Scholar] [CrossRef]
- Huang, Z.S.; Zhang, R.Q. Empirical likelihood for nonparametric parts in semiparametric varying coefficient patially linear models. Stat. Probablity Lett. 2009, 79, 1798–1808. [Google Scholar] [CrossRef]
- Chen, S.N. Imputation of Missing Values Using Quantile Regression. Ph.D. Thesis, Iowa State University, Ames, Iowa, 2014. [Google Scholar]
- Wang, Q.; Rao, J.N.K. Empirical likelihood-based inference under imputation for missing response data. Ann Stat. 2002, 30, 896–924. [Google Scholar]
- Schumaker, L.L. Spline Functions; Wiley: Hoboken, NJ, USA, 2007. [Google Scholar]
- Zhao, P.X.; Xue, L.G. Empirical likelihood inferences for semiparametric varying coefficient partially linear models with longitudinal data. Commun. Stat.-Theory Methods 2010, 39, 1898–1914. [Google Scholar] [CrossRef]
- Xue, L.G.; Zhu, L.X. Empirical likelihood semiparametric regression analysis for longitudinal data. Biometrika 2007, 94, 921–937. [Google Scholar] [CrossRef]
- Lv, X.F.; Li, R. Smoothed empirical likelihood analysis of partially linear quantile regression models with missing response variables. Adv. Stat. Anal. 2013, 97, 317–347. [Google Scholar] [CrossRef]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).