Next Article in Journal
More Theory About Infinite Numbers and Important Applications
Previous Article in Journal
A Novel Detection-and-Replacement-Based Order-Operator for Differential Evolution in Solving Complex Bound Constrained Optimization Problems
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Identification and Empirical Likelihood Inference in Nonlinear Regression Model with Nonignorable Nonresponse

1
Department of Statistics, Jiangsu University of Technology, Changzhou 213001, China
2
School of Mathematics and Information Technology, Yuncheng University, Yuncheng 044000, China
*
Author to whom correspondence should be addressed.
Mathematics 2025, 13(9), 1388; https://doi.org/10.3390/math13091388
Submission received: 25 March 2025 / Revised: 20 April 2025 / Accepted: 23 April 2025 / Published: 24 April 2025
(This article belongs to the Special Issue Modeling, Control and Optimization of Biological Systems)

Abstract

:
The identification of model parameters is a central challenge in the analysis of nonignorable nonresponse data. In this paper, we propose a novel penalized semiparametric likelihood method to obtain sparse estimators for a parametric nonresponse mechanism model. Based on these sparse estimators, an instrumental variable is introduced, enabling the identification of the observed likelihood. Two classes of estimating equations for the nonlinear regression model are constructed, and the empirical likelihood approach is employed to make inferences about the model parameters. The oracle properties of the sparse estimators in the nonresponse mechanism model are systematically established. Furthermore, the asymptotic normality of the maximum empirical likelihood estimators is derived. It is also shown that the empirical log-likelihood ratio functions are asymptotically weighted chi-squared distributed. Simulation studies are conducted to validate the effectiveness of the proposed estimation procedure. Finally, the practical utility of our approach is demonstrated through the analysis of ACTG 175 data.

1. Introduction

Consider a dataset comprising n independent observations { ( x i , Y i ) } i = 1 n , where each observation includes a covariate vector x i R d x and a scalar response variable Y i R . We consider a family of nonlinear regression models given by
Y i = f ( x i ; θ ) + ϖ ( x i ) ε i , i = 1 , , n ,
where f ( x i ; θ ) : R d x × R p R is a known nonlinear function with an unknown vector of parameters θ R p . The error term consists of two components: (1) ϖ ( x i ) : R d x R + , a variance function that modulates the error scale as a function of the covariates, and (2) ε i , a sequence of i.i.d. random variables with E ( ε i | x i ) = 0 and Var ( ε i | x i ) = σ 2 . Model (1) has been extensively studied in the statistical literature, including seminal contributions by Jennrich [1] and Wu [2]. A key example of such models is the Gompertz growth process, which is widely used in biology, epidemiology, and economics. The corresponding function is given by f ( x ; θ 1 , θ 2 , θ 3 ) = θ 1 exp θ 2 exp ( θ 3 x ) , where θ 1 is the upper asymptote, θ 2 controls displacement, and θ 3 represents the growth rate (Fekedulegn et al. [3]). Similarly, the logistic growth function, frequently employed in population dynamics and epidemic modeling, is f ( x ; θ 1 , θ 2 , θ 3 ) = θ 1 / [ 1 + exp θ 2 ( x θ 3 ) ] , where θ 1 is the carrying capacity, θ 2 is the growth rate, and θ 3 is the inflection point. For a fully observed dataset { ( Y i , x i ) } i = 1 n , the parameter θ in model (1) is traditionally estimated using the weighted least squares (WLS) criterion. This estimator is obtained by minimizing the weighted residual sum of squares: i = 1 n ϖ 2 ( x i ) { Y i f ( x i ; θ ) } 2 . As a fundamental method in nonlinear regression, the WLS estimator achieves asymptotic efficiency under heteroscedasticity and serves as the theoretical foundation for various inferential procedures. For further details, see Ivanov [4].
Missing data frequently arise in practical applications due to factors such as reluctance to answer sensitive survey questions. In such cases, directly applying conventional least squares procedures to estimate the parameter vector θ may lead to biased estimates and invalid conclusions (see, for example, Little and Rubin [5]). The inverse probability weighting (IPW) method, introduced by Horvitz and Thompson [6], remains a fundamental approach for addressing missing data challenges. To improve efficiency, Robins et al. [7] developed the augmented inverse probability weighting (AIPW) method, which builds upon a corrected version of complete-case analysis. Subsequent extensions of this methodology across various domains include significant contributions by Han [8], Xue and Xie [9], Sharghi et al. [10], and Li et al. [11], among others. For missing at random (MAR) scenarios, Tang and Zhao [12] developed IPW and AIPW estimating equations for empirical likelihood (EL) inference on θ , extending the foundational methodology of Owen [13]. In more challenging not missing at random (NMAR) settings, where nonresponse depends on the unobserved values, Yang and Tang [14] proposed an EL approach for inference in this modeling framework.
The identification challenge remains a fundamental issue in the analysis of nonignorable missing data. The observed likelihood is identifiable if two distinct populations do not produce identical observed likelihood functions. Crucially, identifiability can fail even when both the outcome model and the nonresponse mechanism model are parametrically defined as demonstrated by Wang et al. [15]. Significant methodological advancements have been made in recent decades to address this issue. For parametric logistic nonresponse mechanisms, Yang and Tang [14] established identifiability conditions within the EL framework. In broader parametric settings, Wang et al. [15] introduced an instrumental variable (IV) approach to resolve the identifiability issue. More recently, Wang et al. [16] investigated an optimal subset selection method for identifying the IV from a set of candidate models. In addition, Chen et al. [17] suggested an IV selection technique based on pseudo-likelihood principles. Further advancements include the work of Du et al. [18] and Beppu and Morikawa [19]. Current estimation strategies for nonresponse mechanisms typically involve a two-stage process, starting with the identification of an appropriate IV, and followed by the estimation of the parameters in the nonresponse mechanism model. However, these methods face significant computational challenges as the candidate model space expands.
A novel penalized semiparametric likelihood method for IV selection is proposed under the parametric assumption of the missingness mechanism. By leveraging the sparse structure of the observed likelihood, we develop a regularized approach to obtain the sparse estimators for the nonresponse mechanism model. To achieve this, we integrate the semiparametric likelihood framework with the SCAD penalty function (Fan and Li [20]). This shrinkage technique enables the simultaneous identification of IV and estimation of a sparse nonresponse mechanism model. Subsequently, the unbiased estimating equations based on IPW and AIPW methods are constructed, and the profile empirical log-likelihood ratio functions (ELLRFs) are rigorously formulated.
Our primary contributions are threefold. First, we propose a penalized semiparametric likelihood framework that effectively combines the SCAD penalty and the sparse likelihood structure. This approach facilitates simultaneous IV selection and parameter estimation for the nonresponse mechanism model. The resulting sparse estimators exhibit the oracle properties, ensuring both selection consistency and asymptotic efficiency. Second, the flexibility of the EL method enables the proposed estimation procedure to produce confidence regions with natural shapes and orientations. Third, under some regularity conditions, we show that the ELLRFs are asymptotically weighted chi-squared distributed, while the maximum empirical likelihood estimators (MELEs) are asymptotically normally distributed, providing a valid foundation for regression parameter inference.
The rest of this article is organized as follows. In Section 2, we present the penalized semiparametric likelihood methodology and construct two types of unbiased estimating equations. The MELEs and ELLRFs are also introduced. In Section 3, we investigate the oracle properties of the sparse estimators for the nonresponse mechanism model, as well as the asymptotic normality of the MELEs and the asymptotic properties of the proposed ELLRFs. Simulation studies and a real data analysis are conducted to evaluate the finite sample performance of the proposed estimates in Section 4 and Section 5, respectively. The concluding discussions are included in Section 6. Proofs of the asymptotic results are relegated to Appendix A.

2. Methods

2.1. Penalized Semiparametric Likelihood Estimation

Let F ( x , Y ) be the unconditional joint distribution of x and Y. Suppose that n 1 out of the n individuals respond on both Y and x , which results in data ( x 1 , Y 1 ) , , ( x n 1 , Y n 1 ) . For the rest of the n n 1 individuals, their Y values are not observed, but their x values are always observed. Let δ represent a missingness indicator of Y, i.e., it takes 1 if Y is observed, and takes 0 otherwise. Suppose that the covariate x has two components, x = ( u , z ) such that the nonresponse mechanism can be modeled as
π ( x , Y ; α ) = P ( δ = 1 | x , Y ) = Ψ ( α 0 + α u u + α z z + α y Y ) ,
where α = ( α 0 , α u , α z , α y ) R d is an unknown parameter to be estimated, and Ψ is a known, strictly monotonic, twice-differentiable function from R to ( 0 , 1 ] . Since model (2) depends explicitly on the potentially unobserved Y when α y 0 , it describes a nonignorable missingness mechanism, often referred to as NMAR. In this context, the missingness indicator δ is typically assumed to follow a conditional Bernoulli distribution with probability π ( x , Y ; α ) . Notably, when α y = 0 , the missingness mechanism simplifies to MAR, as the dependence on the unobserved Y is eliminated.
Following Qin et al. [21], the likelihood of ( α , F ) based on the complete observations { ( x j , Y j ) : j = 1 , , n 1 } is
j = 1 n 1 π ( x j , Y j ; α ) d F ( x j , Y j ) j = n 1 + 1 n { 1 π ( x , Y ; α ) } d F ( x , Y ) ,
which can be rewritten as
L C ( α , W ) = j = 1 n 1 π ( x j , Y j ; α ) d F ( x j , Y j ) W W n 1 ( 1 W ) n n 1 ,
where W = P ( δ = 1 ) = π ( x , Y ; α ) d F ( x , Y ) is the unconditional respondent rate. The first term in Equation (3) is the likelihood conditioning on δ = 1 , and the term W n 1 ( 1 W ) n n 1 is the binomial likelihood of δ . The direct maximization of L C ( α , W ) in Equation (3) may lose some information contained in { x i : i = n 1 + 1 , , n } . To address this limitation, we assume that the auxiliary information on x can be characterized as E { Δ ( x ) } = 0 , where Δ ( x ) = ( Δ 1 ( x ) , , Δ r ( x ) ) is a known r-vector (or scalar) function. To illustrate the rationale underlying the construction of the auxiliary function Δ ( x ) , consider the case where the population mean of x , denoted by μ x , is known. In this setting, one may define Δ ( x ) = x μ x to serve as auxiliary information. When the population mean μ x is unavailable, it can be replaced by the estimated mean x ¯ = n 1 i = 1 n x i . Thus, part of the information contained in { x i : i = n 1 + 1 , , n } is recovered through μ x or x ¯ , thereby enhancing the efficiency of estimation under incomplete data.
By the auxiliary information on x and without assuming any specific form of F ( x , Y ) , we can maximize the semiparametric likelihood (3) subject to the constraints
ϕ j 0 , j = 1 n 1 ϕ j = 1 , j = 1 n 1 ϕ j { π ( x j , Y j ; α ) W } = 0 , j = 1 n 1 ϕ j Δ ( x j ) = 0 ,
where ϕ j is the jump of F ( x , Y ) at { ( x j , Y j ) : j = 1 , , n 1 } .
By introducing Lagrange multipliers and profiling for all of the values of ϕ j , we obtain
ϕ j = 1 n 1 1 + λ 1 Δ ( x j ) + λ 2 { π ( x j , Y j ; α ) W } ,
where λ 1 and λ 2 are Lagrange multipliers as described in Qin and Lawless [22].
Substituting all of the values of ϕ j into Equation (3), the log-likelihood with respect to α and W becomes
( α , W ) = j = 1 n 1 log π ( x j , Y j ; α ) + ( n n 1 ) log ( 1 W ) j = 1 n 1 log 1 + λ 1 Δ ( x j ) + λ 2 { π ( x j , Y j ; α ) W } .
The identifiability of the observed likelihood as established by Wang et al. [16] relies on the existence of an IV z that satisfies two conditions: (i) z can be excluded from the nonresponse mechanism model, i.e., z δ ( u , Y ) , and (ii) z must be related to the study variable Y. Specifically, if the true parameter subvector α z 0 corresponding to z satisfies α z 0 = 0 , then z qualifies as a valid IV by design. This critical insight motivates the development of a penalized semiparametric likelihood framework to achieve the sparse estimation of α in the nonresponse mechanism model. The penalized likelihood estimator α ^ p of α can be obtained by maximizing the following objective function:
p ( α , W ) = ( α , W ) n 1 j = 1 d g γ ( | α j | ) ,
where g γ ( · ) represents the SCAD penalty function. The first derivative of the penalty term is specified as
g γ ( β ) = γ I ( β γ ) + ( a γ β ) + ( a 1 ) γ I ( β > γ )
for β > 0 , where a > 2 is a fixed constant, γ is a tuning parameter, and ( z ) + = max ( z , 0 ) . Following Fan and Li [20], we set a = 3.7 throughout this study. As demonstrated in Theorem 1, the sparse estimator α ^ p achieves the oracle properties, ensuring that P ( α ^ z = 0 ) 1 as n . This guarantees the consistent identification of z as the IV.
Implementing the optimal procedure for (4) presents a notable challenge due to the involvement of the non-concave penalized function g γ ( | α j | ) . To enhance numerical stability, we adopt the local quadratic approximation method introduced by Fan and Li [20]. Given the m-th iteration estimate α j ( m ) , the penalty function can be approximated quadratically as follows:
g γ ( | α j | ) g γ ( | α j ( m ) | ) + 1 2 { g γ ( | α j ( m ) | ) / | α j ( m ) | } { α j 2 ( α j ( m ) ) 2 } .
This approximation simplifies the non-concave penalty function, thereby improving both the computational tractability and convergence properties of the optimization procedure. In addition to the approximation strategy, selecting an appropriate penalty parameter γ is crucial for optimizing model performance. To achieve this, we employ the following Bayesian information criterion:
BIC ( γ ) = 2 p ( α ^ p , W ^ ) log ( n 1 ) df γ ,
where df γ is the number of nonzero elements in α ^ p . By minimizing BIC ( λ ) over λ , the resulting optimal tuning parameter can be obtained.

2.2. Construction of Estimating Equations

For complete data { ( Y i , x i ) : i = 1 , , n } , the WLS estimator can be obtained by solving the following equations:
1 n i = 1 n i ( θ ) ϖ 2 ( x i ) { Y i f ( x i ; θ ) } = 0 ,
where i ( θ ) = f ( x i ; θ ) / θ .
When Y is subject to NMAR, we introduce the following estimating function based on the IPW approach for the ith individual:
φ ^ 1 ( x i , Y i ; θ , α ^ p ) = δ i π ^ ( x i , Y i ; α ^ p ) φ ( x i , Y i ; θ ) ,
where φ ( x i , Y i ; θ ) = i ( θ ) ϖ 2 ( x i ) { Y i f ( x i ; θ ) } and π ^ ( x i , Y i ; α ^ p ) is the consistent estimate of π ( x i , Y i ; α ) .
To improve efficiency, we develop the AIPW-type estimating function with imputation
φ 2 ( x i , Y i ; θ , α ) = δ i π ( x i , Y i ; α ) φ ( x i , Y i ; θ ) + 1 δ i π ( x i , Y i ; α ) m φ 0 ( x i ; θ , α ) ,
where m φ 0 ( x i ; θ , α ) = E { φ ( x i , Y i ; θ ) | x i , δ i = 0 } . Following Tang et al. [23], the conditional density f 0 ( Y i | x i ) = f ( Y i | x i , δ i = 0 ) satisfies
f 0 ( Y i | x i ) = f 1 ( Y i | x i ) × O ( x i , Y i ; α ) E { O ( x i , Y i ; α ) | x i , δ i = 1 } ,
where f 1 ( Y i | x i ) = f ( Y i | x i , δ i = 1 ) is the conditional density of Y i given x i and δ i = 1 , and O ( x i , Y i ; α ) = π 1 ( x i , Y i ; α ) 1 is the conditional odds of nonresponse. Simple algebraic manipulations show that
m φ 0 ( x i ; θ , α ) = E { δ i φ ( x i , Y i ; θ ) O ( x i , Y i ; α ) | x i } E { δ i O ( x i , Y i ; α ) | x i } .
A nonparametric kernel estimator of m φ 0 ( x i ; θ , α ) can be obtained by
m ^ φ 0 ( x ; θ , α ) = i = 1 n ω i 0 * ( x ; α ) φ ( x i , Y i ; θ ) ,
where the weight ω i 0 * ( x ; α ) = δ i O ( x i , Y i ; α ) K h ( x x i ) / k = 1 n δ k O ( x k , Y k ; α ) K h ( x x k ) , and K h ( · ) = h κ K ( · / h ) with K ( · ) being a κ -dimensional kernel function and h representing a bandwidth sequence. Given α ^ p , a kernel-assisted estimating function for the ith observation is given by
φ ^ 2 ( x i , Y i ; θ , α ^ p ) = δ i π ^ ( x i , Y i ; α ^ p ) φ ( x i , Y i ; θ ) + 1 δ i π ^ ( x i , Y i ; α ^ p ) m ^ φ 0 ( x i ; θ , α ^ p ) .

2.3. MELEs of Model Parameters

To fix the notation, we temporarily assume that α is known. Let p i * be non-negative weight allocated to φ 1 ( x i , Y i ; θ , α ) = δ i / π ( x i , Y i ; α ) φ ( x i , Y i ; θ ) with a total mass of 1. Under moment condition E { φ 1 ( x i , Y i ; θ , α ) } = 0 , the profile EL ratio for θ is defined as
L 1 ( θ ) = sup i = 1 n n p i * | p i * 0 , i = 1 n p i * = 1 , i = 1 n p i * φ 1 ( x i , Y i ; θ , α ) = 0 .
By introducing Lagrange multiplier t n R p , we have
p i * = 1 n 1 + t n φ 1 ( x i , Y i ; θ , α ) ,
where t n satisfies
Q n 1 ( θ , t n ) = 1 n i = 1 n φ 1 ( x i , Y i ; θ , α ) 1 + t n φ 1 ( x i , Y i ; θ , α ) = 0 .
Therefore, the ELLRF for θ can be shown to be
1 ( θ , α ) = 2 i = 1 n log 1 + t n φ 1 ( x i , Y i ; θ , α ) .
Maximizing 1 ( θ , α ) leads to the MELE of θ , denoted by θ ^ 1 * . Under some smoothness conditions, θ ^ 1 * can be obtained by simultaneously solving
Q n 1 ( θ , t n ) = 0 , Q n 2 ( θ , t n ) = 1 n i = 1 n t n θ φ 1 ( x i , Y i ; θ , α ) 1 + t n φ 1 ( x i , Y i ; θ , α ) = 0 ,
where θ denotes the partial derivative with respect to θ .
In practical applications, since the parameter α is typically unknown, the ELLRF in Equation (5) cannot be used directly for inference about θ . To address this, given α ^ p , the estimated ELLRF based on the IPW method is
^ 1 ( θ , α ^ p ) = 2 i = 1 n log { 1 + ν n φ ^ 1 ( x i , Y i ; θ , α ^ p ) } ,
where ν n is the Lagrange multiplier and satisfies
1 n i = 1 n φ ^ 1 ( x i , Y i ; θ , α ^ p ) 1 + ν n φ ^ 1 ( x i , Y i ; θ , α ^ p ) = 0 .
Thus, the IPW-based MELE of θ , denoted by θ ^ 1 , can be obtained by maximizing ^ 1 ( θ , α ^ p ) . Similarly, the AIPW-based MELE of θ , denoted by θ ^ 2 , can be obtained by maximizing ^ 2 ( θ , α ^ p ) , where ^ 2 ( θ , α ^ p ) = 2 i = 1 n log { 1 + ν n * φ ^ 2 ( x i , Y i ; θ , α ^ p ) } with ν n * solving the corresponding Lagrange multiplier equations.
Remark 1. 
The proposed method is developed under the assumption that the response variable is subject to NMAR. This assumption is commonly adopted in practical applications, particularly in contexts such as longitudinal studies or clinical trials with outcome-dependent dropout. Notably, as demonstrated by the sensitivity analyses by Ding and Tang [24] and Yang and Tang [14], estimation methods based on the NMAR assumption can still perform well when the true missingness mechanism is MAR, suggesting their robustness to MAR data. However, when the data exhibit a mixture of MAR and NMAR mechanisms, such as when different individuals follow distinct missingness patterns, the validity of NMAR-based methods may be compromised unless a hierarchical structure missingness framework is explicitly incorporated as discussed by Morikawa and Kano [25]. Consequently, in real-world data applications, it is crucial to assess the plausibility of the NMAR assumption on a case-by-case basis, as model performance and identifiability may be sensitive to deviations from the assumed missingness mechanism.

3. Main Results

3.1. Asymptotic Properties

The asymptotic properties of the MELEs and ELLRFs require the following assumptions:
(A1)
The nonresponse mechanism π ( x , Y ; α ) c > 0 almost surely and π ( x ) = E { π ( x , Y ; α 0 ) x } 1 almost surely; in a neighborhood of α 0 , E | π ( x i , Y i ; α ) | 3 < , and 2 π ( x , Y ; α ) / α α exists and is bounded by an integrable function.
(A2)
The probability density function G ( x ) is bounded away from in the support of x ; the first and second derivatives of G ( x ) are continuous, smooth and bounded; and sup x E ( ε 2 | x ) and E ( x 2 ) are finite.
(A3)
m φ 0 ( x ; θ , α ) is twice continuously differentiable in the neighborhood of x .
(A4)
The function f ( x ; θ ) is continuous with respect to θ , where θ lies in a compact set; ( θ ) = f ( x ; θ ) / θ and f ¨ ( θ ) = 2 f ( x ; θ ) / θ θ exist; f ¨ ( x ; θ ) has full column rank.
(A5)
E ϖ 2 ( x ) ( θ ) ( θ ) has full column rank.
(A6)
The kernel function K ( · ) is a probability density function such that (a) it is bounded and has a compact support; (b) it is symmetric with u 2 K ( u ) d u < ; (c) K ( u ) d 0 for some d 0 > 0 in some closed interval centered at zero; and (d) the bandwidth h satisfies n h and n h 4 0 as n .
(A7)
As n , lim inf β 0 + g γ ( β ) / γ > 0 , and the tuning parameter γ satisfies n γ as n and γ 0 .
(A8)
The penalty function g γ ( · ) satisfies max j B g γ ( | α 0 j | ) = o p ( n 1 / 2 ) and max j B g γ ( | α 0 j | ) = o p ( 1 ) , where B = { j : α 0 j 0 } .
(A9)
The moment conditions
E sup ξ 2 h k ( x i , Y i ; ξ ) α j α l 2 <
and
E sup ξ h k ( x i , Y i ; ξ ) α j 2 <
hold for k = 1 , , r + 1 , j = 1 , , d , and l = 1 , , d , where ξ = ( α , W ) with being the compact set, and h ( x i , Y i ; ξ ) is defined in (A1). The notation h k ( x i , Y i ; ξ ) denotes the k-th component of h ( x i , Y i ; ξ ) .
Condition A(1) is similar to that used by Qin et al. [21] and is necessary to establish the asymptotic normality of α ^ p . Condition A(2) is a standard condition in probability theory. Assumptions A(3)–A(5) are typical in empirical likelihood-based inference with estimating equations. Condition A(6) is a common assumption in the nonparametric literature. Assumptions A(7)–A(9) are required to establish the oracle properties of penalized semiparametric likelihood estimators.
Let α 0 = ( α 10 , α 20 ) denote the true value of α = ( α 1 , α z ) , where α 1 = ( α 0 , α u , α y ) . As discussed in Fan and Li [20], the SCAD penalty function possesses the oracle properties. These properties ensure that the SCAD penalty not only promotes a sparse model structure but also yields an estimator that is nearly unbiased for large parameter values. We establish the oracle properties and the consistency of α ^ p in Theorem 1.
Theorem 1. 
Under Assumptions A(1) and A(7)–A(9), as n , we have
( i ) | | α ^ p α 0 | | = O p ( n 1 / 2 ) ;
( ii ) P { α ^ z = 0 } 1 ;
( iii ) n M 1 / 2 ( α ^ 1 α 10 ) L N ( 0 , I ) , where I represents the identity matrix, and M is defined in the Appendix A.
From Theorem 1, we establish the stochastic expansion
n ( α ^ p α 0 ) = 1 n i = 1 n H i ( α 0 ) + o p ( 1 ) ,
where the influence function H i ( α 0 ) is defined in (A2) of the Appendix. The first part of Theorem 1 demonstrates that by appropriately selecting the tuning parameter γ , a root-n consistent penalized likelihood estimator can be obtained. Furthermore, Theorem 1 (ii) establishes the sparsity property, ensuring that α ^ z = 0 with probability approaching one. This result confirms that the penalized estimator effectively identifies and selects the IV z with high probability. Finally, Theorem 1 (iii) establishes the asymptotic normality of α ^ 1 , suggesting that the penalized likelihood method can yield efficient estimates of the nonzero components by effectively reducing the dimensionality of α .
Within the framework of the penalized semiparametric likelihood, the asymptotic properties of θ ^ 1 and θ ^ 2 are established below. For any vector B , let B 2 = B B , and convergence in distribution is denoted by the symbol L . We first define the key quantities:
V 1 = E π 1 ( x , Y ; α 0 ) ϖ 4 ( x ) ( θ 0 ) 2 ε 2 , Γ = E ϖ 2 ( x ) ( θ 0 ) 2 , V 2 = E π 1 ( x , Y ; α 0 ) φ ( x , Y ; θ 0 ) m φ 0 ( x ; θ 0 , α 0 ) 2 + E m φ 0 ( x ; θ 0 , α 0 ) 2 , D ( x , Y ; α 0 ) = δ π ( x , Y ; α 0 ) logit { π ( x , Y ; α 0 ) } α , B k = Cov φ k ( x , Y ; θ 0 , α 0 ) , D ( x , Y ; α 0 ) , k = 1 , 2 , Ω k = Var φ k ( x i , Y i ; θ 0 , α 0 ) B k H i ( α 0 ) , k = 1 , 2 .
Theorem 2. 
Suppose that Conditions (A1)–(A9) hold, Ω 1 and Ω 2 are positive definite matrices, θ 0 is the unique true parameter value of θ, and α is estimated by α ^ p . Define
Σ 1 = ( Γ V 1 1 Γ ) 1 Γ V 1 1 Ω 1 V 1 1 Γ ( Γ V 1 1 Γ ) 1 , Σ 2 = ( Γ V 2 1 Γ ) 1 Γ V 2 1 Ω 2 V 2 1 Γ ( Γ V 2 1 Γ ) 1 .
Then, as n , we have
(1)
Asymptotic normality:
n ( θ ^ 1 θ 0 ) L N ( 0 , Σ 1 ) , n ( θ ^ 2 θ 0 ) L N ( 0 , Σ 2 ) ;
(2)
Likelihood ratio convergence:
^ 1 ( θ 0 , α ^ p ) L k * = 1 p ρ 1 , k * χ 1 , k * 2 , ^ 2 ( θ 0 , α ^ p ) L k * = 1 p ρ 2 , k * χ 1 , k * 2 ,
where { χ 1 , k * 2 } k * = 1 p are independent chi-squared variates with 1 degree of freedom, and { ρ m , k * } k * = 1 p ( m = 1 , 2 ) are eigenvalues of V m 1 Ω m .
Theorem 2 (1) establishes the asymptotic normality of θ ^ 1 and θ ^ 2 , enabling normal approximation (NA)-based inference. Specifically, the ( 1 ϑ ) -level NA confidence region for θ is
θ : ( θ ^ j θ ) n Σ ^ j 1 ( θ ^ j θ ) χ p , 1 ϑ 2 , j = 1 , 2 ,
where Σ ^ j 1 is a consistent plug-in estimator of Σ j 1 , and χ p , 1 ϑ 2 denotes the 1 ϑ quantile of the chi-squared distribution χ p 2 with p degrees of freedom. Theorem 2 (2) characterizes the ELLRFs, yielding the EL confidence region
CI ϑ ( θ ) = θ : ^ j ( θ , α ^ p ) c ϑ ( j ) , j = 1 , 2 ,
where c ϑ ( j ) is the 1 ϑ quantile of the distribution k * = 1 p ρ j , k * χ 1 , k * 2 , and { ρ j , k * } k * = 1 p are the eigenvalues of V j 1 Ω j .

3.2. Double Robustness

From Theorem 2, we know that if the model π ( x , Y ; α ) is correctly specified, the proposed estimators θ ^ 1 are unbiased and consistent under certain regularity conditions. However, verifying these specifications is a challenging task, and the misspecification of π ( x , Y ; α ) may result in biased estimates and reduced efficiency unless additional data assumptions are imposed. To address these limitations, a double robust estimation procedure has been developed in the NMAR settings. Specifically, following Miao and Tchetgen [26] and Liu and Yuan [27], the conditional density function of ( z , Y , δ ) can be factorized as
f ( z , Y , δ | u ) = c ( u ) exp ( 1 δ ) OR ( Y | u ) P ( δ | Y = 0 , u ) f ( z , Y | δ = 1 , u ) ,
where c ( u ) = P ( δ = 1 | u ) / P ( δ = 1 | Y = 0 , u ) , P ( δ = 1 | Y = 0 , u ) is the baseline propensity score, f ( z , Y | δ = 1 , u ) is the baseline outcome density, and
OR ( Y | u ) = log P ( δ = 0 | Y , u ) P ( δ = 1 | Y = 0 , u ) P ( δ = 0 | Y = 0 , u ) P ( δ = 1 | Y , u )
is the log of the conditional odds ratio function relating Y and δ given u .
When focusing on the estimation of the response mean, θ = E ( Y ) , we have φ ( x i , Y i ; θ ) = Y θ . As demonstrated by Liu and Yuan [27], if OR ( Y | u ) is correctly specified, the estimator θ ^ 1 is unbiased and consistent if either f ( z , Y | δ = 1 ) or P ( δ = 1 | Y = 0 , u ) is correctly specified. Therefore, by selecting an appropriate model of the log odds ratio from a set of candidate models, the proposed estimation procedure is recommended within the EL framework for nonlinear regression. This approach helps mitigate potential biases arising from the misspecification of the missingness mechanism.
Moreover, if both π ( x , Y ; α ) and the moment functions m φ 0 ( x i ; θ , α ) are correctly specified, the proposed estimator θ ^ 2 remains unbiased and consistent under certain regularity conditions. Following Zhao et al. [28], we show that the moment functions φ 2 ( x i , Y i ; θ , α ) possess the double robustness property when the missingness mechanism, as specified in model (2), is modeled parametrically. The double robustness property is summarized in the following Proposition 1.
Proposition 1. 
(1) Regardless of the choice of m φ 0 ( x i ; θ , α ) , φ 2 ( x i , Y i ; θ , α ) has mean zero, provided that the model for π ( x , Y ; α ) is correctly specified. (2) If the true missingness mechanism is a parametric logistic model logit { π ( x , Y ; α * ) } = F ( x i ; α * ) + q ( Y i ) , where F ( · ) is a smooth function with an unknown parameter vector α * , and q ( · ) is an arbitrary user-specified function that measures the deviation from the ignorable missing-data mechanism assumption, then the AIPW moment functions φ 2 ( x i , Y i ; θ , α ) still have mean zero, even if the model for F ( x i ; α 0 ) is misspecified.

3.3. Dimension Reduction

In many practical applications, the covariate dimension can be large, making it challenging to obtain an appropriate estimator for m φ 0 ( x i ; θ , α ) using kernel-smoothing imputation methods. To address this issue, let S be a continuous function from R d x to R such that E { φ ( x i , Y i ; θ ) | S i , δ i = 0 } = E { φ ( x i , Y i ; θ ) | x i , δ i = 0 } with S i = S ( x i ) . Under this assumption, we have
E δ i π ( x i , Y i ; α ) φ ( x i , Y i ; θ ) + 1 δ i π ( x i , Y i ; α ) m φ 0 ( S i ; θ , α ) = 0 ,
where m φ 0 ( S i ; θ , α ) = E { δ i φ ( x i , Y i ; θ ) O ( x i , Y i ; α ) | S i } / E { δ i O ( x i , Y i ; α ) | S i } . Consequently, the kernel-assisted estimating equations can be constructed as
φ ^ * ( x i , Y i ; θ , α ) = δ i π ( x i , Y i ; α ) φ ( x i , Y i ; θ ) + 1 δ i π ( x i , Y i ; α ) m φ 0 ( S i ; θ , α ) ,
where m φ 0 ( S i ; θ , α ) is structurally identical to m φ 0 ( x i ; θ , α ) except that x is replaced by S. Given α ^ p , one can obtain a semiparametric dimension reduction EL estimator θ ^ * based on φ ^ * ( x i , Y i ; θ , α ) . It is common to assume that the working index S = S ( x , γ * ) involves an unknown parameter vector γ * . If an estimator γ ^ * is available, following the arguments of Hu et al. [29], we can show that the resulting EL estimator based on φ ^ * ( x i , Y i ; θ , α ) is asymptotically equivalent to θ ^ 2 when γ ^ * γ * = O p ( n 1 / 2 ) .

3.4. Asymptotic Variance Estimation

In order to construct confidence regions for the proposed estimators, it is essential to estimate their asymptotic variances Σ 1 and Σ 2 consistently from the sample { ( x i , Y i , δ i ) : i = 1 , , n } . To achieve this, we employ the plug-in method in the simulation studies. For instance, the sample-based estimate V ^ 1 of V 1 is
V ^ 1 = 1 n i = 1 n δ i π 2 ( x i , Y i ; α ^ p ) ϖ 4 ( x i ) ( θ ^ 1 ) 2 ε ^ i 2 .
Other estimates for Γ , V 2 , B 1 , B 2 , Ω 1 and Ω 2 can be obtained in a similar manner. We omit the details here for brevity.
While the plug-in method is effective in NMAR settings, it can be computationally intensive due to the complexity of the asymptotic variances involved. As an alternative, particularly when dealing with large datasets, the bootstrap procedure provides an effective approach for approximating these asymptotic variances. This method, which has been explored in studies such as those of Zhao et al. [30] and Jiang et al. [31] for NMAR data, can help alleviate computational challenges and provide more practical estimations in large-scale applications.

4. Simulation Study

Finite-sample performance of the proposed methods is evaluated through Monte Carlo experiments. For bandwidth selection, we implement the data-driven approaches of Zhou et al. [32] and Tang et al. [23], adopting the rule-of-thumb bandwidth: h n = σ ^ X n 1 / 3 , where σ ^ X denotes the sample standard deviation of the observed covariate X. This practical bandwidth selector balances asymptotic optimality and computational simplicity.

4.1. Simulation 1

Let f ( x i ; θ ) = exp ( X i 1 θ 1 + X i 2 θ 2 ) , and ϖ ( x i ) = exp ( X i 1 X i 2 ) . The true parameter is set as θ 0 = ( 0.8 , 1 ) , and the error term ε i N ( 0 , 0 . 5 2 ) . The covariates are generated under two scenarios: In Model A, X i 1 and X i 2 are independently sampled from U ( 0 , 1 ) ; in Model B, X i 1 U ( 0 , 1 ) while X i 1 = X i 2 + ε i * ( ε i * U ( 1 , 1 ) ), allowing us to examine the impact of covariate dependence. We implement a sample size of n = 150 , with response variables generated following the specification in model (1).
The missing indicator follows the nonignorable mechanism
δ i Bernoulli π ( x i , Y i ; α ) , π ( x i , Y i ; α ) = exp ( α 0 + α y Y i ) 1 + exp ( α 0 + α y Y i ) ,
where α = ( 0.05 , 0.05 ) . The covariates X i 1 and X i 2 serve as instrumental variables. The parameter α is estimated using the penalized semiparametric likelihood method, incorporating the following auxiliary information matrix:
Δ ( x i , Y i ; α ) = δ i π 1 ( x i , Y i ; α ) ( X i 1 X 1 ¯ ) δ i π 1 ( x i , Y i ; α ) ( X i 2 X 2 ¯ ) ,
where X 1 ¯ = n 1 i = 1 n X i 1 and X 2 ¯ = n 1 i = 1 n X i 2 . We adopt the product Gaussian kernel K ( u 1 , u 2 ) = e ( u 1 2 + u 2 2 ) / 2 / 2 π and set the bandwidth as h = σ ^ X 1 n 1 / 3 following Tang et al. [23].
Based on 500 independent replications, the proposed penalized method achieves an average IV selection rate of 92.8% for X 1 and X 2 , demonstrating its effectiveness. For Model A and Model B, the 95% confidence regions for parameter θ and their coverage probabilities are calculated based on the EL methods EL ( θ ^ 1 ) and EL ( θ ^ 2 ) , as well as the normal approximation approaches NA ( θ ^ 1 ) and NA ( θ ^ 2 ) . The simulation results for the confidence regions are displayed in Figure 1.
The left panel of Figure 1 presents the simulated confidence regions for Model A based on the four aforementioned approaches, whereas the right panel displays the corresponding results for Model B. Several notable findings emerge from Figure 1. First, the confidence regions constructed using EL approaches are smaller than those based on NA methods, indicating the superior efficiency of EL-based inference. Second, the EL-based confidence region for θ ^ 2 is smaller than that for θ ^ 1 , highlighting the enhanced efficiency of the AIPW estimator relative to the IPW estimator. Third, even in the presence of correlation between covariates in Model B, the EL and NA approaches yield confidence regions similar to those in Model A, implying the stability of these methods. The coverage probabilities for all four approaches are comparable in models A and B, closely aligning with the nominal 95% level. Overall, the EL-based approaches demonstrate superior efficiency relative to the NA-based methods, and the AIPW-based estimation is shown to be more efficient than the IPW-based estimation.

4.2. Simulation 2

We consider the regression model with nonlinear components
Y i = θ 0 + k = 1 4 θ k X i k + exp ( θ 5 X i 5 ) + 0.5 ε i , i = 1 , , n ,
where the true parameter vector is θ 0 = ( 2.5 , 0.5 , 1.5 , 0.5 , 1 , 0.5 ) . The covariates x i = ( X i 1 , , X i 5 ) follow the multivariate normal distribution N 5 ( 0 , Σ ) with covariance matrix Σ = ( 0 . 5 | i j | ) 5 × 5 . The error terms ε i are independently distributed as N ( 0 , 1 ) .
The nonresponse mechanism follows the nonignorable logistic model
π ( x i , Y i ; α ) = 1 + exp α 0 k = 1 5 α k X i k α y Y i 1 ,
where α = ( 0.5 , 0 , 0 , 0.8 , 0 , 0 , 0.5 ) . The IV z = ( X 1 , X 2 , X 4 , X 5 ) is identified through the proposed penalized estimation process. To address high-dimensional challenges, we employ MAR-based propensity score estimation
s ˜ ( x , γ ^ * ) = 1 + exp ( γ ^ 0 * γ ^ 1 * x ) 1 ,
where γ ^ * denotes the maximum likelihood estimates. This enables the construction of a semiparametric estimator
m ^ φ 0 ( s ˜ , α , θ ) = i = 1 n δ i O ( x i , Y i ; α ) F h ( s ˜ s ˜ i ) φ ( x i , Y i ; θ ) i = 1 n δ i O ( x i , Y i ; α ) F h ( s ˜ s ˜ i ) ,
where F h ( · ) represents a univariate kernel density function. For each data generating mechanism, we generate 500 Monte Carlo random samples of sizes 150 and 250.
Table 1 summarizes the finite-sample performance of the proposed method, presenting three key metrics for nonzero components in α : empirical bias (Bias), root mean square (RMS) error, and standard deviation (SD). The variable selection outcomes are quantified through two measures: “T” (mean count of correctly excluded irrelevant variables) and “F” (mean count of erroneously excluded significant variables). Table 2 compares the estimation accuracy of regression coefficients between IPW and AIPW approaches, reporting their respective bias, RMS, and SD values.
The principal findings emerge as follows:
(1) Variable Selection Efficacy: The penalized semiparametric likelihood method demonstrates robust variable selection capability in the nonresponse mechanism model, effectively distinguishing between relevant and irrelevant covariates.
(2) Estimation Precision: For active components in α , the observed minimal bias with closely matched SD and RMS values confirms the estimator’s statistical efficiency.
(3) MELE Performance Consistency: For both θ ^ 1 and θ ^ 2 , the SD and RMS are nearly identical, suggesting that the proposed method for MELEs effectively estimates parameters through the penalized parametric likelihood approach.
(4) Sample Size Effects: Enhanced estimation efficiency emerges with larger samples for both the missingness data model and regression model.

5. Application to the ACTG 175 Data

We demonstrate the proposed methodology using data from the AIDS Clinical Trials Group Protocol 175 (ACTG 175) involving 2139 HIV-infected participants (Hammer et al. [33]). Following the established analytical approaches of Davidian et al. [34], Tsiatis et al. [35], and Han [8], we classify treatments into two groups: zidovudine (ZDV) monotherapy (532 subjects) versus combined therapies (1607 subjects). The analysis focuses on CD4 counts at 96 ± 5 weeks post-baseline ( Y = CD 4 96 ) as the primary endpoint, with the following covariates:
  • Treatment assignment ( X 1 : 0 = ZDV monotherapy)
  • Baseline CD4 count ( X 2 : CD 4 0 )
  • Demographic covariates: age ( X 3 ), weight ( X 4 ), race ( X 5 : 0 = White), gender ( X 6 : 0 = Female)
  • Clinical covariates: antiretroviral history ( X 7 : 0 = naive), early treatment termination ( X 8 : 0 = completed)
The binary indicator variable r encodes the missingness status of the response Y, where r i = 1 indicates an observed outcome, and r i = 0 denotes a missing value. Previous studies of Davidian et al. [34], Tsiatis et al. [35], and Han [8] assumed that the missingness mechanism depends solely on covariates through a MAR framework. Our penalized semiparametric likelihood approach enhances robustness by incorporating shrinkage estimation within the nonresponse mechanism model. Specifically, shrinkage of the response variable coefficient toward zero provides formal evidence supporting the MAR assumption, while a nonzero estimate suggests NMAR.
To facilitate direct comparison with Han [8], we specialize the general model (1) to a linear regression framework
Y = θ 1 + l = 2 9 θ l X l 1 + ε , E ( ε | X ) = 0 ,
where X 1 X 8 represent the baseline covariates defined previously. The nonresponse mechanism is parameterized via logistic regression
P ( r = 1 | X , Y ) = exp α 0 + k = 1 8 α k X k + α 9 Y 1 + exp α 0 + k = 1 8 α k X k + α 9 Y ,
with parameter vector α = ( α 0 , , α 9 ) . To address dimensionality challenges, we implement the regularization strategy detailed in Section 4.2, constructing consistent estimators for m φ 0 ( x ; θ , α ) through MAR-based nonresponse mechanism weighting.
The penalized semiparametric likelihood estimates are presented in Table 3, with p-values calculated using 200 bootstrap replications (Efron and Tibshirani [36]). The weight ( α 3 ) and age ( α 4 ) show nonsignificant contributions to the nonresponse mechanism, as their coefficients are shrunk to zero with p-values exceeding 0.1. The significant coefficient for CD4 counts at 96 ± 5 weeks ( α 9 ) indicates an NMAR in this dataset.
Table 4 presents the analysis results for model (6), with standard errors estimated through 200 bootstrap replications. The comparative results from the complete-case analysis and Han’s multiply robust method, as described by Han [8], are also included. The nonsignificant predictors include age, weight, and gender. The analysis reveals five critical clinical insights:
(1) Treatment Superiority: Combination antiretroviral therapies (Trt = 1) demonstrate significantly higher CD4 counts at 96 ± 5 weeks compared to ZDV monotherapy, establishing the enhanced therapeutic effectiveness of newer regimens.
(2) Baseline Predictive Power: Baseline CD4 counts (CD40) show significant positive association with follow-up counts, confirming their prognostic value in HIV management.
(3) Racial Disparity: White patients maintain clinically significant CD4 count advantage over nonwhite counterparts, suggesting differential disease progression trajectories.
(4) Treatment History Impact: Antiretroviral-experienced patients exhibit substantially reduced CD4 counts compared to naive patients, indicating potential cumulative treatment effects.
(5) Adherence Consequences: Early treatment discontinuation associates with marked CD4 count reduction, underscoring the critical importance of sustained therapeutic engagement.

6. Conclusions and Future Work

We developed a penalized semiparametric likelihood approach that resolves the identification challenges in nonignorable missing data analysis. The proposed estimator achieves the oracle properties under appropriate tuning parameter selection as established in our theoretical framework. The construction of profile EL ratio functions incorporated IPW and AIPW estimating equations. Our analysis demonstrated that when using consistently estimated nonresponse mechanism parameters, the ELLRFs follow an asymptotic weighted χ 2 distribution. Furthermore, we systematically established the asymptotic normality of regression parameter estimators. Simulation studies and real-data applications confirmed the method’s practical effectiveness in both parameter estimation and variable selection. Comparative analyses revealed superior performance over existing approaches in handling nonignorable nonresponse data.
In practical applications, nonlinear regression models often involve high-dimensional covariates, which can lead to sparsity within the model. The direct application of the proposed estimation procedure in such contexts may lead to biased estimates. One potential approach to address this challenge is the application of the penalized EL method, as studied by Ren and Zhang [37], for model selection. It could effectively balance the model complexity and goodness of fit, thereby reducing the bias induced by high-dimensional covariates. This extension requires a systematic and separate investigation within the NMAR framework. A detailed exploration of this important issue will be undertaken in future research.

Author Contributions

Conceptualization, X.D. and X.L.; methodology, X.D. and X.L.; validation, X.D.; formal analysis, X.L.; investigation, X.D.; writing— original draft, X.D.; writing—review and editing, X.D. and X.L.; supervision, X.L.; project administration, X.D.; funding acquisition, X.D. and X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (Grant Numbers 12426666 and 12426668), Zhongwu Young Teachers Program for the Innovative Talents of Jiangsu University of Technology, and Doctoral Research Project of Yuncheng University (YQ-2023074).

Data Availability Statement

The real data that are used to illustrate the proposed methods are available at https://github.com/dingxianwen-dxw/ACTG175 (accessed on 22 March 2025).

Acknowledgments

The authors wish to thank the Editor-in-Chief, the Associate Editor and two reviewers for their many helpful and insightful comments and suggestions that greatly improved the paper.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

To establish the proofs for Theorem 1, we first introduce some essential notations and supporting lemmas.
The log-likelihood ( α , W ) (after profiling p i , s ) can be rewritten as
( α , W , v ) = 1 ( α , W , v ) + 2 ( W ) ,
where v = ( v 1 , v 2 ) with v 1 = λ 1 and v 2 = λ 2 1 / W . The components of the log-likelihood function are defined as follows:
1 ( α , W , v ) = i = 1 n 1 log { 1 + v h ( x i , Y i ; α , W ) } ,
2 ( W ) = n 1 log W + ( n n 1 ) log ( 1 W ) ,
where the function h ( x i , Y i ; α , W ) is given by
h ( x i , Y i ; α , W ) = W Δ ( x i ) π ( x i , Y i ; α ) , W π ( x i , Y i ; α ) W π ( x i , Y i ; α ) .
Following the arguments presented by Qin et al. [21], it can be shown that
i = 1 n 1 h ( x i , Y i ; α , W ) 1 + v h ( x i , Y i ; α , W ) = 0 .
It is worthwhile to note that E c { h ( x i , Y i ; α ) } = 0 .
Lemma A1. 
Assume that sup ξ | | h ( x i , Y i ; ξ ) | | W ( x i , Y i ) , where E { W ( x i , Y i ) } κ < for some constant κ > 2 . Then, for any 1 / κ < < 1 / 2 and Λ n 1 = { v : | | v | | n 1 } , we have
sup ξ , v Λ n 1 , 1 i n 1 | v h ( x i , Y i ; ξ ) | = o p ( 1 ) .
Proof of Lemma A1. 
By Markov’s inequality, we obtain
max 1 i n 1 sup ξ | | h ( x i , Y i ; ξ ) | | = O p ( n 1 1 κ ) .
Then, applying the Cauchy–Schwarz inequality, we have
sup ξ , v Λ n 1 , 1 i n 1 | v h ( x i , Y i ; ξ ) | n 1 O p ( n 1 1 / κ ) = O p ( n 1 κ ) = o p ( 1 ) .
This completes the proof. □
Lemma A2. 
Under condition A(1), let ξ = ( α , W ) denote the parameter vector, and let its true value be ξ 0 = ( α 0 , W 0 ) . Define H ( ξ ) = E { h ( x i , Y i ; ξ ) h ( x i , Y i ; ξ ) } and assume H ( ξ ) is a positive definite matrix. Then, for all ξ { ξ : | | ξ ξ 0 | | = O p ( n 1 1 / 2 ) } , we have
( 1 ) v ( ξ ) = H 1 ( ξ ) 1 n 1 i = 1 n 1 h ( x i , Y i ; ξ ) + o p ( n 1 1 / 2 ) ; ( 2 ) v ( ξ ^ ) = O p ( n 1 1 / 2 ) .
Proof of Lemma A2. 
We begin by considering the first part of Lemma A2. By Lemma A1, applying the Taylor series expansion to Equation (A1) yields i = 1 n 1 { 1 v h ( x i , Y i ; ξ ) ( 1 + o p ( 1 ) ) } h ( x i , Y i ; ξ ) = 0 , which establishes the desired result. The second part of Lemma A2 follows directly from Owen [13]. □
Proof of Theorem 1. 
We begin by considering the first part of Theorem 1. By noting that ξ = ( α , W ) , we have p ( α , W ) = p ( ξ , v ) = 1 ( ξ , v ) + 2 ( W ) n 1 j = 1 d g γ ( | α j | ) . Let ϱ n 1 = n 1 1 / 2 . Following the arguments of Fan and Li [20], it is necessary to show that for any given ϵ > 0 , there exists a sufficiently large constant C such that
P [ sup | | b | | = C { p ( ξ 0 + ϱ n 1 b ) p ( ξ 0 ) } < ϵ ] 1 ϵ .
This result implies the existence of a local maximum ξ ^ of ξ in the ball { ξ 0 + ϱ n 1 b : | | b | | C } .
From the condition p γ ( 0 ) = 0 , we have
p ( ξ 0 + ϱ n 1 b , v ) p ( ξ 0 , v ) 1 ( ξ 0 + ϱ n 1 b , v ) 1 ( ξ 0 , v ) + 2 ( W 0 + ϱ n 1 b ) 2 ( W 0 ) n 1 j = 1 k { g γ ( | α 0 j + ϱ n 1 b j | ) g γ ( α 0 j ) } : = I 1 + I 2 + I 3 ,
where k is the number of components of α 10 . Taking the Taylor series expansion of 1 ( ξ ) around ξ 0 yields
1 ( ξ , v ) = i = 1 n 1 log { 1 + v ( ξ ) h ( x i , Y i ; ξ ) } = i = 1 n 1 v ( ξ ) h ( x i , Y i ; ξ ) { 1 + o p ( 1 ) } .
Following the results of Lemma A2, we have
I 1 = n 1 h ¯ ( ξ ) H 1 ( ξ ) h ¯ ( ξ ) + n 1 h ¯ ( ξ 0 ) H 1 ( ξ 0 ) h ¯ ( ξ 0 ) { 1 + o p ( 1 ) } = n 1 F [ H 1 ( ξ 0 ) { 1 + o p ( 1 ) } ] F + n 1 h ¯ ( ξ 0 ) H 1 ( ξ 0 ) h ¯ ( ξ 0 ) = 2 n 1 ϱ n 1 h ¯ ( ξ 0 ) H 1 ( ξ 0 ) U ( ξ 0 ) b { 1 + o p ( 1 ) } n 1 b U ( ξ 0 ) H 1 ( ξ 0 ) U ( ξ 0 ) b ϱ n 1 2 { 1 + o p ( 1 ) } : = I 11 + I 12 ,
where F = { h ¯ ( ξ 0 ) + U ( ξ 0 ) { 1 + o p ( 1 ) } ϱ n 1 b } , h ¯ ( ξ ) = 1 / n 1 i = 1 n 1 h ( x i , Y i ; ξ ) and U ( ξ 0 ) = E { h ( x , Y ; ξ 0 ) / α } . Because 2 ( W ) is the log binomial likelihood, we have
I 2 = 2 ( W 0 + ϱ n 1 b ) 2 ( W 0 ) < 0 .
It follows from the Taylor expansion that
I 3 = j = 1 k n 1 ϱ n 1 g γ ( | α 0 j | ) sign ( α 0 j ) b j + j = 1 k n 1 ϱ n 1 2 g γ ( | α 0 j | ) b j 2 { 1 + o p ( 1 ) } : = I 31 + I 32 .
Note that
| I 31 | j = 1 k | n 1 ϱ n 1 g γ ( | α 0 j | ) sign ( α 0 j ) b j | k n 1 ϱ n 1 max 1 j k g γ ( | α 0 j | ) | | b j | | n 1 ϱ n 1 2 | | b | | , | I 32 | j = 1 k | n 1 ϱ n 1 2 g γ ( | α 0 j | ) b j 2 { 1 + o p ( 1 ) } | max 1 j k { | g γ ( | α 0 j | ) | : α o j 0 } n 1 ϱ n 1 2 | | b | | 2 .
When | | b | | is chosen to be large enough such that I 12 dominates the other terms I 11 , I 31 and I 32 , and taking into account the negative term I 2 , we conclude that p ( ξ 0 + ϱ n 1 b , v ) p ( ξ 0 , v ) may be negative. Thus, the first part of Theorem 1 holds.
Now, we proceed to prove Theorem 1 (ii).
By Lemma A1, we have v ( ξ ) h ( x i , Y i ; ξ ) = o p ( 1 ) . Taking the Taylor series expansion of the first partial derivative of p ( ξ , v ) at α j ( j B ) yields
1 n 1 p ( ξ ) α j = 1 n 1 i = 1 n 1 v ( ξ ) h ( x i , Y i ; ξ ) α j { 1 + o p ( 1 ) } g γ ( | α j | ) sign ( α j ) = 1 n 1 i = 1 n 1 v ( ξ ) h ( x i , Y i ; ξ 0 ) α j + 2 h ( x i , Y i ; ξ 0 ) α j α ( α α 0 ) { 1 + o p ( 1 ) } g γ ( | α j | ) sign ( α j ) : = T 1 j + T 2 j + T 3 j + o p ( n 1 1 / 2 ) .
Let U j ( α ) denote the j t h column vector of matrix U ( α ) = E { h ( x , Y ; α , W ) / α } . It follows from Assumptions A(7)–A(9) and Lemma A2 that
max j B ( | T 1 j | ) max j B | v U j ( α 0 ) | + v 1 n 1 i = 1 n 1 h ( x i , Y i ; α 0 , W ) α j U j ( α 0 ) max j B | v U j ( α 0 ) | + | | v | | 1 n 1 i = 1 n 1 h ( x i , Y i ; α 0 , W ) α j U j ( α 0 ) = O p ( n 1 1 / 2 ) .
Furthermore, by Assumption A(9), we have
max j B ( | T 2 j | ) C 1 n 1 l = 1 k v 2 h ( x i , Y i ; ξ 0 ) α j α l = O p 1 n 1 2 = o p 1 n 1 .
So we obtain 1 / n 1 p ( ξ ) / α j = γ { p γ ( | α j | ) sign ( α j ) / γ + O p ( 1 / n γ ) } , which implies that the sign of 1 / n 1 p ( ξ ) / α j is dominated by the sign of α j . Thus, for any j B and n , we have 1 / n 1 p ( ξ ) / α j < 0 when α j ( 0 , C / n ) and 1 / n 1 p ( ξ ) / α j > 0 when α j ( C / n , 0 ) with probability tending to one. This result implies that α ^ z = 0 with probability tending to one. Therefore, Theorem 1 (ii) holds.
Now we proceed to prove Theorem 1 (iii).
For simplicity of notation, we temporarily denote h i = h ( x i , Y i ; ξ ) . Let I d = ( H 1 , H 2 ) denote the d× d identity matrix, where H 1 R | B | × d and H 2 R ( d | B | ) × d with | B | being the cardinality of B . Let
S ( v , α , τ ) = 1 n 1 i = 1 n 1 { log ( 1 + v h i ) + n 1 log W + ( n n 1 ) log ( 1 W ) } j = 1 d p γ ( | α j | ) τ H 2 α ,
where τ R d | B | is another Lagrange multiplier vector. The penalized likelihood p ( v , α , τ ) can be rewritten as p ( v , α , τ ) = n 1 S ( v , α , τ ) . Let
S 1 ( v , α , τ ) = S ( v , α , τ ) v = 1 n 1 i = 1 n 1 h i 1 + v h i = i = 1 n π i * h i , S 2 ( v , α , τ ) = S ( v , α , τ ) α = i = 1 n π i * α h i v w ( α ) H 2 τ , S 3 ( v , α , τ ) = S ( v , α , τ ) τ = H 2 α ,
where w ( α ) = ( g γ ( | α 1 | ) sign ( α 1 ) , , g γ ( | α k | ) sign ( α k ) , 0 , , 0 ) , π i * = { n 1 ( 1 + v h i ) } 1 , and α h i = h i / α . Thus, v ^ , α ^ and τ ^ satisfy S t ( v ^ , α ^ , τ ^ ) = 0 for t = 1 , 2 , 3 . Let H ^ ( α 0 ) = 1 / n 1 i = 1 n 1 h ( x i , Y i ; α 0 , W ) h ( x i , Y i ; α 0 , W ) and U ^ ( α 0 ) = 1 / n 1 i = 1 n 1 α h ( x i , Y i ; α 0 , W ) , we obtain
S 11 ( 0 , α 0 , 0 ) = 2 S ( 0 , α 0 , 0 ) / v v = H ^ ( α 0 ) , S 12 ( 0 , α 0 , 0 ) = 2 S ( 0 , α 0 , 0 ) / v α = U ^ ( α 0 ) , S 13 ( 0 , α 0 , 0 ) = 2 S ( 0 , α 0 , 0 ) / v τ = 0 , S 21 ( 0 , α 0 , 0 ) = 2 S ( 0 , α 0 , 0 ) / α v = U ^ ( α 0 ) , S 22 ( 0 , α 0 , 0 ) = 2 S ( 0 , α 0 , 0 ) / α α = 0 , S 23 ( 0 , α 0 , 0 ) = 2 S ( 0 , α 0 , 0 ) / α τ = H 2 , S 31 ( 0 , α 0 , 0 ) = 2 S ( 0 , α 0 , 0 ) / τ v = 0 , S 32 ( 0 , α 0 , 0 ) = 2 S ( 0 , α 0 , 0 ) / τ α = H 2 , S 33 ( 0 , α 0 , 0 ) = 2 S ( 0 , α 0 , 0 ) / τ τ = 0 .
Let H = E H ^ ( α 0 ) and U = E U ^ ( α 0 ) . Taking the Taylor expansion of S t ( v ^ , α ^ , τ ^ ) = 0 ( t = 1 , 2 , 3 ) at ( 0 , α 0 , 0 ) yields
S 1 ( 0 , α 0 , 0 ) 0 0 = H U 0 U 0 H 2 0 H 2 0 v ^ 0 α ^ α 0 τ ^ 0 + o p ( n 1 2 ) .
Define the matrix Q as follows:
Q = Q 11 Q 12 Q 21 Q 22 ,
where Q 11 = H , Q 12 = ( U , 0 ) , Q 21 = Q 12 , and
Q 22 = 0 H 2 H 2 0 .
Additionally, let Ξ = ( α , τ ) . Then, we have
v ^ 0 Ξ ^ Ξ 0 = Q 1 S 1 ( 0 , α 0 , 0 ) 0 + o p ( n 1 2 ) .
Let Q = Q 22 Q 21 Q 11 1 Q 12 . By applying the block matrix inversion formula, we obtain Ξ ^ Ξ 0 = Q 1 Q 21 Q 11 1 S 1 ( 0 , α 0 , 0 ) + o p ( 1 ) . Define H i ( α 0 ) as the appropriate submatrix of the matrix δ i Q 1 Q 21 Q 11 1 h ( x i , Y i ; ξ 0 ) . Then, we have
n ( α ^ p α 0 ) = 1 n i = 1 n H i ( α 0 ) + o p ( 1 ) .
By invoking the Lindeberg–Feller conditions, we conclude that n ( α ^ p α 0 ) L N ( 0 , M ) , where M = Var ( H i ( α 0 ) ) . □
Lemma A3. 
Suppose Conditions A(1)–A(9) hold; if α is estimated by the penalized likelihood method, α ^ = α ^ p , then as n , we have
( 1 ) 1 n i = 1 n φ ^ 1 ( x i , Y i ; θ 0 , α ^ p ) L N ( 0 , Ω 1 ) , 1 n i = 1 n φ ^ 1 ( x i , Y i ; θ 0 , α ^ p ) 2 P V 1 , 1 n i = 1 n θ φ ^ 1 ( x i , Y i ; θ 0 , α ^ p ) P Γ , ( 2 ) 1 n i = 1 n φ ^ 2 ( x i , Y i ; θ 0 , α ^ p ) L N ( 0 , Ω 2 ) , 1 n i = 1 n φ ^ 2 ( x i , Y i ; θ 0 , α ^ p ) 2 P V 2 , 1 n i = 1 n θ φ ^ 2 ( x i , Y i ; θ 0 , α ^ p ) P Γ .
Proof of Lemma A3. 
We begin by proving part (1). Expanding 1 / n i = 1 n φ ^ 1 ( x i , Y i ; θ 0 , α ^ p ) at α = α 0 using a Taylor series gives
1 n i = 1 n φ ^ 1 ( x i , Y i ; θ 0 , α ^ p ) = 1 n i = 1 n φ 1 ( x i , Y i ; θ 0 , α 0 ) + 1 n i = 1 n φ ^ 1 ( x i , Y i ; θ 0 , α 0 ) α n ( α ^ p α 0 ) + o p ( 1 ) .
We observe that
1 n i = 1 n φ ^ 1 ( x i , Y i ; θ 0 , α 0 ) α = 1 n i = 1 n δ i π 2 ( x i , Y i ; α 0 ) φ ( x i , Y i ; θ 0 ) π ( x i , Y i ; α 0 ) α = 1 n i = 1 n δ i π ( x i , Y i ; α 0 ) { 1 π ( x i , Y i ; α 0 ) } φ ( x i , Y i ; θ 0 ) logit { π ( x i , Y i ; α 0 ) } / α P E { φ 1 ( x , Y ; θ 0 ) { δ π ( x , Y ; α 0 ) } logit { π ( x i , Y i ; α 0 ) } / α } = C o v { φ 1 ( x , Y ; θ 0 ) , D ( x , Y ; α 0 ) } = B 1 .
Thus, we obtain
1 n i = 1 n φ ^ 1 ( x i , Y i ; θ 0 , α ^ p ) = 1 n i = 1 n { φ 1 ( x i , Y i ; θ 0 , α 0 ) B 1 H i ( α 0 ) } + o p ( 1 ) L N ( 0 , Ω 1 ) .
By direct calculation, we obtain
1 n i = 1 n φ ^ 1 ( x i , Y i ; θ 0 , α ^ p ) 2 = 1 n i = 1 n δ i π 2 ( x i , Y i ; α 0 ) i ( θ 0 ) 2 ϖ 4 ( x i ) ε i 2 + o p ( 1 ) P V 1 .
We note that
1 n i = 1 n θ φ ^ 1 ( x i , Y i ; θ 0 , α ^ p ) = 1 n i = 1 n δ i ϖ 2 ( x i ) π ( x i , Y i ; α 0 ) f ¨ ( x i ; θ 0 ) { Y i f ( x i ; θ 0 ) } 1 n i = 1 n δ i ϖ 2 ( x i ) π ( x i , Y i ; α 0 ) i ( θ 0 ) 2 + o p ( 1 ) P Γ ,
where f ¨ ( x i ; θ ) is defined in Assumption A(4). This completes the proof of Lemma A3 (1).
Now, we proceed to prove the second part of Lemma A3.
We observe that
1 n i = 1 n φ ^ 2 ( x i , Y i ; θ 0 , α ^ p ) = 1 n i = 1 n φ ^ 2 ( x i , Y i ; θ 0 , α 0 ) + O ( θ 0 , α 0 ) ( α ^ p α 0 ) + o ( | | α ^ p α 0 | | ) ,
where O ( θ 0 , α 0 ) = 1 n i = 1 n φ ^ 2 ( x i , Y i ; θ 0 , α 0 ) / α .
Through direct computation, we obtain
O ( θ 0 , α 0 ) = 1 n i = 1 n δ i π 2 ( x i , Y i ; α 0 ) { φ ( x i , Y i ; θ 0 ) m φ 0 ( x i , θ 0 ; α 0 ) } π ( x i , Y i ; α 0 ) α + 1 n i = 1 n δ i π 2 ( x i , Y i ; α 0 ) { m ^ φ 0 ( x i ; θ 0 , α 0 ) m φ 0 ( x i , θ 0 ; α 0 ) } π ( x i , Y i ; α 0 ) α + 1 n i = 1 n { 1 δ i π ( x i , Y i ; α 0 ) } m ^ φ 0 ( x i ; θ 0 , α 0 ) α : = T n 1 + T n 2 + T n 3 .
By leveraging the consistency property of the kernel regression estimator, we establish that T n 2 = o p ( 1 ) . Define z ( x i , Y i ; α 0 ) = logit { π ( x i , Y i ; α 0 ) } / α . Additionally, let m z 0 ( x ; α 0 ) = E { z ( x , Y , α 0 ) | x , δ = 0 } ,   m z φ 0 ( x ; α 0 ) = E { z ( x , Y , α 0 ) φ ( x , Y ; θ 0 ) | x , δ = 0 } .
Using the kernel regression method, we obtain the following estimators:
m ^ z 0 ( x i ; α 0 ) = j = 1 n δ j O ( x j , Y j ) K h ( x j x i ) z ( x j , Y j , α 0 ) j = 1 n δ j O ( x j , Y j ) K h ( x j x i ) , m ^ z φ 0 ( x i ; α 0 ) = j = 1 n δ j O ( x j , Y j ) K h ( x j x i ) z ( x j , Y j , α 0 ) φ ( Y j , x j ; θ 0 ) j = 1 n δ j O ( x j , Y j ) K h ( x j x i ) .
Consequently, we have
m ^ φ 0 ( x i ; θ 0 , α 0 ) α = m ^ φ 0 ( x i ; θ 0 , α 0 ) m ^ z 0 ( x i ; α 0 ) m ^ z φ 0 ( x i ; α 0 ) .
Let Λ n ( x i ) = G ^ ( x i ) G ( x i ) and z j ( α 0 ) = z ( x j , Y j ; α 0 ) . For notation simplicity, we temporarily denote φ i = φ ( x i , Y i ; θ 0 ) , m φ 0 ( x i ) = m φ 0 ( x i ; θ 0 , α 0 ) , and m ^ φ 0 ( x i ) = m ^ φ 0 ( x i ; θ 0 , α 0 ) . By performing a further decomposition of T n 3 , we obtain
T n 3 = 1 n i = 1 n 1 δ i π ( x i , Y i ; α 0 ) m ^ φ 0 ( x i ; θ 0 , α 0 ) α = 1 n i = 1 n 1 δ i π ( x i , Y i ; α 0 ) { m ^ φ 0 ( x i ; θ 0 , α 0 ) m ^ z 0 ( x i ; α 0 ) m ^ z φ 0 ( x i , α 0 ) } = 1 n i = 1 n 1 δ i π ( x i , Y i ; α 0 ) { m ^ φ 0 ( x i ; θ 0 , α 0 ) m ^ z 0 ( x i ; α 0 ) m φ 0 ( x i ) m z 0 ( x i ; α 0 ) } 1 n i = 1 n 1 δ i π ( x i , Y i ; α 0 ) { m ^ z φ 0 ( x i ; α 0 ) m φ 0 ( x i ) m z 0 ( x i ; α 0 ) } : = T n 31 T n 32 .
For T n 31 , we have
T n 31 = 1 n i = 1 n 1 δ i π ( x i , Y i ; α 0 ) { m ^ φ 0 ( x i ; θ 0 , α 0 ) m ^ z 0 ( x i , α 0 ) m φ 0 ( x i ) m z 0 ( x i ; α 0 ) } = 1 n i = 1 n 1 δ i π ( x i , Y i ; α 0 ) m z 0 ( x i , α 0 ) { m ^ φ 0 ( x i ; θ 0 , α 0 ) m φ 0 ( x i ) } + 1 n i = 1 n 1 δ i π ( x i , Y i ; α 0 ) { m ^ z 0 ( x i , α ) m z 0 ( x i , α ) } { m ^ φ 0 ( x i , θ , α ) m φ 0 ( x i ) } + 1 n i = 1 n 1 δ i π ( x i , Y i ; α 0 ) m φ 0 ( x i ) { m ^ z 0 ( x i ; α 0 ) m z 0 ( x i ; α 0 ) } : = T n 311 + T n 312 + T n 313 .
By applying standard arguments, we obtain T n 31 j = o p ( n 1 / 2 ) for j = 1 , 2 , 3 . For T n 32 , we have
T n 32 = 1 n i = 1 n 1 δ i π ( x i , Y i ; α 0 ) { m ^ z φ 0 ( x i ; α 0 ) m φ 0 ( x i ) m z 0 ( x i ; α 0 ) } = 1 n i = 1 n 1 δ i π ( x i , Y i ; α 0 ) 1 n j = 1 n δ j O j ( α 0 ) K h ( x j x i ) { z j ( α 0 ) ψ j m φ 0 ( x j ) m z 0 ( x j ) } G ( x i ) + 1 n i = 1 n 1 δ i π ( x i , Y i ; α 0 ) 1 n j = 1 n δ j O j ( α 0 ) K h ( x j x i ) { m φ 0 ( x j ) m z 0 ( x j ) m φ 0 ( x i ) m z 0 ( x i ) } G ( x i ) 1 n i = 1 n 1 δ i π ( x i , Y i ; α 0 ) { m ^ z φ 0 ( x i ) G ^ ( x i ) m z φ 0 ( x i ) G ( x i ) } G 2 ( x i ) Λ n ( x i ) + 1 n i = 1 n 1 δ i π ( x i , Y i ; α 0 ) m ^ z φ 0 ( x i ) G 2 ( x i ) Λ n 2 ( x i ) 1 n i = 1 n ( 1 δ i ) m z φ 0 ( x i ) G ( x i ) G 2 ( x i ) Λ n ( x i ) + 1 n i = 1 n 1 δ i π ( x i , Y i ; α 0 ) m φ 0 ( x i ) m z 0 ( x i ) Λ n ( x i ) G ( x i ) : = T n 321 + T n 322 + T n 323 + T n 324 + T n 325 + T n 326 .
Standard arguments can also be employed to conclude that T n 32 j = o p ( n 1 / 2 ) for j = 1 , , 6 . Combining the above results, we obtain T n 2 = o p ( 1 ) and T n 3 = o p ( 1 ) .
Next, we consider T n 1 . A straightforward calculation yields
π ( x , Y ; α ) / α = π ( x , Y ; α ) { 1 π ( x , Y ; α ) } z ( x , Y ; α ) .
On the other hand,
E δ π ( x , Y ; α 0 ) { 1 π ( x , Y ; α 0 ) } z ( x , Y ; α 0 ) { φ ( x i , Y i ; θ 0 ) m φ 0 ( x i , θ 0 ; α 0 ) } = E δ π ( x , Y ; α 0 ) { δ π ( x , Y ; α 0 ) } z ( x , Y ; α 0 ) { φ ( x i , Y i ; θ 0 ) m φ 0 ( x i , θ 0 ; α 0 ) } = E δ π ( x , Y ; α 0 ) { φ ( x i , Y i ; θ 0 ) m φ 0 ( x i , θ 0 ; α 0 ) } D ( x , Y ; α 0 ) = E δ π ( x , Y ; α 0 ) { φ ( x i , Y i ; θ 0 ) m φ 0 ( x i , θ 0 ; α 0 ) } + m φ 0 ( x i , θ 0 ; α 0 ) D ( x , Y ; α 0 ) = C o v { φ ^ 2 ( x , Y ; θ 0 , α 0 ) , D ( x , Y α 0 ) } .
The third equality holds because
E { δ π ( x , Y ; α 0 ) } z ( x , Y ; α 0 ) | x = E E [ { δ π ( x , Y ; α 0 ) } z ( x , Y ; α 0 ) | x , Y ] | x = 0 ,
which results in
E m φ 0 ( x i , θ 0 ; α 0 ) D ( x , Y ; α 0 ) = E m φ 0 ( x i , θ 0 ; α 0 ) E [ { δ π ( x , Y ; α 0 ) } z ( x , Y ; α 0 ) | x ] = 0 .
Then, for T n 1 , we have T n 1 = C o v { φ ^ 2 ( x , Y ; θ 0 , α 0 ) , D ( x , Y ; α 0 ) } + o p ( 1 ) . Furthermore, we have
1 n i = 1 n φ ^ 2 ( x i , Y i ; θ 0 , α ^ p ) = 1 n i = 1 n φ ^ 2 ( x i , Y i ; θ 0 , α 0 ) B 2 ( α ^ p α 0 ) + o p ( n 1 2 ) ,
which is equivalent to
1 n i = 1 n φ ^ 2 ( x i , Y i ; θ 0 , α ^ p ) = 1 n i = 1 n φ ^ 2 ( x i , Y i ; θ 0 , α 0 ) B 2 n ( α ^ p α 0 ) + o p ( 1 ) = 1 n i = 1 n φ ^ 2 ( x i , Y i ; θ 0 , α 0 ) B 2 H i ( α 0 ) + o p ( 1 ) L N ( 0 , Ω 2 ) .
The second and third parts of Lemma A3 (2) can be proved using similar arguments as those in the proof of the corresponding parts in Lemma A3 (1). Thus, the proof of Lemma A3 is complete. □
Proof of Theorem 2. 
We begin by considering the first part of Theorem 2. Let θ ^ 1 and ν ^ n be the solutions to the following equations:
Q n 1 ( θ , ν n ) = 1 n i = 1 n φ ^ 1 ( x i , Y i ; θ , α ^ p ) 1 + ν n φ ^ 1 ( x i , Y i ; θ , α ^ p ) = 0 , Q n 2 ( θ , ν n ) = 1 n i = 1 n ν n θ φ ^ 1 ( x i , Y i ; θ , α ^ p ) 1 + ν n φ ^ 1 ( x i , Y i ; θ , α ^ p ) = 0 .
Taking the Taylor series expansion of Q n 1 ( θ ^ 1 , ν ^ n ) and Q n 2 ( θ ^ 1 , ν ^ n ) at ( θ 0 , 0 ) , we obtain
0 = Q n 1 ( θ ^ 1 , ν ^ n ) = Q n 1 ( θ 0 , 0 ) + Q n 1 ( θ 0 , 0 ) θ ( θ ^ 1 θ 0 ) + Q n 1 ( θ 0 , 0 ) ν ^ n ( ν ^ n 0 ) + o p ( σ n ) , 0 = Q n 2 ( θ ^ 1 , ν ^ n ) = Q n 2 ( θ 0 , 0 ) + Q n 2 ( θ 0 , 0 ) θ ( θ ^ 1 θ 0 ) + Q n 2 ( θ 0 , 0 ) ν ^ n ( ν ^ n 0 ) + o p ( σ n ) ,
where σ n = | | θ ^ 1 θ 0 | | + | | ν ^ n | | .
Through direct calculation, we obtain
Q n 1 ( θ 0 , 0 ) θ = 1 n i = 1 n θ φ ^ 1 ( x i , Y i ; θ , α ^ p ) , Q n 1 ( θ 0 , 0 ) ν n = 1 n i = 1 n φ ^ 1 ( x i , Y i ; θ , α ^ p ) 2 , Q n 2 ( θ 0 , 0 ) θ = 0 , Q n 2 ( θ 0 , 0 ) ν n = 1 n i = 1 n θ φ ^ 1 ( x i , Y i ; θ , α ^ p ) .
Then, we have
ν ^ n θ ^ 1 θ 0 = S n 1 Q n 1 ( θ 0 , 0 ) + o p ( σ n ) o p ( σ n ) ,
where
S n = 1 n i = 1 n φ ^ 1 ( x i , Y i ; θ , α ^ p ) 2 1 n i = 1 n θ φ ^ 1 ( x i , Y i ; θ , α ^ p ) 1 n i = 1 n θ φ ^ 1 ( x i , Y i ; θ , α ^ p ) 0 .
From Lemma A3, we obtain the following convergence result for S n :
S n P S = V 1 Γ Γ 0 .
Additionally, from Lemma A3, we have Q n 1 ( θ 0 , 0 ) = 1 / n i = 1 n φ ^ 1 ( x i , Y i ; θ , α ^ p ) = O p ( n 1 / 2 ) , which implies that σ n = O p ( n 1 / 2 ) . Thus, we obtain
n ( θ ^ 1 θ 0 ) = ( Γ V 1 1 Γ ) 1 Γ V 1 1 1 n i = 1 n φ ^ 1 ( x i , Y i ; θ , α ^ p ) + o p ( 1 ) .
Therefore, we have n ( θ ^ 1 θ 0 ) L N ( 0 , Σ 1 ) . Following the same procedure as outlined above, we can also establish that n ( θ ^ 2 θ 0 ) L N ( 0 , Σ 2 ) .
We now consider the second part of Theorem 2. Using the same argument as in Tang et al. [23], we obtain
^ 1 ( θ 0 , α ^ p ) = Z 1 n i = 1 n φ ^ 1 ( x i , Y i ; θ 0 , α ^ p ) 2 1 Z + o p ( 1 ) ,
where Z = 1 / n i = 1 n φ ^ 1 ( x i , Y i ; θ 0 , α ^ p ) . Applying Lemma A3, we obtain the desired result. The asymptotic distribution of ^ 2 ( θ 0 , α ^ p ) can be derived by following the same reasoning as in the proof of ^ 1 ( θ 0 , α ^ p ) . This completes the proof of Theorem 2. □

References

  1. Jennrich, R.I. Asymptotic properties of non-linear least squares estimators. Ann. Math. Stat. 1969, 40, 633–643. [Google Scholar] [CrossRef]
  2. Wu, C.F. Asymptotic theory of nonlinear least squares estimation. Ann. Stat. 1981, 9, 501–513. [Google Scholar] [CrossRef]
  3. Fekedulegn, D.; Mac Siurtain, M.P.; Colbert, J.J. Parameter estimation of nonlinear growth models in forestry. Silva Fenn 1999, 33, 327–336. [Google Scholar] [CrossRef]
  4. Ivanov, A.V. Asymptotic Theory of Nonlinear Regression; Kluwer Academic Publishers: Dordrecht, The Netherlands, 1997. [Google Scholar]
  5. Little, R.J.; Rubin, D.B. Statistical Analysis with Missing Data; John Wiley & Sons: New York, NY, USA, 2019. [Google Scholar]
  6. Horvitz, D.G.; Thompson, D.J. A generalization of sampling without replacement from a finite universe. J. Am. Stat. Assoc. 1952, 47, 663–685. [Google Scholar] [CrossRef]
  7. Robins, J.M.; Rotnitzky, A.; Zhao, L. Estimation of regression coefficients when some regressors are not always observed. J. Am. Stat. Assoc. 1994, 89, 846–866. [Google Scholar] [CrossRef]
  8. Han, P. Multiply robust estimation in regression analysis with missing data. J. Am. Stat. Assoc. 2014, 109, 1159–1173. [Google Scholar] [CrossRef]
  9. Xue, L.; Xie, J. Efficient robust estimation for single-index mixed effects models with missing observations. Stat. Pap. 2024, 65, 827–864. [Google Scholar] [CrossRef]
  10. Sharghi, S.; Stoll, K.; Ning, W. Statistical inferences for missing response problems based on modified empirical likelihood. Stat. Pap. 2024, 65, 4079–4120. [Google Scholar] [CrossRef]
  11. Li, W.; Luo, S.; Xu, W. Calibrated regression estimation using empirical likelihood under data fusion. Comput. Stat. Data Anal. 2024, 190, 107871. [Google Scholar] [CrossRef]
  12. Tang, N.; Zhao, P. Empirical likelihood-based inference in nonlinear regression models with missing responses at random. Statistics 2013, 47, 1141–1159. [Google Scholar] [CrossRef]
  13. Owen, A.B. Empirical likelihood ratio confidence regions. Ann. Stat. 1990, 18, 90–120. [Google Scholar] [CrossRef]
  14. Yang, Z.; Tang, N. Empirical likelihood for nonlinear regression models with nonignorable missing responses. Can. J. Stat. 2020, 48, 386–416. [Google Scholar] [CrossRef]
  15. Wang, S.; Shao, J.; Kim, J.K. An instrumental variable approach for identification and estimation with nonignorable nonresponse. Stat. Sin. 2014, 24, 1097–1116. [Google Scholar] [CrossRef]
  16. Wang, L.; Shao, J.; Fang, F. Propensity model selection with nonignorable nonresponse and instrument variable. Stat. Sin. 2021, 31, 647–672. [Google Scholar] [CrossRef]
  17. Chen, J.; Shao, J.; Fang, F. Instrument search in pseudo-likelihood approach for nonignorable nonresponse. Ann. Inst. Stat. Math. 2021, 73, 519–533. [Google Scholar] [CrossRef]
  18. Du, J.; Li, Y.; Cui, X. Identification and estimation of generalized additive partial linear models with nonignorable missing response. Commun. Math. Stat. 2024, 12, 113–156. [Google Scholar] [CrossRef]
  19. Beppu, K.; Morikawa, K. Verifiable identification condition for nonignorable nonresponse data with categorical instrumental variables. Stat. Theory Relat. Fields. 2024, 8, 40–50. [Google Scholar] [CrossRef]
  20. Fan, J.; Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
  21. Qin, J.; Leung, D.; Shao, J. Estimation with survey data under nonignorable nonresponse or informative sampling. J. Am. Stat. Assoc. 2002, 97, 193–200. [Google Scholar] [CrossRef]
  22. Qin, J.; Lawless, J.F. Empirical likelihood and general estimating equations. Ann. Stat. 1994, 22, 300–325. [Google Scholar] [CrossRef]
  23. Tang, N.; Zhao, P.; Zhu, H. Empirical likelihood for estimating equations with nonignorably missing data. Stat. Sin. 2014, 24, 723–747. [Google Scholar] [CrossRef] [PubMed]
  24. Ding, X.; Tang, N. Adjusted empirical likelihood estimation of distribution function and quantile with nonignorable missing data. J. Syst. Sci. Complex. 2018, 31, 820–840. [Google Scholar] [CrossRef]
  25. Morikawa, K.; Kano, Y. Statistical inference with different missing-data mechanisms. arXiv 2014, arXiv:1407.4971. [Google Scholar]
  26. Miao, W.; Tchetgen, E.J. On varieties of doubly robust estimators under missingness not at random with a shadow variable. Biometrika 2016, 103, 475–482. [Google Scholar] [CrossRef]
  27. Liu, T.; Yuan, X. Doubly robust augmented-estimating-equations estimation with nonignorable nonresponse data. Stat. Pap. 2020, 61, 2241–2270. [Google Scholar] [CrossRef]
  28. Zhao, P.; Tang, N.; Zhu, H. Generalized empirical likelihood inferences for nonsmooth moment functions with nonignorable missing values. Stat. Sin. 2020, 30, 217–249. [Google Scholar]
  29. Hu, Z.; Follmann, D.A.; Qin, J. Semiparametric dimension reduction estimation for mean response with missing data. Biometrika 2010, 97, 305–319. [Google Scholar] [CrossRef]
  30. Zhao, P.; Tang, N.; Qu, A.; Jiang, D. Semiparametric estimating equations inference with nonignorable missing data. Stat. Sin. 2017, 27, 89–113. [Google Scholar]
  31. Jiang, D.; Zhao, P.; Tang, N. A propensity score adjustment method for regression models with nonignorable missing covariates. Comput. Stat. Data Anal. 2016, 94, 98–119. [Google Scholar] [CrossRef]
  32. Zhou, Y.; Wan, A.T.K.; Wang, X. Estimating equations inference with missing data. J. Am. Stat. Assoc. 2008, 103, 1187–1199. [Google Scholar] [CrossRef]
  33. Hammer, S.M.; Katzenstein, D.A.; Hughes, M.D.; Gundacker, H.; Schooley, R.T.; Haubrich, R.H.; Henry, W.K.; Lederman, M.M.; Phair, J.P.; Niu, M.; et al. A trial comparing nucleoside monotherapy with combination therapy in HIV-infected adults with CD4 cell counts from 200 to 500 per cubic millimeter. N. Engl. J. Med. 1996, 335, 1081–1090. [Google Scholar] [CrossRef] [PubMed]
  34. Davidian, M.; Tsiatis, A.A.; Leon, S. Semiparametric estimation of treatment effect in a pretest–posttest study with missing data. Statist. Sci. 2005, 20, 261–301. [Google Scholar] [CrossRef] [PubMed]
  35. Tsiatis, A.A.; Davidian, M.; Zhang, M.; Lu, X. Covariate adjustment for two-sample treatment comparisons in randomized clinical trials: A principled yet flexible approach. Stat. Med. 2008, 27, 4658–4677. [Google Scholar] [CrossRef]
  36. Efron, B.; Tibshirani, R.J. An Introduction to the Bootstrap; Chapman & Hall: New York, NY, USA, 1993. [Google Scholar]
  37. Ren, Y.; Zhang, X. Variable selection using penalized empirical likelihood. Sci. China Math. 2011, 54, 1829–1845. [Google Scholar] [CrossRef]
Figure 1. Simulation 1 results comparing EL and NA methods. Line specifications: EL ( θ ^ 2 ) (red solid), EL ( θ ^ 1 ) (green dotted), NA ( θ ^ 2 ) (black dash-dot), NA ( θ ^ 1 ) (blue thick solid).
Figure 1. Simulation 1 results comparing EL and NA methods. Line specifications: EL ( θ ^ 2 ) (red solid), EL ( θ ^ 1 ) (green dotted), NA ( θ ^ 2 ) (black dash-dot), NA ( θ ^ 1 ) (blue thick solid).
Mathematics 13 01388 g001
Table 1. Simulation results on the estimation performance of α in Simulation 2.
Table 1. Simulation results on the estimation performance of α in Simulation 2.
n = 150 n = 250
Est.BiasSDRMSTFBiasSDRMSTF
α ^ 0 0.05490.10210.11583.6900.04470.07030.08333.790
α ^ 3 0.02110.23130.23210.00380.16120.1612
α ^ y 0.00310.13970.13970.01180.09840.0990
Table 2. Simulation results on the estimation performance of θ in Simulation 2.
Table 2. Simulation results on the estimation performance of θ in Simulation 2.
IPWAIPW
n Est.BiasSDRMSBiasSDRMS
150 θ ^ 0 0.00140.04420.04430.00070.04390.0439
θ ^ 1 0.00150.05190.05200.00120.05200.0522
θ ^ 2 0.00150.05690.05700.00080.05760.0576
θ ^ 3 0.00060.05960.05960.00010.06010.0601
θ ^ 4 0.00230.05730.05740.00200.05740.0574
θ ^ 5 0.00110.03050.03050.00090.03050.0305
250 θ ^ 0 0.00060.03820.03820.00080.03790.0379
θ ^ 1 0.00120.03990.03990.00120.04010.0402
θ ^ 2 0.00150.04660.04660.00130.04660.0466
θ ^ 3 0.00100.04500.04500.00130.04490.0449
θ ^ 4 0.00160.04090.04090.00210.04090.0409
θ ^ 5 0.00050.02270.02270.00070.02250.0225
Table 3. Estimation of response model parameters α .
Table 3. Estimation of response model parameters α .
Est.Estimatep-ValueEst.Estimatep-Value
α ^ 0 0.64<0.001 α ^ 5 0.0068<0.001
α ^ 1 −0.0007<0.001 α ^ 6 0.00020.002
α ^ 2 0.00110.003 α ^ 7 0.0010<0.001
α ^ 3 00.574 α ^ 8 −0.6299<0.001
α ^ 4 00.191 α ^ 9 −0.0010<0.001
Table 4. Results of the analysis on the ACTG 175 data.
Table 4. Results of the analysis on the ACTG 175 data.
Complete-Case AnalysisHan’s Method
Estimates.e.p-ValueEstimates.e.p-Value
Intercept21.5027.440.43365.5334.060.054
Trt63.689.09<0.00152.7210.34<0.001
CD 4 0 0.760.04<0.0010.730.05<0.001
Age0.100.450.8160.140.550.796
Weight0.540.280.0540.270.330.417
Race−20.608.510.015−18.309.660.058
Gender−10.7310.790.320−16.5411.340.145
History−42.027.62<0.001−41.458.65<0.001
Offtrt−80.729.62<0.001−86.8710.31<0.001
IPWAIPW
Estimates.e.p-ValueEstimates.e.p-Value
Intercept33.1530.660.279634.2830.770.2651
Trt62.149.52<0.00161.779.60<0.001
CD 4 0 0.760.05<0.0010.760.05<0.001
Age0.180.530.72780.180.540.7326
Weight0.420.320.17970.420.320.1884
Race−22.0710.100.0288−22.0110.130.0297
Gender−9.3812.070.4369−9.3312.100.4402
History−41.348.67<0.001−41.248.70<0.001
Offtrt−74.7411.62<0.001−74.4411.64<0.001
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ding, X.; Li, X. Identification and Empirical Likelihood Inference in Nonlinear Regression Model with Nonignorable Nonresponse. Mathematics 2025, 13, 1388. https://doi.org/10.3390/math13091388

AMA Style

Ding X, Li X. Identification and Empirical Likelihood Inference in Nonlinear Regression Model with Nonignorable Nonresponse. Mathematics. 2025; 13(9):1388. https://doi.org/10.3390/math13091388

Chicago/Turabian Style

Ding, Xianwen, and Xiaoxia Li. 2025. "Identification and Empirical Likelihood Inference in Nonlinear Regression Model with Nonignorable Nonresponse" Mathematics 13, no. 9: 1388. https://doi.org/10.3390/math13091388

APA Style

Ding, X., & Li, X. (2025). Identification and Empirical Likelihood Inference in Nonlinear Regression Model with Nonignorable Nonresponse. Mathematics, 13(9), 1388. https://doi.org/10.3390/math13091388

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop