Identification and Empirical Likelihood Inference in Nonlinear Regression Model with Nonignorable Nonresponse

Ding, Xianwen; Li, Xiaoxia

doi:10.3390/math13091388

Open AccessArticle

Identification and Empirical Likelihood Inference in Nonlinear Regression Model with Nonignorable Nonresponse

by

Xianwen Ding

¹ and

Xiaoxia Li

^2,*

¹

Department of Statistics, Jiangsu University of Technology, Changzhou 213001, China

²

School of Mathematics and Information Technology, Yuncheng University, Yuncheng 044000, China

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(9), 1388; https://doi.org/10.3390/math13091388

Submission received: 25 March 2025 / Revised: 20 April 2025 / Accepted: 23 April 2025 / Published: 24 April 2025

(This article belongs to the Special Issue Modeling, Control and Optimization of Biological Systems)

Download

Browse Figure

Versions Notes

Abstract

The identification of model parameters is a central challenge in the analysis of nonignorable nonresponse data. In this paper, we propose a novel penalized semiparametric likelihood method to obtain sparse estimators for a parametric nonresponse mechanism model. Based on these sparse estimators, an instrumental variable is introduced, enabling the identification of the observed likelihood. Two classes of estimating equations for the nonlinear regression model are constructed, and the empirical likelihood approach is employed to make inferences about the model parameters. The oracle properties of the sparse estimators in the nonresponse mechanism model are systematically established. Furthermore, the asymptotic normality of the maximum empirical likelihood estimators is derived. It is also shown that the empirical log-likelihood ratio functions are asymptotically weighted chi-squared distributed. Simulation studies are conducted to validate the effectiveness of the proposed estimation procedure. Finally, the practical utility of our approach is demonstrated through the analysis of ACTG 175 data.

Keywords:

identification; empirical likelihood; nonignorable nonresponse; nonlinear model

MSC:

62J02

1. Introduction

Consider a dataset comprising n independent observations

{(x_{i}, Y_{i})}_{i = 1}^{n}

, where each observation includes a covariate vector

x_{i} \in R^{d_{x}}

and a scalar response variable

Y_{i} \in R

. We consider a family of nonlinear regression models given by

Y_{i} = f (x_{i}; θ) + ϖ (x_{i}) ε_{i}, i = 1, \dots, n,

(1)

where

f (x_{i}; θ) : R^{d_{x}} \times R^{p} \to R

is a known nonlinear function with an unknown vector of parameters

θ \in R^{p}

. The error term consists of two components: (1)

ϖ (x_{i}) : R^{d_{x}} \to R^{+}

, a variance function that modulates the error scale as a function of the covariates, and (2)

ε_{i}

, a sequence of i.i.d. random variables with

E (ε_{i} | x_{i}) = 0

and

Var (ε_{i} | x_{i}) = σ^{2}

. Model (1) has been extensively studied in the statistical literature, including seminal contributions by Jennrich [1] and Wu [2]. A key example of such models is the Gompertz growth process, which is widely used in biology, epidemiology, and economics. The corresponding function is given by

f (x; θ_{1}, θ_{2}, θ_{3}) = θ_{1} exp \{- θ_{2} exp (- θ_{3} x)\}

, where

θ_{1}

is the upper asymptote,

θ_{2}

controls displacement, and

θ_{3}

represents the growth rate (Fekedulegn et al. [3]). Similarly, the logistic growth function, frequently employed in population dynamics and epidemic modeling, is

f (x; θ_{1}, θ_{2}, θ_{3}) = θ_{1} / [1 + exp \{- θ_{2} (x - θ_{3})\}]

, where

θ_{1}

is the carrying capacity,

θ_{2}

is the growth rate, and

θ_{3}

is the inflection point. For a fully observed dataset

{(Y_{i}, x_{i})}_{i = 1}^{n}

, the parameter

θ

in model (1) is traditionally estimated using the weighted least squares (WLS) criterion. This estimator is obtained by minimizing the weighted residual sum of squares:

\sum_{i = 1}^{n} ϖ^{- 2} (x_{i}) {Y_{i} - f (x_{i}; θ)}^{2}

. As a fundamental method in nonlinear regression, the WLS estimator achieves asymptotic efficiency under heteroscedasticity and serves as the theoretical foundation for various inferential procedures. For further details, see Ivanov [4].

Missing data frequently arise in practical applications due to factors such as reluctance to answer sensitive survey questions. In such cases, directly applying conventional least squares procedures to estimate the parameter vector

θ

may lead to biased estimates and invalid conclusions (see, for example, Little and Rubin [5]). The inverse probability weighting (IPW) method, introduced by Horvitz and Thompson [6], remains a fundamental approach for addressing missing data challenges. To improve efficiency, Robins et al. [7] developed the augmented inverse probability weighting (AIPW) method, which builds upon a corrected version of complete-case analysis. Subsequent extensions of this methodology across various domains include significant contributions by Han [8], Xue and Xie [9], Sharghi et al. [10], and Li et al. [11], among others. For missing at random (MAR) scenarios, Tang and Zhao [12] developed IPW and AIPW estimating equations for empirical likelihood (EL) inference on

θ

, extending the foundational methodology of Owen [13]. In more challenging not missing at random (NMAR) settings, where nonresponse depends on the unobserved values, Yang and Tang [14] proposed an EL approach for inference in this modeling framework.

The identification challenge remains a fundamental issue in the analysis of nonignorable missing data. The observed likelihood is identifiable if two distinct populations do not produce identical observed likelihood functions. Crucially, identifiability can fail even when both the outcome model and the nonresponse mechanism model are parametrically defined as demonstrated by Wang et al. [15]. Significant methodological advancements have been made in recent decades to address this issue. For parametric logistic nonresponse mechanisms, Yang and Tang [14] established identifiability conditions within the EL framework. In broader parametric settings, Wang et al. [15] introduced an instrumental variable (IV) approach to resolve the identifiability issue. More recently, Wang et al. [16] investigated an optimal subset selection method for identifying the IV from a set of candidate models. In addition, Chen et al. [17] suggested an IV selection technique based on pseudo-likelihood principles. Further advancements include the work of Du et al. [18] and Beppu and Morikawa [19]. Current estimation strategies for nonresponse mechanisms typically involve a two-stage process, starting with the identification of an appropriate IV, and followed by the estimation of the parameters in the nonresponse mechanism model. However, these methods face significant computational challenges as the candidate model space expands.

A novel penalized semiparametric likelihood method for IV selection is proposed under the parametric assumption of the missingness mechanism. By leveraging the sparse structure of the observed likelihood, we develop a regularized approach to obtain the sparse estimators for the nonresponse mechanism model. To achieve this, we integrate the semiparametric likelihood framework with the SCAD penalty function (Fan and Li [20]). This shrinkage technique enables the simultaneous identification of IV and estimation of a sparse nonresponse mechanism model. Subsequently, the unbiased estimating equations based on IPW and AIPW methods are constructed, and the profile empirical log-likelihood ratio functions (ELLRFs) are rigorously formulated.

Our primary contributions are threefold. First, we propose a penalized semiparametric likelihood framework that effectively combines the SCAD penalty and the sparse likelihood structure. This approach facilitates simultaneous IV selection and parameter estimation for the nonresponse mechanism model. The resulting sparse estimators exhibit the oracle properties, ensuring both selection consistency and asymptotic efficiency. Second, the flexibility of the EL method enables the proposed estimation procedure to produce confidence regions with natural shapes and orientations. Third, under some regularity conditions, we show that the ELLRFs are asymptotically weighted chi-squared distributed, while the maximum empirical likelihood estimators (MELEs) are asymptotically normally distributed, providing a valid foundation for regression parameter inference.

The rest of this article is organized as follows. In Section 2, we present the penalized semiparametric likelihood methodology and construct two types of unbiased estimating equations. The MELEs and ELLRFs are also introduced. In Section 3, we investigate the oracle properties of the sparse estimators for the nonresponse mechanism model, as well as the asymptotic normality of the MELEs and the asymptotic properties of the proposed ELLRFs. Simulation studies and a real data analysis are conducted to evaluate the finite sample performance of the proposed estimates in Section 4 and Section 5, respectively. The concluding discussions are included in Section 6. Proofs of the asymptotic results are relegated to Appendix A.

2. Methods

2.1. Penalized Semiparametric Likelihood Estimation

Let

F (x, Y)

be the unconditional joint distribution of

x

and Y. Suppose that

n_{1}

out of the n individuals respond on both Y and

x

, which results in data

(x_{1}, Y_{1}), \dots, (x_{n_{1}}, Y_{n_{1}})

. For the rest of the

n - n_{1}

individuals, their Y values are not observed, but their

x

values are always observed. Let

δ

represent a missingness indicator of Y, i.e., it takes 1 if Y is observed, and takes 0 otherwise. Suppose that the covariate

x

has two components,

x = {(u^{⊤}, z^{⊤})}^{⊤}

such that the nonresponse mechanism can be modeled as

π (x, Y; α) = P (δ = 1 | x, Y) = Ψ (α_{0} + α_{u}^{⊤} u + α_{z}^{⊤} z + α_{y} Y),

(2)

where

α = {(α_{0}, α_{u}^{⊤}, α_{z}^{⊤}, α_{y})}^{⊤} \in R^{d}

is an unknown parameter to be estimated, and

Ψ

is a known, strictly monotonic, twice-differentiable function from

R

to

(0, 1]

. Since model (2) depends explicitly on the potentially unobserved Y when

α_{y} \neq 0

, it describes a nonignorable missingness mechanism, often referred to as NMAR. In this context, the missingness indicator

δ

is typically assumed to follow a conditional Bernoulli distribution with probability

π (x, Y; α)

. Notably, when

α_{y} = 0

, the missingness mechanism simplifies to MAR, as the dependence on the unobserved Y is eliminated.

Following Qin et al. [21], the likelihood of

(α, F)

based on the complete observations

{(x_{j}, Y_{j}) : j = 1, \dots, n_{1}}

is

\prod_{j = 1}^{n_{1}} π (x_{j}, Y_{j}; α) d F (x_{j}, Y_{j}) \prod_{j = n_{1} + 1}^{n} \int \int {1 - π (x, Y; α)} d F (x, Y),

which can be rewritten as

L_{C} (α, W) = \{\prod_{j = 1}^{n_{1}} \frac{π (x_{j}, Y_{j}; α) d F (x_{j}, Y_{j})}{W}\} W^{n_{1}} {(1 - W)}^{n - n_{1}},

(3)

where

W = P (δ = 1) = \int \int π (x, Y; α) d F (x, Y)

is the unconditional respondent rate. The first term in Equation (3) is the likelihood conditioning on

δ = 1

, and the term

W^{n_{1}} {(1 - W)}^{n - n_{1}}

is the binomial likelihood of

δ

. The direct maximization of

L_{C} (α, W)

in Equation (3) may lose some information contained in

{x_{i} : i = n_{1} + 1, \dots, n}

. To address this limitation, we assume that the auxiliary information on

x

can be characterized as

E {Δ (x)} = 0

, where

Δ (x) = {(Δ_{1} (x), \dots, Δ_{r} (x))}^{⊤}

is a known r-vector (or scalar) function. To illustrate the rationale underlying the construction of the auxiliary function

Δ (x)

, consider the case where the population mean of

x

, denoted by

μ_{x}

, is known. In this setting, one may define

Δ (x) = x - μ_{x}

to serve as auxiliary information. When the population mean

μ_{x}

is unavailable, it can be replaced by the estimated mean

\bar{x} = n^{- 1} \sum_{i = 1}^{n} x_{i}

. Thus, part of the information contained in

{x_{i} : i = n_{1} + 1, \dots, n}

is recovered through

μ_{x}

or

\bar{x}

, thereby enhancing the efficiency of estimation under incomplete data.

By the auxiliary information on

x

and without assuming any specific form of

F (x, Y)

, we can maximize the semiparametric likelihood (3) subject to the constraints

\begin{matrix} ϕ_{j} \geq 0, \sum_{j = 1}^{n_{1}} ϕ_{j} = 1, \\ \sum_{j = 1}^{n_{1}} ϕ_{j} {π (x_{j}, Y_{j}; α) - W} = 0, \\ \sum_{j = 1}^{n_{1}} ϕ_{j} Δ (x_{j}) = 0, \end{matrix}

where

ϕ_{j}

is the jump of

F (x, Y)

at

{(x_{j}, Y_{j}) : j = 1, \dots, n_{1}}

.

By introducing Lagrange multipliers and profiling for all of the values of

ϕ_{j}

, we obtain

ϕ_{j} = \frac{1}{n_{1} [1 + λ_{1}^{⊤} Δ (x_{j}) + λ_{2} {π (x_{j}, Y_{j}; α) - W}]},

where

λ_{1}

and

λ_{2}

are Lagrange multipliers as described in Qin and Lawless [22].

Substituting all of the values of

ϕ_{j}

into Equation (3), the log-likelihood with respect to

α

and W becomes

\begin{matrix} ℓ (α, W) & = \sum_{j = 1}^{n_{1}} log π (x_{j}, Y_{j}; α) + (n - n_{1}) log (1 - W) \\ - \sum_{j = 1}^{n_{1}} log [1 + λ_{1}^{⊤} Δ (x_{j}) + λ_{2} {π (x_{j}, Y_{j}; α) - W}] . \end{matrix}

The identifiability of the observed likelihood as established by Wang et al. [16] relies on the existence of an IV

z

that satisfies two conditions: (i)

z

can be excluded from the nonresponse mechanism model, i.e.,

z ⊥ δ ∣ (u, Y)

, and (ii)

z

must be related to the study variable Y. Specifically, if the true parameter subvector

α_{z}^{0}

corresponding to

z

satisfies

α_{z}^{0} = 0

, then

z

qualifies as a valid IV by design. This critical insight motivates the development of a penalized semiparametric likelihood framework to achieve the sparse estimation of

α

in the nonresponse mechanism model. The penalized likelihood estimator

{\hat{α}}_{p}

of

α

can be obtained by maximizing the following objective function:

ℓ_{p} (α, W) = ℓ (α, W) - n_{1} \sum_{j = 1}^{d} g_{γ} (| α_{j} |),

(4)

where

g_{γ} (\cdot)

represents the SCAD penalty function. The first derivative of the penalty term is specified as

\begin{matrix} g_{γ}^{'} (β) = γ \{I (β \leq γ) + \frac{{(a γ - β)}_{+}}{(a - 1) γ} I (β > γ)\} \end{matrix}

for

β > 0

, where

a > 2

is a fixed constant,

γ

is a tuning parameter, and

{(z)}_{+} = max (z, 0)

. Following Fan and Li [20], we set

a = 3.7

throughout this study. As demonstrated in Theorem 1, the sparse estimator

{\hat{α}}_{p}

achieves the oracle properties, ensuring that

P ({\hat{α}}_{z} = 0) \to 1

as

n \to \infty

. This guarantees the consistent identification of

z

as the IV.

Implementing the optimal procedure for (4) presents a notable challenge due to the involvement of the non-concave penalized function

g_{γ} (| α_{j} |)

. To enhance numerical stability, we adopt the local quadratic approximation method introduced by Fan and Li [20]. Given the m-th iteration estimate

α_{j}^{(m)}

, the penalty function can be approximated quadratically as follows:

g_{γ} (| α_{j} |) \approx g_{γ} (| α_{j}^{(m)} |) + \frac{1}{2} {g_{γ}^{'} (| α_{j}^{(m)} |) / | α_{j}^{(m)} |} {α_{j}^{2} - {(α_{j}^{(m)})}^{2}} .

This approximation simplifies the non-concave penalty function, thereby improving both the computational tractability and convergence properties of the optimization procedure. In addition to the approximation strategy, selecting an appropriate penalty parameter

γ

is crucial for optimizing model performance. To achieve this, we employ the following Bayesian information criterion:

BIC (γ) = 2 ℓ_{p} ({\hat{α}}_{p}, \hat{W}) - log (n_{1}) {df}_{γ},

where

{df}_{γ}

is the number of nonzero elements in

{\hat{α}}_{p}

. By minimizing

BIC (λ)

over

λ

, the resulting optimal tuning parameter can be obtained.

2.2. Construction of Estimating Equations

For complete data

{(Y_{i}, x_{i}) : i = 1, \dots, n}

, the WLS estimator can be obtained by solving the following equations:

\frac{1}{n} \sum_{i = 1}^{n} \nabla_{i} (θ) ϖ^{- 2} (x_{i}) {Y_{i} - f (x_{i}; θ)} = 0,

where

\nabla_{i} (θ) = \partial f (x_{i}; θ) / \partial θ

.

When Y is subject to NMAR, we introduce the following estimating function based on the IPW approach for the ith individual:

{\hat{φ}}_{1} (x_{i}, Y_{i}; θ, {\hat{α}}_{p}) = \frac{δ_{i}}{\hat{π} (x_{i}, Y_{i}; {\hat{α}}_{p})} φ (x_{i}, Y_{i}; θ),

where

φ (x_{i}, Y_{i}; θ) = \nabla_{i} (θ) ϖ^{- 2} (x_{i}) {Y_{i} - f (x_{i}; θ)}

and

\hat{π} (x_{i}, Y_{i}; {\hat{α}}_{p})

is the consistent estimate of

π (x_{i}, Y_{i}; α)

.

To improve efficiency, we develop the AIPW-type estimating function with imputation

φ_{2} (x_{i}, Y_{i}; θ, α) = \frac{δ_{i}}{π (x_{i}, Y_{i}; α)} φ (x_{i}, Y_{i}; θ) + \{1 - \frac{δ_{i}}{π (x_{i}, Y_{i}; α)}\} m_{φ}^{0} (x_{i}; θ, α),

where

m_{φ}^{0} (x_{i}; θ, α) = E {φ (x_{i}, Y_{i}; θ) | x_{i}, δ_{i} = 0}

. Following Tang et al. [23], the conditional density

f_{0} (Y_{i} | x_{i}) = f (Y_{i} | x_{i}, δ_{i} = 0)

satisfies

f_{0} (Y_{i} | x_{i}) = f_{1} (Y_{i} | x_{i}) \times \frac{O (x_{i}, Y_{i}; α)}{E {O (x_{i}, Y_{i}; α) | x_{i}, δ_{i} = 1}},

where

f_{1} (Y_{i} | x_{i}) = f (Y_{i} | x_{i}, δ_{i} = 1)

is the conditional density of

Y_{i}

given

x_{i}

and

δ_{i} = 1

, and

O (x_{i}, Y_{i}; α) = π^{- 1} (x_{i}, Y_{i}; α) - 1

is the conditional odds of nonresponse. Simple algebraic manipulations show that

m_{φ}^{0} (x_{i}; θ, α) = \frac{E {δ_{i} φ (x_{i}, Y_{i}; θ) O (x_{i}, Y_{i}; α) | x_{i}}}{E {δ_{i} O (x_{i}, Y_{i}; α) | x_{i}}} .

A nonparametric kernel estimator of

m_{φ}^{0} (x_{i}; θ, α)

can be obtained by

{\hat{m}}_{φ}^{0} (x; θ, α) = \sum_{i = 1}^{n} ω_{i 0}^{*} (x; α) φ (x_{i}, Y_{i}; θ),

where the weight

ω_{i 0}^{*} (x; α) = δ_{i} O (x_{i}, Y_{i}; α) K_{h} (x - x_{i}) / \sum_{k = 1}^{n} δ_{k} O (x_{k}, Y_{k}; α) K_{h} (x - x_{k})

, and

K_{h} (\cdot) = h^{- κ} K (\cdot / h)

with

K (\cdot)

being a

κ

-dimensional kernel function and h representing a bandwidth sequence. Given

{\hat{α}}_{p}

, a kernel-assisted estimating function for the ith observation is given by

{\hat{φ}}_{2} (x_{i}, Y_{i}; θ, {\hat{α}}_{p}) = \frac{δ_{i}}{\hat{π} (x_{i}, Y_{i}; {\hat{α}}_{p})} φ (x_{i}, Y_{i}; θ) + \{1 - \frac{δ_{i}}{\hat{π} (x_{i}, Y_{i}; {\hat{α}}_{p})}\} {\hat{m}}_{φ}^{0} (x_{i}; θ, {\hat{α}}_{p}) .

2.3. MELEs of Model Parameters

To fix the notation, we temporarily assume that

α

is known. Let

p_{i}^{*}

be non-negative weight allocated to

φ_{1} (x_{i}, Y_{i}; θ, α) = δ_{i} / π (x_{i}, Y_{i}; α) φ (x_{i}, Y_{i}; θ)

with a total mass of 1. Under moment condition

E {φ_{1} (x_{i}, Y_{i}; θ, α)} = 0

, the profile EL ratio for

θ

is defined as

\begin{matrix} L_{1} (θ) = sup \{\prod_{i = 1}^{n} n p_{i}^{*} | p_{i}^{*} \geq 0, \sum_{i = 1}^{n} p_{i}^{*} = 1, \sum_{i = 1}^{n} p_{i}^{*} φ_{1} (x_{i}, Y_{i}; θ, α) = 0\} . \end{matrix}

By introducing Lagrange multiplier

t_{n} \in R^{p}

, we have

p_{i}^{*} = \frac{1}{n \{1 + t_{n}^{⊤} φ_{1} (x_{i}, Y_{i}; θ, α)\}},

where

t_{n}

satisfies

Q_{n 1} (θ, t_{n}) = \frac{1}{n} \sum_{i = 1}^{n} \frac{φ_{1} (x_{i}, Y_{i}; θ, α)}{1 + t_{n}^{⊤} φ_{1} (x_{i}, Y_{i}; θ, α)} = 0 .

Therefore, the ELLRF for

θ

can be shown to be

ℓ_{1} (θ, α) = 2 \sum_{i = 1}^{n} log \{1 + t_{n}^{⊤} φ_{1} (x_{i}, Y_{i}; θ, α)\} .

(5)

Maximizing

- ℓ_{1} (θ, α)

leads to the MELE of

θ

, denoted by

{\hat{θ}}_{1}^{*}

. Under some smoothness conditions,

{\hat{θ}}_{1}^{*}

can be obtained by simultaneously solving

\begin{matrix} Q_{n 1} (θ, t_{n}) = 0, Q_{n 2} (θ, t_{n}) = \frac{1}{n} \sum_{i = 1}^{n} \frac{t_{n}^{⊤} \partial_{θ} φ_{1} (x_{i}, Y_{i}; θ, α)}{1 + t_{n}^{⊤} φ_{1} (x_{i}, Y_{i}; θ, α)} = 0, \end{matrix}

where

\partial_{θ}

denotes the partial derivative with respect to

θ

.

In practical applications, since the parameter

α

is typically unknown, the ELLRF in Equation (5) cannot be used directly for inference about

θ

. To address this, given

{\hat{α}}_{p}

, the estimated ELLRF based on the IPW method is

{\hat{ℓ}}_{1} (θ, {\hat{α}}_{p}) = 2 \sum_{i = 1}^{n} log {1 + ν_{n}^{⊤} {\hat{φ}}_{1} (x_{i}, Y_{i}; θ, {\hat{α}}_{p})},

where

ν_{n}

is the Lagrange multiplier and satisfies

\frac{1}{n} \sum_{i = 1}^{n} \frac{{\hat{φ}}_{1} (x_{i}, Y_{i}; θ, {\hat{α}}_{p})}{1 + ν_{n}^{⊤} {\hat{φ}}_{1} (x_{i}, Y_{i}; θ, {\hat{α}}_{p})} = 0 .

Thus, the IPW-based MELE of

θ

, denoted by

{\hat{θ}}_{1}

, can be obtained by maximizing

- {\hat{ℓ}}_{1} (θ, {\hat{α}}_{p})

. Similarly, the AIPW-based MELE of

θ

, denoted by

{\hat{θ}}_{2}

, can be obtained by maximizing

- {\hat{ℓ}}_{2} (θ, {\hat{α}}_{p})

, where

{\hat{ℓ}}_{2} (θ, {\hat{α}}_{p}) = 2 \sum_{i = 1}^{n} log {1 + ν_{n}^{* ⊤} {\hat{φ}}_{2} (x_{i}, Y_{i}; θ, {\hat{α}}_{p})}

with

ν_{n}^{*}

solving the corresponding Lagrange multiplier equations.

Remark 1.

The proposed method is developed under the assumption that the response variable is subject to NMAR. This assumption is commonly adopted in practical applications, particularly in contexts such as longitudinal studies or clinical trials with outcome-dependent dropout. Notably, as demonstrated by the sensitivity analyses by Ding and Tang [24] and Yang and Tang [14], estimation methods based on the NMAR assumption can still perform well when the true missingness mechanism is MAR, suggesting their robustness to MAR data. However, when the data exhibit a mixture of MAR and NMAR mechanisms, such as when different individuals follow distinct missingness patterns, the validity of NMAR-based methods may be compromised unless a hierarchical structure missingness framework is explicitly incorporated as discussed by Morikawa and Kano [25]. Consequently, in real-world data applications, it is crucial to assess the plausibility of the NMAR assumption on a case-by-case basis, as model performance and identifiability may be sensitive to deviations from the assumed missingness mechanism.

3. Main Results

3.1. Asymptotic Properties

The asymptotic properties of the MELEs and ELLRFs require the following assumptions:

(A1): The nonresponse mechanism $π (x, Y; α) \geq c > 0$ almost surely and $π (x) = E {π (x, Y; α_{0}) ∣ x} \neq 1$ almost surely; in a neighborhood of $α_{0}$ , $E | π (x_{i}, Y_{i}; α) |^{3} < \infty$ , and $\partial^{2} π (x, Y; α) / \partial α \partial α^{⊤}$ exists and is bounded by an integrable function.
(A2): The probability density function $G (x)$ is bounded away from ∞ in the support of $x$ ; the first and second derivatives of $G (x)$ are continuous, smooth and bounded; and ${sup}_{x} E (ε^{2} | x)$ and $E (∥ x ∥^{2})$ are finite.
(A3): $m_{φ}^{0} (x; θ, α)$ is twice continuously differentiable in the neighborhood of $x$ .
(A4): The function $f (x; θ)$ is continuous with respect to $θ$ , where $θ$ lies in a compact set; $\nabla (θ) = \partial f (x; θ) / \partial θ$ and $\ddot{f} (θ) = \partial^{2} f (x; θ) / \partial θ \partial θ^{⊤}$ exist; $\ddot{f} (x; θ)$ has full column rank.
(A5): $E \{ϖ^{- 2} (x) \nabla (θ) \nabla {(θ)}^{⊤}\}$ has full column rank.
(A6): The kernel function $K (\cdot)$ is a probability density function such that (a) it is bounded and has a compact support; (b) it is symmetric with $\int u^{2} K (u) d u < \infty$ ; (c) $K (u) \geq d_{0}$ for some $d_{0} > 0$ in some closed interval centered at zero; and (d) the bandwidth h satisfies $n h \to \infty$ and $n h^{4} \to 0$ as $n \to \infty$ .
(A7): As $n \to \infty$ , ${lim inf}_{β \to 0^{+}} g_{γ}^{'} (β) / γ > 0$ , and the tuning parameter $γ$ satisfies $\sqrt{n} γ \to \infty$ as $n \to \infty$ and $γ \to 0$ .
(A8): The penalty function $g_{γ} (\cdot)$ satisfies $\max_{j \in B} g_{γ}^{'} (| α_{0 j} |) = o_{p} (n^{- 1 / 2})$ and $\max_{j \in B} g_{γ}^{″} (| α_{0 j} |)$ $= o_{p} (1)$ , where $B = {j : α_{0 j} \neq 0}$ .
(A9): The moment conditions

$E {\{sup_{ξ \in ℵ} |\frac{\partial^{2} h_{k} (x_{i}, Y_{i}; ξ)}{\partial α_{j} \partial α_{l}}|\}}^{2} < \infty$

and

$E {\{sup_{ξ \in ℵ} |\frac{\partial h_{k} (x_{i}, Y_{i}; ξ)}{\partial α_{j}}|\}}^{2} < \infty$

hold for $k = 1, \dots, r + 1$ , $j = 1, \dots, d$ , and $l = 1, \dots, d$ , where $ξ = {(α^{⊤}, W)}^{⊤}$ with ℵ being the compact set, and $h (x_{i}, Y_{i}; ξ)$ is defined in (A1). The notation $h_{k} (x_{i}, Y_{i}; ξ)$ denotes the k-th component of $h (x_{i}, Y_{i}; ξ)$ .

Condition A(1) is similar to that used by Qin et al. [21] and is necessary to establish the asymptotic normality of

{\hat{α}}_{p}

. Condition A(2) is a standard condition in probability theory. Assumptions A(3)–A(5) are typical in empirical likelihood-based inference with estimating equations. Condition A(6) is a common assumption in the nonparametric literature. Assumptions A(7)–A(9) are required to establish the oracle properties of penalized semiparametric likelihood estimators.

Let

α_{0} = {(α_{10}^{⊤}, α_{20}^{⊤})}^{⊤}

denote the true value of

α = {(α_{1}^{⊤}, α_{z}^{⊤})}^{⊤}

, where

α_{1} = {(α_{0}, α_{u}^{⊤}, α_{y})}^{⊤}

. As discussed in Fan and Li [20], the SCAD penalty function possesses the oracle properties. These properties ensure that the SCAD penalty not only promotes a sparse model structure but also yields an estimator that is nearly unbiased for large parameter values. We establish the oracle properties and the consistency of

{\hat{α}}_{p}

in Theorem 1.

Theorem 1.

Under Assumptions A(1) and A(7)–A(9), as

n \to \infty

, we have

(i)

| | {\hat{α}}_{p} - α_{0} | | = O_{p} (n^{- 1 / 2})

;

(ii)

P {{\hat{α}}_{z} = 0} \to 1

;

(iii)

\sqrt{n} M^{- 1 / 2} ({\hat{α}}_{1} - α_{10})

\overset{L}{\to}

N (0, I)

, where I represents the identity matrix, and

M

is defined in the Appendix A.

From Theorem 1, we establish the stochastic expansion

\sqrt{n} ({\hat{α}}_{p} - α_{0}) = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} H_{i} (α_{0}) + o_{p} (1),

where the influence function

H_{i} (α_{0})

is defined in (A2) of the Appendix. The first part of Theorem 1 demonstrates that by appropriately selecting the tuning parameter

γ

, a root-n consistent penalized likelihood estimator can be obtained. Furthermore, Theorem 1 (ii) establishes the sparsity property, ensuring that

{\hat{α}}_{z} = 0

with probability approaching one. This result confirms that the penalized estimator effectively identifies and selects the IV

z

with high probability. Finally, Theorem 1 (iii) establishes the asymptotic normality of

{\hat{α}}_{1}

, suggesting that the penalized likelihood method can yield efficient estimates of the nonzero components by effectively reducing the dimensionality of

α

.

Within the framework of the penalized semiparametric likelihood, the asymptotic properties of

{\hat{θ}}_{1}

and

{\hat{θ}}_{2}

are established below. For any vector

B

, let

B^{\otimes 2} = B B^{⊤}

, and convergence in distribution is denoted by the symbol

\overset{L}{\to}

. We first define the key quantities:

\begin{matrix} V_{1} & = E \{π^{- 1} (x, Y; α_{0}) ϖ^{- 4} (x) \nabla {(θ_{0})}^{\otimes 2} ε^{2}\}, \\ Γ & = - E \{ϖ^{- 2} (x) \nabla {(θ_{0})}^{\otimes 2}\}, \\ V_{2} & = E [π^{- 1} (x, Y; α_{0}) {\{φ (x, Y; θ_{0}) - m_{φ}^{0} (x; θ_{0}, α_{0})\}}^{\otimes 2}] \\ + E \{m_{φ}^{0} {(x; θ_{0}, α_{0})}^{\otimes 2}\}, \\ D (x, Y; α_{0}) & = \{δ - π (x, Y; α_{0})\} \frac{\partial logit {π (x, Y; α_{0})}}{\partial α}, \\ B_{k} & = Cov \{φ_{k} (x, Y; θ_{0}, α_{0}), D (x, Y; α_{0})\}, k = 1, 2, \\ Ω_{k} & = Var \{φ_{k} (x_{i}, Y_{i}; θ_{0}, α_{0}) - B_{k} H_{i} (α_{0})\}, k = 1, 2 . \end{matrix}

Theorem 2.

Suppose that Conditions (A1)–(A9) hold,

Ω_{1}

and

Ω_{2}

are positive definite matrices,

θ_{0}

is the unique true parameter value of θ, and α is estimated by

{\hat{α}}_{p}

. Define

\begin{matrix} Σ_{1} & = {(Γ^{⊤} V_{1}^{- 1} Γ)}^{- 1} Γ^{⊤} V_{1}^{- 1} Ω_{1} V_{1}^{- 1} Γ {(Γ^{⊤} V_{1}^{- 1} Γ)}^{- 1}, \\ Σ_{2} & = {(Γ^{⊤} V_{2}^{- 1} Γ)}^{- 1} Γ^{⊤} V_{2}^{- 1} Ω_{2} V_{2}^{- 1} Γ {(Γ^{⊤} V_{2}^{- 1} Γ)}^{- 1} . \end{matrix}

Then, as

n \to \infty

, we have

(1): Asymptotic normality:

$\begin{matrix} \sqrt{n} ({\hat{θ}}_{1} - θ_{0}) & \overset{L}{\to} N (0, Σ_{1}), \\ \sqrt{n} ({\hat{θ}}_{2} - θ_{0}) & \overset{L}{\to} N (0, Σ_{2}); \end{matrix}$
(2): Likelihood ratio convergence:

$\begin{matrix} {\hat{ℓ}}_{1} (θ_{0}, {\hat{α}}_{p}) & \overset{L}{\to} \sum_{k^{*} = 1}^{p} ρ_{1, k^{*}} χ_{1, k^{*}}^{2}, \\ {\hat{ℓ}}_{2} (θ_{0}, {\hat{α}}_{p}) & \overset{L}{\to} \sum_{k^{*} = 1}^{p} ρ_{2, k^{*}} χ_{1, k^{*}}^{2}, \end{matrix}$

where ${χ_{1, k^{*}}^{2}}_{k^{*} = 1}^{p}$ are independent chi-squared variates with 1 degree of freedom, and ${ρ_{m, k^{*}}}_{k^{*} = 1}^{p}$ ( $m = 1, 2$ ) are eigenvalues of $V_{m}^{- 1} Ω_{m}$ .

Theorem 2 (1) establishes the asymptotic normality of

{\hat{θ}}_{1}

and

{\hat{θ}}_{2}

, enabling normal approximation (NA)-based inference. Specifically, the

(1 - ϑ)

-level NA confidence region for

θ

is

\{θ : {({\hat{θ}}_{j} - θ)}^{⊤} n {\hat{Σ}}_{j}^{- 1} ({\hat{θ}}_{j} - θ) \leq χ_{p, 1 - ϑ}^{2}\}, j = 1, 2,

where

{\hat{Σ}}_{j}^{- 1}

is a consistent plug-in estimator of

Σ_{j}^{- 1}

, and

χ_{p, 1 - ϑ}^{2}

denotes the

1 - ϑ

quantile of the chi-squared distribution

χ_{p}^{2}

with p degrees of freedom. Theorem 2 (2) characterizes the ELLRFs, yielding the EL confidence region

{CI}_{ϑ} (θ) = \{θ : {\hat{ℓ}}_{j} (θ, {\hat{α}}_{p}) \leq c_{ϑ}^{(j)}\}, j = 1, 2,

where

c_{ϑ}^{(j)}

is the

1 - ϑ

quantile of the distribution

\sum_{k^{*} = 1}^{p} ρ_{j, k^{*}} χ_{1, k^{*}}^{2}

, and

{ρ_{j, k^{*}}}_{k^{*} = 1}^{p}

are the eigenvalues of

V_{j}^{- 1} Ω_{j}

.

3.2. Double Robustness

From Theorem 2, we know that if the model

π (x, Y; α)

is correctly specified, the proposed estimators

{\hat{θ}}_{1}

are unbiased and consistent under certain regularity conditions. However, verifying these specifications is a challenging task, and the misspecification of

π (x, Y; α)

may result in biased estimates and reduced efficiency unless additional data assumptions are imposed. To address these limitations, a double robust estimation procedure has been developed in the NMAR settings. Specifically, following Miao and Tchetgen [26] and Liu and Yuan [27], the conditional density function of

(z, Y, δ)

can be factorized as

f (z, Y, δ | u) = c (u) exp (1 - δ) OR (Y | u) P (δ | Y = 0, u) f (z, Y | δ = 1, u),

where

c (u) = P (δ = 1 | u) / P (δ = 1 | Y = 0, u)

,

P (δ = 1 | Y = 0, u)

is the baseline propensity score,

f (z, Y | δ = 1, u)

is the baseline outcome density, and

OR (Y | u) = log \{\frac{P (δ = 0 | Y, u) P (δ = 1 | Y = 0, u)}{P (δ = 0 | Y = 0, u) P (δ = 1 | Y, u)}\}

is the log of the conditional odds ratio function relating Y and

δ

given

u

.

When focusing on the estimation of the response mean,

θ = E (Y)

, we have

φ (x_{i}, Y_{i}; θ) = Y - θ

. As demonstrated by Liu and Yuan [27], if

OR (Y | u)

is correctly specified, the estimator

{\hat{θ}}_{1}

is unbiased and consistent if either

f (z, Y | δ = 1)

or

P (δ = 1 | Y = 0, u)

is correctly specified. Therefore, by selecting an appropriate model of the log odds ratio from a set of candidate models, the proposed estimation procedure is recommended within the EL framework for nonlinear regression. This approach helps mitigate potential biases arising from the misspecification of the missingness mechanism.

Moreover, if both

π (x, Y; α)

and the moment functions

m_{φ}^{0} (x_{i}; θ, α)

are correctly specified, the proposed estimator

{\hat{θ}}_{2}

remains unbiased and consistent under certain regularity conditions. Following Zhao et al. [28], we show that the moment functions

φ_{2} (x_{i}, Y_{i}; θ, α)

possess the double robustness property when the missingness mechanism, as specified in model (2), is modeled parametrically. The double robustness property is summarized in the following Proposition 1.

Proposition 1.

(1) Regardless of the choice of

m_{φ}^{0} (x_{i}; θ, α)

,

φ_{2} (x_{i}, Y_{i}; θ, α)

has mean zero, provided that the model for

π (x, Y; α)

is correctly specified. (2) If the true missingness mechanism is a parametric logistic model

logit {π (x, Y; α^{*})} = F (x_{i}; α^{*}) + q (Y_{i})

, where

F (\cdot)

is a smooth function with an unknown parameter vector

α^{*}

, and

q (\cdot)

is an arbitrary user-specified function that measures the deviation from the ignorable missing-data mechanism assumption, then the AIPW moment functions

φ_{2} (x_{i}, Y_{i}; θ, α)

still have mean zero, even if the model for

F (x_{i}; α_{0})

is misspecified.

3.3. Dimension Reduction

In many practical applications, the covariate dimension can be large, making it challenging to obtain an appropriate estimator for

m_{φ}^{0} (x_{i}; θ, α)

using kernel-smoothing imputation methods. To address this issue, let S be a continuous function from

R^{d_{x}}

to

R

such that

E {φ (x_{i}, Y_{i}; θ) | S_{i}, δ_{i} = 0} = E {φ (x_{i}, Y_{i}; θ) | x_{i}, δ_{i} = 0}

with

S_{i} = S (x_{i})

. Under this assumption, we have

E [\frac{δ_{i}}{π (x_{i}, Y_{i}; α)} φ (x_{i}, Y_{i}; θ) + \{1 - \frac{δ_{i}}{π (x_{i}, Y_{i}; α)}\} m_{φ}^{0} (S_{i}; θ, α)] = 0,

where

m_{φ}^{0} (S_{i}; θ, α) = E {δ_{i} φ (x_{i}, Y_{i}; θ) O (x_{i}, Y_{i}; α) | S_{i}} / E {δ_{i} O (x_{i}, Y_{i}; α) | S_{i}}

. Consequently, the kernel-assisted estimating equations can be constructed as

{\hat{φ}}^{*} (x_{i}, Y_{i}; θ, α) = \frac{δ_{i}}{π (x_{i}, Y_{i}; α)} φ (x_{i}, Y_{i}; θ) + \{1 - \frac{δ_{i}}{π (x_{i}, Y_{i}; α)}\} m_{φ}^{0} (S_{i}; θ, α),

where

m_{φ}^{0} (S_{i}; θ, α)

is structurally identical to

m_{φ}^{0} (x_{i}; θ, α)

except that

x

is replaced by S. Given

{\hat{α}}_{p}

, one can obtain a semiparametric dimension reduction EL estimator

{\hat{θ}}^{*}

based on

{\hat{φ}}^{*} (x_{i}, Y_{i}; θ, α)

. It is common to assume that the working index

S = S (x, γ^{*})

involves an unknown parameter vector

γ^{*}

. If an estimator

{\hat{γ}}^{*}

is available, following the arguments of Hu et al. [29], we can show that the resulting EL estimator based on

{\hat{φ}}^{*} (x_{i}, Y_{i}; θ, α)

is asymptotically equivalent to

{\hat{θ}}_{2}

when

{\hat{γ}}^{*} - γ^{*} = O_{p} (n^{- 1 / 2})

.

3.4. Asymptotic Variance Estimation

In order to construct confidence regions for the proposed estimators, it is essential to estimate their asymptotic variances

Σ_{1}

and

Σ_{2}

consistently from the sample

{(x_{i}, Y_{i}, δ_{i}) : i = 1, \dots, n}

. To achieve this, we employ the plug-in method in the simulation studies. For instance, the sample-based estimate

{\hat{V}}_{1}

of

V_{1}

is

{\hat{V}}_{1} = \frac{1}{n} \sum_{i = 1}^{n} \frac{δ_{i}}{π^{2} (x_{i}, Y_{i}; {\hat{α}}_{p})} ϖ^{- 4} (x_{i}) \nabla {({\hat{θ}}_{1})}^{\otimes 2} {\hat{ε}}_{i}^{2} .

Other estimates for

Γ

,

V_{2}

,

B_{1}

,

B_{2}

,

Ω_{1}

and

Ω_{2}

can be obtained in a similar manner. We omit the details here for brevity.

While the plug-in method is effective in NMAR settings, it can be computationally intensive due to the complexity of the asymptotic variances involved. As an alternative, particularly when dealing with large datasets, the bootstrap procedure provides an effective approach for approximating these asymptotic variances. This method, which has been explored in studies such as those of Zhao et al. [30] and Jiang et al. [31] for NMAR data, can help alleviate computational challenges and provide more practical estimations in large-scale applications.

4. Simulation Study

Finite-sample performance of the proposed methods is evaluated through Monte Carlo experiments. For bandwidth selection, we implement the data-driven approaches of Zhou et al. [32] and Tang et al. [23], adopting the rule-of-thumb bandwidth:

h_{n} = {\hat{σ}}_{X} n^{- 1 / 3}

, where

{\hat{σ}}_{X}

denotes the sample standard deviation of the observed covariate X. This practical bandwidth selector balances asymptotic optimality and computational simplicity.

4.1. Simulation 1

Let

f (x_{i}; θ) = exp (X_{i 1} θ_{1} + X_{i 2} θ_{2})

, and

ϖ (x_{i}) = \sqrt{exp (X_{i 1} X_{i 2})}

. The true parameter is set as

θ_{0} = {(0.8, 1)}^{⊤}

, and the error term

ε_{i} \sim N (0, 0 . 5^{2})

. The covariates are generated under two scenarios: In Model A,

X_{i 1}

and

X_{i 2}

are independently sampled from

U (0, 1)

; in Model B,

X_{i 1} \sim U (0, 1)

while

X_{i 1} = X_{i 2} + ε_{i}^{*}

(

ε_{i}^{*} \sim U (- 1, 1)

), allowing us to examine the impact of covariate dependence. We implement a sample size of

n = 150

, with response variables generated following the specification in model (1).

The missing indicator follows the nonignorable mechanism

δ_{i} \sim Bernoulli (π (x_{i}, Y_{i}; α)), π (x_{i}, Y_{i}; α) = \frac{exp (α_{0} + α_{y} Y_{i})}{1 + exp (α_{0} + α_{y} Y_{i})},

where

α = {(0.05, 0.05)}^{⊤}

. The covariates

X_{i 1}

and

X_{i 2}

serve as instrumental variables. The parameter

α

is estimated using the penalized semiparametric likelihood method, incorporating the following auxiliary information matrix:

\begin{matrix} Δ (x_{i}, Y_{i}; α) = (\begin{matrix} δ_{i} π^{- 1} (x_{i}, Y_{i}; α) (X_{i 1} - \bar{X_{1}}) \\ δ_{i} π^{- 1} (x_{i}, Y_{i}; α) (X_{i 2} - \bar{X_{2}}) \end{matrix}), \end{matrix}

where

\bar{X_{1}} = n^{- 1} \sum_{i = 1}^{n} X_{i 1}

and

\bar{X_{2}} = n^{- 1} \sum_{i = 1}^{n} X_{i 2}

. We adopt the product Gaussian kernel

K (u_{1}, u_{2}) = e^{- (u_{1}^{2} + u_{2}^{2}) / 2} / 2 π

and set the bandwidth as

h = {\hat{σ}}_{X_{1}} n^{- 1 / 3}

following Tang et al. [23].

Based on 500 independent replications, the proposed penalized method achieves an average IV selection rate of 92.8% for

X_{1}

and

X_{2}

, demonstrating its effectiveness. For Model A and Model B, the 95% confidence regions for parameter

θ

and their coverage probabilities are calculated based on the EL methods

EL ({\hat{θ}}_{1})

and

EL ({\hat{θ}}_{2})

, as well as the normal approximation approaches

NA ({\hat{θ}}_{1})

and

NA ({\hat{θ}}_{2})

. The simulation results for the confidence regions are displayed in Figure 1.

The left panel of Figure 1 presents the simulated confidence regions for Model A based on the four aforementioned approaches, whereas the right panel displays the corresponding results for Model B. Several notable findings emerge from Figure 1. First, the confidence regions constructed using EL approaches are smaller than those based on NA methods, indicating the superior efficiency of EL-based inference. Second, the EL-based confidence region for

{\hat{θ}}_{2}

is smaller than that for

{\hat{θ}}_{1}

, highlighting the enhanced efficiency of the AIPW estimator relative to the IPW estimator. Third, even in the presence of correlation between covariates in Model B, the EL and NA approaches yield confidence regions similar to those in Model A, implying the stability of these methods. The coverage probabilities for all four approaches are comparable in models A and B, closely aligning with the nominal 95% level. Overall, the EL-based approaches demonstrate superior efficiency relative to the NA-based methods, and the AIPW-based estimation is shown to be more efficient than the IPW-based estimation.

4.2. Simulation 2

We consider the regression model with nonlinear components

Y_{i} = θ_{0} + \sum_{k = 1}^{4} θ_{k} X_{i k} + exp (θ_{5} X_{i 5}) + 0.5 ε_{i}, i = 1, \dots, n,

where the true parameter vector is

θ_{0} = {(2.5, 0.5, - 1.5, 0.5, - 1, 0.5)}^{⊤}

. The covariates

x_{i} = {(X_{i 1}, \dots, X_{i 5})}^{⊤}

follow the multivariate normal distribution

N_{5} (0, Σ)

with covariance matrix

Σ = {(0 . 5^{| i - j |})}_{5 \times 5}

. The error terms

ε_{i}

are independently distributed as

N (0, 1)

.

The nonresponse mechanism follows the nonignorable logistic model

π (x_{i}, Y_{i}; α) = {\{1 + exp (- α_{0} - \sum_{k = 1}^{5} α_{k} X_{i k} - α_{y} Y_{i})\}}^{- 1},

where

α = {(0.5, 0, 0, 0.8, 0, 0, 0.5)}^{⊤}

. The IV

z = (X_{1}, X_{2}, X_{4}, X_{5})

is identified through the proposed penalized estimation process. To address high-dimensional challenges, we employ MAR-based propensity score estimation

\tilde{s} (x, {\hat{γ}}^{*}) = {\{1 + exp (- {\hat{γ}}_{0}^{*} - {\hat{γ}}_{1}^{* ⊤} x)\}}^{- 1},

where

{\hat{γ}}^{*}

denotes the maximum likelihood estimates. This enables the construction of a semiparametric estimator

{\hat{m}}_{φ}^{0} (\tilde{s}, α, θ) = \frac{\sum_{i = 1}^{n} δ_{i} O (x_{i}, Y_{i}; α) F_{h} (\tilde{s} - {\tilde{s}}_{i}) φ (x_{i}, Y_{i}; θ)}{\sum_{i = 1}^{n} δ_{i} O (x_{i}, Y_{i}; α) F_{h} (\tilde{s} - {\tilde{s}}_{i})},

where

F_{h} (\cdot)

represents a univariate kernel density function. For each data generating mechanism, we generate 500 Monte Carlo random samples of sizes 150 and 250.

Table 1 summarizes the finite-sample performance of the proposed method, presenting three key metrics for nonzero components in

α

: empirical bias (Bias), root mean square (RMS) error, and standard deviation (SD). The variable selection outcomes are quantified through two measures: “T” (mean count of correctly excluded irrelevant variables) and “F” (mean count of erroneously excluded significant variables). Table 2 compares the estimation accuracy of regression coefficients between IPW and AIPW approaches, reporting their respective bias, RMS, and SD values.

The principal findings emerge as follows:

(1) Variable Selection Efficacy: The penalized semiparametric likelihood method demonstrates robust variable selection capability in the nonresponse mechanism model, effectively distinguishing between relevant and irrelevant covariates.

(2) Estimation Precision: For active components in

α

, the observed minimal bias with closely matched SD and RMS values confirms the estimator’s statistical efficiency.

(3) MELE Performance Consistency: For both

{\hat{θ}}_{1}

and

{\hat{θ}}_{2}

, the SD and RMS are nearly identical, suggesting that the proposed method for MELEs effectively estimates parameters through the penalized parametric likelihood approach.

(4) Sample Size Effects: Enhanced estimation efficiency emerges with larger samples for both the missingness data model and regression model.

5. Application to the ACTG 175 Data

We demonstrate the proposed methodology using data from the AIDS Clinical Trials Group Protocol 175 (ACTG 175) involving 2139 HIV-infected participants (Hammer et al. [33]). Following the established analytical approaches of Davidian et al. [34], Tsiatis et al. [35], and Han [8], we classify treatments into two groups: zidovudine (ZDV) monotherapy (532 subjects) versus combined therapies (1607 subjects). The analysis focuses on CD4 counts at

96 \pm 5

weeks post-baseline (

Y = CD 4_{96}

) as the primary endpoint, with the following covariates:

Treatment assignment ( $X_{1}$ : 0 = ZDV monotherapy)
Baseline CD4 count ( $X_{2}$ : $CD 4_{0}$ )
Demographic covariates: age ( $X_{3}$ ), weight ( $X_{4}$ ), race ( $X_{5}$ : 0 = White), gender ( $X_{6}$ : 0 = Female)
Clinical covariates: antiretroviral history ( $X_{7}$ : 0 = naive), early treatment termination ( $X_{8}$ : 0 = completed)

The binary indicator variable r encodes the missingness status of the response Y, where

r_{i} = 1

indicates an observed outcome, and

r_{i} = 0

denotes a missing value. Previous studies of Davidian et al. [34], Tsiatis et al. [35], and Han [8] assumed that the missingness mechanism depends solely on covariates through a MAR framework. Our penalized semiparametric likelihood approach enhances robustness by incorporating shrinkage estimation within the nonresponse mechanism model. Specifically, shrinkage of the response variable coefficient toward zero provides formal evidence supporting the MAR assumption, while a nonzero estimate suggests NMAR.

To facilitate direct comparison with Han [8], we specialize the general model (1) to a linear regression framework

Y = θ_{1} + \sum_{l = 2}^{9} θ_{l} X_{l - 1} + ε, E (ε | X) = 0,

(6)

where

X_{1}

–

X_{8}

represent the baseline covariates defined previously. The nonresponse mechanism is parameterized via logistic regression

P (r = 1 | X, Y) = \frac{exp (α_{0} + \sum_{k = 1}^{8} α_{k} X_{k} + α_{9} Y)}{1 + exp (α_{0} + \sum_{k = 1}^{8} α_{k} X_{k} + α_{9} Y)},

with parameter vector

α = {(α_{0}, \dots, α_{9})}^{⊤}

. To address dimensionality challenges, we implement the regularization strategy detailed in Section 4.2, constructing consistent estimators for

m_{φ}^{0} (x; θ, α)

through MAR-based nonresponse mechanism weighting.

The penalized semiparametric likelihood estimates are presented in Table 3, with p-values calculated using 200 bootstrap replications (Efron and Tibshirani [36]). The weight (

α_{3}

) and age (

α_{4}

) show nonsignificant contributions to the nonresponse mechanism, as their coefficients are shrunk to zero with p-values exceeding 0.1. The significant coefficient for CD4 counts at

96 \pm 5

weeks (

α_{9}

) indicates an NMAR in this dataset.

Table 4 presents the analysis results for model (6), with standard errors estimated through 200 bootstrap replications. The comparative results from the complete-case analysis and Han’s multiply robust method, as described by Han [8], are also included. The nonsignificant predictors include age, weight, and gender. The analysis reveals five critical clinical insights:

(1) Treatment Superiority: Combination antiretroviral therapies (Trt = 1) demonstrate significantly higher CD4 counts at

96 \pm 5

weeks compared to ZDV monotherapy, establishing the enhanced therapeutic effectiveness of newer regimens.

(2) Baseline Predictive Power: Baseline CD4 counts (CD4₀) show significant positive association with follow-up counts, confirming their prognostic value in HIV management.

(3) Racial Disparity: White patients maintain clinically significant CD4 count advantage over nonwhite counterparts, suggesting differential disease progression trajectories.

(4) Treatment History Impact: Antiretroviral-experienced patients exhibit substantially reduced CD4 counts compared to naive patients, indicating potential cumulative treatment effects.

(5) Adherence Consequences: Early treatment discontinuation associates with marked CD4 count reduction, underscoring the critical importance of sustained therapeutic engagement.

6. Conclusions and Future Work

We developed a penalized semiparametric likelihood approach that resolves the identification challenges in nonignorable missing data analysis. The proposed estimator achieves the oracle properties under appropriate tuning parameter selection as established in our theoretical framework. The construction of profile EL ratio functions incorporated IPW and AIPW estimating equations. Our analysis demonstrated that when using consistently estimated nonresponse mechanism parameters, the ELLRFs follow an asymptotic weighted

χ^{2}

distribution. Furthermore, we systematically established the asymptotic normality of regression parameter estimators. Simulation studies and real-data applications confirmed the method’s practical effectiveness in both parameter estimation and variable selection. Comparative analyses revealed superior performance over existing approaches in handling nonignorable nonresponse data.

In practical applications, nonlinear regression models often involve high-dimensional covariates, which can lead to sparsity within the model. The direct application of the proposed estimation procedure in such contexts may lead to biased estimates. One potential approach to address this challenge is the application of the penalized EL method, as studied by Ren and Zhang [37], for model selection. It could effectively balance the model complexity and goodness of fit, thereby reducing the bias induced by high-dimensional covariates. This extension requires a systematic and separate investigation within the NMAR framework. A detailed exploration of this important issue will be undertaken in future research.

Author Contributions

Conceptualization, X.D. and X.L.; methodology, X.D. and X.L.; validation, X.D.; formal analysis, X.L.; investigation, X.D.; writing— original draft, X.D.; writing—review and editing, X.D. and X.L.; supervision, X.L.; project administration, X.D.; funding acquisition, X.D. and X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (Grant Numbers 12426666 and 12426668), Zhongwu Young Teachers Program for the Innovative Talents of Jiangsu University of Technology, and Doctoral Research Project of Yuncheng University (YQ-2023074).

Data Availability Statement

The real data that are used to illustrate the proposed methods are available at https://github.com/dingxianwen-dxw/ACTG175 (accessed on 22 March 2025).

Acknowledgments

The authors wish to thank the Editor-in-Chief, the Associate Editor and two reviewers for their many helpful and insightful comments and suggestions that greatly improved the paper.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

To establish the proofs for Theorem 1, we first introduce some essential notations and supporting lemmas.

The log-likelihood

ℓ (α, W)

(after profiling

p_{i}^{,} s

) can be rewritten as

\begin{matrix} ℓ (α, W, v) = ℓ_{1} (α, W, v) + ℓ_{2} (W), \end{matrix}

where

v = {(v_{1}^{⊤}, v_{2})}^{⊤}

with

v_{1} = λ_{1}

and

v_{2} = λ_{2} - 1 / W

. The components of the log-likelihood function are defined as follows:

ℓ_{1} (α, W, v) = - \sum_{i = 1}^{n_{1}} \log {1 + v^{⊤} h (x_{i}, Y_{i}; α, W)},

ℓ_{2} (W) = n_{1} \log W + (n - n_{1}) \log (1 - W),

where the function

h (x_{i}, Y_{i}; α, W)

is given by

h (x_{i}, Y_{i}; α, W) = {(\frac{W Δ^{⊤} (x_{i})}{π (x_{i}, Y_{i}; α)}, \frac{W}{π (x_{i}, Y_{i}; α)} \{W - π (x_{i}, Y_{i}; α)\})}^{⊤} .

Following the arguments presented by Qin et al. [21], it can be shown that

\begin{matrix} \sum_{i = 1}^{n_{1}} \frac{h (x_{i}, Y_{i}; α, W)}{1 + v^{⊤} h (x_{i}, Y_{i}; α, W)} = 0 . \end{matrix}

(A1)

It is worthwhile to note that

E_{c} {h (x_{i}, Y_{i}; α)} = 0

.

Lemma A1.

Assume that

{sup}_{ξ \in ℵ} | | h (x_{i}, Y_{i}; ξ) | | \leq W (x_{i}, Y_{i})

, where

E {W (x_{i}, Y_{i})}^{κ} < \infty

for some constant

κ > 2

. Then, for any

1 / κ < ℏ < 1 / 2

and

Λ_{n_{1}} = {v : | | v | | \leq n_{1}^{- ℏ}}

, we have

sup_{ξ \in ℵ, v \in Λ_{n_{1}}, 1 \leq i \leq n_{1}} | v^{⊤} h (x_{i}, Y_{i}; ξ) | = o_{p} (1) .

Proof of Lemma A1.

By Markov’s inequality, we obtain

max_{1 \leq i \leq n_{1}} sup_{ξ \in ℵ} | | h (x_{i}, Y_{i}; ξ) | | = O_{p} (n_{1}^{\frac{1}{κ}}) .

Then, applying the Cauchy–Schwarz inequality, we have

sup_{ξ \in ℵ, v \in Λ_{n_{1}}, 1 \leq i \leq n_{1}} | v^{⊤} h (x_{i}, Y_{i}; ξ) | \leq n_{1}^{- ℏ} O_{p} (n_{1}^{1 / κ}) = O_{p} (n^{\frac{1}{κ} - ℏ}) = o_{p} (1) .

This completes the proof. □

Lemma A2.

Under condition A(1), let

ξ = {(α^{⊤}, W)}^{⊤}

denote the parameter vector, and let its true value be

ξ_{0} = {(α_{0}^{⊤}, W_{0})}^{⊤}

. Define

H (ξ) = E {h (x_{i}, Y_{i}; ξ) h^{⊤} (x_{i}, Y_{i}; ξ)}

and assume

H (ξ)

is a positive definite matrix. Then, for all

ξ \in {ξ : | | ξ - ξ_{0} | | = O_{p} (n_{1}^{- 1 / 2})}

, we have

\begin{matrix} (1) v (ξ) = H^{- 1} (ξ) \frac{1}{n_{1}} \sum_{i = 1}^{n_{1}} h (x_{i}, Y_{i}; ξ) + o_{p} (n_{1}^{- 1 / 2}); (2) v (\hat{ξ}) = O_{p} (n_{1}^{- 1 / 2}) . \end{matrix}

Proof of Lemma A2.

We begin by considering the first part of Lemma A2. By Lemma A1, applying the Taylor series expansion to Equation (A1) yields

\sum_{i = 1}^{n_{1}} {1 - v^{⊤} h (x_{i}, Y_{i}; ξ) (1 + o_{p} (1))} h (x_{i}, Y_{i}; ξ) = 0

, which establishes the desired result. The second part of Lemma A2 follows directly from Owen [13]. □

Proof of Theorem 1.

We begin by considering the first part of Theorem 1. By noting that

ξ = {(α^{⊤}, W)}^{⊤}

, we have

ℓ_{p} (α, W) = ℓ_{p} (ξ, v) = ℓ_{1} (ξ, v) + ℓ_{2} (W) - n_{1} \sum_{j = 1}^{d} g_{γ} (| α_{j} |)

. Let

ϱ_{n_{1}} = {n_{1}}^{- 1 / 2}

. Following the arguments of Fan and Li [20], it is necessary to show that for any given

ϵ > 0

, there exists a sufficiently large constant C such that

P [\sup_{| | b | | = C} {ℓ_{p} (ξ_{0} + ϱ_{n_{1}} b) - ℓ_{p} (ξ_{0})} < ϵ] \geq 1 - ϵ .

This result implies the existence of a local maximum

\hat{ξ}

of

ξ

in the ball

{ξ_{0} + ϱ_{n_{1}} b : | | b | | \leq C}

.

From the condition

p_{γ} (0) = 0

, we have

\begin{matrix} ℓ_{p} (ξ_{0} + ϱ_{n_{1}} b, v) - ℓ_{p} (ξ_{0}, v) & \leq & ℓ_{1} (ξ_{0} + ϱ_{n_{1}} b, v) - ℓ_{1} (ξ_{0}, v) + ℓ_{2} (W_{0} + ϱ_{n_{1}} b) - ℓ_{2} (W_{0}) \\ - & n_{1} \sum_{j = 1}^{k} {g_{γ} (| α_{0 j} + ϱ_{n_{1}} b_{j} |) - g_{γ} (α_{0 j})} \\ : = & I_{1} + I_{2} + I_{3}, \end{matrix}

where k is the number of components of

α_{10}

. Taking the Taylor series expansion of

ℓ_{1} (ξ)

around

ξ_{0}

yields

\begin{matrix} ℓ_{1} (ξ, v) = - \sum_{i = 1}^{n_{1}} \log {1 + v^{⊤} (ξ) h (x_{i}, Y_{i}; ξ)} = - \sum_{i = 1}^{n_{1}} v^{⊤} (ξ) h (x_{i}, Y_{i}; ξ) {1 + o_{p} (1)} . \end{matrix}

Following the results of Lemma A2, we have

\begin{matrix} I_{1} & = & \{- n_{1} {\bar{h}}^{⊤} (ξ) H^{- 1} (ξ) \bar{h} (ξ) + n_{1} {\bar{h}}^{⊤} (ξ_{0}) H^{- 1} (ξ_{0}) \bar{h} (ξ_{0})\} {1 + o_{p} (1)} \\ = & - n_{1} F^{⊤} [H^{- 1} (ξ_{0}) {1 + o_{p} (1)}] F + n_{1} {\bar{h}}^{⊤} (ξ_{0}) H^{- 1} (ξ_{0}) \bar{h} (ξ_{0}) \\ = & - 2 n_{1} ϱ_{n_{1}} \bar{h} (ξ_{0}) H^{- 1} (ξ_{0}) U^{⊤} (ξ_{0}) b {1 + o_{p} (1)} \\ - n_{1} b^{⊤} U (ξ_{0}) H^{- 1} (ξ_{0}) U^{⊤} (ξ_{0}) b ϱ_{n_{1}}^{2} {1 + o_{p} (1)} \\ : = & I_{11} + I_{12}, \end{matrix}

where

F = {{\bar{h}}^{⊤} (ξ_{0}) + U^{⊤} (ξ_{0}) {1 + o_{p} (1)} ϱ_{n_{1}} b}

,

\bar{h} (ξ) = 1 / n_{1} \sum_{i = 1}^{n_{1}} h (x_{i}, Y_{i}; ξ)

and

U (ξ_{0}) = E {\partial h (x, Y; ξ_{0}) / \partial α}

. Because

ℓ_{2} (W)

is the log binomial likelihood, we have

I_{2} = ℓ_{2} (W_{0} + ϱ_{n_{1}} b) - ℓ_{2} (W_{0}) < 0 .

It follows from the Taylor expansion that

\begin{matrix} I_{3} & = - \sum_{j = 1}^{k} n_{1} ϱ_{n_{1}} g_{γ}^{'} (| α_{0 j} |) sign (α_{0 j}) b_{j} + \sum_{j = 1}^{k} n_{1} ϱ_{n_{1}}^{2} g_{γ}^{″} (| α_{0 j} |) b_{j}^{2} {1 + o_{p} (1)} \\ : = I_{31} + I_{32} . \end{matrix}

Note that

\begin{matrix} | I_{31} | \leq \sum_{j = 1}^{k} | n_{1} ϱ_{n_{1}} g_{γ}^{'} (| α_{0 j} |) sign (α_{0 j}) b_{j} | \leq \sqrt{k} n_{1} ϱ_{n_{1}} \max_{1 \leq j \leq k} g_{γ}^{'} (| α_{0 j} |) | | b_{j} | | \leq n_{1} ϱ_{n_{1}}^{2} | | b | |, \\ | I_{32} | \leq \sum_{j = 1}^{k} | n_{1} ϱ_{n_{1}}^{2} g_{γ}^{″} (| α_{0 j} |) b_{j}^{2} {1 + o_{p} (1)} | \leq \max_{1 \leq j \leq k} {| g_{γ}^{″} (| α_{0 j} |) | : α_{o j} \neq 0} n_{1} ϱ_{n_{1}}^{2} {| | b | |}^{2} . \end{matrix}

When

| | b | |

is chosen to be large enough such that

I_{12}

dominates the other terms

I_{11}

,

I_{31}

and

I_{32}

, and taking into account the negative term

I_{2}

, we conclude that

ℓ_{p} (ξ_{0} + ϱ_{n_{1}} b, v) - ℓ_{p} (ξ_{0}, v)

may be negative. Thus, the first part of Theorem 1 holds.

Now, we proceed to prove Theorem 1 (ii).

By Lemma A1, we have

v^{⊤} (ξ) h (x_{i}, Y_{i}; ξ) = o_{p} (1)

. Taking the Taylor series expansion of the first partial derivative of

ℓ_{p} (ξ, v)

at

α_{j} (j \notin B)

yields

\begin{matrix} \frac{1}{n_{1}} \frac{\partial ℓ_{p} (ξ)}{\partial α_{j}} & = & \frac{1}{n_{1}} \sum_{i = 1}^{n_{1}} v^{⊤} (ξ) \frac{\partial h (x_{i}, Y_{i}; ξ)}{\partial α_{j}} {1 + o_{p} (1)} - g_{γ}^{'} (| α_{j} |) sign (α_{j}) \\ = & \frac{1}{n_{1}} \sum_{i = 1}^{n_{1}} v^{⊤} (ξ) \{\frac{\partial h (x_{i}, Y_{i}; ξ_{0})}{\partial α_{j}} + \frac{\partial^{2} h (x_{i}, Y_{i}; ξ_{0})}{\partial α_{j} \partial α^{⊤}} {(α - α_{0})}^{⊤}\} {1 + o_{p} (1)} \\ - g_{γ}^{'} (| α_{j} |) sign (α_{j}) \\ : = & T_{1 j} + T_{2 j} + T_{3 j} + o_{p} (n_{1}^{- 1 / 2}) . \end{matrix}

Let

U_{j} (α)

denote the

j t h

column vector of matrix

U (α) = E {\partial h (x, Y; α, W) / \partial α}

. It follows from Assumptions A(7)–A(9) and Lemma A2 that

\begin{matrix} \max_{j \notin B} (| T_{1 j} |) & \leq & \max_{j \notin B} [| v^{⊤} U_{j} (α_{0}) | + |v^{⊤} \{\frac{1}{n_{1}} \sum_{i = 1}^{n_{1}} \frac{\partial h (x_{i}, Y_{i}; α_{0}, W)}{\partial α_{j}} - U_{j} (α_{0})\}|] \\ \leq & \max_{j \notin B} | v^{⊤} U_{j} (α_{0}) | + | | v | | ||\frac{1}{n_{1}} \sum_{i = 1}^{n_{1}} \frac{\partial h (x_{i}, Y_{i}; α_{0}, W)}{\partial α_{j}} - U_{j} (α_{0})|| \\ = & O_{p} (n_{1}^{- 1 / 2}) . \end{matrix}

Furthermore, by Assumption A(9), we have

\begin{matrix} \max_{j \notin B} (| T_{2 j} |) \leq C \frac{1}{\sqrt{n_{1}}} \sum_{l = 1}^{k} |v^{⊤} \frac{\partial^{2} h (x_{i}, Y_{i}; ξ_{0})}{\partial α_{j} \partial α_{l}}| = O_{p} \{{(\frac{1}{\sqrt{n_{1}}})}^{2}\} = o_{p} (\frac{1}{\sqrt{n_{1}}}) . \end{matrix}

So we obtain

1 / n_{1} \partial ℓ_{p} (ξ) / \partial α_{j} = γ {- p_{γ}^{'} (| α_{j} |) sign (α_{j}) / γ + O_{p} (1 / \sqrt{n} γ)}

, which implies that the sign of

1 / n_{1} \partial ℓ_{p} (ξ) / \partial α_{j}

is dominated by the sign of

α_{j}

. Thus, for any

j \notin B

and

n \to \infty

, we have

1 / n_{1} \partial ℓ_{p} (ξ) / \partial α_{j} < 0

when

α_{j} \in (0, C / \sqrt{n})

and

1 / n_{1} \partial ℓ_{p} (ξ) / \partial α_{j} > 0

when

α_{j} \in (- C / \sqrt{n}, 0)

with probability tending to one. This result implies that

{\hat{α}}_{z} = 0

with probability tending to one. Therefore, Theorem 1 (ii) holds.

Now we proceed to prove Theorem 1 (iii).

For simplicity of notation, we temporarily denote

h^{i} = h (x_{i}, Y_{i}; ξ)

. Let

I_{d} = {(H_{1}^{⊤}, H_{2}^{⊤})}^{⊤}

denote the d× d identity matrix, where

H_{1} \in R^{| B | \times d}

and

H_{2} \in R^{(d - | B |) \times d}

with

| B |

being the cardinality of

B

. Let

\begin{matrix} S (v, α, τ) & = & - \frac{1}{n_{1}} \sum_{i = 1}^{n_{1}} {\log (1 + v^{⊤} h^{i}) + n_{1} \log W + (n - n_{1}) \log (1 - W)} \\ - \sum_{j = 1}^{d} p_{γ} (| α_{j} |) - τ^{⊤} H_{2} α, \end{matrix}

where

τ \in R^{d - | B |}

is another Lagrange multiplier vector. The penalized likelihood

ℓ_{p} (v, α, τ)

can be rewritten as

ℓ_{p} (v, α, τ) = n_{1} S (v, α, τ)

. Let

\begin{matrix} S_{1} (v, α, τ) & = & \frac{\partial S (v, α, τ)}{\partial v} = - \frac{1}{n_{1}} \sum_{i = 1}^{n_{1}} \frac{h^{i}}{1 + v^{⊤} h^{i}} = - \sum_{i = 1}^{n} π_{i}^{*} h^{i}, \\ S_{2} (v, α, τ) & = & \frac{\partial S (v, α, τ)}{\partial α} = \sum_{i = 1}^{n} - π_{i}^{*} \partial_{α}^{⊤} h^{i} v - w (α) - H_{2}^{⊤} τ, \\ S_{3} (v, α, τ) & = & \frac{\partial S (v, α, τ)}{\partial τ} = - H_{2} α, \end{matrix}

where

w (α) = (g_{γ}^{'} (| α_{1} |) sign (α_{1}), \dots, g_{γ}^{'} (| α_{k} |) sign (α_{k}) {, 0, \dots, 0)}^{⊤}

,

π_{i}^{*} = {n_{1} (1 + v^{⊤} h^{i})}^{- 1}

, and

\partial_{α} h^{i} = \partial h^{i} / \partial α^{⊤}

. Thus,

\hat{v}, \hat{α}

and

\hat{τ}

satisfy

S_{t} (\hat{v}, \hat{α}, \hat{τ}) = 0

for

t = 1, 2, 3

. Let

\hat{H} (α_{0}) = 1 / n_{1} \sum_{i = 1}^{n_{1}} h (x_{i}, Y_{i}; α_{0}, W) h^{⊤} (x_{i}, Y_{i}; α_{0}, W)

and

\hat{U} (α_{0}) = 1 / n_{1} \sum_{i = 1}^{n_{1}} \partial_{α}^{⊤} h (x_{i}, Y_{i}; α_{0}, W)

, we obtain

\begin{matrix} S_{11} (0, α_{0}, 0) = \partial^{2} S (0, α_{0}, 0) / \partial v \partial v^{⊤} = \hat{H} (α_{0}), \\ S_{12} (0, α_{0}, 0) = \partial^{2} S (0, α_{0}, 0) / \partial v \partial α^{⊤} = - {\hat{U}}^{⊤} (α_{0}), \\ S_{13} (0, α_{0}, 0) = \partial^{2} S (0, α_{0}, 0) / \partial v \partial τ = 0, S_{21} (0, α_{0}, 0) = \partial^{2} S (0, α_{0}, 0) / \partial α \partial v^{⊤} = - \hat{U} (α_{0}), \\ S_{22} (0, α_{0}, 0) = \partial^{2} S (0, α_{0}, 0) / \partial α \partial α^{⊤} = 0, S_{23} (0, α_{0}, 0) = \partial^{2} S (0, α_{0}, 0) / \partial α \partial τ^{⊤} = - H_{2}^{⊤}, \\ S_{31} (0, α_{0}, 0) = \partial^{2} S (0, α_{0}, 0) / \partial τ \partial v^{⊤} = 0, S_{32} (0, α_{0}, 0) = \partial^{2} S (0, α_{0}, 0) / \partial τ \partial α^{⊤} = - H_{2}, \\ S_{33} (0, α_{0}, 0) = \partial^{2} S (0, α_{0}, 0) / \partial τ \partial τ^{⊤} = 0 . \end{matrix}

Let

H = E \hat{H} (α_{0})

and

U = E \hat{U} (α_{0})

. Taking the Taylor expansion of

S_{t} (\hat{v}, \hat{α}, \hat{τ}) = 0

(

t = 1, 2, 3

) at

(0, α_{0}, 0)

yields

(\begin{matrix} - S_{1} (0, α_{0}, 0) \\ 0 \\ 0 \end{matrix}) = (\begin{matrix} H & - U^{⊤} & 0 \\ - U & 0 & - H_{2}^{⊤} \\ 0 & - H_{2} & 0 \end{matrix}) (\begin{matrix} \hat{v} - 0 \\ \hat{α} - α_{0} \\ \hat{τ} - 0 \end{matrix}) + o_{p} (n^{- \frac{1}{2}}) .

Define the matrix Q as follows:

Q = (\begin{matrix} Q_{11} & Q_{12} \\ Q_{21} & Q_{22} \end{matrix}),

where

Q_{11} = H

,

Q_{12} = (U^{⊤}, 0)

,

Q_{21} = Q_{12}^{⊤}

, and

Q_{22} = (\begin{matrix} 0 & - H_{2}^{⊤} \\ - H_{2} & 0 \end{matrix}) .

Additionally, let

Ξ = {(α^{⊤}, τ^{⊤})}^{⊤}

. Then, we have

(\begin{matrix} \hat{v} - 0 \\ \hat{Ξ} - Ξ_{0} \end{matrix}) = Q^{- 1} \{(\begin{matrix} - S_{1} (0, α_{0}, 0) \\ 0 \end{matrix}) + o_{p} (n^{- \frac{1}{2}})\} .

Let

Q = Q_{22} - Q_{21} Q_{11}^{- 1} Q_{12}

. By applying the block matrix inversion formula, we obtain

\hat{Ξ} - Ξ_{0} = - Q^{- 1} Q_{21} Q_{11}^{- 1} S_{1} (0, α_{0}, 0) + o_{p} (1)

. Define

H_{i} (α_{0})

as the appropriate submatrix of the matrix

δ_{i} Q^{- 1} Q_{21} Q_{11}^{- 1} h (x_{i}, Y_{i}; ξ_{0})

. Then, we have

\sqrt{n} ({\hat{α}}_{p} - α_{0}) = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} H_{i} (α_{0}) + o_{p} (1) .

(A2)

By invoking the Lindeberg–Feller conditions, we conclude that

\sqrt{n} ({\hat{α}}_{p} - α_{0}) \overset{L}{\to} N (0, M)

, where

M = Var (H_{i} (α_{0}))

. □

Lemma A3.

Suppose Conditions A(1)–A(9) hold; if α is estimated by the penalized likelihood method,

\hat{α} = {\hat{α}}_{p}

, then as

n \to \infty

, we have

\begin{matrix} (1) & \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} {\hat{φ}}_{1} (x_{i}, Y_{i}; θ_{0}, {\hat{α}}_{p}) \overset{L}{\to} N (0, Ω_{1}), \frac{1}{n} \sum_{i = 1}^{n} {{\hat{φ}}_{1} (x_{i}, Y_{i}; θ_{0}, {\hat{α}}_{p})}^{\otimes 2} \overset{P}{\to} V_{1}, \\ \frac{1}{n} \sum_{i = 1}^{n} \partial_{θ} {\hat{φ}}_{1} (x_{i}, Y_{i}; θ_{0}, {\hat{α}}_{p}) \overset{P}{\to} Γ, \\ (2) & \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} {\hat{φ}}_{2} (x_{i}, Y_{i}; θ_{0}, {\hat{α}}_{p}) \overset{L}{\to} N (0, Ω_{2}), \frac{1}{n} \sum_{i = 1}^{n} {{\hat{φ}}_{2} (x_{i}, Y_{i}; θ_{0}, {\hat{α}}_{p})}^{\otimes 2} \overset{P}{\to} V_{2}, \\ \frac{1}{n} \sum_{i = 1}^{n} \partial_{θ} {\hat{φ}}_{2} (x_{i}, Y_{i}; θ_{0}, {\hat{α}}_{p}) \overset{P}{\to} Γ . \end{matrix}

Proof of Lemma A3.

We begin by proving part (1). Expanding

1 / \sqrt{n} \sum_{i = 1}^{n} {\hat{φ}}_{1} (x_{i}, Y_{i}; θ_{0}, {\hat{α}}_{p})

at

α = α_{0}

using a Taylor series gives

\begin{matrix} \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} {\hat{φ}}_{1} (x_{i}, Y_{i}; θ_{0}, {\hat{α}}_{p}) & = & \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} φ_{1} (x_{i}, Y_{i}; θ_{0}, α_{0}) \\ + & \frac{1}{n} \sum_{i = 1}^{n} \frac{\partial {\hat{φ}}_{1} (x_{i}, Y_{i}; θ_{0}, α_{0})}{\partial α^{⊤}} \sqrt{n} ({\hat{α}}_{p} - α_{0}) + o_{p} (1) . \end{matrix}

We observe that

\begin{matrix} \frac{1}{n} \sum_{i = 1}^{n} \frac{\partial {\hat{φ}}_{1} (x_{i}, Y_{i}; θ_{0}, α_{0})}{\partial α^{⊤}} = - \frac{1}{n} \sum_{i = 1}^{n} \frac{δ_{i}}{π^{2} (x_{i}, Y_{i}; α_{0})} φ (x_{i}, Y_{i}; θ_{0}) \frac{\partial π (x_{i}, Y_{i}; α_{0})}{\partial α^{⊤}} \\ = - \frac{1}{n} \sum_{i = 1}^{n} \frac{δ_{i}}{π (x_{i}, Y_{i}; α_{0})} {1 - π (x_{i}, Y_{i}; α_{0})} φ (x_{i}, Y_{i}; θ_{0}) \partial logit {π (x_{i}, Y_{i}; α_{0})} / \partial α \\ \overset{P}{\to} - E {φ_{1} (x, Y; θ_{0}) {δ - π (x, Y; α_{0})} \partial logit {π (x_{i}, Y_{i}; α_{0})} / \partial α} \\ = - C o v {φ_{1} (x, Y; θ_{0}), D (x, Y; α_{0})} = - B_{1} . \end{matrix}

Thus, we obtain

\begin{matrix} \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} {\hat{φ}}_{1} (x_{i}, Y_{i}; θ_{0}, {\hat{α}}_{p}) = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} {φ_{1} (x_{i}, Y_{i}; θ_{0}, α_{0}) - B_{1} H_{i} (α_{0})} + o_{p} (1) \overset{L}{\to} N (0, Ω_{1}) . \end{matrix}

By direct calculation, we obtain

\begin{matrix} \frac{1}{n} \sum_{i = 1}^{n} {{\hat{φ}}_{1} (x_{i}, Y_{i}; θ_{0}, {\hat{α}}_{p})}^{\otimes 2} = \frac{1}{n} \sum_{i = 1}^{n} \frac{δ_{i}}{π^{2} (x_{i}, Y_{i}; α_{0})} \nabla_{i} {(θ_{0})}^{\otimes 2} ϖ^{- 4} (x_{i}) ε_{i}^{2} + o_{p} (1) \overset{P}{\to} V_{1} . \end{matrix}

We note that

\begin{matrix} \frac{1}{n} \sum_{i = 1}^{n} \partial_{θ} {\hat{φ}}_{1} (x_{i}, Y_{i}; θ_{0}, {\hat{α}}_{p}) \\ = \frac{1}{n} \sum_{i = 1}^{n} \frac{δ_{i} ϖ^{- 2} (x_{i})}{π (x_{i}, Y_{i}; α_{0})} \ddot{f} (x_{i}; θ_{0}) {Y_{i} - f (x_{i}; θ_{0})} - \frac{1}{n} \sum_{i = 1}^{n} \frac{δ_{i} ϖ^{- 2} (x_{i})}{π (x_{i}, Y_{i}; α_{0})} {\nabla_{i} (θ_{0})}^{\otimes 2} + o_{p} (1) \\ \overset{P}{\to} Γ, \end{matrix}

where

\ddot{f} (x_{i}; θ)

is defined in Assumption A(4). This completes the proof of Lemma A3 (1).

Now, we proceed to prove the second part of Lemma A3.

We observe that

\begin{matrix} \frac{1}{n} \sum_{i = 1}^{n} {\hat{φ}}_{2} (x_{i}, Y_{i}; θ_{0}, {\hat{α}}_{p}) = \frac{1}{n} \sum_{i = 1}^{n} {\hat{φ}}_{2} (x_{i}, Y_{i}; θ_{0}, α_{0}) + O (θ_{0}, α_{0}) ({\hat{α}}_{p} - α_{0}) + o (| | {\hat{α}}_{p} - α_{0} | |), \end{matrix}

where

O (θ_{0}, α_{0}) = \frac{1}{n} \sum_{i = 1}^{n} \partial {\hat{φ}}_{2} (x_{i}, Y_{i}; θ_{0}, α_{0}) / \partial α^{⊤}

.

Through direct computation, we obtain

\begin{matrix} O (θ_{0}, α_{0}) = & - \frac{1}{n} \sum_{i = 1}^{n} \frac{δ_{i}}{π^{2} (x_{i}, Y_{i}; α_{0})} {φ (x_{i}, Y_{i}; θ_{0}) - m_{φ}^{0} (x_{i}, θ_{0}; α_{0})} \frac{\partial π (x_{i}, Y_{i}; α_{0})}{\partial α^{⊤}} \\ + \frac{1}{n} \sum_{i = 1}^{n} \frac{δ_{i}}{π^{2} (x_{i}, Y_{i}; α_{0})} {{\hat{m}}_{φ}^{0} (x_{i}; θ_{0}, α_{0}) - m_{φ}^{0} (x_{i}, θ_{0}; α_{0})} \frac{\partial π (x_{i}, Y_{i}; α_{0})}{\partial α^{⊤}} \\ + \frac{1}{n} \sum_{i = 1}^{n} {1 - \frac{δ_{i}}{π (x_{i}, Y_{i}; α_{0})}} \frac{\partial {\hat{m}}_{φ}^{0} (x_{i}; θ_{0}, α_{0})}{\partial α^{⊤}} \\ : = T_{n 1} + T_{n 2} + T_{n 3} . \end{matrix}

By leveraging the consistency property of the kernel regression estimator, we establish that

T_{n 2} = o_{p} (1)

. Define

z (x_{i}, Y_{i}; α_{0}) = \partial logit {π (x_{i}, Y_{i}; α_{0})} / \partial α

. Additionally, let

m_{z}^{0} (x; α_{0}) = E {z (x, Y, α_{0}) | x, δ = 0},

m_{z φ}^{0} (x; α_{0}) = E {z (x, Y, α_{0}) φ (x, Y; θ_{0}) | x, δ = 0} .

Using the kernel regression method, we obtain the following estimators:

\begin{matrix} {\hat{m}}_{z}^{0} (x_{i}; α_{0}) = \frac{\sum_{j = 1}^{n} δ_{j} O (x_{j}, Y_{j}) K_{h} (x_{j} - x_{i}) z (x_{j}, Y_{j}, α_{0})}{\sum_{j = 1}^{n} δ_{j} O (x_{j}, Y_{j}) K_{h} (x_{j} - x_{i})}, \\ {\hat{m}}_{z φ}^{0} (x_{i}; α_{0}) = \frac{\sum_{j = 1}^{n} δ_{j} O (x_{j}, Y_{j}) K_{h} (x_{j} - x_{i}) z (x_{j}, Y_{j}, α_{0}) φ (Y_{j}, x_{j}; θ_{0})}{\sum_{j = 1}^{n} δ_{j} O (x_{j}, Y_{j}) K_{h} (x_{j} - x_{i})} . \end{matrix}

Consequently, we have

\begin{matrix} \frac{\partial {\hat{m}}_{φ}^{0} (x_{i}; θ_{0}, α_{0})}{\partial α} = {\hat{m}}_{φ}^{0} (x_{i}; θ_{0}, α_{0}) {\hat{m}}_{z}^{0} (x_{i}; α_{0}) - {\hat{m}}_{z φ}^{0} (x_{i}; α_{0}) . \end{matrix}

Let

Λ_{n} (x_{i}) = \hat{G} (x_{i}) - G (x_{i})

and

z_{j} (α_{0}) = z (x_{j}, Y_{j}; α_{0})

. For notation simplicity, we temporarily denote

φ_{i} = φ (x_{i}, Y_{i}; θ_{0}), m_{φ}^{0} (x_{i}) = m_{φ}^{0} (x_{i}; θ_{0}, α_{0})

, and

{\hat{m}}_{φ}^{0} (x_{i}) = {\hat{m}}_{φ}^{0} (x_{i}; θ_{0}, α_{0})

. By performing a further decomposition of

T_{n 3}

, we obtain

\begin{matrix} T_{n 3} & = & \frac{1}{n} \sum_{i = 1}^{n} \{1 - \frac{δ_{i}}{π (x_{i}, Y_{i}; α_{0})}\} \frac{\partial {\hat{m}}_{φ}^{0} (x_{i}; θ_{0}, α_{0})}{\partial α} \\ = & \frac{1}{n} \sum_{i = 1}^{n} \{1 - \frac{δ_{i}}{π (x_{i}, Y_{i}; α_{0})}\} {{\hat{m}}_{φ}^{0} (x_{i}; θ_{0}, α_{0}) {\hat{m}}_{z}^{0} (x_{i}; α_{0}) - {\hat{m}}_{z φ}^{0} (x_{i}, α_{0})} \\ = & \frac{1}{n} \sum_{i = 1}^{n} \{1 - \frac{δ_{i}}{π (x_{i}, Y_{i}; α_{0})}\} {{\hat{m}}_{φ}^{0} (x_{i}; θ_{0}, α_{0}) {\hat{m}}_{z}^{0} (x_{i}; α_{0}) - m_{φ}^{0} (x_{i}) m_{z}^{0} (x_{i}; α_{0})} \\ - \frac{1}{n} \sum_{i = 1}^{n} \{1 - \frac{δ_{i}}{π (x_{i}, Y_{i}; α_{0})}\} {{\hat{m}}_{z φ}^{0} (x_{i}; α_{0}) - m_{φ}^{0} (x_{i}) m_{z}^{0} (x_{i}; α_{0})} \\ : = & T_{n 31} - T_{n 32} . \end{matrix}

For

T_{n 31}

, we have

\begin{matrix} T_{n 31} & = & \frac{1}{n} \sum_{i = 1}^{n} \{1 - \frac{δ_{i}}{π (x_{i}, Y_{i}; α_{0})}\} {{\hat{m}}_{φ}^{0} (x_{i}; θ_{0}, α_{0}) {\hat{m}}_{z}^{0} (x_{i}, α_{0}) - m_{φ}^{0} (x_{i}) m_{z}^{0} (x_{i}; α_{0})} \\ = & \frac{1}{n} \sum_{i = 1}^{n} \{1 - \frac{δ_{i}}{π (x_{i}, Y_{i}; α_{0})}\} m_{z}^{0} (x_{i}, α_{0}) {{\hat{m}}_{φ}^{0} (x_{i}; θ_{0}, α_{0}) - m_{φ}^{0} (x_{i})} \\ + \frac{1}{n} \sum_{i = 1}^{n} \{1 - \frac{δ_{i}}{π (x_{i}, Y_{i}; α_{0})}\} {{\hat{m}}_{z}^{0} (x_{i}, α) - m_{z}^{0} (x_{i}, α)} {{\hat{m}}_{φ}^{0} (x_{i}, θ, α) - m_{φ}^{0} (x_{i})} \\ + \frac{1}{n} \sum_{i = 1}^{n} \{1 - \frac{δ_{i}}{π (x_{i}, Y_{i}; α_{0})}\} m_{φ}^{0} (x_{i}) {{\hat{m}}_{z}^{0} (x_{i}; α_{0}) - m_{z}^{0} (x_{i}; α_{0})} \\ : = & T_{n 311} + T_{n 312} + T_{n 313} . \end{matrix}

By applying standard arguments, we obtain

T_{n 31 j} = o_{p} (n^{- 1 / 2})

for

j = 1, 2, 3

. For

T_{n 32}

, we have

\begin{matrix} T_{n 32} \\ = \frac{1}{n} \sum_{i = 1}^{n} \{1 - \frac{δ_{i}}{π (x_{i}, Y_{i}; α_{0})}\} {{\hat{m}}_{z φ}^{0} (x_{i}; α_{0}) - m_{φ}^{0} (x_{i}) m_{z}^{0} (x_{i}; α_{0})} \\ = \frac{1}{n} \sum_{i = 1}^{n} \{1 - \frac{δ_{i}}{π (x_{i}, Y_{i}; α_{0})}\} \frac{\frac{1}{n} \sum_{j = 1}^{n} δ_{j} O_{j} (α_{0}) K_{h} (x_{j} - x_{i}) {z_{j} (α_{0}) ψ_{j} - m_{φ}^{0} (x_{j}) m_{z}^{0} (x_{j})}}{G (x_{i})} \\ + \frac{1}{n} \sum_{i = 1}^{n} \{1 - \frac{δ_{i}}{π (x_{i}, Y_{i}; α_{0})}\} \frac{\frac{1}{n} \sum_{j = 1}^{n} δ_{j} O_{j} (α_{0}) K_{h} (x_{j} - x_{i}) {m_{φ}^{0} (x_{j}) m_{z}^{0} (x_{j}) - m_{φ}^{0} (x_{i}) m_{z}^{0} (x_{i})}}{G (x_{i})} \\ - \frac{1}{n} \sum_{i = 1}^{n} \{1 - \frac{δ_{i}}{π (x_{i}, Y_{i}; α_{0})}\} \frac{{{\hat{m}}_{z φ}^{0} (x_{i}) \hat{G} (x_{i}) - m_{z φ}^{0} (x_{i}) G (x_{i})}}{G^{2} (x_{i})} Λ_{n} (x_{i}) \\ + \frac{1}{n} \sum_{i = 1}^{n} \{1 - \frac{δ_{i}}{π (x_{i}, Y_{i}; α_{0})}\} \frac{{\hat{m}}_{z φ}^{0} (x_{i})}{G^{2} (x_{i})} Λ_{n}^{2} (x_{i}) - \frac{1}{n} \sum_{i = 1}^{n} (1 - δ_{i}) \frac{m_{z φ}^{0} (x_{i}) G (x_{i})}{G^{2} (x_{i})} Λ_{n} (x_{i}) \\ + \frac{1}{n} \sum_{i = 1}^{n} \{1 - \frac{δ_{i}}{π (x_{i}, Y_{i}; α_{0})}\} \frac{m_{φ}^{0} (x_{i}) m_{z}^{0} (x_{i}) Λ_{n} (x_{i})}{G (x_{i})} \\ : = T_{n 321} + T_{n 322} + T_{n 323} + T_{n 324} + T_{n 325} + T_{n 326} . \end{matrix}

Standard arguments can also be employed to conclude that

T_{n 32 j} = o_{p} (n^{- 1 / 2})

for

j = 1, \dots, 6

. Combining the above results, we obtain

T_{n 2} = o_{p} (1)

and

T_{n 3} = o_{p} (1)

.

Next, we consider

T_{n 1}

. A straightforward calculation yields

\begin{matrix} \partial π (x, Y; α) / \partial α = π (x, Y; α) {1 - π (x, Y; α)} z (x, Y; α) . \end{matrix}

On the other hand,

\begin{matrix} E \{\frac{δ}{π (x, Y; α_{0})} {1 - π (x, Y; α_{0})} z (x, Y; α_{0}) {φ (x_{i}, Y_{i}; θ_{0}) - m_{φ}^{0} (x_{i}, θ_{0}; α_{0})}\} \\ = E \{\frac{δ}{π (x, Y; α_{0})} {δ - π (x, Y; α_{0})} z (x, Y; α_{0}) {φ (x_{i}, Y_{i}; θ_{0}) - m_{φ}^{0} (x_{i}, θ_{0}; α_{0})}\} \\ = E \{\frac{δ}{π (x, Y; α_{0})} {φ (x_{i}, Y_{i}; θ_{0}) - m_{φ}^{0} (x_{i}, θ_{0}; α_{0})} D (x, Y; α_{0})\} \\ = E \{[\frac{δ}{π (x, Y; α_{0})} {φ (x_{i}, Y_{i}; θ_{0}) - m_{φ}^{0} (x_{i}, θ_{0}; α_{0})} + m_{φ}^{0} (x_{i}, θ_{0}; α_{0})] D (x, Y; α_{0})\} \\ = C o v {{\hat{φ}}_{2} (x, Y; θ_{0}, α_{0}), D (x, Y α_{0})} . \end{matrix}

The third equality holds because

\begin{matrix} E [{δ - π (x, Y; α_{0})} z (x, Y; α_{0}) | x] = E \{E [{δ - π (x, Y; α_{0})} z (x, Y; α_{0}) | x, Y] | x\} = 0, \end{matrix}

which results in

E \{m_{φ}^{0} (x_{i}, θ_{0}; α_{0}) D (x, Y; α_{0})\} = E \{m_{φ}^{0} (x_{i}, θ_{0}; α_{0}) E [{δ - π (x, Y; α_{0})} z (x, Y; α_{0}) | x]\} = 0 .

Then, for

T_{n 1}

, we have

T_{n 1} = - C o v {{\hat{φ}}_{2} (x, Y; θ_{0}, α_{0}), D (x, Y; α_{0})} + o_{p} (1) .

Furthermore, we have

\begin{matrix} \frac{1}{n} \sum_{i = 1}^{n} {\hat{φ}}_{2} (x_{i}, Y_{i}; θ_{0}, {\hat{α}}_{p}) = \frac{1}{n} \sum_{i = 1}^{n} {\hat{φ}}_{2} (x_{i}, Y_{i}; θ_{0}, α_{0}) - B_{2} ({\hat{α}}_{p} - α_{0}) + o_{p} (n^{- \frac{1}{2}}), \end{matrix}

which is equivalent to

\begin{matrix} \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} {\hat{φ}}_{2} (x_{i}, Y_{i}; θ_{0}, {\hat{α}}_{p}) & = & \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} {\hat{φ}}_{2} (x_{i}, Y_{i}; θ_{0}, α_{0}) - B_{2} \sqrt{n} ({\hat{α}}_{p} - α_{0}) + o_{p} (1) \\ = & \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} \{{\hat{φ}}_{2} (x_{i}, Y_{i}; θ_{0}, α_{0}) - B_{2} H_{i} (α_{0})\} + o_{p} (1) \\ \overset{L}{\to} & N (0, Ω_{2}) . \end{matrix}

The second and third parts of Lemma A3 (2) can be proved using similar arguments as those in the proof of the corresponding parts in Lemma A3 (1). Thus, the proof of Lemma A3 is complete. □

Proof of Theorem 2.

We begin by considering the first part of Theorem 2. Let

{\hat{θ}}_{1}

and

{\hat{ν}}_{n}

be the solutions to the following equations:

\begin{matrix} Q_{n 1} (θ, ν_{n}) = \frac{1}{n} \sum_{i = 1}^{n} \frac{{\hat{φ}}_{1} (x_{i}, Y_{i}; θ, {\hat{α}}_{p})}{1 + ν_{n}^{⊤} {\hat{φ}}_{1} (x_{i}, Y_{i}; θ, {\hat{α}}_{p})} = 0, Q_{n 2} (θ, ν_{n}) = \frac{1}{n} \sum_{i = 1}^{n} \frac{ν_{n}^{⊤} \partial_{θ} {\hat{φ}}_{1} (x_{i}, Y_{i}; θ, {\hat{α}}_{p})}{1 + ν_{n}^{⊤} {\hat{φ}}_{1} (x_{i}, Y_{i}; θ, {\hat{α}}_{p})} = 0 . \end{matrix}

Taking the Taylor series expansion of

Q_{n 1} ({\hat{θ}}_{1}, {\hat{ν}}_{n})

and

Q_{n 2} ({\hat{θ}}_{1}, {\hat{ν}}_{n})

at

(θ_{0}, 0)

, we obtain

\begin{matrix} 0 & = & Q_{n 1} ({\hat{θ}}_{1}, {\hat{ν}}_{n}) = Q_{n 1} (θ_{0}, 0) + \frac{\partial Q_{n 1} (θ_{0}, 0)}{\partial θ} ({\hat{θ}}_{1} - θ_{0}) + \frac{\partial Q_{n 1} (θ_{0}, 0)}{\partial {\hat{ν}}_{n}} ({\hat{ν}}_{n} - 0) + o_{p} (σ_{n}), \\ 0 & = & Q_{n 2} ({\hat{θ}}_{1}, {\hat{ν}}_{n}) = Q_{n 2} (θ_{0}, 0) + \frac{\partial Q_{n 2} (θ_{0}, 0)}{\partial θ} ({\hat{θ}}_{1} - θ_{0}) + \frac{\partial Q_{n 2} (θ_{0}, 0)}{\partial {\hat{ν}}_{n}^{⊤}} ({\hat{ν}}_{n} - 0) + o_{p} (σ_{n}), \end{matrix}

where

σ_{n} = | | {\hat{θ}}_{1} - θ_{0} | | + | | {\hat{ν}}_{n} | |

.

Through direct calculation, we obtain

\begin{matrix} \frac{\partial Q_{n 1} (θ_{0}, 0)}{\partial θ} = \frac{1}{n} \sum_{i = 1}^{n} \partial_{θ} {\hat{φ}}_{1} (x_{i}, Y_{i}; θ, {\hat{α}}_{p}), \frac{\partial Q_{n 1} (θ_{0}, 0)}{\partial ν_{n}^{⊤}} = - \frac{1}{n} \sum_{i = 1}^{n} {\hat{φ}}_{1} {(x_{i}, Y_{i}; θ, {\hat{α}}_{p})}^{\otimes 2}, \\ \frac{\partial Q_{n 2} (θ_{0}, 0)}{\partial θ} = 0, \frac{\partial Q_{n 2} (θ_{0}, 0)}{\partial ν_{n}^{⊤}} = \frac{1}{n} \sum_{i = 1}^{n} \partial_{θ} {\hat{φ}}_{1} (x_{i}, Y_{i}; θ, {\hat{α}}_{p}) . \end{matrix}

Then, we have

(\begin{matrix} {\hat{ν}}_{n} \\ {\hat{θ}}_{1} - θ_{0} \end{matrix}) = S_{n}^{- 1} (\begin{matrix} - Q_{n 1} (θ_{0}, 0) + o_{p} (σ_{n}) \\ o_{p} (σ_{n}) \end{matrix}),

where

S_{n} = (\begin{matrix} - \frac{1}{n} \sum_{i = 1}^{n} {\hat{φ}}_{1} {(x_{i}, Y_{i}; θ, {\hat{α}}_{p})}^{\otimes 2} & \frac{1}{n} \sum_{i = 1}^{n} \partial_{θ} {\hat{φ}}_{1} (x_{i}, Y_{i}; θ, {\hat{α}}_{p}) \\ \frac{1}{n} \sum_{i = 1}^{n} \partial_{θ} {\hat{φ}}_{1} {(x_{i}, Y_{i}; θ, {\hat{α}}_{p})}^{⊤} & 0 \end{matrix}) .

From Lemma A3, we obtain the following convergence result for

S_{n}

:

S_{n} \overset{P}{\to} S = (\begin{matrix} - V_{1} & Γ \\ Γ & 0 \end{matrix}) .

Additionally, from Lemma A3, we have

Q_{n 1} (θ_{0}, 0) = 1 / n \sum_{i = 1}^{n} {\hat{φ}}_{1} (x_{i}, Y_{i}; θ, {\hat{α}}_{p}) = O_{p} (n^{- 1 / 2})

, which implies that

σ_{n} = O_{p} (n^{- 1 / 2})

. Thus, we obtain

\begin{matrix} \sqrt{n} ({\hat{θ}}_{1} - θ_{0}) = - {(Γ V_{1}^{- 1} Γ)}^{- 1} Γ^{⊤} V_{1}^{- 1} \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} {\hat{φ}}_{1} (x_{i}, Y_{i}; θ, {\hat{α}}_{p}) + o_{p} (1) . \end{matrix}

Therefore, we have

\sqrt{n} ({\hat{θ}}_{1} - θ_{0}) \overset{L}{\to} N (0, Σ_{1})

. Following the same procedure as outlined above, we can also establish that

\sqrt{n} ({\hat{θ}}_{2} - θ_{0}) \overset{L}{\to} N (0, Σ_{2}) .

We now consider the second part of Theorem 2. Using the same argument as in Tang et al. [23], we obtain

\begin{matrix} {\hat{ℓ}}_{1} (θ_{0}, {\hat{α}}_{p}) & = & Z^{⊤} {\{\frac{1}{n} \sum_{i = 1}^{n} {\hat{φ}}_{1} {(x_{i}, Y_{i}; θ_{0}, {\hat{α}}_{p})}^{\otimes 2}\}}^{- 1} Z + o_{p} (1), \end{matrix}

where

Z = 1 / \sqrt{n} \sum_{i = 1}^{n} {\hat{φ}}_{1} (x_{i}, Y_{i}; θ_{0}, {\hat{α}}_{p})

. Applying Lemma A3, we obtain the desired result. The asymptotic distribution of

{\hat{ℓ}}_{2} (θ_{0}, {\hat{α}}_{p})

can be derived by following the same reasoning as in the proof of

{\hat{ℓ}}_{1} (θ_{0}, {\hat{α}}_{p})

. This completes the proof of Theorem 2. □

References

Jennrich, R.I. Asymptotic properties of non-linear least squares estimators. Ann. Math. Stat. 1969, 40, 633–643. [Google Scholar] [CrossRef]
Wu, C.F. Asymptotic theory of nonlinear least squares estimation. Ann. Stat. 1981, 9, 501–513. [Google Scholar] [CrossRef]
Fekedulegn, D.; Mac Siurtain, M.P.; Colbert, J.J. Parameter estimation of nonlinear growth models in forestry. Silva Fenn 1999, 33, 327–336. [Google Scholar] [CrossRef]
Ivanov, A.V. Asymptotic Theory of Nonlinear Regression; Kluwer Academic Publishers: Dordrecht, The Netherlands, 1997. [Google Scholar]
Little, R.J.; Rubin, D.B. Statistical Analysis with Missing Data; John Wiley & Sons: New York, NY, USA, 2019. [Google Scholar]
Horvitz, D.G.; Thompson, D.J. A generalization of sampling without replacement from a finite universe. J. Am. Stat. Assoc. 1952, 47, 663–685. [Google Scholar] [CrossRef]
Robins, J.M.; Rotnitzky, A.; Zhao, L. Estimation of regression coefficients when some regressors are not always observed. J. Am. Stat. Assoc. 1994, 89, 846–866. [Google Scholar] [CrossRef]
Han, P. Multiply robust estimation in regression analysis with missing data. J. Am. Stat. Assoc. 2014, 109, 1159–1173. [Google Scholar] [CrossRef]
Xue, L.; Xie, J. Efficient robust estimation for single-index mixed effects models with missing observations. Stat. Pap. 2024, 65, 827–864. [Google Scholar] [CrossRef]
Sharghi, S.; Stoll, K.; Ning, W. Statistical inferences for missing response problems based on modified empirical likelihood. Stat. Pap. 2024, 65, 4079–4120. [Google Scholar] [CrossRef]
Li, W.; Luo, S.; Xu, W. Calibrated regression estimation using empirical likelihood under data fusion. Comput. Stat. Data Anal. 2024, 190, 107871. [Google Scholar] [CrossRef]
Tang, N.; Zhao, P. Empirical likelihood-based inference in nonlinear regression models with missing responses at random. Statistics 2013, 47, 1141–1159. [Google Scholar] [CrossRef]
Owen, A.B. Empirical likelihood ratio confidence regions. Ann. Stat. 1990, 18, 90–120. [Google Scholar] [CrossRef]
Yang, Z.; Tang, N. Empirical likelihood for nonlinear regression models with nonignorable missing responses. Can. J. Stat. 2020, 48, 386–416. [Google Scholar] [CrossRef]
Wang, S.; Shao, J.; Kim, J.K. An instrumental variable approach for identification and estimation with nonignorable nonresponse. Stat. Sin. 2014, 24, 1097–1116. [Google Scholar] [CrossRef]
Wang, L.; Shao, J.; Fang, F. Propensity model selection with nonignorable nonresponse and instrument variable. Stat. Sin. 2021, 31, 647–672. [Google Scholar] [CrossRef]
Chen, J.; Shao, J.; Fang, F. Instrument search in pseudo-likelihood approach for nonignorable nonresponse. Ann. Inst. Stat. Math. 2021, 73, 519–533. [Google Scholar] [CrossRef]
Du, J.; Li, Y.; Cui, X. Identification and estimation of generalized additive partial linear models with nonignorable missing response. Commun. Math. Stat. 2024, 12, 113–156. [Google Scholar] [CrossRef]
Beppu, K.; Morikawa, K. Verifiable identification condition for nonignorable nonresponse data with categorical instrumental variables. Stat. Theory Relat. Fields. 2024, 8, 40–50. [Google Scholar] [CrossRef]
Fan, J.; Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
Qin, J.; Leung, D.; Shao, J. Estimation with survey data under nonignorable nonresponse or informative sampling. J. Am. Stat. Assoc. 2002, 97, 193–200. [Google Scholar] [CrossRef]
Qin, J.; Lawless, J.F. Empirical likelihood and general estimating equations. Ann. Stat. 1994, 22, 300–325. [Google Scholar] [CrossRef]
Tang, N.; Zhao, P.; Zhu, H. Empirical likelihood for estimating equations with nonignorably missing data. Stat. Sin. 2014, 24, 723–747. [Google Scholar] [CrossRef] [PubMed]
Ding, X.; Tang, N. Adjusted empirical likelihood estimation of distribution function and quantile with nonignorable missing data. J. Syst. Sci. Complex. 2018, 31, 820–840. [Google Scholar] [CrossRef]
Morikawa, K.; Kano, Y. Statistical inference with different missing-data mechanisms. arXiv 2014, arXiv:1407.4971. [Google Scholar]
Miao, W.; Tchetgen, E.J. On varieties of doubly robust estimators under missingness not at random with a shadow variable. Biometrika 2016, 103, 475–482. [Google Scholar] [CrossRef]
Liu, T.; Yuan, X. Doubly robust augmented-estimating-equations estimation with nonignorable nonresponse data. Stat. Pap. 2020, 61, 2241–2270. [Google Scholar] [CrossRef]
Zhao, P.; Tang, N.; Zhu, H. Generalized empirical likelihood inferences for nonsmooth moment functions with nonignorable missing values. Stat. Sin. 2020, 30, 217–249. [Google Scholar]
Hu, Z.; Follmann, D.A.; Qin, J. Semiparametric dimension reduction estimation for mean response with missing data. Biometrika 2010, 97, 305–319. [Google Scholar] [CrossRef]
Zhao, P.; Tang, N.; Qu, A.; Jiang, D. Semiparametric estimating equations inference with nonignorable missing data. Stat. Sin. 2017, 27, 89–113. [Google Scholar]
Jiang, D.; Zhao, P.; Tang, N. A propensity score adjustment method for regression models with nonignorable missing covariates. Comput. Stat. Data Anal. 2016, 94, 98–119. [Google Scholar] [CrossRef]
Zhou, Y.; Wan, A.T.K.; Wang, X. Estimating equations inference with missing data. J. Am. Stat. Assoc. 2008, 103, 1187–1199. [Google Scholar] [CrossRef]
Hammer, S.M.; Katzenstein, D.A.; Hughes, M.D.; Gundacker, H.; Schooley, R.T.; Haubrich, R.H.; Henry, W.K.; Lederman, M.M.; Phair, J.P.; Niu, M.; et al. A trial comparing nucleoside monotherapy with combination therapy in HIV-infected adults with CD4 cell counts from 200 to 500 per cubic millimeter. N. Engl. J. Med. 1996, 335, 1081–1090. [Google Scholar] [CrossRef] [PubMed]
Davidian, M.; Tsiatis, A.A.; Leon, S. Semiparametric estimation of treatment effect in a pretest–posttest study with missing data. Statist. Sci. 2005, 20, 261–301. [Google Scholar] [CrossRef] [PubMed]
Tsiatis, A.A.; Davidian, M.; Zhang, M.; Lu, X. Covariate adjustment for two-sample treatment comparisons in randomized clinical trials: A principled yet flexible approach. Stat. Med. 2008, 27, 4658–4677. [Google Scholar] [CrossRef]
Efron, B.; Tibshirani, R.J. An Introduction to the Bootstrap; Chapman & Hall: New York, NY, USA, 1993. [Google Scholar]
Ren, Y.; Zhang, X. Variable selection using penalized empirical likelihood. Sci. China Math. 2011, 54, 1829–1845. [Google Scholar] [CrossRef]

Figure 1. Simulation 1 results comparing EL and NA methods. Line specifications:

EL ({\hat{θ}}_{2})

(red solid),

EL ({\hat{θ}}_{1})

(green dotted),

NA ({\hat{θ}}_{2})

(black dash-dot),

NA ({\hat{θ}}_{1})

(blue thick solid).

Figure 1. Simulation 1 results comparing EL and NA methods. Line specifications:

EL ({\hat{θ}}_{2})

(red solid),

EL ({\hat{θ}}_{1})

(green dotted),

NA ({\hat{θ}}_{2})

(black dash-dot),

NA ({\hat{θ}}_{1})

(blue thick solid).

Table 1. Simulation results on the estimation performance of

α

in Simulation 2.

Table 1. Simulation results on the estimation performance of

α

in Simulation 2.

	$n = 150$					$n = 250$
Est.	Bias	SD	RMS	T	F	Bias	SD	RMS	T	F
${\hat{α}}_{0}$	0.0549	0.1021	0.1158	3.69	0	0.0447	0.0703	0.0833	3.79	0
${\hat{α}}_{3}$	0.0211	0.2313	0.2321	–	–	0.0038	0.1612	0.1612	–	–
${\hat{α}}_{y}$	0.0031	0.1397	0.1397	–	–	0.0118	0.0984	0.0990	–	–

Table 2. Simulation results on the estimation performance of

θ

in Simulation 2.

Table 2. Simulation results on the estimation performance of

θ

in Simulation 2.

		IPW			AIPW
$n$	Est.	Bias	SD	RMS	Bias	SD	RMS
150	${\hat{θ}}_{0}$	0.0014	0.0442	0.0443	0.0007	0.0439	0.0439
	${\hat{θ}}_{1}$	0.0015	0.0519	0.0520	0.0012	0.0520	0.0522
	${\hat{θ}}_{2}$	0.0015	0.0569	0.0570	0.0008	0.0576	0.0576
	${\hat{θ}}_{3}$	0.0006	0.0596	0.0596	0.0001	0.0601	0.0601
	${\hat{θ}}_{4}$	0.0023	0.0573	0.0574	0.0020	0.0574	0.0574
	${\hat{θ}}_{5}$	0.0011	0.0305	0.0305	0.0009	0.0305	0.0305
250	${\hat{θ}}_{0}$	0.0006	0.0382	0.0382	0.0008	0.0379	0.0379
	${\hat{θ}}_{1}$	0.0012	0.0399	0.0399	0.0012	0.0401	0.0402
	${\hat{θ}}_{2}$	0.0015	0.0466	0.0466	0.0013	0.0466	0.0466
	${\hat{θ}}_{3}$	0.0010	0.0450	0.0450	0.0013	0.0449	0.0449
	${\hat{θ}}_{4}$	0.0016	0.0409	0.0409	0.0021	0.0409	0.0409
	${\hat{θ}}_{5}$	0.0005	0.0227	0.0227	0.0007	0.0225	0.0225

Table 3. Estimation of response model parameters

α

.

Table 3. Estimation of response model parameters

α

.

Est.	Estimate	p-Value	Est.	Estimate	p-Value
${\hat{α}}_{0}$	0.64	<0.001	${\hat{α}}_{5}$	0.0068	<0.001
${\hat{α}}_{1}$	−0.0007	<0.001	${\hat{α}}_{6}$	0.0002	0.002
${\hat{α}}_{2}$	0.0011	0.003	${\hat{α}}_{7}$	0.0010	<0.001
${\hat{α}}_{3}$	0	0.574	${\hat{α}}_{8}$	−0.6299	<0.001
${\hat{α}}_{4}$	0	0.191	${\hat{α}}_{9}$	−0.0010	<0.001

Table 4. Results of the analysis on the ACTG 175 data.

	Complete-Case Analysis			Han’s Method
	Estimate	s.e.	p-Value	Estimate	s.e.	p-Value
Intercept	21.50	27.44	0.433	65.53	34.06	0.054
Trt	63.68	9.09	<0.001	52.72	10.34	<0.001
$CD 4_{0}$	0.76	0.04	<0.001	0.73	0.05	<0.001
Age	0.10	0.45	0.816	0.14	0.55	0.796
Weight	0.54	0.28	0.054	0.27	0.33	0.417
Race	−20.60	8.51	0.015	−18.30	9.66	0.058
Gender	−10.73	10.79	0.320	−16.54	11.34	0.145
History	−42.02	7.62	<0.001	−41.45	8.65	<0.001
Offtrt	−80.72	9.62	<0.001	−86.87	10.31	<0.001
	IPW			AIPW
	Estimate	s.e.	p-Value	Estimate	s.e.	p-Value
Intercept	33.15	30.66	0.2796	34.28	30.77	0.2651
Trt	62.14	9.52	<0.001	61.77	9.60	<0.001
$CD 4_{0}$	0.76	0.05	<0.001	0.76	0.05	<0.001
Age	0.18	0.53	0.7278	0.18	0.54	0.7326
Weight	0.42	0.32	0.1797	0.42	0.32	0.1884
Race	−22.07	10.10	0.0288	−22.01	10.13	0.0297
Gender	−9.38	12.07	0.4369	−9.33	12.10	0.4402
History	−41.34	8.67	<0.001	−41.24	8.70	<0.001
Offtrt	−74.74	11.62	<0.001	−74.44	11.64	<0.001

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ding, X.; Li, X. Identification and Empirical Likelihood Inference in Nonlinear Regression Model with Nonignorable Nonresponse. Mathematics 2025, 13, 1388. https://doi.org/10.3390/math13091388

AMA Style

Ding X, Li X. Identification and Empirical Likelihood Inference in Nonlinear Regression Model with Nonignorable Nonresponse. Mathematics. 2025; 13(9):1388. https://doi.org/10.3390/math13091388

Chicago/Turabian Style

Ding, Xianwen, and Xiaoxia Li. 2025. "Identification and Empirical Likelihood Inference in Nonlinear Regression Model with Nonignorable Nonresponse" Mathematics 13, no. 9: 1388. https://doi.org/10.3390/math13091388

APA Style

Ding, X., & Li, X. (2025). Identification and Empirical Likelihood Inference in Nonlinear Regression Model with Nonignorable Nonresponse. Mathematics, 13(9), 1388. https://doi.org/10.3390/math13091388

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Identification and Empirical Likelihood Inference in Nonlinear Regression Model with Nonignorable Nonresponse

Abstract

1. Introduction

2. Methods

2.1. Penalized Semiparametric Likelihood Estimation

2.2. Construction of Estimating Equations

2.3. MELEs of Model Parameters

3. Main Results

3.1. Asymptotic Properties

3.2. Double Robustness

3.3. Dimension Reduction

3.4. Asymptotic Variance Estimation

4. Simulation Study

4.1. Simulation 1

4.2. Simulation 2

5. Application to the ACTG 175 Data

6. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI