Next Article in Journal
Recurrence Resonance and 1/f Noise in Neurons Under Quantum Conditions and Their Manifestations in Proteinoid Microspheres
Next Article in Special Issue
Online Monitoring and Fault Diagnosis for High-Dimensional Stream with Application in Electron Probe X-Ray Microanalysis
Previous Article in Journal
The Epistemic Uncertainty Gradient in Spaces of Random Projections
Previous Article in Special Issue
Subspace Learning for Dual High-Order Graph Learning Based on Boolean Weight
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Penalized Exponentially Tilted Likelihood for Growing Dimensional Models with Missing Data

Yunnan Key Laboratory of Statistical Modeling and Data Analysis, Yunnan University, Kunming 650050, China
*
Author to whom correspondence should be addressed.
Entropy 2025, 27(2), 146; https://doi.org/10.3390/e27020146
Submission received: 21 December 2024 / Revised: 20 January 2025 / Accepted: 21 January 2025 / Published: 1 February 2025

Abstract

:
This paper develops a penalized exponentially tilted (ET) likelihood to simultaneously estimate unknown parameters and select variables for growing dimensional models with missing response at random. The inverse probability weighted approach is employed to compensate for missing information and to ensure the consistency of parameter estimators. Based on the penalized ET likelihood, we construct an ET likelihood ratio statistic to test the contrast hypothesis of parameters. Under some wild conditions, we obtain the consistency, asymptotic properties, and oracle properties of parameter estimators and show that the constrained penalized ET likelihood ratio statistic for testing the contrast hypothesis possesses the Wilks’ property. Simulation studies are conducted to validate the finite sample performance of the proposed methodologies. Thyroid data taken from the First People’s Hospital of Yunnan Province is employed to illustrate the proposed methodologies.

1. Introduction

Estimating equations (EEs) are widely used to make statistical inference on data with unknown distribution but moment conditions. Generally, it can be formulated as g ( x , y ; β ) = ( g 1 ( x , y ; β ) , g 2 ( x , y ; β ) , , g r ( x , y ; β ) ) , satisfying the following moment restrictions, E [ g ( x , y ; β ) ] = 0 , where { x , y } are response variables or covariates and β is an unknown p-dimensional parameter vector. There is considerable literature on the statistical inference of EEs. For example, Godambe [1] discussed the properties (e.g., unbiasedness) of unknown parameter estimators in EEs. Carroll et al. [2] applied EEs to measurement error models. Hardin & Hilbe [3] extended EEs to a more general case. For the fixed and finite values of p and r, various methods have been developed to make statistical inference on EEs. For example, see the generalized method of moments (GMM) [4], the empirical likelihood (EL) method [5,6,7], and the exponentially tilted (ET) likelihood method [8]. In particular, Newey & Smith [9] showed that EL- and ET-likelihood-based estimators of unknown parameters in EEs belong to a class of generalized EL (GEL) estimators. The GEL estimator, as an alternative to the GMM estimator, possesses the Wilks’ property, Bartlett correction, and well-defined high-order asymptotic properties, which are superior to GMM estimation [9,10].
Estimating equations for growing dimensional models has extensive applications in statistics, economics, finance, and other related fields. When p and r grow to infinity, it usually assumes that these models are sparse, meaning that only a small number of covariates contribute to the response variable. Hence, many penalized approaches have been developed to simultaneously estimate unknown parameters and select important covariates. Introducing penalty functions into likelihood is a widely adopted approach; there are many choices for penalty functions, such as least absolute shrinkage and selection operator (Lasso) [11], adaptive Lasso [12], group Lasso, smoothly clipped absolute deviation (SCAD) [13], and folded concave penalty (FCP) [14], among others. For example, Hastie et al. [15] showed that introducing penalty functions into likelihood can adjust the trade-off between bias and variance. Lam & Fan [16] studied the profile-kernel likelihood method in linear regression models. Zou & Zhang [17] established the oracle property of the adaptive elastic-net under the divergent parameter model. Caner & Zhang [18] extended Zou & Zhang’s work to GMMs. Tang et al. [8] proposed an exponentially tilted (ET) likelihood inference for EEs with growing dimensional models, studied the asymptotic properties of ET likelihood estimators, and showed that the ET likelihood estimators of unknown parameters are robust enough to model misspecification and the ET likelihood ratio statistic for testing parameters and possess the Wilks’ property under some wild conditions. The aforementioned literature only focuses on situations where the data are completely observed. However, missing data are widely encountered in many applications, such as biomedical studies and clinical trials, due to various reasons, such as data collection errors, non-response, or technical limitations.
In data analysis, the imputation of missing data is critical when the missing data mechanism is not missing completely at random, in that the complete case analysis, which is conducted using only the observations with completely observed variables, may lead to severe bias. To address the issue, many methods have been developed to make statistical inference on missing data models via the imputation of missing data in various fields, including economics, social sciences, healthcare, and machine learning. For example, Wang et al. [19] investigated the estimation problem of regression coefficients in linear models with missing covariates. Ref. [20] applied locally weighted kernel polynomial regression methods to generalized linear models with missing predictors. Scharfstein et al. [21] studied the nonignorable drop-out problem in semiparametric nonresponse models. FitzGerald [22] proposed a weighted approach to addressing missing data in generalized estimating equations with binary outcomes. Wang et al. [23] presented a general approach to handling missing data in a longitudinal study based on expected EEs. Zhou et al. [24] redefined the EE imputed method using the kernel regression method to mitigate the biased effects of missing data. Tang et al. [25] developed an EL inference on parameters in generalized EEs with nonignorable missing response data. Tang et al. [8] proposed a nonparametric imputation method based on the propensity score in a general class of semiparametric models for missing data. Qi et al. [26] reformulated the EE imputed method via the kernel regression method to handle the problem of nonignorable missing data and utilized GMM to simultaneously estimate model parameters and tilting parameters in propensity score functions. Shao & Wang [27] investigated the nonignorable missing data with a semiparametric exponential tilting propensity and applied the inverse propensity weighting approach to estimating population parameters. Tang et al. [28] proposed a new feature screening procedure in ultrahigh-dimensional partially linear models with missing responses at random for longitudinal data based on the profile marginal kernel-assisted EE imputation technique. Liu and Yuan [29] proposed adaptive EL estimators to improve the estimation efficiency of existing adaptive estimators of the mean response with nonignorable nonresponse data. The aforementioned literature on the imputation of missing data mainly focus on EL inference for fixed-dimensional EEs. Although Tang et al. [8] considered an ET likelihood inference on EEs with growing dimensional models to address the shortcoming of EL inference that is not robust to the misspecification of EEs, to our knowledge, there is little work on ET likelihood inference on growing dimensional models with missing responses at random.
In this paper, we conduct statistical inference on growing dimensional models with missing data and propose a penalized ET likelihood to simultaneously estimate parameters and select variables. We consider the case where data including response variables and covariates are subject to missing at random, impute information for missing data via the inverse probability weighted (IPW) approach, and develop the ET likelihood estimators of parameters based on the EE imputed method together with penalized ET likelihood. When the penalty function satisfies certain conditions, we obtain the oracle property and asymptotic properties of parameter estimators and show that the constrained penalized ET likelihood ratio statistic for testing the contrast hypothesis asymptotically follows the centred chi-squared distribution. Based on the test statistics, we construct confidence intervals of parameters. We conduct simulation studies for several settings to verify the finite sample performance of the proposed methodologies, including the robustness and oracle properties of parameter estimators. Thyroid data taken from the First People’s Hospital of Yunnan Province is used to select important variables associated with the thyroid nodules disease.
The rest of this paper is organized as follows. Section 2 constructs the penalized ET likelihood and develops the ET likelihood estimators of parameters and the penalized ET likelihood ratio statistic for testing te contrast hypothesis. Section 3 studies the oracle properties and asymptotic properties of parameter estimators and shows the Wilks’ property of the asymptotic distribution of the constrained penalized ET likelihood ratio statistic for testing the contrast hypothesis under some regularization conditions. Section 4 conducts simulation studies to assess the finite sample performance of the proposed methodologies. An example is illustrated in Section 5. A brief discussion is given in Section 6.

2. Model and Notation

Let { X i , Y i } i = 1 n be n independent and identically distributed observations of random variables { X , Y } coming from an unknown distribution F ( x , y ) , where X i ’s are the d x -dimensional vector and Y i ’s are the d y -dimensional vector. It is assumed that X i ’s are always observed, while Y i ’s are subject to missingness for i = 1 , , n . Let δ i be an indicator of Y i , i.e., δ i = 1 if Y i is observed and δ i = 0 if Y i is missing for i = 1 , , n . It is assumed that δ i and δ j are independent for any i j . Here, we assume a missing at random (MAR) mechanism for missing value of Y i , i.e., δ i only depends on X i , such that Pr ( δ i = 1 X i ) π i ( X i ; γ ) for i = 1 , , n , where γ is a q-dimensional unknown parameter vector.
To meet the identifiability condition of the considered model, the idea of instrumental variables [30] is adopted. Thus, it is assumed that X i can be decomposed into X i = ( U i , Z i ) , where Z i is a d z -dimensional vector of instrumental variables. Under the above assumption, the missingness data mechanism model can be expressed as Pr ( δ i = 1 X i ) = Pr ( δ i = 1 U i ) = π i ( U i ; γ ) , which can be formulated by a logistic regression. That is,
δ i | U i B ( π i ( U i ; γ ) )
where B ( a ) represents the Bernoulli distribution with the success probability a,
logit { π i ( U i ; γ ) } = γ c + U i γ u ,
where logit ( a ) = log { a / ( 1 a ) } , γ = ( γ c , γ u ) = ( γ 1 , , γ q ) is the q-dimensional vector of unknown parameters associated with propensity score function π i ( U i ; γ ) . To estimate unknown parameters, we consider the following calibration conditions:
φ ( U i , Z i ; γ ) = δ i π i ( U i ; γ ) 1 d ( Z i ) , i = 1 , 2 , , n ,
where d ( Z i ) is any users-specified function or vector on Z i . If π i ( U i ; γ ) is correctly specified, we have E [ φ ( U i , Z i ; γ ) ] = 0 . Stock [31] discussed the conditions for the valid instrumental variables, along with some common pitfalls on the applications of instrumental variables.
Let β 0 be a p-dimensional vector of true parameters satisfying E { g ( X i , Y i ; β 0 ) } = 0 . Here, g ( X , Y ; β ) = ( g 1 ( X , Y ; β ) , , g r ( X , Y ; β ) ) represents r unconditional moment restrictions with r p . To estimate parameter vector γ associated with the missingness data mechanism, we assume that φ ( U i , Z i ; γ ) = ( φ 1 ( U i , Z i ; γ ) , , φ m ( U i , Z i ; γ ) ) satisfies E { φ ( U i , Z i ; γ 0 ) } = 0 for i = 1 , , n , where γ 0 is the true values of parameter γ . Our key purpose is to make statistical inference on β and γ for growing dimensional estimating equations and unknown parameters, i.e., { r , m , p , q } are allowed to diverge with the sample size n.
To make statistical inference on unknown parameter vectors β and γ , we consider the inverse probability weighted (IPW) method, which is an important method for handling missing data and has been widely applied to various missing data analyses. For example, Rosenbaum & Rubin [32] provided a detailed argument on the application of propensity scores for adjusting weights and reducing the bias of selection. Imai & Ratkovic [33] developed the IPW method to estimate causal effects. Robins et al. [34] proposed a semiparametric approach based on inverse probability weighted EEs to handle missing data. Robins & Rotnitzky [35] discussed the application of the IPW method for improving parameter estimation efficiency in the presence of missing data. Hirano et al. [36] applied the IPW method to efficiently estimate the average effect of treatments in the presence of missing data. Inspired by the aforementioned literature, we consider the following IPW-based EEs:
g i p w ( X i , Y i , U i ; β , γ ) = δ i π i ( U i ; γ ) g ( X i , Y i ; β ) .
Combining the calibration conditions (3) and (4) leads to the following modified estimating equations:
G ( D i ; β , γ ) = ( g 1 i p w ( D i ; β , γ ) , , g r i p w ( D i ; β , γ ) , φ 1 ( U i ; γ ) , , φ m ( U i ; γ ) ) ,
where D i = { X i , Y i , U i } for i = 1 , , n .
Let t = r + m represent the dimension of the modified estimating equations and θ = ( β , γ ) be an s-dimensional vector of the parameters associated with the modified estimating equations, where s = p + q . When t s , the estimators of parameters θ can be directly obtained by solving estimating Equation (5), but this is not necessarily feasible, especially when t > s , i.e., the number of estimating equations is greater than the number of parameters, implying that estimating equations provide external available information, which may lead to the over-identification problem. Hansen [4] showed that this external available information can improve estimation efficiency in the presence of over-identification. Due to the fact that the ET likelihood is more robust than the EL likelihood in the presence of model misclassification [8], here we utilize the ET likelihood method to estimate parameters θ . Following the literature [8], the ET likelihood for θ based on the data { D i : i = 1 , , n } can be defined as
L ( θ ) = inf i = 1 n w i log ( w i ) : w i 0 , i = 1 n w i = 1 , i = 1 n w i G ( D i ; θ ) = 0 .
With the Lagrange multiplier method, we obtain
w i = exp { λ G ( D i ; θ ) } j = 1 n exp { λ G ( D j ; θ ) } ,
where λ is the Lagrange multiplier. Also, the global minimizer of w i is reached when w i = n 1 when there are no constrained conditions. Thus, the profiled log-ET likelihood ratio function of θ based on the data { D i : i = 1 , , n } can be expressed as
l n ( λ , θ ) = { L ( θ ) + log ( n ) } = log 1 n i = 1 n exp { λ G ( D i ; θ ) } .
Let
U n 1 ( λ , θ ) = l n ( λ , θ ) / λ = n 1 i = 1 n exp { λ G ( D i ; θ ) } G ( D i ; θ ) , U n 2 ( λ , θ ) = l n ( λ , θ ) / θ = n 1 i = 1 n exp { λ G ( D i ; θ ) } λ { G ( D i ; θ ) / θ } ,
and
θ ^ ET = arg max θ Θ inf λ Λ ^ n ( θ ) l n ( λ , θ ) ,
where Λ ^ n ( θ ) = { λ : λ G ( D i ; θ ) ε , i = 1 , , n } , in which ε is an open ball containing zero. Under some regularity conditions, θ ^ ET can be obtained by simultaneously solving U n 1 ( λ , θ ) = 0 and U n 2 ( λ , θ ) = 0 , and the Lagrange multiplier λ satisfies U n 1 ( λ , θ ^ ET ) = 0 .
In growing dimensional estimating equations, there is a well-known ill-posed problem [8]. To address this issue, we impose a sparsity assumption on parameter vectors θ and γ so that only a very small portion of their coordinates are non-zero, which means only a small number of covariates contribute to the response variable. Under the sparsity assumption, let A β = { j : β 0 j 0 } and d β = | A β | be the cardinality of A β , where β 0 j represents the jth component of the true value β 0 of parameter vector β . Without loss of generality, let β = ( β 1 , β 2 ) , where β 1 R d β and β 2 R p d β correspond to the non-zero and zero components of β , respectively. Similarly, let A γ = { j : γ 0 j 0 } , its cardinality be d γ = | A γ | , and γ = ( γ 1 , γ 2 ) , where γ 1 R d γ and γ 2 R q d γ correspond to non-zero and zero components of γ , respectively. Also, let A θ = { j : θ 0 j 0 } , d θ = | A θ | , θ = ( θ 1 , θ 2 ) , where θ 1 R d θ and θ 2 R s d θ , where d θ = d β + d γ . Here, we consider simultaneously selecting non-zero components of both parameters of γ associated with the missingness data mechanism and parameter β related to growing dimensional EEs. To this end, we define the following penalized log-ET likelihood ratio function of θ based on the data { D i : i = 1 , , n } :
l p n ( θ ) = log 1 n i = 1 n exp { λ G ( D i ; θ ) } j = 1 p p ν 1 ( | β j | ) k = 1 q p ν 2 ( | γ k | ) ,
where p ν ( · ) is some proper penalty function, with the tuning parameter ν controlling the trade-off between bias and model complexity. There are many choices for penalty functions, such as Lasso, adaptive Lasso, SCAD, FCP, and MCP. When the penalty function is non-convex, the calculation of the penalized ET likelihood estimation of parameters θ is challenging [37]. In particular, for the SCAD penalty function, the nested optimization algorithm is adopted to minimize the objective function l p n ( θ ) given in (8) by using the local quadratic approximation method of Fan & Li [13] to approximate the nonconvex penalty p ν ( | θ j | ) for j = 1 , , p + q . The computational complexity of optimizing Equation (8) is O ( n t s 3 T ) , where T is the number of iterations.
For the selection of tuning parameter ν k ( k = 1 , 2 ), there are many widely used methods, such as Akaike information criterion (AIC), Bayesian information criterion (BIC), cross validation (CV), and generalized CV, among others. Similarly to [8], we consider the Bayesian information criterion,
BIC ( ν ) = 2 l ( θ ^ ν ) + B n log ( n ) d f ν
to select the tuning parameter ν k , where ν = { ν 1 , ν 2 } , l ( θ ) = l n ( λ , θ ) is given in Equation (7), θ ^ ν is the penalized ET estimator of θ depending on the tuning parameter ν , ν is the tuning parameter, d f ν is the number of non-zero coefficients in θ implying the “degrees of freedom” of the estimated EEs, and B n is the scaling factor that diverges to infinity at a slow rate as s . Similarly to [8], when s is fixed, we simply take B n = 1 ; otherwise, we set B n = max ( log log s , 1 ) . By the Bayesian information criterion, it can be seen that the continuously increasing B n is used to offset the effect of the continuously increasing dimension s on parameter estimation.

3. Asymptotic Properties

3.1. Asymptotic Properties for Correctly Specified EEs

This section focuses on the consistent and oracle properties of parameter estimator θ ^ P E T , and on testing the contrast hypothesis. To this end, we need the following assumptions.
Assumption 1.(i) The probability density function f ( x , y ) is bounded for any x X and y Y , where X and Y are the supports of x and y, respectively. (ii) The second derivatives of f ( x , y ) are continuous and bounded.
Assumption 2.(i) The propensity score function π ( u ; γ ) satisfies π ( u ; γ ) > c 0 for any u U , where U is the support of u, and c 0 > 0 is a constant. (ii) The second derivatives of π ( u ; γ ) are continuous and bounded.
Assumption 3.
Parameter vector θ = ( β , γ ) is recognizable, which means that there exists a unique solution, θ 0 = ( β 0 , γ 0 ) Θ to E [ G ( D i ; θ 0 ) ] = 0 for D i { X , Y , U } , i = 1 , , n .
Assumption 4.
Let D n β = { β : β β 0 C r / n } , D n γ = { γ : γ γ 0 C m / n } , and D n θ = { θ : θ θ 0 C t / n } , where C > 0 is a proper constant. There are three constants, C 1 g > 0 , K 1 g > 0 , and K 2 g > 0 , and two functions, K 1 g ( D ˜ i ) and K 2 g ( D ˜ i ) , satisfying E [ K 1 g ( D ˜ i ) ] 2 K 1 g and E [ K 2 g ( D ˜ i ) ] 2 K 2 g , such that for β D n β , (i) | E [ n 1 i = 1 n g ( D ˜ i ; β ) g ( D ˜ i ; β ) ] | C 1 g , with probability tending to one; (ii) sup j E [ g j ( D ˜ i ; β ) ] 2 C 1 g and sup j , l E [ g j ( D ˜ i ; β ) g l ( D ˜ i ; β ) ] 2 C 1 g ; (iii) sup j , l , β | g j ( D ˜ i ; β ) / β l | K 1 g ( D ˜ i ) and sup j , l 1 , l 2 , β | 2 g j ( D ˜ i ; β ) / β l 1 β l 2 | K 2 g ( D ˜ i ) , where D ˜ i = { X i , Y i } .
Assumption 5.
t = o ( n 1 / 2 1 / ι ) for some ι > 2 and t = O ( s ) .
Assumptions 1 and 2 are the widely used conditions in the missing data literature. Assumption 1 implies that the probability density functions f ( u ) and f ( z ) are bounded for any u U and z Z , respectively, where Z is the support of z. Assumption 3 ensures the existence and consistency of the solution that can be obtained with (7). Assumption 4 is a common condition for sample moments. Also, it follows from Assumptions 1(i) and 2 and (3) that the EEs φ ( U i ; γ ) and g i p w ( D i ; θ ) satisfy Assumption 4 for γ D n γ and θ D n θ . There are constants C 1 G > 0 , K 1 G > 0 , K 2 G > 0 and functions K 1 G ( D i ) and K 2 G ( D i ) satisfying E [ K 1 G ( D i ) ] 2 K 1 G and E [ K 2 G ( D i ) ] 2 K 2 G , such that, for θ D n θ , (i) | E [ n 1 i = 1 n G ( D i ; θ ) G ( D i ; θ ) ] | C 1 G with probability tending to one; (ii) sup j E [ G j ( D i ; θ ) ] 2 C 1 G and sup j , l E [ G j ( D i ; θ ) G l ( D i ; θ ) ] 2 C 1 G ; (iii) sup j , l , θ | G j ( D i ; θ ) / θ l | K 1 G ( D i ) and sup j , l 1 , l 2 , θ | 2 G j ( D i ; θ ) / θ l 1 θ l 2 | K 2 G ( D i ) , which is used to ensure the boundedness of EEs G ( D i ; θ ) and their derivatives [8]. Here, we assume that t and s are allowed to be divergent when n . But Assumption 5 imposes certain limitations on the divergence rate of t and s.
Lemma 1.
Under Assumptions 1–5, we have
l p n ( θ ) = 1 2 ( θ θ ˙ ) V ( θ θ ˙ ) + o p ( t / n ) ,
where V = V 0 + M Σ 1 M , θ ˙ = V 1 ( V 0 θ 0 + M Σ 1 M θ ^ ET ) , θ ^ ET is the ET likelihood estimator of θ, and V 0 is a diagonal matrix with the jth diagonal element being V 0 j j for j = 1 , , p + q .
Similarly to [8], the jth non-zero diagonal component V 0 j j can be evaluated with the following quadratic approximation of p ν ( | θ j | ) at θ 0 j : p ν ( | θ j | ) = p ν ( | θ 0 j | ) + p ˙ ν ( | θ j ξ | ) θ j ξ ( θ j θ 0 j ) / | θ j ξ | p ν ( | θ 0 j | ) + p ˙ ν ( | θ j ξ | ) ( θ j 2 θ 0 j 2 ) / ( 2 | θ j ξ | ) for θ j { θ j : | θ j θ 0 j | C t / n } , which yields V 0 j j = p ˙ ν ( | θ j ξ | ) / | θ j ξ | , where θ j ξ is a point lying in the line between θ j and θ 0 j . Lemma 1 provides a quadratic form of the penalized ET likelihood ratio function, which is approximately an ellipsoid, with θ ˙ being the center. Lemma 1 can be used to obtain the consistency and asymptotic properties of the parameter estimator for θ . To obtain the oracle property of the parameter estimator for θ , we also need the following assumptions:
Assumption 6.
For any ϵ > 0 , inf β Θ β W | | E [ g ( D ˜ i ; β ) ] | | ψ 1 ( d β , r ) ψ 2 ( ϵ ) > 0 and inf γ Θ γ W | | E [ φ ( U i ; γ ) ] | | ψ 1 ( d γ , m ) ψ 2 ( ϵ ) > 0 , where Θ β W = { β Θ β : | | β β 0 | | ϵ } , Θ γ W = { γ Θ γ : | | γ γ 0 | | ϵ } , Θ β is the projection of θ-parameter space Θ onto β, Θ γ is the projection of θ-parameter space Θ onto γ, ψ 2 ( · ) is some proper positive function, and the positive function ψ 1 ( · , · ) satisfies lim inf d β , r ψ 1 ( d β , r ) > 0 .
Assumption 7.
Let Σ g ( β ) = E [ { g i ( β ) g ( β ) } { g i ( β ) g ( β ) } ] . There are constants C 2 g and C 3 g , such that 0 < C 2 g EV 1 { Σ g ( β ) } EV t { Σ g ( β ) } C 3 g < for β D n β , where g i ( β ) = g ( D ˜ i ; β ) , g ( β ) = E [ g ( D ˜ i ; β ) ] , EV j ( A ) represents the jth eigenvalue of matrix A.
Assumption 8.
There is a positive function η ( ν ) > 0 , such that η ( ν ) = O ( ν ) . For θ 0 Θ , max j | θ 0 j | < , and η ( ν ) satisfies η ( ν ) / min j A | θ 0 j | 0 , where θ 0 j represents the jth component of the true value θ 0 for parameter vector θ.
Assumption 9.
The penalty function p ν ( · ) satisfies max j p ˙ ν ( | θ 0 j | ) = o { ( n s ) 1 / 2 } and max j p ¨ ν ( | θ 0 j | ) = o ( s 1 / 2 ) , and max j p ˙ ν ( | θ j | ) / η ( ν ) 0 and s / [ η ( ν ) min j p ˙ ν ( | θ j | ) ] 0 for θ D n θ as n , where p ˙ ν ( · ) and p ¨ ν ( · ) are the first- and second-order derivatives of p ν ( · ) , respectively.
Assumption 6 is the identification condition for the parameter space, which grows to infinity. Assumption 7 ensures that Σ g ( β ) is a strictly positive definite matrix. It follows from Assumptions 1(i) and 2(i) and (3) that Σ φ ( γ ) = E [ { φ i ( γ ) φ ( γ ) } { φ i ( γ ) φ ( γ ) } ] and Σ g i p w ( θ ) = E [ { g i i p w ( θ ) g i p w ( θ ) } { g i i p w ( θ ) g i p w ( θ ) } ] are also strictly positive definite matrices, where φ i ( γ ) = φ ( U i ; γ ) , φ ( γ ) = E [ φ ( U i ; γ ) ] , g i i p w ( θ ) = g i p w ( D i ; θ ) and g i p w ( θ ) = E [ g i p w ( D i ; θ ) ] . The tuning parameter ν associated with the penalty function controls the variable selection process to some extent. Without loss of generality, we assume that there exists a positive function η ( ν ) related to ν , such that A ^ = { j : | θ ^ j PET | > η ( ν ) } , where θ ^ j PET is the jth component of the penalized ET estimator θ ^ PET . In addition, if parameter vector θ quickly converges to zero, the estimators of parameters will be subject to certain disturbances, which results in the relatively large biases and variances of parameter estimators. Therefore, Assumption 8 limits the convergence rate of parameter vector θ . Assumption 9 has been adopted in the regularization literature. The widely used penalty functions, such as the SCAD penalty function, satisfy Assumption 9.
Theorem 1.
Let θ ˇ PET = ( θ ˇ 1 PET , θ ˇ 2 PET ) be the penalized ET likelihood estimator of θ for a correctly specified model. Under Assumptions 1–9, we have
(i) 
|| θ ˇ 1 PET θ 10 || P 0 and θ ˇ 2 PET P 0 as n , where the notation P represents the convergence in probability.
(ii) 
n A n Γ 1 1 / 2 ( θ ˇ PET θ 0 ) L N ( 0 , V ) , where A n R s × t satisfies A n A n V as n , V is a s × s nonnegative symmetric matrix, and L denotes convergence in distribution.
When the model correctly specified, Theorem 1 shows that the penalized ET likelihood estimator θ ˇ PET of θ possesses the consistency property for non-zero components of θ 0 , and zero components of θ 0 are estimated as zero, with probability tending to one by the penalized ET likelihood method. Also, Theorem 1 shows that the distribution of the penalized ET likelihood estimator θ ˇ PET asymptotically converges to the normal distribution. In particular, it follows from θ ˇ PET = ( β ˇ PET , γ ˇ PET ) that n A n , β Γ 1 , β 1 / 2 ( β ˇ PET β 0 ) L N ( 0 , V β ) and n A n , γ Γ 1 , γ 1 / 2 ( γ ˇ PET γ 0 ) L N ( 0 , V γ ) , where A n , β , Γ 1 , β , and V β are the projections of matrices A n , Γ 1 , and V onto the β space, respectively, and A n , γ , Γ 1 , γ , and V γ are the projections of matrices A n , Γ 1 , and V onto the γ space, respectively.
We consider the following null and alternative hypotheses for testing the contrast of parameters θ :
H 0 : C n θ = 0 versus H 1 : C n θ 0 ,
where C n R e × s satisfies C n C n = I e for some fixed integer e and I e is the e × e identity matrix. The aforementioned hypothesis includes some special cases. For example, H 0 j : θ j = 0 for j = 1 , , s , or H 0 : θ j θ k = 0 for j k { 1 , , s } , and so on. The penalized ET likelihood ratio statistic for testing the null hypothesis H 0 : C n θ = 0 is given as
l ˜ ( C n ) = 2 n l p n ( θ ^ PETO ) max θ , C n θ = 0 l p n ( θ ) ,
where θ ^ PETO = arg max θ Θ inf λ Λ ^ n ( θ ) l p n ( θ ) . Similarly to the empirical likelihood literature, we can show that l ˜ ( C n ) converges to χ 2 -distribution as n under some regularity conditions.

3.2. Asymptotic Properties for Misspecified EEs

Section 3.1 discusses the asymptotic properties of estimators for parameters θ when EEs are correctly specified. However, in practical applications, it is rather difficult to correctly specify EEs. That is, misspecified EEs are usually encountered in some applications. Thus, the aforementioned asymptotic properties do not hold for misspecified EEs. To address this issue, here we investigate the asymptotic properties of estimators for parameters θ obtained with misspecified EEs. To this end, we need the following assumptions.
Assumption 10.(i) The EEs g ( X i ; β ) and penalty function p ν ( | θ | ) are continuous. (ii) The second-order derivatives of g ( X i ; β ) and π ( U i ; γ ) are continuous with respect to β and γ in the neighbourhoods N β * and N γ * of β * and γ * , respectively, where β * and γ * are the “pseudo-truth” values of β and γ, which are interior points of Θ β and Θ γ , respectively.
Assumption 11.
There are constant K 3 G < and functions K 3 G ( D i ) satisfying the condition E { K 3 G ( D i ) } 2 K 3 G , such that sup θ Θ sup λ Λ ^ n ( θ ) exp { λ G ( D i ; θ ) } < K 3 G ( D i ) , where Λ ^ n ( θ ) is a compact set, such that λ * ( θ ) int ( Λ ^ n ( θ ) ) .
Assumption 12.(i) For all θ R s , the conditional expectation E { G ( D i ; θ ) | G ( D i ; θ 0 ) = G ( d ; θ 0 ) } = g ( d ; θ 0 ) exists. (ii) The expectation function E [ exp { λ * ( θ ) G ( D i ; θ ) } ] is convex and maximized at the “pseudo-truth” value θ * of θ, where  θ *  is the interior point of Θ.
Assumption 13.
There is a function h ( D i ) satisfying
E sup θ N θ * sup λ Λ ^ n ( θ ) exp { l 1 λ G ( D i ; θ ) } h l 2 ( D i ) <
for l 1 , l 2 = 0 , 1 , 2 , where N θ * is the neighbourhood of θ * , such that | | G ( D i ; θ ) | | h ( D i ) , | | θ G ( D i ; θ ) | | h ( D i ) , | | θ j 1 θ j 2 G ( D i ; θ ) | | h ( D i ) for any θ N θ * and j 1 , j 2 = 1 , , s , where θ G ( D i ; θ ) G ( D i ; θ ) / θ and θ j 1 θ j 2 G ( D i ; θ ) 2 G ( D i ; θ ) / θ j 1 θ j 2 .
Assumption 14.(i) As n and ν 0 , lim inf θ 0 + p ˙ ν ( θ ) / ν > 0 and | | λ ( θ ) | | = o ( ν ) for λ ( θ ) int ( Λ ^ n ( θ ) ) . (ii) There is a constant C p , such that max j A θ p ˙ ν ( | θ 0 j | ) C p ν and max j A θ p ¨ ν ( | θ 0 j | ) C p ν .
Assumptions 10–13 are some regularity conditions on the EEs g ( X i , Y i ; β ) , G ( D i ; θ ) , and the propensity score functions π ( U i ; γ ) required for the consistency and asymptotic normality of the penalized ET likelihood estimator when EEs are misspecified [8]. Assumptions 11–13 are slightly stronger than Assumptions 1 and 2. In fact, we only need to make corresponding assumptions on the EEs g ( X i , Y i ; β ) and propensity score functions π ( U i ; γ ) based on Assumptions 1 and 2 and Equation (3), and can then obtain Assumptions 11–13. Assumption 14 gives the relationship between the order of Lagrange multipliers of the ET likelihood function and the order of tuning parameter associated with the penalty function, and presents the constraint for the first- and second-order derivatives of the penalty function in terms of the tuning parameter ν , which is used to ensure the sparsity of the penalized ET likelihood estimator in the presence of EE misspecification.
Theorem 2.
Let θ ^ PET = ( θ ^ 1 PET , θ ^ 2 PET ) be the penalized ET likelihood estimator of θ in the presence of EE misspecification. Under Assumptions 10–14, as n , we have
(i) 
(Consistency) θ ^ PET P θ * , where P denotes convergence in probability;
(ii) 
(Oracle property) θ ^ 2 PET = 0 with probability tending to one;
(iii) 
(Asymptotic normality) n ( ζ ^ PET ζ * ) L N ( 0 , Δ 1 Ω Δ ) , where ζ = ( λ , θ , τ ) , Δ = E { ζ S ˜ ( λ , θ , τ ) } | ζ = ζ * , Ω = E { S ˜ ( λ * , θ * , τ * ) S ˜ ( λ * , θ * , τ * ) } , τ is the Lagrange multiplier, vector S ˜ ( λ , θ , τ ) is the first-order derivatives of the penalized log-ET likelihood ratio function for misspecified EEs under constraint M 2 θ = θ 2 = 0 in which M 2 is the projection matrix, and L denotes convergence in distribution.
Theorem 2 indicates that there is no convergence rate due to the misspecification of EEs; the penalized ET likelihood estimator of θ converges to its pseudo-true value rather than to its true value. Also, Theorem 2 is shown to possess the oracle property under some regularity conditions. The consistency of the estimator holds only when the penalized log-ET likelihood ratio function converges to its population form, satisfying some continuity and boundness assumptions. Moreover, Theorem 3 establishes the asymptotic normality of estimators for the non-zero components of θ .
Theorem 3.
Under Assumptions 10–14, we have l ˜ ( C n ) M ˜ L χ e 2 , where M ˜ = K 1 ( S 1 S 2 ) 1 with K 1 , S 1 , and S 2 defined in Appendix A and χ e 2 denotes the chi-squared distribution with e degrees of freedom.
Theorem 3 extends the results of the growing dimensional unconditional moment models given in [8] to the MAR case. Theorem 3 can be used to test hypotheses and construct the confidence regions of parameters θ . In particular, the 100 ( 1 α ) % approximate confidence region for C n θ based on the constrainedly penalized ET likelihood ratio statistic can be written as
C R α = ς : 2 n M ˜ l p n ( θ ^ ) max θ , C n θ = ς l p n ( θ ) χ e 2 ( 1 α ) ,
where χ e 2 ( 1 α ) is the ( 1 α ) -quantile of the chi-squared distribution with e degrees of freedom.

4. Simulation Studies

In this section, some simulation studies are conducted to investigate the finite sample performance of the proposed methodologies.
Experiment 1 (Linear Regression Model). The data { Y i : i = 1 , , n } are generated with the following linear regression model: Y i = β c + X i β x + ϵ i , where X i = ( U i , Z i ) is the p-dimensional covariate, β c is the constant term, β x is the p-dimensional vector of regression coefficients, the instrumental variables Z i = ( z i , 1 , , z i , p q ) are drawn from the multivariate normal distribution N ( 0 , Σ ) , i.e., Z i N ( 0 , Σ ) , covariates U i = ( u i , 1 , , u i , q ) and u i , j N ( 10 z i , j , 1 ) for j = 1 , , q ( 2 q = p ), and ϵ i ’s follow the standard normal distribution N ( 0 , 1 ) for i = 1 , , n . The true value β 0 of β = ( β c , β x ) R p + 1 is taken as β 0 = ( 1.0 , 0.6 , 0.5 , 0.0 , , 0.0 ) , which implies that there are two non-zero elements and p 2 zero elements in β x , corresponding to two active and p 2 inactive covariates. The true value of the ( k , l ) th component of Σ = ( ρ k l ) is taken to be ρ k l = 0 . 5 | k l | for k , l = 1 , , p . Here, we assume that covariates x i are completely observed, while responses y i are subject to missingness. To this end, we regard δ i as the missing indicator for Y i , i.e., δ i = 0 if Y i is missing and δ i = 1 if Y i is observed, and denote π i ( U i ; γ ) = Pr ( δ i = 1 | U i ) . To create missing data for response y i , we consider the following propensity score functions:
(M1) logit { π i ( U i ; γ ) } = γ c + U i γ u , where U i = ( u i 1 , , u i q ) are simulated from the above normal distribution, and γ = ( γ c , γ u ) = ( γ 0 , γ 1 , , γ q ) , logit( b ) = log { b / ( 1 b ) } ;
(M2) π i ( U i ; γ ) = 0.5 if γ c + U i γ u a , and π i ( U i ; γ ) = 0.5 if γ c + U i γ u > a ;
(M3) π i ( U i ; γ ) = Φ ( γ c + U i γ u ) , where Φ ( · ) is the cumulative distribution function of the standard normal distribution.
The true value γ 0 of γ R q + 1 is taken as γ 0 = ( 1.0 , 0.8 , 0.5 , 0.0 , , 0.0 ) , which indicates that there are three non-zero components and q 2 zero elements in γ 0 corresponding to two active predictors and q 2 inactive predictors. To illustrate the proposed method, we consider the following unconditional moments: g ( D i ; β ) = Z i ( Y i β c Z i β x ) and
φ ( U i , Z i ; γ ) = δ i π i ( U i ; γ ) 1 Z i , i = 1 , , n
with π i ( U i ; γ ) specified in model (M1), which implies that (M1) is a correctly specified model and (M2) and (M3) are the misspecified models that are used to show the robustness of the proposed methods, and E { g ( D i ; β 0 ) } = 0 and E { φ ( U i , Z i ; θ 0 ) } = 0 .
We consider five combinations of ( n , p + 1 , q + 1 ) : ( n , p , q ) = ( 50 , 9 , 5 ) , ( 100 , 13 , 7 ) , ( 200 , 17 , 9 ) , ( 500 , 23 , 12 ) , and ( 500 , 101 , 51 ) , where p is the even-integer part of ( 5 n ) 2 5 or 9 ( n ) 2 5 8 , and q = p / 2 . The tuning parameter τ is selected with the BIC, and the SCAD is taken as the penalty function p ν ( · ) . For each of the five combinations for ( n , p , q ) , we simulate 1000 datasets, which are used to estimate parameters θ = { β , γ } via the proposed method. We calculate the bias, standard deviation (SD), and root mean square error (RMSE) of parameters θ for assessing the accuracy of the proposed method, where bias= T 1 t = 1 T θ j ( t ) θ 0 j (i.e., the difference between the true value and the mean of the estimates based on T = 1000 replications), SD= T 1 t = 1 T ( θ j ( t ) θ ¯ j ) 2 (i.e., the standard deviation of T = 1000 estimates), and RMSE = T 1 t = 1 T ( θ j ( t ) θ 0 j ) 2 (i.e., the root mean square error of estimates obtained with T = 1000 replications and its true value) and θ ¯ j = T 1 t = 1 T θ j ( t ) , in which θ j ( t ) is the penalized ET likelihood estimate of the jth component of θ for the tth replication for j = 1 , , p + q + 2 . To assess the oracle property of the proposed variable selection method, we evaluate the average number of correctly estimated zero components (denoted as ‘TP’) and the average number of incorrectly estimated zero components (denoted as ‘FP’). Also, to assess the model fitting effect, we calculate the average number of under-fitting (denoted as ‘UF’), correctly-fitting (denoted as ‘CF’), and over-fitting (denoted as ‘OF’) for 1000 repeated simulations, where under-fitting refers to the case that some zero-components are incorrectly estimated as non-zero, over-fitting refers to the case that some non-zero-components are incorrectly estimated to be zero and all zero-components are correctly estimated to be zero, while correctly-fitting refers to the case that all zero-components are correctly estimated to be zero and all non-zero-components are correctly estimated as non-zero. The results for bias, SD, and RMSE values of non-zero parameters in θ and variable selection are given in Table 1. An examination of Table 1 shows that (i) bias values of all the parameter estimates are relatively small and SD and RMSE values of all the parameter estimates are almost identical; (ii) as the sample size n increases, the average number of correctly estimated zero components tends towards 100 % and the average number of incorrectly estimated zero components tends towards 0, implying that the proposed variable selection method performs well; (iii) as the sample size n increases, UF and OF tend towards 0 and CF tends towards 100 % ; and (iv) although UF, CF, and OF values corresponding to misspecified models (e.g., M2 and M3) are larger than those corresponding to the correctly specified model (e.g., M1), their differences are relatively small, implying that the proposed penalized ET likelihood method is robust to the misspecified model.
Experiment 2 (logistic model). The data { Y i : i = 1 , , n } are generated from the following logistic model: Y i Bernoulli ( p i ) with logit ( p i ) = β c + X i β x , where X i = ( U i , Z i ) is the p-dimensional covariates, the instrumental variables Z i N p ( 0 , I ) , where I is a p × p identity matrix, and β = ( β c , β x ) . The q-dimensional covariates U i = ( u i , 1 , , u i , q ) and u i , j N ( z i , j , 100 ) for j = 1 , , q ( 2 q = p ). Similarly to experiment 1, we assume that X i ’s are fully observed, while Y i ’s are subject to missingness and missing indicators δ i for responses Y i are simulated from the Bernoulli distribution with the propensity score function π i ( U i ; γ ) and logit { π i ( U i ; γ ) } = γ c + U i γ u , where γ = ( γ c , γ u ) . Here, we consider the same true value of { θ , γ } as those given in experiment 1, four combinations of triples ( n , p + 1 , q + 1 ) : ( n , p + 1 , q + 1 ) = ( 50 , 9 , 5 ) , ( 100 , 13 , 7 ) , ( 200 , 17 , 9 ) , and ( 500 , 101 , 51 ) , where p is the even-integer part of ( 5 n ) 2 5 or 9 n 2 5 8 and q = p / 2 . As an illustration of the proposed method, we consider the following unconditional moment constraints:
g ( D i ; β ) = ( 1 , Z i ) Y i exp ( β c + Z i β x ) 1 + exp ( β c + Z i β x ) , i = 1 , , n .
The missing mechanism model for fitting data is given in Equation (1). We select the tuning parameter τ with the BIC criterion and take the SCAD as the penalty function. For simplicity, let ω 1 = n 1 i = 1 n ( 1 , Z i ) log ( p ˜ i / ( 1 p ˜ i ) ) , ω 2 = n 1 i = 1 n ( 1 , U i ) log ( π ˜ i / ( 1 π ˜ i ) ) and ω = ( ω 1 , ω 2 ) is a s-dimensional vector, where p ˜ i = n 1 i = 1 n y i , π ˜ i = n 1 i = 1 n δ i , ω j is the jth component of ω , I ( A ) is an indicator function of the event A, and { B } + = max ( 0 , B ) . For comparison, we consider the hard-threshold (HT) estimator θ ^ j HT = ω j I ( | ω j | < ς 1 ) , the soft-threshold (ST) estimator θ ^ j ST = sign ( ω j ) { | ω j | ς 2 } + , and the penalized least squares estimator with the SCAD penalty function (denoted as ‘PLS’). In implementing the proposed method (denoted as ‘PETL’), we evaluate the tuning parameters { ς 1 , ς 2 } with the five-fold cross-validation method. The corresponding results of non-zero parameters in β and γ and variable selection for 1000 replications are given in Table 2.
An examination of Table 2 shows that (i) the proposed method and the competing methods have relatively small Bias values and negligible differences between SD and RMSE values, regardless of the sample size and the dimensionality, which implies that the compared four methods can well estimate non-zero parameters; (ii) the proposed method has larger TP and CF values and smaller FP and UF values than the compared three methods, regardless of the sample size and the dimensionality, which implies that the proposed method behaves better than the compared three methods; (iii) as the sample size increases, the TP and CF values of the proposed method approach p 3 , while its FP, UF, and OF values approach 0 and its CF values are close to one; (iv) the HT and ST methods for variable selection perform poorly due to the lack of the penalty function; and (v) when the sample size is small, the PLS method for variable selection behaves unsatisfactorily, even if the SCAD penalty function is used. In a word, the proposed method outperforms the other three methods, regardless of parameter estimation and variable selection.
Experiment 3 (structural equation model). To investigate the performance of the proposed method, we consider a more complicated model (e.g., structural equation model), which is specified by
Y i = B X i + ϵ i , X i = C X i + ε i , i = 1 , , n
where Y i is an e-dimensional vector of manifest variables, X i = ( U i , Z i ) is a f-dimensional vector of latent variables, where U i = ( u i , 1 , , u i , h ) is an h-dimensional vector, Z i = ( z i , 1 , , z i , f h ) is a f h -dimensional instrumental variables and f = 2 h , B is a e × f factor loading matrix, C is a f × f coefficient matrix used to identify the dependence structure between latent variables, and ϵ i and ε i are measurement errors. It is assumed that ϵ i ’s follow the multivariate normal distribution with zero mean and covariance Σ ϵ = diag ( α 1 , , α e ) , i.e., ϵ i N ( 0 , Σ ϵ ) , ε i = ( ε u , i , ε z , i ) , where ε u , i is an h-dimensional zero vector and ε z , i ’s follow the multivariate normal distribution N ( 0 , Σ ε ) , where Σ ε = diag ( ζ 1 , , ζ f h ) and e = 2 f . The dataset { Y i } i = 1 n is generated from the structural equation model introduced above, with the following specification of B and C:
B = β 1 , 1 β 1 , f β e , 1 β e , f , C = 0 C u 0 C z
where
C u = 1 0.5 0.0 0.0 0.5 1 0.0 0.0 0.0 0.0 1 0.5 0.0 0.0 0.5 1 , C z = 1 0.8 0.0 0.0 0.8 1 0.0 0.0 0.0 0.0 1 0.8 0.0 0.0 0.8 1
are the h × ( f h ) and ( f h ) × ( f h ) matrices, respectively. Here, we take the true values of ζ m and α k as 0.8 for m = 1 , , f h and k = 1 , , e , and set the true value of β j 1 , j 2 to be 0.6 for | j 1 j 2 | = 1 and 0.0 otherwise, which indicates that the number of non-zero parameters is 2 ( f 1 ) . Because β k , l measures the dependence of manifest variables Y i , k on latent variable X i , l , we only regard β k , l as a parameter of interest, which indicates that p = e f = 2 f 2 . To create the missing values of components of Y i for i = 1 , , n , we consider the missingness data mechanism model given in Equation (1). The true value of the q-dimensional parameter vector γ = ( γ c , γ u ) in Equation (1) is taken as γ = ( 0.8 , 0.8 , 0.0 , , 0.0 ) and q = h + 1 . As an illustration of the proposed method, we consider the following unconditional moment constraints:
g ( Y i ; β ) = vech { Y i Y i B ( I C ) 1 Σ ϵ ( I C ) B + Σ ε } ,
which satisfy E [ g ( Y i ; β 0 ) ] = 0 for i = 1 , , n , where β 0 is the true value of β , and vech ( H ) represents the half-vectorization of matrix H. In this case, the number of unconditional moment constraints is t = f ( 2 f + 1 ) + h , which shows that g ( Y i ; β ) ’s are over-identification conditions (i.e., t > s ). Under the settings specified above, we consider the following three combinations of the number of parameters of interest p, the number of parameters q in missingness data mechanism model (1), and the sample size n: ( n , p , q ) = ( 117 , 8 , 2 ) , ( 241 , 32 , 3 ) , ( 602 , 72 , 4 ) , where n is taken as the integer of ( ( p + 80 ) / 18 ) 3 . For comparison, we also consider the following two competing methods: generalized method of moments (GMM) and penalized empirical likelihood (PEL). For each of combinations of ( n , p , q ) , we generate 1000 datasets. For each of the simulated 1000 datasets, we utilize the proposed method and the considered two competing methods to estimate parameters and select variable selection. In implementing the proposed method, we use BIC criterion to select the tuning parameters τ . Bias, standard deviation (SD), and root mean square error (RMSE) values of the non-zero component of C for 1000 replications are reported in Table 3. Also, the TP, FP, UF, CF, and OF values for identifying non-zero components of C are given in Table 4.
An examination of Table 3 and Table 4 shows that (i) SD values of the parameter estimates are almost the same as their corresponding RMSE values for the considered three methods, regardless of the sample sizes, which implies that the considered three methods can estimate parameters well; (ii) the RMSE value decreases as the sample size increases for the considered three methods, which indicates that increasing the sample size can improve the performance of the considered three estimation methods; (iii) the proposed penalized ET likelihood estimation behaves better than the other two competing methods, in that the total RMSE value of the former is smaller than that of the latter, while the GMM estimation has the same performance as the PEL estimation, in that they have the same total RMSE value, indicating that the GMM and PEL methods have the same asymptotic properties [24]; (iv) and the PETL method behaves better than the other two methods in identifying non-zero components of C, in that the former has smaller FP, UF, and OF values and a larger CF value than the latter, and the TP value of the former is closer to the number of zero components in C (i.e., ( f 1 ) 2 + 1 ) than that of the latter, regardless of the sample sizes.

5. Real Examples

As an illustration of the proposed method, here we consider thyroid data taken from the First People’s Hospital of Yunnan Province. In the data, thyroid nodule refers to any growth of thyroid cells that form a lump within the thyroid. Most thyroid nodules do not cause any symptoms. Mostly, a nodule cannot cause pain, difficulty swallowing or breathing, hoarseness, or symptoms of hyperthyroidism; these are usually defined as benign nodules. Here, we explore the relationship between malignant thyroid nodules and variables obtained with ultrasound (US) instruments and clinical characteristics, such as nodular echogenecity ( x 1 ), thyroid background echogenecity ( x 2 ), anterior cervical muscles echogenecity ( x 3 ), echo ratio between nodular and anterior cervical muscles ( x 4 ), nodule size ( x 5 ), longitudinal diameter ( x 6 ), horizontal diameter ( x 7 ), aspect ratio ( x 8 ), age ( x 9 ), sex (1 = male, 0 = female; x 10 ), number of nodules (1 = more, 0 = less; x 11 ), shape (1 = irregularities, 0 = rules; x 12 ), margins (1 = unclear, 0 = clear; x 13 ), margins (1 = angled, 0 = non-angled; x 14 ), echogenecity (1 = uniform, 0 = uneven; x 15 ), internal structural solidity (1 = yes, 0 = no; x 16 ), internal structure cystic (1 = yes, 0 = no; x 17 ), internal structure cystic solid mixture (1 = yes, 0 = no; x 18 ), sound halo (1 = yes, 0 = no; x 19 ), rear attenuation (1 = yes, 0 = no; x 20 ), lateral acoustic shadow (1 = yes, 0 = no; x 21 ), bloodstream (1 = yes, 0 = no; x 22 ), blood supply branch (1 = yes, 0 = no; x 23 ), blood supply strip (1 = yes, 0 = no; x 24 ), blood supply punctate (1 = yes, 0 = no; x 25 ), and calcification (1 = yes, 0 = no; x 26 ). Thus, we take the aforementioned 26 variables as covariates, denoted as x = ( x 1 , , x 26 ) , and regard the indicator of malignant thyroid nodules (1 = malignant, 0 = benign) as response variable y, i.e., y = 1 if patient has the malignant thyroid nodules, and y = 0 otherwise. In this study, the sample size is n = 2430 and covariates in x are fully observed, while roughly 40 % of observations of response variable y are subject to missingness. Here, we assume that the missingness data mechanism for y is missing at random, in that y can be unobserved due to the presence of some covariates rather than judgement of benign or malignant thyroid nodules. Under the above assumption, we consider the following logistic regression model:
logit { Pr ( y = 1 | x ) } = β 0 + x 1 β 1 + + x 26 β 26 ,
which leads to the following unconditional moment constraints:
g ( x i , y i ; θ ) = ( 1 , x i ) y i exp ( β 0 + x i 1 β 1 + + x i , 26 β 26 ) 1 + exp ( β 0 + x i 1 β 1 + + x i , 26 β 26 ) ,
which satisfy E { g ( x i , y i ; θ ) } = 0 , where θ = ( β 0 , β 1 , , β 26 ) and x i = ( x i , 1 , , x i , 26 ) for i = 1 , , n . Let U i = ( x i , 1 , x i , 4 , x i , 8 , x i , 14 , x i , 15 , x i , 18 , x i , 25 ) , and other components of x i are taken as instrumental variables (denoted as U i ), in that the components in U i can be indirectly determined by the components in Z i ; for example, when the blood supply is neither branching nor strip-shaped, it can be determined as point-shaped. For missing value of y i ’s, we consider a logistic regression model for propensity score function π i ( U i ; γ ) , i.e., logit { π i ( U i ; γ ) } = γ c + U i γ u with γ = ( γ c , γ 1 , , γ q ) and q = 7 . To apply the preceding proposed method to the considered dataset, we consider the following unconditional moments for estimating unknown parameters in γ :
φ ( U i , Z i ; γ ) = δ i π i ( U i ; γ ) 1 Z i , i = 1 , , n .
In utilizing the proposed method, we take the tuning parameter τ with the BIC criterion and consider three penalty functions, Lasso, Adaptive Lasso (represented by ALasso), and SCAD for comparison. We calculate the PETL estimates of unknown parameters and their corresponding 95 % confidence intervals.
An examination of Table 5 shows that the three penalty functions consistently identify β 2 , β 4 , β 5 , β 6 , β 7 , β 10 , β 16 , β 17 , β 18 , β 19 , β 21 as zero, in that their corresponding 95% CIs contain zero, and consistently detect β 11 , β 12 , β 13 , β 14 , β 15 , β 20 , β 22 , β 26 as non-zero, in that their corresponding 95% CIs do not contain zero (i.e., number of nodules, shape, margins, echogenecity, rear attenuation, bloodstream, and calcification are positively associated with malignant thyroid nodules), while β 3 and β 9 are identified as no effect on malignant thyroid nodules by Lasso and ALasso penalty functions but there ia a negative effect on malignant thyroid nodules only by SCAD. Also, β 23 , β 24 , and β 25 are detected as on effect on malignant thyroid nodules only by Lasso. The PETL estimates of non-zero parameters obtained with three penalty functions are almost identical. The above results indicate that the Lasso penalty function tends to shrink more parameters to zero, which helps us better identify important variables. However, it is also possible for this to be over-fitted, and this may cause more important variables to be ignored. Combining the above results leads to the following conclusions. First, it follows from a negative estimate of β 1 that there is a negative correlation between nodular echogenicity and malignant thyroid nodules, i.e., hypoechogenicity often indicates that the thyroid nodule is malignant. In addition, it follows from β ^ 8 , β ^ 13 , β ^ 14 , β ^ 26 that aspect ratio, margins, and microcalcifications are closely related to malignant thyroid nodules, which is consistent with the conclusion of [38]. Second, by β ^ 9 obtained with the SCAD penalty function, there is a positive correlation between age and malignant thyroid nodules. The coefficients associated with nodule size x 5 , sex x 10 , and internal structure (solidity x 16 , cystic x 17 , cystic solid mixture x 18 ) are shrunk to zero, implying that these covariates are not associated with malignant thyroid nodules, which is consistent with the conclusion regarding the role of predicting telomerase reverse transcriptase (TERT) promoter mutations in follicular thyroid cancer [39,40]. Third, there are correlations between certain data types, but their interaction effects have not been explored well, in that some of these data types represent certain interaction effects; for example, the aspect ratio represents the interaction effect between longitudinal diameter and horizontal diameter. By β ^ 6 , β ^ 7 , and β ^ 8 , although longitudinal diameter x 6 and horizontal diameter x 7 are not correlated with malignant thyroid nodules, their interaction effect (aspect ratio x 8 ) has a negative effect on malignant thyroid nodules.

6. Discussion

This paper studies growing dimensional unconditional moment models in the presence of missing responses at random, constructs the modified unconditional moment models via the inverse probability weighted approach to deal with missing data, utilizes logistic regression models to describe propensity score functions, employs calibration conditions to establish estimating equations for estimating parameters in logistic regression models, and proposes a penalized log-ET likelihood ratio function to estimate unknown parameters in growing dimensional unconditional moment models and logistic regression models by combining the modified unconditional moment models and calibration conditions. To test contrasting hypotheses, we construct a constrainedly penalized ET likelihood ratio statistic, which is theoretically shown to asymptotically follow the centred chi-squared distribution that extends Wilks’ theory to growing unconditional moment models with missing response at random. Empirical results demonstrate that the proposed method has better estimation properties and more accurate variable selection than other methods, such as GMM and penalized empirical likelihood methods. Also, empirical results evidence that the proposed penalized ET likelihood method is robust to misspecified moment models. The thyroid data taken from the First People’s Hospital of Yunnan Province are used to illustrate the proposed method, and parameter estimates show the consistency with [38,39,40].
The proposed method has the following merits. First, it can simultaneously estimate unknown parameters associated with growing dimensional unconditional moment models and propensity score functions and select important variables. Second, novel unified equations are constructed to jointly rather than separately estimate unknown parameters by incorporating growing dimensional unconditional moment models and calibration conditions, which largely enhances the efficiency of the algorithm. Third, under some regularity conditions, we show the oracle property and asymptotic property of parameter estimators.
This paper focuses on missing at random of response variable and logistic regression models of propensity score functions. It is possible to extend the proposed method to nonignorable missing responses together with unknown forms of propensity score functions based on the neural network method to approximate the unknown propensity score functions.

Author Contributions

Methodology, X.S., P.Z. and N.T.; Formal analysis, X.S.; Investigation, P.Z.; Writing—original draft, X.S.; Writing—review & editing, P.Z. and N.T.; Supervision, N.T.; Funding acquisition, N.T. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by grants from the National Key R&D Program of China (No. 2022YFA1003701) and the National Natural Science Foundation of China (No. 12271472).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The datasets generated and analysed are available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare no competing interests.

Appendix A. Proof of Theorems

In what follows, we will present details for the proofs of lemmas and theorems. For simplicity, let P n { f ( α ) } = n 1 i = 1 n f ( Z i ; α ) , M ( θ ) = E { G ( D i ; θ ) / θ } , Σ ( θ ) = E [ { G ( D i ; θ ) E [ G ( D i ; θ ) ] } { G ( D i ; θ ) E [ G ( D i ; θ ) ] } ] , Σ = Σ ( θ 0 ) , M = M ( θ 0 ) , υ i = λ G ( D i ; θ ) for i = 1 , , n , and λ ¯ = arg inf λ Λ n l n ( λ , θ ) , where Λ n = { λ R r : | | λ | | o p ( t 1 / 2 n 1 / ι ) } for some positive ι > 2 , t = r + m is the number of the modified estimating equations, including the growing-dimensional unconditional moment model and calibration conditions.
Proof of Lemma 1.
Taking the Taylor expansion of l ( λ ¯ , θ ) at λ = 0 , we have
l ( λ ¯ , θ ) = l ( 0 ) υ i λ ¯ P n [ G ( θ ) ] + 1 2 λ ¯ 1 n i = 1 n 2 l ( v ˙ G ( D i ; θ ) ) υ i 2 G ( D i ; θ ) G ( D i ; θ ) λ ¯ || λ ¯ || · || P n [ G ( θ ) ] || + C || λ ¯ | | 2 ,
where P n { G ( θ ) } = n 1 i = 1 n G ( D i ; θ ) . On the other hand,
l ( λ ¯ , θ ) l ( 0 , θ ) = 0 ,
Combining (A1) and (A2) yields | | λ ¯ | | | | P n [ G ( θ ) ] | | . It follows from Assumptions 1, 2, and 4 that the estimating equations G ( D i ; θ ) belong to the P-Glivenko–Cantelli class and the collection of components of G ( D i ; θ ) belong to P-Donsker class; then,
|| P n [ G ( θ ) ] E [ G ( D i ; θ ) ] || = O p ( t / n ) , || P n [ G ( θ ) G ( θ ) ] Σ ( θ ) || = O p ( t / n ) , || P n [ G ( θ ) θ ] M ( θ ) || = O p ( t s / n ) .
In addition, by the uniform modulus of continuity results, we have
|| P n [ G ( θ ) ] E [ G ( D i ; θ ) ] P n [ G ( θ 0 ) ] + E [ G ( D i ; θ 0 ) ] || = o p ( t / n ) ,
for estimating equations G ( D ; θ ) uniformly in θ Θ n . Combining Assumption 5 and Equation (A4) yields | | P n [ G ( θ ) ] | | = O p ( t / n ) and | | λ ¯ | | = O p ( t / n ) = o p ( t 1 / 2 n 1 / ι ) for some positive ι > 2 . Therefore, λ ¯ int ( Λ n ) Λ ^ n ( θ ) , which indicates l ( λ ¯ , θ ) / λ = 0 . By the convexity of l n ( λ , θ ) and Λ ^ n ( θ ) , we have λ ¯ = λ ¯ ( θ ) , and arg inf λ Λ ^ n ( θ ) l n ( λ , θ ) exists. The first-order partial derivative of l n ( λ , θ ) with respect to λ is given by
l n ( λ , θ ) λ = 1 n i = 1 n exp [ λ G ( D i ; θ ) ] 1 n i = 1 n exp [ λ G ( D i ; θ ) ] G ( D i ; θ ) = i = 1 n exp [ λ G ( D i ; θ ) ] G ( D i ; θ ) i = 1 n exp [ λ G ( D i ; θ ) ] = i = 1 n [ 1 + λ G ( D i ; θ ) ( 1 + o p ( 1 ) ) ] G ( D i ; θ ) i = 1 n exp [ λ G ( D i ; θ ) ] = 0 ,
where the third equation holds due to max 1 i n sup θ | λ G ( D i ; θ ) | = o p ( 1 ) for λ Λ n . By Assumption 5 and Equation (A3), we have
λ ¯ ( θ ) = Σ 1 ( θ ) P n [ G ( θ ) ] + o p ( t / n ) ,
and then the ET likelihood l n ( λ , θ ) can be written as
l n ( λ , θ ) = 1 2 P n [ G ( θ ) ] Σ 1 ( θ ) P n [ G ( θ ) ] + o p ( t / n ) .
Moreover, it follows from Equation (A3) that the expansion of P n [ G ( θ ) ] at θ 0 is given by
P n [ G ( θ ) ] = P n [ G ( θ 0 ) ] + M ( θ θ 0 ) + o p ( t / n )
for any θ D n θ . Thus, combining Equations (A5) and (A6) yields
l n ( λ , θ ) = 1 2 ( θ θ 0 ) M Σ 1 M ( θ θ 0 ) + ( θ θ 0 ) M Σ 1 P n [ G ( θ 0 ) ] 1 2 P n [ G ( θ 0 ) ] Σ 1 P n [ G ( θ 0 ) ] + o p ( t / n ) ,
and
θ ^ ET θ 0 = ( M Σ 1 M ) 1 M Σ 1 P n [ G ( θ 0 ) ] ,
where θ ^ ET is the ET likelihood estimator of θ . Combining Equations (A7) and (A8) yields
l n ( λ , θ ) = 1 2 ( θ θ 0 ) M Σ 1 M ( θ θ 0 ) + ( θ θ 0 ) M Σ 1 M ( θ θ 0 ) 1 2 P n [ G ( θ 0 ) ] Σ 1 P n [ G ( θ 0 ) ] + o p ( t / n ) = 1 2 ( θ θ 0 ) M Σ 1 M ( θ 2 θ ^ ET + θ 0 ) 1 2 P n [ G ( θ 0 ) ] Σ 1 P n [ G ( θ 0 ) ] + o p ( t / n ) = 1 2 ( θ θ ^ ET ) M Σ 1 M ( θ θ ^ ET ) + o p ( t / n ) .
By the local quadratic approximation to the penalty function p ν ( · ) , we obtain
p ν ( θ ) 1 2 ( θ θ 0 ) V 0 ( θ θ 0 ) + o p ( t / n ) .
Combining Equations (A9) and (A10) yields
l p n ( θ ) = l n ( λ , θ ) p ν ( θ ) = 1 2 ( θ θ ^ ET ) M Σ 1 M ( θ θ ^ ET ) 1 2 ( θ θ 0 ) V 0 ( θ θ 0 ) + R n = 1 2 ( θ θ ˙ ) V ( θ θ ˙ ) + C l p n + o p ( t / n ) ,
where C l p n = θ 0 V 0 θ 0 θ ^ ET M Σ 1 M θ ^ ET + θ ˙ J θ ˙ is some constant, V = V 0 + M Σ 1 M , and θ ˙ = V 1 ( V 0 θ 0 + M Σ 1 M θ ^ ET ) . Hence, the penalized log-ET likelihood ratio function can be expressed as
l p n ( θ ) = 1 2 ( θ θ ˙ ) V ( θ θ ˙ ) + o p ( t / n ) .
Thus, we complete the proof of Lemma 1. □
Proof of Theorem 1.
Let θ ˇ PET = ( θ ˇ 1 PET , θ ˇ 2 PET ) be the penalized ET likelihood estimator of θ when the model is correctly specified.
(i) Firstly, we show that | | θ ˇ 1 PET θ 10 | | 0 as n . Let θ * = ( θ 1 , 0 ) , for the constrained subspace { θ R s : θ A θ c = 0 } , it follows from p ν ( 0 ) = 0 and L ( θ ) = L ( θ * ) that l p n ( θ ) = l p n ( θ * ) . Let θ ˇ * = ( θ ˇ 1 PET , 0 ) be the maximizer of l p n ( θ ) on the constrained subspace { θ R s : θ A θ c = 0 } . Thus, it follows from the proof of Lemma 1 that | | P n [ G ( θ ˇ * ) ] | | = O p ( d θ / n ) . Next, we draw the conclusion by contradiction trick, i.e., if | | θ ˇ 1 PET θ 10 | | does not converge to zero in probability, there exists a subsequence { n f , d θ f , t f } , such that | | θ ˇ 1 n f θ 10 | | ϵ 1 with probability tends to one for some constant ϵ 1 > 0 . On the other hand, from Assumption 6 we have lim inf d θ , t ψ 1 ( d θ , t ) > 0 , and | | E [ G ( D i ; θ ˇ n f * ) ] | | = o p { ψ 1 ( d θ f , t f ) } + O p ( d θ f / n f ) , which is in conflict with | | E [ G ( D i ; θ ˇ n f * ) ] | | ψ 1 ( d θ f , t f ) ψ 2 ( ϵ 1 ) . Hence, it follows that | | θ ˇ 1 PET θ 10 | | 0 as n . In addition, Assumption 4 indicates that | | P n [ G ( θ ˇ * ) ] P n [ G ( θ 0 ) ] | | C | | θ ˇ 1 PET θ 10 | | with probability tending to one for some constant C; thus, we have | | θ ˇ 1 PET θ 10 | | = O p ( d θ / n ) .
Next, we show that θ ˇ 2 PET = 0 with probability tending to one. Let A ˇ θ = { j : θ ˇ j PET 0 } be the index set of non-zero components of penalized ET likelihood estimator θ ˇ PET under the correctly specified model, then its complement A ˇ θ c = { j : θ ˇ j PET = 0 } . Hence, we need to show that Pr ( A ˇ θ c A θ c ) 0 or Pr ( A ˇ θ A θ ) 0 as n . The tuning parameter ν in the penalty function controls the variable selection process to some extent; without losing generality, there exists a positive function η ( ν ) related to ν , such that A ˇ θ = { j : | θ ˇ j PET | > η ( ν ) } . It is not hard to find that the event { A ˇ θ A θ } is equivalent to the event { | θ ˇ j PET | η ( ν ) : some j A θ } { | θ ˇ j PET | > η ( ν ) : some j A θ c } . Note that
Pr ( { | θ ˇ j PET | η ( ν ) : some j A θ } ) j A θ Pr ( | θ ˇ j PET | η ( ν ) ) = j A θ Pr ( | θ 0 j | | θ ˇ j PET | | θ 0 j | η ( ν ) ) j A θ Pr ( | θ 0 j θ ˇ j PET | | θ 0 j | η ( ν ) ) j A θ Pr ( | θ 0 j θ ˇ j PET | min j A θ | θ 0 j | η ( ν ) ) j A θ Pr ( | θ ˇ j PET θ 0 j | η ( ν ) ) ,
where the last inequality holds due to Assumption 8. Similarly, we have Pr ( { | θ ˇ j PET | > η ( ν ) : some j A θ c } ) j A θ c Pr ( | θ ˇ j PET θ 0 j | η ( ν ) ) . Combining the above equations leads to Pr ( { A ˇ θ A θ } ) j = 1 s Pr ( | θ ˇ j PET θ 0 j | η ( ν ) ) . In addition, from the proof of Lemma 1, we have θ ˙ = ( M Σ 1 M + V 0 ) 1 ( M Σ 1 M θ ˇ PET ) and θ ˇ PET = ( M Σ 1 M ) 1 M Σ 1 P n [ G ( θ 0 ) ] + θ 0 ; it follows that
θ ˙ θ 0 = ( M Σ 1 M + V 0 ) 1 V 0 θ 0 + ( M Σ 1 M + V 0 ) 1 M Σ 1 P n [ G ( θ 0 ) ] ,
which yields E [ θ ˙ θ 0 ] = ( M Σ 1 M + V 0 ) 1 V 0 θ 0 = μ and v a r [ θ ˙ θ 0 ] = ( M Σ 1 M + V 0 ) 1 M Σ 1 M ( M Σ 1 M + V 0 ) 1 = σ . Let μ m = max j { | μ j | } and σ m = EV max { σ } , where EV max { σ } represents the maximum eigenvalue of σ , for some η ( ν ) > 0 ,
j = 1 s Pr ( | θ ˇ j PET θ 0 j | η ( ν ) ) = j = 1 s Pr ( ( | θ ˇ j PET θ 0 j | | μ j | ) / σ m ( η ( ν ) | μ j | ) / σ m ) j = 1 s Pr ( ( | θ ˇ j PET θ 0 j μ j | ) / σ j j ( η ( ν ) μ j ) / σ m ) s Pr ( | z | ( η ( ν ) μ m ) / σ m ) ,
where μ j is the jth diagonal element of μ and σ j j is the jth diagonal element of σ , and z is a standard normal random variable. By the Chebyshev inequality, we have
j = 1 s Pr ( | θ ˇ j PET θ 0 j | η ( ν ) ) s σ m ( η ( ν ) μ m ) 2 .
By some algebraic operations,
σ = ( M Σ 1 M + V 0 ) 1 M Σ 1 M ( M Σ 1 M + V 0 ) 1 = ( M Σ 1 M + V 0 ) 1 ( M Σ 1 M + V 0 V 0 ) ( M Σ 1 M + V 0 ) 1 = ( M Σ 1 M + V 0 ) 1 V 0 ( M Σ 1 M + V 0 ) 2 .
Thus, we have
σ m = EV max { ( M Σ 1 M + V 0 ) 1 V 0 ( M Σ 1 M + V 0 ) 2 } EV max { ( M Σ 1 M + V 0 ) 1 } ε n 1 ,
where ε n is the smallest eigenvalue of matrix M Σ 1 M + ϕ I and ϕ is a diagonal element of V 0 . It follows from Assumptions 4 and 7 that M Σ 1 M is a positive definite matrix, which leads to ε n ϕ and then σ m ϕ 1 . Note that the jth diagonal element of matrix V 0 is V 0 k k = p ˙ ν ( | θ k ξ | ) / | θ k ξ | , where θ k ξ is a point lying in the line between θ k and θ 0 k , for θ k { θ k : | θ k θ 0 k | C t / n } . It follows from Assumption 8 that V 0 k k = p ˙ ν ( | θ k ξ | ) / | θ k ξ | p ˙ ν ( | θ k | ) / η ( ν ) ; hence, ϕ min k V 0 k k = O { min k p ˙ ν ( | θ k | ) / η ( ν ) } and then σ m O { η ( ν ) / min k p ˙ ν ( | θ k | ) } . In addition, it is not difficult to obtain that
μ m = || ( M Σ 1 M + V 0 ) 1 V 0 θ 0 | | = || ( M Σ 1 M + ϕ I ) 1 ϕ θ 0 | | ,
Implementing eigenvalue decomposition on matrix M Σ 1 M , we have M Σ 1 M = Q J Q , where Q is an orthogonal matrix and J = Diag ( j 1 , . . . , j s ) is a diagonal matrix. Subsequently, μ m = | | Q J ˜ Q θ 0 | | , where J ˜ = Diag ( ϕ j 1 + ϕ , . . . , ϕ j s + ϕ ) . Further, we obtain
|| Q J ˜ Q θ 0 | | = || J ˜ θ 0 | | ϕ j * + ϕ || θ 0 | | max k V 0 k k j * + max k V 0 k k || θ 0 | | ,
where j * represents the minimum diagonal element of matrix J ˜ . Since all components of θ 0 are finite and by Assumptions 8(ii) and 9(ii), we have μ m O { η ( ν ) } . Taking μ m O { η ( ν ) } and σ m O { η ( ν ) / min k p ˙ ν ( | θ k | ) } back to (A12) yields
j = 1 s Pr ( | θ ˇ j PET θ 0 j | η ( ν ) ) s σ m ( η ( ν ) μ m ) 2 s O { η ( ν ) / min k p ˙ ν ( | θ k | ) } [ η ( ν ) O { η ( ν ) } ] 2 O s η ( ν ) min k p ˙ ν ( | θ k | ) .
Combining Equation (A13) and Assumption 9(iii), we obtain that Pr ( { A ˇ θ A θ } ) 0 as n , which indicates that θ ˇ 2 PET = 0 with probability tending to one.
Below, we prove part (ii); let
H n 1 ( λ , θ ) = l p n ( λ , θ ) / λ = P n [ exp { λ G ( θ ) } G ( θ ) P n [ exp { λ G ( θ ) } ] ] , H n 2 ( λ , θ ) = l p n ( λ , θ ) / θ = P n [ exp { λ G ( θ ) } { θ G ( θ ) } λ P n [ exp { λ G ( θ ) } ] ] ,
θ ˇ PET = arg max θ Θ inf λ Λ ^ n ( θ ) l p n ( θ ) and λ ˇ PET = arg inf λ Λ ^ n ( θ ) max θ Θ l p n ( θ ) under the correctly specified model; hence, H n 1 ( λ ˇ PET , θ ˇ PET ) = 0 and H n 2 ( λ ˇ PET , θ ˇ PET ) = 0 . Subsequently, taking partial derivatives on function H n 1 ( λ , θ ) and H n 2 ( λ , θ ) for θ and λ , respectively, we have
H n 11 ( 0 , θ 0 ) = H n 1 ( 0 , θ 0 ) / λ = P n [ G ( θ 0 ) G ( θ 0 ) ] P n [ G ( θ 0 ) ] P n [ G ( θ 0 ) ] , H n 12 ( 0 , θ 0 ) = H n 1 ( 0 , θ 0 ) / θ = P n [ θ G ( θ 0 ) ] , H n 21 ( 0 , θ 0 ) = H n 2 ( 0 , θ 0 ) / λ = P n [ θ G ( θ 0 ) ] , H n 22 ( 0 , θ 0 ) = H n 2 ( 0 , θ 0 ) / θ = 0 .
Taking Taylor’s expansion for H n 1 ( λ ˇ PET , θ ˇ PET ) = 0 and H n 2 ( λ ˇ PET , θ ˇ PET ) = 0 at ( 0 , θ 0 ) , yields
H n 1 ( 0 , θ 0 ) 0 = K 1 K 2 K 2 0 λ ˇ PET 0 θ ˇ PET θ 0 + R n ,
where K 1 = E { P n [ G ( θ 0 ) G ( θ 0 ) ] } , K 2 = E { P n [ θ G ( θ 0 ) ] } , and R n = k = 1 5 R k n , R n = ( R n ( 1 ) , R n ( 2 ) ) , R 1 n = ( R 1 n ( 1 ) , R 1 n ( 2 ) ) , R 1 n ( 1 ) = ( R 1 n , 1 ( 1 ) , , R 1 n , t ( 1 ) ) , R 1 n ( 2 ) = ( R 1 n , 1 ( 2 ) , , R 1 n , s ( 2 ) ) ,
R 1 n , j ( 1 ) = 1 2 ( κ ˇ PET κ 0 ) κ 2 H n 1 , j ( κ * ) ( κ ˇ PET κ 0 ) , for j = 1 , , t R 1 n , j ( 2 ) = 1 2 ( κ ˇ PET κ 0 ) κ 2 H n 2 , j ( κ * ) ( κ ˇ PET κ 0 ) , for j = 1 , , s
in which κ = ( λ , θ ) , κ 2 H n l , j = 2 H n l , j / κ κ , and H n l , j represents the jth component of H n l , l = 1 , 2 and κ * = ( λ * , θ * ) , satisfying | | κ * ( 0 , θ 0 ) | | | | κ ˇ PET ( 0 , θ 0 ) | | . Taking the Cauchy–Schwarz inequality to R 1 n , j ( 1 ) yields
|| R 1 n , j ( 1 ) | | 2 n 2 || κ ˇ PET κ 0 | | 4 i , k s + t 2 H n 1 , j / κ κ , = O p ( t 2 ( s + t ) 2 n 2 )
for j=1,...,s+t; it follows from Assumption 5 that | | R 1 n , j ( 1 ) | | 2 = o p ( 1 / n ) . The same operation applies to R 1 n , j ( 2 ) ; we obtain | | R 1 n , j ( 2 ) | | 2 = o p ( 1 / n ) , which leads to | | R 1 n , j | | 2 = o p ( 1 / n ) . Furthermore, there are | | R 1 n | | = o p ( 1 / n ) . Let W ( θ ) = { p ˙ ν ( | θ 1 | ) sign ( θ 1 ) , p ˙ ν ( | θ 2 | ) sign ( θ 2 ) , . . . , p ˙ ν ( | θ s | ) sign ( θ s ) } ; it follows from Assumption 9(i) that | | R 2 n | | = | | ( 0 , W ( θ 0 ) ) | | = o p ( 1 / n ) , and | | R 3 n | | = | | ( 0 , W ˙ ( θ ϱ ) ( θ ˇ PET θ 0 ) ) | | = o p ( 1 / n ) , where W ˙ ( · ) is the first-order derivatives of W ( · ) and θ ϱ is a vector lying in the line between θ ˇ PET and θ 0 . By Assumptions 4 and 5, we have | | R 4 n | | = | | ( { ( P n [ G ( θ 0 ) G ( θ 0 ) ] K 1 ) λ ˇ PET } + { ( P n [ θ G ( θ 0 ) ] K 2 ) ( θ ˇ PET θ ) } , 0 ) | | = o p ( 1 / n ) and | | R 5 n | | = | | ( 0 , λ ˇ PET P n [ θ G ( θ 0 ) ] K 2 ) | | = o p ( 1 / n ) . Combining the above equations yields | | R n | | = o p ( 1 / n ) .
By Equation (A14), we have
λ ˇ PET 0 θ ˇ PET θ 0 = S 1 H n 1 ( 0 , θ 0 ) 0 + R n .
where
S = K 1 K 2 K 2 0 .
Inverse the block matrix S and let
S 1 = S 11 S 12 S 21 S 22
where
S 11 = K 1 1 K 1 1 K 2 ( K 2 K 1 1 K 2 ) 1 K 2 K 1 1 , S 12 = K 1 1 K 2 ( K 2 K 1 1 K 2 ) 1 , S 21 = ( K 2 K 1 1 K 2 ) 1 K 2 K 1 1 , S 22 = ( K 2 K 1 1 K 2 ) 1 ,
It follows that
θ ˇ PET θ 0 = ( K 2 K 1 1 K 2 ) 1 K 2 K 1 1 { H n 1 ( 0 , θ 0 ) + R n ( 2 ) } ,
where H n 1 ( 0 , θ 0 ) = P n [ G ( θ ) ] and | | R n ( 2 ) | | = o p ( 1 / n ) . Let Γ 1 = ( K 2 K 1 1 K 2 ) 1 , Γ 2 = Γ 1 K 2 K 1 1 , and M n i = n 1 / 2 A n Γ 1 1 / 2 Γ 2 G ( X i , Y i ; θ 0 ) . For any ϵ 2 > 0 , by the Cauchy–Schwarz inequality,
i = 1 n E [ || M n i | | 2 ] I ( || M n i || > ϵ 2 ) = n E [ || M n i | | 2 ] I ( || M n i || > ϵ 2 ) n { E || M n i | | 4 } 1 / 2 { Pr ( || M n i || > ϵ 2 ) } 1 / 2 .
Since A n A n V , then Pr ( | | M n i | | > ϵ 2 ) E [ | | M n i | | 2 ] / ϵ 2 2 = O p ( n 1 ) by the Markov inequality. Moreover,
E [ | | M n i | | 4 ] = 1 n 2 E [ G ( X i , Y i ; θ ) Γ 2 Γ 1 1 / 2 A n A n Γ 1 1 / 2 Γ 2 G ( X i , Y i ; θ ) ] 2 1 n 2 E max 2 ( A n A n ) E max 2 ( Γ 1 1 ) E [ G ( X i , Y i ; θ ) G ( X i , Y i ; θ ) ] 2 = O p ( s 2 n 2 ) ,
which leads to i = 1 n E [ | | M n i | | 2 ] I ( | | M n i | | > ϵ 2 ) = O p ( n 1 / 2 ) = o p ( 1 ) . Furthermore, i = 1 n cov ( M n i ) = n cov ( M n i ) = A n A n V as n and Γ 2 H n 1 ( 0 , θ 0 ) = θ ˇ PET θ 0 . By the Lindeberg–Feller condition, we have n A n Γ 1 1 / 2 ( θ ˇ PET θ 0 ) L N ( 0 , V ) . □
Proof of Theorem 2.
Let θ ^ PET = ( θ ^ 1 PET , θ ^ 2 PET ) be the penalized ET likelihood estimator of θ in the presence of EE misspecification.
(i) By Assumptions 1, 2, and 4 and Equations (3) and (4), we have that G ( D i ; θ ) is continuous in θ and sup j E [ G j ( D i ; θ ) ] 2 C 1 G ; then, sup θ Θ | | G ( D i ; θ ^ PET ) E [ G ( D i ; θ ) ] | | P 0 , where P denotes convergence in probability. Subsequently, it follows from Assumptions 10–12 and 14 and Equations (3), (4), and (8) that θ ^ PET P θ * .
(ii) The first-order partial derivative of the penalized ET likelihood l p n ( θ ) with respect to θ j for j A θ is given by
l p n ( θ ) θ j = P n [ exp { λ G ( θ ) } θ j G ( θ ) ] P n [ exp { λ G ( θ ) } ] λ p ˙ ν ( | θ j | ) sign ( θ j ) = E [ exp { λ G ( θ ) } θ j G ( θ ) ] + O p ( 1 / n ) E [ exp { λ G ( θ ) } ] + O p ( 1 / n ) λ p ˙ ν ( | θ j | ) sign ( θ j ) O p ( 1 ) || λ || p ˙ ν ( | θ j | ) sign ( θ j ) = ν { || λ || ν O p ( 1 ) p ˙ ν ( | θ j | ) ν sign ( θ j ) } = ν { p ˙ ν ( | θ j | ) ν sign ( θ j ) + o p ( 1 ) } ,
where the third inequality above is derived from Assumptions 13 and 14. It is not difficult to notice that the sign of l p n ( θ ) / θ j is dominated by the sign of θ j from Equation (A17). In other words, we have l p n ( θ ) / θ j > 0 when θ j < 0 , and l p n ( θ ) / θ j < 0 when θ j > 0 w.p.1, as n for any j A θ ; therefore, θ ^ 2 PET = 0 with probability tending to one.
(iii) Denote M 1 and M 2 as a projection matrix, such that M 1 θ = θ 1 and M 2 θ = θ 2 , then the penalized log-ET likelihood ratio function under constraint M 2 θ = θ 2 = 0 is given by
S n ( λ , θ , τ ) = log P n [ exp { λ G ( D i ; θ ) } ] j = 1 s p ν ( | θ j | ) + τ M 2 θ .
Taking partial derivatives of S n ( λ , θ , τ ) with respect to λ , θ , and τ as follows, respectively,
S n 1 ( λ , θ , τ ) = S n ( λ , θ , τ ) / λ = P n [ exp { λ G ( D i ; θ ) } G ( D i ; θ ) P n [ exp { λ G ( D i ; θ ) } ] ] S n 2 ( λ , θ , τ ) = S n ( λ , θ , τ ) / θ = P n [ exp { λ G ( D i ; θ ) } θ G ( D i ; θ ) λ P n [ exp { λ G ( D i ; θ ) } ] ] + ω ( θ ) + M 2 τ S n 3 ( λ , θ , τ ) = S n ( λ , θ , τ ) / τ = M 2 θ ,
where W ( θ ) = { p ˙ ν ( | θ 1 | ) sign ( θ 1 ) , p ˙ ν ( | θ 2 | ) sign ( θ 2 ) , . . . , p ˙ ν ( | θ s | ) sign ( θ s ) } . Let ζ ^ PET = ( λ ^ PET , θ ^ PET , τ ^ PET ) be the penalized ET estimator of ζ = ( λ , θ , τ ) , which is the solution to the functions S n 1 ( λ , θ , τ ) , S n 2 ( λ , θ , τ ) , and S n 3 ( λ , θ , τ ) . Subsequently, taking second partial derivatives of S n ( λ , θ , τ ) with respect to ζ , it is not hard to notice that the component of the second partial derivatives of S n ( λ , θ , τ ) is given by ζ ˜ exp { l 1 λ G ( D i ; θ ) } G l 2 ( θ G ) l 3 ( θ j 1 θ j 2 G ) l 4 , for l 1 = 0 , 1 , j 1 , j 2 = 1 , . . . , s , and 0 l 2 + l 3 + l 4 2 , where G, θ G , θ j 1 θ j 2 G denote the combination of components of G ( D i ; θ ) , θ G ( D i ; θ ) , θ j 1 θ j 2 G ( D i ; θ ) , respectively, and ζ ˜ denotes the product of elements of ζ . Let matrix S ˜ ( λ , θ , τ ) = ( S n 1 ( λ , θ , τ ) , S n 2 ( λ , θ , τ ) , S n 3 ( λ , θ , τ ) ) ; it follows that
E [ sup ζ N ζ * | | S ˜ ( λ , θ , τ ) / ζ | | ] < ,
from Assumption 13, where N ζ * is the neighbourhood of ζ * = ( λ * , θ * , τ * ) ] . In addition, the expectation
E [ sup ζ N ζ * ζ ˜ exp { l 1 λ G ( D i ; θ ) } | G | l 2 | θ G | l 3 | θ j 1 θ j 2 G | l 4 ] E [ sup ζ N ζ * ζ ˜ exp { l 1 λ G ( D i ; θ ) } | h ( D i ) | l 2 ] = E [ sup ζ N ζ * ζ ˜ exp { l 1 λ G ( D i ; θ ) } h l 2 ( D i ) ] < ,
where the second and fourth inequality above is due to Assumptions 10–13. The elements of the matrix S ˜ ( λ , θ , τ ) S ˜ ( λ , θ , τ ) are given by ζ ˜ exp { l 1 λ G ( D i ; θ ) } G l 2 ( θ G ) l 3 ( θ j 1 θ j 2 G ) l 4 ; hence
E [ S ˜ ( λ , θ , τ ) S ˜ ( λ , θ , τ ) ] < .
Combining inequalities (A18) and (A19) and Theorem 3.4 of Newey and McFadden (1994), we have n ( ζ ^ PET ζ * ) L N ( 0 , Δ 1 Ω Δ ) , where Δ = E [ ζ S ˜ ( λ , θ , τ ) | ζ = ζ * ] , Ω = E [ S ˜ ( λ * , θ * , τ * ) S ˜ ( λ * , θ * , τ * ) ] , and L denotes convergence in distribution. □
Proof of Theorem 3.
Being analogue to the proof of Theorem 1, we obtain that
λ ^ PET = { K 1 1 K 1 1 K 2 Γ 1 K 2 K 1 1 } P n [ G ( θ ) ] + R ˜ n ( 1 ) , θ ^ PET θ 0 = Γ 1 K 2 K 1 1 P n [ G ( θ ) ] + R ˜ n ( 2 ) ,
where | | R ˜ n ( 1 ) | | = | | R ˜ n ( 2 ) | | = o p ( 1 / n ) . Taking the Taylor expansion of l n ( λ , θ ) at ω ^ i = λ ^ PET G ( X i , Y i ; θ ^ PET ) yields l n ( λ , θ ) = P n [ ω ^ ( 1 + o p ( 1 ) ) ] . Substituting λ ^ PET and θ ^ PET into l n ( λ , θ ) yields 2 n l n ( λ ^ PET , θ ^ PET ) = n P n [ G ( θ ) ] { K 1 1 K 1 1 K 2 Γ 1 K 2 K 1 1 } P n [ G ( θ ) ] + o p ( 1 ) . Let λ ˘ and θ ˘ be the maximizing l p n ( θ ) under null hypothesis H 0 : C n θ = 0 . The constrained penalized ET likelihood under null hypothesis H 0 is given by
l c p n ( λ , θ , υ ) = log P n [ exp { λ G ( θ ) } ] j = 1 s p ν ( | θ j | ) + υ C n θ ,
where υ is a Lagrange multiplier. Since C n θ = 0 and C n C n = I e , through some algebraic operations, we obtain
λ ˘ = { K 1 1 K 1 1 K 2 Γ 3 K 2 K 1 1 } P n [ G ( θ ) ] + R ˘ n 1 ,
where Γ 3 = Γ 1 C n ( C n Γ 1 C n ) C n Γ 1 Γ 1 . Subsequently, the constrained penalized ET likelihood is given by 2 n l c p n ( λ ˘ , θ ˘ ) = n P n [ G ( θ ) ] { K 1 1 K 1 1 K 2 Γ 3 K 2 K 1 1 } P n [ G ( θ ) ] + o p ( 1 ) , where o p ( 1 ) includes a penalty function term. Combining the above equation, we obtain the constrained penalized ET likelihood ratio statistic under null hypothesis H 0 as follows:
l ˜ ( C n ) = 2 n { l p n ( θ ^ ) max θ , C n θ = 0 l p n ( θ ) } = 2 n { l p n ( λ ˇ PET , θ ^ PET ) l p n ( λ ˘ , θ ˘ ) } = n P n [ G ( θ ) ] K 1 1 / 2 ( S 1 S 2 ) K 1 1 / 2 P n [ G ( θ ) ] + o p ( 1 ) ,
where S 1 = K 1 1 / 2 K 2 K 3 K 2 K 1 1 / 2 and S 2 = K 1 1 / 2 K 2 Γ 1 K 2 K 1 1 / 2 are two symmetric idempotent matrices. Hence, there is a matrix S ˜ , such that S 1 S 2 = S ˜ S ˜ and S ˜ S ˜ = I e . By the center limit theorem, we obtain that n S ˜ K 1 1 / 2 P n [ G ( θ ) ] L N ( 0 , I e ) , which leads to n P n [ G ( θ ) ] K 1 1 / 2 ( S 1 S 2 ) K 1 1 / 2 P n [ G ( θ ) ] L ρ 1 χ 1 2 + . . . + ρ e χ 1 2 , where the weights ρ 1 . . . ρ e are the eigenvalues of matrix ( S 1 S 2 ) K 1 1 . Let M ˜ = K 1 ( S 1 S 2 ) 1 ; taking the transformation to the constrained penalized ET likelihood function yields l ˜ ( C n ) M ˜ L χ e 2 . This completes our proof. □

References

  1. Godambe, V.P. Estimating Functions, 1st ed.; Oxford University: Oxford, UK, 1991. [Google Scholar]
  2. Carroll, R.; Ruppert, D.; Stefanski, L.; Crainiceanu, C. Nonlinear Measurement Error Models, A Modern Perspective, 2nd ed.; Chapman and Hall/CRC: Boca Raton, FL, USA, 2006. [Google Scholar]
  3. Hardin, J.W.; Hilbe, J.M. Generalized Estimating Equations, 1st ed.; Chapman and Hall/CRC: Boca Raton, FL, USA, 2003. [Google Scholar]
  4. Hansen, L.P. Large Sample Properties of Generalized Method of Moments Estimators. Econometrica 1982, 50, 1029–1054. [Google Scholar] [CrossRef]
  5. Kitamura, Y. Empirical likelihood methods with weakly dependent processes. Ann. Stat. 1997, 25, 2084–2102. [Google Scholar] [CrossRef]
  6. Owen, A.B. Empirical likelihood ratio confidence intervals for a single functional. Biometrika 1988, 75, 237–249. [Google Scholar] [CrossRef]
  7. Qin, J.; Lawless, J. Empirical likelihood and general estimating Equations. Ann. Stat. 1994, 22, 300–325. [Google Scholar] [CrossRef]
  8. Tang, N.; Yan, X.; Zhao, P. Exponentially tilted likelihood inference on growing dimensional unconditional moment models. J. Econom. 2018, 202, 57–74. [Google Scholar] [CrossRef]
  9. Newey, W.; Smith, R.J. Higher order properties of GMM and generalized empirical likelihood estimators. Econometrica 2003, 72, 219–255. [Google Scholar] [CrossRef]
  10. Anatolyev, S. GMM, GEL, serial correlation, and asymptotic bias. Econometrica 2005, 73, 983–1002. [Google Scholar] [CrossRef]
  11. Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. b-Methodol. 1996, 58, 267–288. [Google Scholar] [CrossRef]
  12. Zou, H. The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 2006, 101, 1418–1429. [Google Scholar] [CrossRef]
  13. Fan, J.; Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
  14. Lv, J.; Fan, Y. A unified approach to model selection and sparse recovery using regularized least squares. Ann. Stat. 2009, 37, 3498–3528. [Google Scholar] [CrossRef]
  15. Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.; Springer Nature: Berlin/Heidelberg, Germany, 2009. [Google Scholar]
  16. Lam, C.; Fan, J. Profile-kernel likelihood inference with diverging number of parameters. Ann. Stat. 2008, 36, 2232–2260. [Google Scholar] [CrossRef] [PubMed]
  17. Zou, H.; Zhang, H.H. On the adaptive elastic-net with a diverging number of parameters. Ann. Stat. 2009, 37, 1733–1751. [Google Scholar] [CrossRef] [PubMed]
  18. Caner, M.; Zhang, H.H. Adaptive elastic net for generalized methods of moments. J. Bus. Econ. Stat. 2014, 32, 30–47. [Google Scholar] [CrossRef]
  19. Wang, C.Y.; Wang, S.; Zhao, L.-P.; Ou, S.-T. Weighted semiparametric estimation in regression analysis with missing covariate data. J. Am. Stat. Assoc. 1997, 92, 512–525. [Google Scholar] [CrossRef]
  20. Wang, C.Y.; Wang, S.; Gutierrez, R.G.; Carroll, R.J. Local linear regression for generalized linear models with missing data. Ann. Stat. 1998, 26, 1028–1050. [Google Scholar]
  21. Scharfstein, D.O.; Rotnitzky, A.; Robins, J.M. Adjusting for nonignorable drop-out using semiparametric nonresponse models. J. Am. Stat. Assoc. 1999, 94, 1096–1146. [Google Scholar] [CrossRef]
  22. FitzGerald, E.B. Extended generalized estimating equations for binary familial data with incomplete families. Biometrics 2002, 58, 718–726. [Google Scholar] [CrossRef] [PubMed]
  23. Wang, C.Y.; Huang, Y.; Chao, E.C.; Jeffcoat, M.K. Expected estimating equations for missing data, measurement error and misclassification, with application to longitudinal nonignorable missing data. Biometrics 2008, 64, 85–95. [Google Scholar] [CrossRef] [PubMed]
  24. Zhou, Y.; Liang, H. Statistical inference for semiparametric varying coefficient partially linear models with generated regressors. Ann. Stat. 2008, 37, 427–458. [Google Scholar]
  25. Tang, N.; Zhao, P.; Zhu, H.T. Empirical likelihood for estimating equations with nonignorably missing data. Stat. Sin. 2014, 24, 723–747. [Google Scholar] [CrossRef] [PubMed]
  26. Qi, L.; Zhang, X.; Sun, Y.; Wang, L.; Zhao, Y. Weighted estimating equations for additive hazards models with missing covariates. Ann. Inst. Stat. Math. 2018, 71, 365–387. [Google Scholar] [CrossRef]
  27. Shao, J.; Wang, L. Semiparametric inverse propensity weighting for nonignorable missing data. Biometrika 2016, 103, 175–187. [Google Scholar] [CrossRef]
  28. Tang, N.; Xia, L.; Yan, X. Feature screening in ultrahigh-dimensional partially linear models with missing responses at random. Comput. Stat. Data Anal. 2019, 133, 208–227. [Google Scholar] [CrossRef]
  29. Liu, T.; Yuan, X. Adaptive empirical likelihood estimation with nonignorable nonresponse data. Statistics 2019, 54, 1–22. [Google Scholar] [CrossRef]
  30. Wang, S.; Shao, J.; Kim, J.K. An instrumental variable approach for identification and estimation with nonignorable nonresponse. Stat. Sin. 2014, 24, 1097–1116. [Google Scholar] [CrossRef]
  31. Stock, J.H. Instrumental variables in statistics and econometrics. In International Encyclopedia of the Social & Behavioral Sciences, 2nd ed.; Elsevier: Amsterdam, The Netherlands, 2015; pp. 205–209. [Google Scholar]
  32. Rosenbaum, P.R.; Rubin, D.B. The central role of the propensity score in observational studies for causal effects. Biometrika 1983, 70, 41–55. [Google Scholar] [CrossRef]
  33. Imai, K.; Ratkovic, M. Robust estimation of inverse probability weights for marginal structural models. J. Am. Stat. Assoc. 2015, 110, 1013–1023. [Google Scholar] [CrossRef]
  34. Robins, J.; Rotnitzky, A.; Zhao, L.P. Estimation of regression coefficients when some regressors are not always observed. J. Am. Stat. Assoc. 1994, 89, 846–866. [Google Scholar] [CrossRef]
  35. Robins, J.; Rotnitzky, A.; Zhao, L.P. Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. J. Am. Stat. Assoc. 1995, 90, 106–121. [Google Scholar] [CrossRef]
  36. Hirano, K.; Imbens, G.; Ridder, G. Efficient estimation of average treatment effects using the estimated propensity score. Econometrica 2003, 71, 1161–1189. [Google Scholar] [CrossRef]
  37. Owen, A.B. Empirical Likelihood, 1st ed.; Chapman and Hall/CRC: Boca Raton, FL, USA, 2001. [Google Scholar]
  38. Papini, E.; Guglielmi, R.; Bianchini, A.; Crescenzi, A.; Taccogna, S.; Nardi, F.; Panunzi, C.; Rinaldi, R.; Toscano, V.; Pacella, C.M. Risk of malignancy in nonpalpable thyroid nodules: Predictive value of ultrasound and color-Doppler features. J. Clin. Endocrinol. Metab. 2002, 87, 1941–1946. [Google Scholar] [CrossRef] [PubMed]
  39. Kim, M.K.; Park, H.; Oh, Y.L.; Shin, J.H.; Kim, T.H.; Hahn, S.Y. Role of ultrasound in predicting telomerase reverse transcriptase (TERT) promoter mutation in follicular thyroid carcinoma. Sci. Rep. 2024, 14, 15323. [Google Scholar] [CrossRef] [PubMed]
  40. Xu, T.; Gu, J.; Ye, X.; Xu, S.; Wu, Y.; Shao, X.; Liu, D.; Lu, W.; Hua, F.; Shi, B.; et al. Thyroid nodule sizes influence the diagnostic performance of TIRADS and ultrasound patterns of 2015 ATA guidelines: A multicenter retrospective study. Sci. Rep. 2017, 7, 43183. [Google Scholar] [CrossRef]
Table 1. Bias, SD, and RMSE values of non-zero parameters and performance of variable selection in Experiment 1.
Table 1. Bias, SD, and RMSE values of non-zero parameters and performance of variable selection in Experiment 1.
BiasSDRMSE
( n , p + 1 ) Case β ^ c β ^ x , 1 β ^ x , 2 β ^ c β ^ x , 1 β ^ x , 2 β ^ c β ^ x , 1 β ^ x , 2 TPFPUFCFOF
(50,9)M1−0.02560.0811−0.02850.8410.8630.3130.8420.8670.3155.7790.1410.1960.6970.107
M2−0.0465−0.02550.13550.5370.5010.8700.5390.5020.8805.5760.1650.3340.5630.103
M3−0.0001−0.01320.04270.8510.7930.8220.8510.7930.8235.6890.1390.2620.6390.099
(100,13)M10.0464−0.0228−0.01950.7500.1460.1350.7510.1480.1379.7850.1010.1740.7520.074
M2−0.0714−0.0357−0.05480.3780.7410.7080.3840.7420.7109.5840.1420.2970.6110.092
M30.0487−0.0118−0.02300.8030.2210.2100.8050.2210.2119.7520.0950.1960.7360.068
(200,17)M1−0.0175−0.0044−0.00420.1530.0880.0880.1540.0880.08913.9050.0390.0820.8810.037
M2−0.0293−0.0193−0.01580.1860.1410.1250.1880.1430.12613.8110.0810.1370.7940.069
M3−0.0234−0.0288−0.00810.1680.2180.1180.1690.2200.11813.8500.0680.1160.8250.059
(500,23)M10.00000.00000.00010.0050.0050.0050.0050.0050.00519.9750.0000.0220.9780.000
M2−0.00500.00010.00140.0940.0840.0770.0940.0840.07719.9290.0150.0610.9270.012
M3−0.0003−0.00040.00030.0090.0090.0090.0090.0090.00919.9700.0000.0280.9720.000
(500,101)M1−0.0004−0.0005−0.00010.0500.0100.0100.0500.0100.01097.9340.0000.0400.9600.000
M2−0.0068−0.0013−0.00360.0960.0730.0690.0960.0740.06997.7030.0240.1450.8340.021
M3−0.00440.0043−0.00310.0770.0760.0460.0770.0760.04697.9020.0120.0630.9250.012
BiasSDRMSE
( n , q + 1 ) Case γ ^ c γ ^ u , 1 γ ^ u , 2 γ ^ c γ ^ u , 1 γ ^ u , 2 γ ^ c γ ^ u , 1 γ ^ u , 2 TPFPUFCFOF
(50,5)M1−0.09500.0261−0.01590.3690.7860.2960.3810.7860.2971.9050.1760.0930.7470.160
M2−0.0680−0.0516−0.02460.5700.5260.5130.5740.5280.5131.8640.1740.1280.7320.140
M30.0055−0.0391−0.02980.7860.4810.4590.7860.4830.4601.8710.1030.1230.7900.087
(100,7)M1−0.04100.04570.08490.2290.7950.7910.2320.7960.7963.9230.1120.0750.8250.100
M2−0.03680.0230−0.02120.4450.8520.4100.4460.8520.4113.8180.1410.1610.7270.112
M3−0.0687−0.1052−0.02680.7270.7120.3220.7300.7200.3233.9000.1380.0940.7850.121
(200,9)M1−0.0263−0.0240−0.00740.1990.1380.1630.2000.1400.1635.9550.0660.0420.8930.065
M2−0.03240.00020.00360.5560.5510.5440.5560.5510.5445.9200.0620.0720.8730.055
M3−0.0229−0.0188−0.00890.1750.1480.1100.1770.1500.1105.9540.0660.0400.8990.061
(500,12)M10.0001−0.0005−0.00000.0060.0060.0060.0060.0060.0068.9810.0000.0190.9810.000
M2−0.0143−0.0147−0.01270.1190.1110.0800.1200.1120.0818.9710.0580.0280.9170.055
M3−0.00020.0001−0.00030.0080.0080.0080.0080.0080.0088.9890.0000.0110.9890.000
(500,51)M1−0.0195−0.0138−0.00900.1340.1070.0650.1350.1080.06547.9570.0520.0350.9150.050
M2−0.0068−0.0073−0.00490.0920.0920.0600.0930.0920.06047.8620.0260.0830.8940.023
M3−0.0173−0.0160−0.00230.1680.1090.0530.1690.1100.05347.9000.0320.0700.9010.029
Table 2. Bias, SD, and RMSE values of non-zero parameters and performance of variable selection in Experiment 2.
Table 2. Bias, SD, and RMSE values of non-zero parameters and performance of variable selection in Experiment 2.
BiasSDRMSE
( n , p + 1 ) Case β ^ c β ^ x , 1 β ^ x , 2 β ^ c β ^ x , 1 β ^ x , 2 β ^ c β ^ x , 1 β ^ x , 2 TPFPUFCFOF
(50,9)PETL−0.0200−0.0327−0.02370.8430.2180.2200.8440.2210.2225.7600.1240.2080.6980.094
HT0.0013−0.04860.02210.9580.5910.9100.9580.5930.9114.1030.3300.7170.2000.083
ST−0.0346−0.03490.03580.9230.8480.8840.9240.8490.8852.9610.2220.8540.1140.032
PLS−0.1264−0.0767−0.02320.7620.9740.7320.7720.9770.7324.5070.3650.6990.2140.087
(100,13)PETL0.0477−0.0238−0.01950.7490.2270.2180.7500.2280.2199.7920.0890.1670.7670.066
HT−0.1052−0.04930.04760.4820.4210.8130.4940.4240.8157.9940.2380.5300.3650.105
ST−0.0732−0.0083−0.05010.5380.4970.4890.5430.4970.4914.4750.1570.7740.1960.030
PLS−0.0250−0.03640.02300.8680.5650.8320.8690.5660.8338.7990.3260.5230.3420.135
(200,17)PETL0.00630.00060.00170.1650.0810.0800.1650.0810.08013.9090.0000.0740.926−0.000
HT−0.0558−0.0372−0.02780.2770.4790.2000.2830.4800.20212.8670.1430.2870.6160.097
ST−0.0303−0.0310−0.02680.1910.3750.3760.1930.3770.3778.2950.0900.6030.3630.034
PLS−0.0275−0.0138−0.01020.1820.1360.1230.1840.1370.12313.4850.0830.2060.7340.060
(500,101)PETL0.0000−0.0004−0.00010.0080.0080.0080.0080.0080.00897.9840.0000.0120.9880.000
HT−0.0286−0.0124−0.01810.1810.1210.1190.1830.1220.12094.2610.0960.1380.7810.081
ST−0.0075−0.0018−0.00540.1180.0800.0810.1180.0800.08174.9970.0210.4500.5360.014
PLS−0.0537−0.0404−0.03460.2360.1580.2000.2420.1630.20395.2830.1950.3050.5650.130
BiasSDRMSE
( n , q + 1 ) Case γ ^ c γ ^ u , 1 γ ^ u , 2 γ ^ c γ ^ u , 1 γ ^ u , 2 γ ^ c γ ^ u , 1 γ ^ u , 2 TPFPUFCFOF
(50,5)PETL−0.08520.0307−0.01060.3310.7850.2470.3420.7850.2471.9040.1520.0940.7680.138
HT−0.1180−0.0948−0.00580.6220.5650.6940.6330.5720.6941.3320.3310.5050.3420.153
ST−0.0448−0.0687−0.04740.8850.6830.6490.8860.6870.6501.0760.2010.6910.2500.059
PLS−0.1135−0.0745−0.06250.7760.7620.7630.7840.7660.7661.5860.2740.3450.4890.166
(100,7)PETL−0.0476−0.0422−0.02320.2990.2670.2470.3020.2700.2483.9230.1270.0740.8140.112
HT−0.0459−0.0020−0.03670.7850.8020.3950.7870.8020.3973.2030.2330.4610.4240.115
ST0.0010−0.0369−0.04120.8100.8000.4850.8100.8010.4871.7670.1910.7650.2010.034
PLS−0.1247−0.1587−0.11480.8700.7940.7780.8790.8100.7863.5480.3430.3160.4590.225
(200,9)PETL−0.0161−0.0239−0.01210.1440.1400.0970.1450.1420.0975.9550.0580.0400.9020.058
HT−0.0386−0.0296−0.02880.2610.2330.2250.2640.2350.2275.4330.1250.2760.6350.089
ST−0.0331−0.0295−0.01510.2050.1790.1310.2080.1820.1324.0410.1030.5400.4200.040
PLS−0.0366−0.0219−0.02150.2070.1640.1320.2100.1660.1335.7720.1060.1500.7640.086
(500,51)PETL−0.00090.00030.00210.0090.0090.0330.0090.0090.03347.9910.0000.0070.9930.000
HT−0.0259−0.0283−0.02350.1670.1610.1130.1690.1630.11646.2420.1080.1270.7790.094
ST−0.0366−0.0220−0.00450.3340.1320.0820.3360.1340.08235.7870.0520.4470.5270.026
PLS−0.0434−0.0359−0.02400.2260.1790.1270.2300.1830.13046.5350.1460.2980.6020.100
Table 3. Bias, SD, and RMSE values of parameter estimates for penalized ET likelihood (PETL), GMM, and PEL in Experiment 3.
Table 3. Bias, SD, and RMSE values of parameter estimates for penalized ET likelihood (PETL), GMM, and PEL in Experiment 3.
PETLGMMPEL
( n , p ) ParBiasSDRMSEALCPBiasSDRMSEALCPBiasSDRMSEALCP
(117,8) β ^ 12 −0.01320.1420.1431.46297.4−0.03230.3160.3172.64688.40.00650.5490.5491.27981.2
β ^ 21 −0.02080.1540.1551.95297.8−0.03490.3200.3222.65988.7−0.02530.1550.1571.50193.7
β ^ 32 −0.02330.1610.1621.94597.5−0.02450.3310.3321.19781.0−0.02290.1510.1521.26993.5
(241,32) β ^ 12 −0.00640.0690.0690.87897.1−0.03190.2620.2642.91291.6−0.01010.1050.1051.31293.5
β ^ 21 −0.00290.0680.0680.61897.6−0.02900.1460.1482.99494.4−0.01470.1090.1101.93694.9
β ^ 23 −0.00430.0660.0660.61296.8−0.03600.1600.1642.94592.5−0.01350.2080.2081.96096.2
β ^ 32 0.00320.1740.1741.17797.7−0.02140.2590.2591.98194.4−0.00950.0990.0991.90893.8
β ^ 34 0.00000.1720.1720.88495.7−0.03320.1490.1531.95993.2−0.01310.1140.1151.88592.5
β ^ 43 0.01030.1770.1781.17897.6−0.02420.2670.2681.94292.5−0.00340.2320.2321.93594.8
β ^ 54 −0.00460.0760.0760.87596.2−0.02400.1360.1382.91591.7−0.00980.2310.2311.92794.3
(602,72) β ^ 12 −0.00070.0290.0290.59899.8−0.01060.1170.1171.39492.9−0.00820.0730.0741.42795.1
β ^ 21 0.00090.0300.0300.59599.3−0.00530.0580.0581.38492.3−0.01200.0840.0851.42995.2
β ^ 23 0.00060.0300.0300.59999.7−0.01060.0780.0790.57990.2−0.00710.0660.0661.00895.0
β ^ 32 −0.00010.0290.0290.44298.1−0.00590.0610.0610.58491.8−0.00840.0730.0740.99993.7
β ^ 34 −0.00010.0310.0310.31599.8−0.00680.0630.0640.98693.0−0.00610.0600.0600.60194.5
β ^ 43 0.00050.0300.0300.45099.8−0.00730.1180.1180.60094.3−0.00560.0600.0611.00394.7
β ^ 45 0.00610.1320.1320.44391.2−0.00570.0610.0610.59192.8−0.00870.0710.0710.60694.9
β ^ 54 −0.00210.1290.1290.43991.4−0.00820.0690.0691.42695.1−0.00720.0660.0661.00995.0
β ^ 56 0.00100.0300.0300.43997.5−0.00690.0630.0640.99093.4−0.00830.0730.0741.00394.2
β ^ 65 0.00330.0290.0300.44298.3−0.00420.0510.0511.38692.4−0.00440.0500.0501.44096.0
β ^ 76 0.00060.0290.0290.44098.1−0.00750.0660.0660.99493.4−0.00470.0570.0571.00795.2
Table 4. TP, FP, UF, CF, and OF values of variable selection for penalized ET likelihood (PETL), GMM, and PEL in Experiment 3.
Table 4. TP, FP, UF, CF, and OF values of variable selection for penalized ET likelihood (PETL), GMM, and PEL in Experiment 3.
PETLGMMPEL
( n , p , q ) TPFPUFCFOFTPFPUFCFOFTPFPUFCFOF
(117,8,2)4.8780.1140.1120.7940.0944.7030.1560.2560.6300.1144.7830.1380.1950.6960.109
(241,32,3)24.8400.0340.1220.8510.02724.4040.3580.3620.4410.19724.3540.1410.3430.5660.091
(602,72,4)60.8660.0000.0810.9190.00060.2770.1180.2250.6890.08660.2530.1280.2280.6860.086
Table 5. PETL estimates (Est) and 95% confidence intervals (CI) of parameters under three penalty functions in real example.
Table 5. PETL estimates (Est) and 95% confidence intervals (CI) of parameters under three penalty functions in real example.
LassoALassoSCAD
Par. Est CI Est CI Est CI
β ^ 1 −2.864(−3.515,−2.073)−2.850(−3.248,−2.455)−2.929(−3.223,−2.630)
β ^ 2 0.000(−0.744,0.771)−0.195(−0.869,0.375)−0.012(−0.116,0.081)
β ^ 3 −0.082(−0.927,0.801)−0.046(−0.256,0.152)−0.134(−0.241,−0.031)
β ^ 4 0.000(−0.523,0.584)0.040(−0.640,0.614)0.084(−0.435,0.599)
β ^ 5 0.000(−0.471,0.597)0.000(−0.656,0.600)0.000(−0.296,0.302)
β ^ 6 0.000(−0.699,0.801)0.128(−0.505,0.727)0.111(−0.191,0.406)
β ^ 7 0.000(−0.381,0.404)0.282(−0.282,0.871)0.283(−0.011,0.582)
β ^ 8 −3.118(−3.643,−2.520)−3.104(−3.728,−2.491)−3.139(−3.636,−2.629)
β ^ 9 0.000(−1.056,0.742)0.429(−0.223,1.006)0.469(0.169,0.769)
β ^ 10 0.000(−0.592,0.625)0.000(−0.416,0.407)0.000(−0.505,0.496)
β ^ 11 5.920(5.374,6.541)5.801(5.607,5.984)5.865(5.371,6.371)
β ^ 12 6.205(5.570,6.824)6.343(5.817,6.897)6.207(5.904,6.505)
β ^ 13 4.952(4.428,5.586)4.893(4.530,5.304)4.939(4.641,5.243)
β ^ 14 4.284(3.599,5.111)4.338(3.789,4.867)4.481(3.987,4.990)
β ^ 15 5.617(5.211,6.096)5.777(5.179,6.373)5.727(5.628,5.820)
β ^ 16 0.000(−0.789,0.741)0.000(−0.389,0.405)0.000(−0.300,0.306)
β ^ 17 0.000(−0.357,0.442)0.000(−0.433,0.393)0.000(−0.097,0.095)
β ^ 18 0.000(−0.770,0.797)0.000(−0.425,0.412)0.000(−0.527,0.497)
β ^ 19 0.000(−0.512,0.566)0.000(−0.615,0.584)0.000(−0.099,0.093)
β ^ 20 1.633(1.078,2.248)1.231(0.813,1.618)1.230(0.700,1.736)
β ^ 21 0.000(−0.784,0.797)0.000(−0.614,0.563)0.000(−0.103,0.100)
β ^ 22 8.851(8.037,9.757)8.603(7.998,9.211)8.478(8.173,8.781)
β ^ 23 0.000(−0.390,0.413)2.993(2.369,3.586)2.774(2.248,3.263)
β ^ 24 0.000(−0.363,0.467)2.901(2.345,3.468)2.635(2.112,3.134)
β ^ 25 0.000(−0.364,0.354)3.118(2.753,3.498)2.971(2.867,3.090)
β ^ 26 5.647(5.060,6.238)5.452(5.055,5.836)5.461(4.923,5.955)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sha, X.; Zhao, P.; Tang, N. Penalized Exponentially Tilted Likelihood for Growing Dimensional Models with Missing Data. Entropy 2025, 27, 146. https://doi.org/10.3390/e27020146

AMA Style

Sha X, Zhao P, Tang N. Penalized Exponentially Tilted Likelihood for Growing Dimensional Models with Missing Data. Entropy. 2025; 27(2):146. https://doi.org/10.3390/e27020146

Chicago/Turabian Style

Sha, Xiaoming, Puying Zhao, and Niansheng Tang. 2025. "Penalized Exponentially Tilted Likelihood for Growing Dimensional Models with Missing Data" Entropy 27, no. 2: 146. https://doi.org/10.3390/e27020146

APA Style

Sha, X., Zhao, P., & Tang, N. (2025). Penalized Exponentially Tilted Likelihood for Growing Dimensional Models with Missing Data. Entropy, 27(2), 146. https://doi.org/10.3390/e27020146

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop