Estimating the Complier Average Causal Effect with Non-Ignorable Missing Outcomes Using Likelihood Analysis

Jierui Du; Gao Wen; Xin Liang

doi:10.3390/math12091300

,

and

¹

School of Economics and Statistics, Guangzhou University, Guangzhou 510006, China

²

School of Mathematics and Statistics, Guangxi Normal University, Guilin 541004, China

^*

Author to whom correspondence should be addressed.

Mathematics2024, 12(9), 1300;https://doi.org/10.3390/math12091300

This article belongs to the Special Issue Advances in Statistical AI and Causal Inference

Version Notes

Order Reprints

Abstract

Missing data problems arise in randomized trials, which complicates the inference of causal effects if the missing mechanism is non-ignorable. We tackle the challenge of identifying and estimating the complier average causal effect parameters under non-ignorable missingness by increasing the covariates to mitigate the sensitivity to the violation of specific identification assumptions. The missing data mechanism is assumed to follow a logistic model, wherein the absence of the outcome is explained by the outcome itself, the treatment received, and the covariates. We establish the identifiability of the models under mild conditions by assuming that the outcome follows a normal distribution. We develop a computational method to estimate model parameters through a two-step likelihood estimation approach, employing subgroup analysis. The bootstrap method is employed for variance estimation, and the effectiveness of our approach is confirmed through simulation. We applied the proposed method to analyze the household income dataset from the Chinese Household Income Project Survey 2013.

Keywords:

complier average causal effect; non-ignorable missingness; identifiability; likelihood; subgroup analysis

MSC:

62D10; 62D20

1. Introduction

Non-compliance and missing data are phenomena which usually occur in studies of economics, medicine, and public health. Non-compliance arises when certain participants do not adhere to their prescribed treatments, while the problem of missing data arises when the study’s researchers are unable to gather information for some participants [1]. These unmeasured confounding variables may complicate the inference of causal effects if the missing mechanism is non-ignorable [2,3]. The missingness is named ignorable if it depends on the observed data only; otherwise, it is named non-ignorable [4,5]. Identifying the complier average causal effect becomes challenging in the presence of both non-compliance and non-ignorable missing values, as it is impossible to identify the full data distribution or the causal effect without additional assumptions.

To address non-compliance, one can assume that the response in the dataset follows a specific distribution, such as exponential families, and use maximum likelihood estimation. The authors in [6,7,8] presented the estimation of the complier average causal effect (CACE) with the maximum likelihood estimation method with the EM algorithm [9]. The maximum likelihood method provides the advantage of relaxing the exclusion restrictions, which can often be unrealistic, particularly in natural experiment scenarios [10]. Another approach to addressing non-compliance is the instrumental variable method [11]. Utilizing instrumental variables is a viable method when stringent assumptions about the response distribution could result in mis-specification.

The shadow variable strategy is a mainstream solution for establishing identifiability in cases of non-ignorable missingness [12]. Analogous approaches also employ instrumental variables, as suggested by [13]. Nonetheless, the selection of appropriate instrumental or shadow variables poses challenges, particularly among a multitude of covariates [14]. The authors of [15] illustrate that stronger assumptions regarding the response mechanism enable the establishment of identifiability derived from the distribution of the observed data. When the response adheres to exponential families, as noted by [16,17], the identifiability under non-ignorable missingness is readily attainable without the use of instrumental variables.

It is critically important to integrate causal inference with research on non-ignorable missing data [18,19]. The authors of [20,21,22] introduced estimators for the complier average causal effect parameters, assuming that the missing data are ignorable. However, if the missingness is non-ignorable, these methods may yield biased estimators. The recent literature has introduced strategies for addressing missing covariates or outcomes in the context of non-ignorable missingness, which includes but is not limited to the following papers. The study in [1] initially explored the identifiability of parameters in randomized clinical trials characterized by non-compliance and non-ignorable missing binary outcome variables. The study in [2] examined semi-parametric identifiability and developed an estimation method for the complier average causal effect in randomized clinical trials, addressing the challenge of non-ignorable missingness in continuous outcomes. The study in [23] formulated a method using a shadow variable to identify and estimate the complier average causal effect parameters arising from outcomes that represent non-ignorable missingness. The study in [19] tackled the challenge of non-ignorable missingness confounders in causal analysis from observational data, demonstrating that causal effects are ascertainable in scenarios wherein the missingness mechanism is outcome-independent, conditional on the treatment and potential missing confounders. The study in [24] developed semi-parametric estimators to determine the average causal effect, accounting for non-ignorable missing confounders, based on the premise that the missingness of data is independent of the outcome. The study in [25] addressed the problem of identifying the treatment benefit rate and treatment harm rate in scenarios wherein treatment, endpoints, or covariates are missing. The study in [26] investigates a treatment-independent missingness assumption, which facilitates the identification of causal effects in situations wherein confounders represent non-ignorable missingness.

In the existing literature, limited attention has been given to situations wherein a general missing mechanism model is selected to address non-ignorable missingness. The study in [2] utilized an outcome-dependent non-ignorable missingness model without incorporating an auxiliary variable. The study in [3] proposed using an auxiliary variable that serves as both a shadow variable for non-ignorable missingness and an instrumental variable for causal effects. However, the models used in the above approach assume the absence of covariates, despite their common presence in practical applied research. In this paper, we examine the complier average causal effect parameters under non-ignorable missingness by including covariates to mitigate the sensitivity to the violation of specific identification assumptions. The missing data mechanism is assumed to follow a logistic model, wherein the absence of the outcome is explained by the outcome itself, the treatment received, and the covariates. We establish the identifiability of the models under mild conditions by assuming that the outcome follows a normal distribution. We develop a computational method to estimate model parameters through a two-step likelihood estimation approach, employing subgroup analysis. The bootstrap method is employed for variance estimation, and the effectiveness of our approach is confirmed through simulation. The proposed method is applied to analyze the household income dataset from the Chinese Household Income Project Survey 2013.

The rest of this article is organized as follows. Section 2 introduces the general modeling framework, with notation and assumptions. In Section 3, we give the theoretical results on the identifiability of the parameters and the estimation approach. The performance of the proposed method is evaluated through simulation studies in Section 4. The application to CHIP data is presented in Section 5. Concluding remarks are provided in Section 6.

2. Notation and Assumptions

Let

Y_{i}

denote the individual outcome and let

X_{i}

represent the individual covariate vector, where

Y_{i} \in R

,

X \equiv {(1, X_{1, i}, \dots, X_{p, i})}^{⊤} \in R^{p + 1}

.

Z_{i}

denotes the randomized treatment assignment for the ith unit in the study, where i ranges from 1 to n. We assign

Z_{i} = 1

if the ith individual is allocated to the treatment group and

Z_{i} = 0

if it is assigned to the control group. Moreover,

D_{i}

represents the treatment received by the ith individual, with

D_{i} = 1

indicating treatment and

D_{i} = 0

indicating no treatment.

D (z)

and

Y (z)

denote the potential treatment received and the potential outcome under the assigned treatment

Z = z

, respectively. We define

R (z)

as the binary response indicator for

Y (z)

, where

R (z) = 1

if

Y (z)

is observed and

R (z) = 0

if

Y (z)

is missing. Additionally, we define

Y (z, d)

as representing the potential outcome under the assigned treatment

Z = z

and actual treatment

D = d

.

Following [6], we define

U_{i}

as the compliance status of the ith patient, which is determined as follows:

\begin{matrix} U_{i} = \{\begin{matrix} c & if D_{i} (0) = 0 a n d D_{i} (1) = 1 \\ n & if D_{i} (0) = 0 a n d D_{i} (1) = 0 \\ a & if D_{i} (0) = 1 a n d D_{i} (1) = 1 \\ d & if D_{i} (0) = 1 a n d D_{i} (1) = 0 \end{matrix} \end{matrix}

where the potential intermediate outcomes c, n, a, and d stand for complier, never-taker, always-taker, and defier, respectively. The complier average causal effect gives the predictors

X = x

, which equals

\begin{matrix} CACE (x) = E {Y (1) - Y (0) | U = c, X = x} \end{matrix}

Next, we provide the necessary assumptions to ensure the identifiability of

CACE (x)

under the non-ignorable missingness. These are formalized by the following assumptions [1,11]:

Assumption 1

(Stable unit treatment value assumption, SUTVA). If

z_{i} = z_{i}^{'}

, then

D_{i} (z) = D_{i} (z^{'})

. And, if

z_{i} = z_{i}^{'}, d_{i} = d_{i}^{'}

, then

Y_{i} (z, d) = Y_{i} (z^{'}, d^{'})

.

Assumption 2

(Randomization). The treatment assignment Z is randomization.

Assumption 3

(Exclusion restrictions).

P (Y (1, d) = Y (0, d) | X = x) = 1

for

d \in {0, 1}

.

Assumption 4

(First-stage). Given

X = x

, there is a non-zero average causal effect of Z on D, i.e.,

E {D (1) | X = x} \neq E {D (0) | X = x}

.

Assumption 5

(Monotonicity).

D_{i} (1) \geq D_{i} (0)

for all

i = 1, \dots, N

. Defining

ω_{u} = P (U = u)

where

u \in {c, n, a, d}

, we then have

ω_{d} = 0

.

Assumption 6

(Compound exclusion restrictions).

P [{Y (1, d), R (1)} = {Y (0, d), R (0)} | X = x] = 1

for

d \in {0, 1}

.

Assumption 7

(Non-ignorable missingness).

E {R (z) | Y (z) = y, D (z) = d, U = u, X = x} = E {R (z) | Y (z) = y, D (z) = d, X = x}

and

E {R (1) | Y (1) = y, D (1) = d, X = x} = E {R (0) | Y (0) = y, D (0) = d, X = x}

.

Assumption 8

(Subgroup). U is independent of variable

X

.

Assumption 9

(Normal). The conditional density of the outcome variable Y has the following normal form:

\begin{matrix} p (y | x; θ_{z u}) & = & p (y | Z = z, U = u, X = x; β_{z u}, σ_{z u}^{2}) \\ = & \frac{1}{\sqrt{2 π} σ_{z u}} exp \{- \frac{{(y - β_{z u}^{⊤} x)}^{2}}{2 σ_{z u}^{2}}\} \end{matrix}

where

u \in {c, n, a}

is the compliance status and

θ_{z u} = {(β_{z u}^{⊤}, σ_{z u}^{2})}^{⊤}

.

Assumption 1 posits that

Y_{i} = D_{i} Y_{i} (z, 1) + (1 - D_{i}) Y_{i} (z, 0)

and

D_{i} = Z_{i} D (1) + (1 - Z_{i}) D (0)

, indicating that the observed outcome equals the potential outcome evaluated at the observed treatment value, and the observed treatment equals the potential treatment evaluated at the assigned treatment [27]. This assumption is typically reasonable when dealing with randomly sampled units. Assumption 2 [1] means that Z is as good as randomly assigned. Assumption 3 asserts that, given

X

,

Y (d) = Y (1, d) = Y (0, d)

. This concept encapsulates the fundamental principle of instrumental variable procedures, indicating that any influence of Z on Y must be channeled through an impact of Z on D [28]. Under Assumption 3, with

X

fixed, the conditional density of Y remains independent of Z for never-takers and always-takers. Assumption 4 stipulates that, for every stratum defined by

X = x

,

ω_{c}

is strictly positive. Assumption 5 asserts the absence of defiers within the population. Assumptions 1–5 are standard assumptions in causal effect models.

Assumption 6 [2] is credible in a double-blinded clinical trial, as patients are unaware of the treatment assigned to them, and the treatment assignment is performed through randomization. Assumption 7 suggests that the absence of an outcome variable is accounted for by the outcome itself, the treatment received, and the covariates. It implies that the expectation of

R (z)

given

Y (z) = y

,

D (z) = d

,

U = u

, and

X = x

is equal to the expectation of observing R given

Y = y

,

D = d

, and

X = x

. The missing data mechanism is assumed to follow a logistic model, expressed as

E (R | Y = y, D = d, X = x) = E_{d} (R | Y = y, X = x) = {logit}_{d}^{- 1} (α_{d} y + ϕ_{d}^{⊤} x) = π_{d} (α, ϕ)

,

d \in {0, 1}

,

{logit}^{- 1} (\cdot) \equiv exp (\cdot) / {1 + exp (\cdot)}

. Assumption 8 incorporates the independence between X and U into the computational method for estimating model parameters. This inclusion enhances the robustness and validity of the estimation procedure, particularly through a two-step maximum likelihood estimation approach using subgroup analysis. The purpose of Assumption 9 is to establish the identifiability of the models under mild conditions, facilitating the creation of a computational method to estimate model parameters.

3. Identifiability and Estimation

In this section, we initially construct a model for non-compliant data and missing data, outlining the steps to solve the model. Subsequently, we theoretically establish the identifiability of the model involved in each step. Given that the second step entails a system of non-linear equations and integral operations, which can be challenging to handle analytically, we present a numerical calculation method to address this complexity.

We introduce a convenient two-stage estimation process based on the random sample

(R_{i}, D_{i}, Z_{i}, X_{i}, Y_{i}), i = 1, \dots, N

. Given

X = x

, the joint density function with respect to variables

(y, z, u)

can be expressed as follows:

\begin{matrix} f (y, z, u | x) = f (z | x) f (u | z, x) f (y | z, u, x) = f (z) f (u | z) f (y | z, u, x) \end{matrix}

Assumptions 2 and 8 ensure the validity of the equation above. Let

ξ = P (Z = 1), ω_{a} = P (U = a) = P (D = 1 | Z = 0), ω_{n} = P (U = n) = P (D = 0 | Z = 1), ω_{c} = 1 - ω_{n} - ω_{a}

, and denote

η = {(ξ, ω_{a}, ω_{n})}^{⊤}

. Assumption 3 implies that

f (y | x; θ_{1 a}) = f (y | x; θ_{0 a})

, denoted as

f (y | x; θ_{a})

, and

f (y | x; θ_{1 n}) = f (y | x; θ_{0 n})

, and denoted as

f (y | x; θ_{n})

. Let

n_{z} = # {i : Z_{i} = z}

,

n_{z d} = # {i : Z_{i} = z, D_{i} = d}

for

z \in {0, 1}

and

d \in {0, 1}

. As stated in [2], the full likelihood for

(η, α, ϕ, θ)

can be expressed as

\begin{matrix} L (η, α, ϕ, θ) \propto L (η) L (α_{1}, ϕ_{1}, θ_{a}, θ_{1 c}) L (α_{1}, ϕ_{1}, θ_{a}) L (α_{0}, ϕ_{0}, θ_{n}, θ_{0 c}) L (α_{1}, ϕ_{1}, θ_{n}) \end{matrix}

(1)

where

\begin{matrix} L (η) = ξ^{n_{1}} {(1 - ξ)}^{n_{0}} {(1 - ω_{n})}^{n_{11}} ω_{n}^{n_{10}} ω_{a}^{n_{01}} {(1 - ω_{a})}^{n_{00}} \end{matrix}

(2)

\begin{matrix} L (α_{1}, ϕ_{1}, θ_{a}, θ_{1 c}) \\ = \prod_{i : (Z_{i}, D_{i}) = (1, 1)} {[π_{1} (α_{1}, ϕ_{1}) \{\frac{ω_{a}}{ω_{a} + ω_{c}} p (Y_{i} | X_{i}; θ_{a}) + \frac{ω_{c}}{ω_{a} + ω_{c}} p (Y_{i} | X_{i}; θ_{1 c})\}]}^{R_{i}} \\ {[\int {1 - π_{1} (α_{1}, ϕ_{1})} \{\frac{ω_{a}}{ω_{a} + ω_{c}} p (y | X_{i}; θ_{a}) + \frac{ω_{c}}{ω_{a} + ω_{c}} p (y | X_{i}; θ_{1 c})\} d y]}^{1 - R_{i}} \end{matrix}

(3)

\begin{matrix} L (α_{1}, ϕ_{1}, θ_{a}) \\ = \prod_{i : (Z_{i}, D_{i}) = (0, 1)} {\{π_{1} (α_{1}, ϕ_{1}) p (Y_{i} | X_{i}; θ_{a})\}}^{R_{i}} {\{\int {1 - π_{1} (α_{1}, ϕ_{1})} p (y | X_{i}; θ_{a}) d y\}}^{1 - R_{i}} \end{matrix}

(4)

\begin{matrix} L (α_{0}, ϕ_{0}, θ_{n}) \\ = \prod_{i : (Z_{i}, D_{i}) = (1, 0)} {\{π_{1} (α_{0}, ϕ_{0}) p (Y_{i} | X_{i}; θ_{n})\}}^{R_{i}} {\{\int {1 - π_{1} (α_{0}, ϕ_{0})} p (y | X_{i}; θ_{n}) d y\}}^{1 - R_{i}} \end{matrix}

(5)

\begin{matrix} L (α_{0}, ϕ_{0}, θ_{n}, θ_{0 c}) \\ = \prod_{i : (Z_{i}, D_{i}) = (0, 0)} {[π_{1} (α_{0}, ϕ_{0}) \{\frac{ω_{n}}{ω_{n} + ω_{c}} p (Y_{i} | X_{i}; θ_{n}) + \frac{ω_{c}}{ω_{n} + ω_{c}} p (Y_{i} | X_{i}; θ_{0 c})\}]}^{R_{i}} \\ {[\int {1 - π_{1} (α_{0}, ϕ_{0})} \{\frac{ω_{n}}{ω_{n} + ω_{c}} p (y | X_{i}; θ_{n}) + \frac{ω_{c}}{ω_{n} + ω_{c}} p (y | X_{i}; θ_{0 c})\} d y]}^{1 - R_{i}} \end{matrix}

(6)

It is evident from (2) that the maximum likelihood estimator for

η

is

(\hat{ξ}, {\hat{ω}}_{a}, {\hat{ω}}_{n}) = (n_{1} / n, n_{01} / n_{1}, n_{10} / n_{0})

, which is equivalent to the moment estimator, thus ensuring its identifiability. Next, we propose a two-stage estimation process

(α_{d}, ϕ_{d}, θ_{z u})

and establish the identifiability of the models involved in these two steps.

Step 1. Based on Assumptions 3 and 7, the conditional density of the outcome Y for never-takers and always-takers, as well as the missing data models, are independent of Z. The likelihood function computed on never-takers with

(Z_{i}, D_{i}) = (0, 1)

is (4) and the likelihood function computed on always-takers with

(Z_{i}, D_{i}) = (1, 0)

is (5). We can maximize (4) and (5) to obtain that the maximum likelihood estimators for

(α_{1}, ϕ_{1}, θ_{a})

and

(α_{0}, ϕ_{0}, θ_{n})

are

({\hat{α}}_{1}, {\hat{ϕ}}_{1}, {\hat{θ}}_{a})

and

({\hat{α}}_{0}, {\hat{ϕ}}_{0}, {\hat{θ}}_{n})

.

Step 2. We substitute

(\hat{ξ}, {\hat{ω}}_{a}, {\hat{ω}}_{n})

,

({\hat{α}}_{0}, {\hat{ϕ}}_{0}, {\hat{θ}}_{n})

, and

({\hat{α}}_{1}, {\hat{ϕ}}_{1}, {\hat{θ}}_{a})

into (3) and (6) and consider the numerical characteristics of the mixed normal distribution and the inverse probability weighting method (more details can be found in [29]) to obtain the estimators of

θ_{1 c}

and

θ_{0 c}

, denoted as

{\hat{θ}}_{1 c}

and

{\hat{θ}}_{0 c}

.

Non-ignorable missingness in Y presents challenges to the identifiability of the observed likelihood function (4) and (5), as emphasized by [12]. The next theorem shows that the parameters are all identifiable under mild conditions.

Theorem 1.

Suppose there exists one continuous covariate; if the sign of some element of

{(α_{d}, ϕ_{d}^{⊤})}^{⊤}

is known, then the vector of parameters

(α_{d}, ϕ_{d}, θ_{z u})

is identifiable, where

d \in {0, 1}

and

z u \in {1 c, 0 c, n, a}

.

Proof.

(i) First, we establish the identifiability of parameters

(α_{1}, ϕ_{1}, θ_{a})

. Suppose there exist two sets of parameters

(α_{1}, ϕ_{1}, θ_{a})

and

(α_{1}^{*}, ϕ_{1}^{*}, θ_{a}^{*})

such that

\begin{matrix} log π (y, x; α_{1}, ϕ_{1}) + log p (y | x, θ_{a}) = log π (y, x; α_{1}^{*}, ϕ_{1}^{*}) + log p (y | x; θ_{a}^{*}) \end{matrix}

(7)

which holds for all

(y, x)

. The authors in [12] accounted that

(α_{1}, ϕ_{1}, θ_{a})

is identifiable if (7) implies that

\begin{matrix} α = α^{*}, ϕ_{1} = ϕ_{1}^{*}, θ_{a} = θ_{a}^{*} \end{matrix}

Without a loss of generality, suppose

X_{1}

is a continuous variable and can take any real values, and let covariates

X_{2}, \dots, X_{p}

be omitted since we can consider

X_{2}, \dots, X_{p}

as fixed, while

X_{1}

varies. Due to the fact that the proof for the case

ϕ_{1} = 0

can be obtained by mimicking the following process, here, we consider the case

ϕ_{1} \neq 0

. Let

g (\cdot) = log π (\cdot)

; then,

g (α_{1} y + ϕ_{1}^{⊤} x) + log p (y | x; θ_{a}) = g (α_{1}^{*} y + {ϕ_{1}^{*}}^{⊤} x) + log p (y | x; θ_{a}^{*})

can be transformed into the following form:

\begin{matrix} g (α_{1} y + ϕ_{1, 0} + ϕ_{1, 1} x_{1}) + log p (y | x_{1}; β_{a, 0}, β_{a, 1}, σ_{a}^{2}) \\ = & g (α_{1}^{*} y + ϕ_{1, 0}^{*} + ϕ_{1, 1}^{*} x_{1}) + log p (y | x_{1}; β_{a, 0}^{*}, β_{a, 1}^{*}, {σ_{a}^{2}}^{*}) \end{matrix}

(8)

Applying operation

\partial / \partial x_{1}

on both sides of (8) yields

\begin{matrix} g^{'} (α_{1} y + ϕ_{1, 0} + ϕ_{1, 1} x_{1}) ϕ_{1} + \frac{y - β_{a, 0} - β_{a, 1} x_{1}}{σ^{2}} β_{a, 1} \\ = & g^{'} (α_{1}^{*} y + ϕ_{1, 0}^{*} + ϕ_{1, 1}^{*} x_{1}) ϕ_{1}^{*} + \frac{y - β_{a, 0}^{*} - β_{a, 1}^{*} x_{1}}{{σ^{2}}^{*}} β_{a, 1}^{*} \end{matrix}

(9)

Applying operation

\partial^{2} / \partial y^{2}

on both sides of (9) yields

\begin{matrix} g^{(3)} (α_{1} y + ϕ_{1, 0} + ϕ_{1, 1} x_{1}) ϕ_{1, 1} α_{1}^{2} = g^{(3)} (α_{1}^{*} y + ϕ_{1, 0}^{*} + ϕ_{1, 1}^{*} x_{1}) ϕ_{a, 1}^{*} {α_{1}^{*}}^{2} . \end{matrix}

(10)

If

ϕ_{1, 1}^{*} = 0

or

α_{1}^{*} = 0

, then (10) reduces to

g_{i}^{(3)} (α_{1} y + ϕ_{1, 0} + ϕ_{1, 1} x_{1}) ϕ_{1, 1} α_{1}^{2} = 0

and then

ϕ_{1, 1} α_{1}^{2} = 0

, which contradicts the assumption

ϕ_{1, 1} \neq 0

and

α_{1} \neq 0 .

Now, we consider the case

ϕ_{1, 1}^{*} \neq 0

and

α_{1}^{*} \neq 0 .

Due to the fact that

g (\cdot) = log {logit}^{- 1} (\cdot)

, the roots of the derivatives of

g (t)

are as follows:

\{\begin{matrix} g^{'} (t) has no roots \\ g^{″} (t) has no roots \\ g^{(3)} (t) has one root 0 \\ g^{(4)} (t) has two roots log (2 + \sqrt{3}), log (2 - \sqrt{3}) \\ g^{(5)} (t) has three roots 0, log (5 + 2 \sqrt{6}), log (5 - 2 \sqrt{6}) \end{matrix}

If

ϕ_{1, 1} α_{1}^{*} \neq ϕ_{1, 1}^{*} α_{1}

, then this leads to a contradiction. Assume the line

α_{1} y + ϕ_{1, 0} + ϕ_{1, 1} x_{1}

intersects with

α_{1}^{*} y + ϕ_{1, 0}^{*} + ϕ_{1, 1}^{*} x_{1}

at

({\dot{x}}_{1}, \dot{y})

; then, we have

α_{1} \dot{y} + ϕ_{1, 0} + ϕ_{1, 1} {\dot{x}}_{1} = α_{1}^{*} \dot{y} + ϕ_{1, 0}^{*} + ϕ_{1, 0}^{*} {\dot{x}}_{1} = t

. If

t = 0,

by applying operation

\partial / \partial y

and

\partial / \partial x_{1}

on both sides of (10) at

({\dot{x}}_{1}, \dot{y})

, we have

ϕ_{1, 1} α_{1}^{3} = ϕ_{1, 1}^{*} α_{1}^{* 3}

and

ϕ_{1, 1}^{2} α_{1}^{2} = ϕ_{1, 1}^{* 2} α_{1}^{* 2}

, then

ϕ_{1, 1} α_{1}^{*} = ϕ_{1, 1}^{*} α_{1}

, which is a contradiction. If t is equal to

log (2 + \sqrt{3})

or

log (2 - \sqrt{3})

, by (10) and by applying operation

\partial^{2} / \partial x_{1} \partial y

on both sides of (10) at

({\dot{x}}_{1}, \dot{y})

and

ϕ_{1, 1} α_{1}^{2} = ϕ_{1, 1}^{*} α_{1}^{* 2}

and

ϕ_{1, 1}^{2} α_{1}^{3} = ϕ_{1, 1}^{* 2} α_{1}^{* 3}

, then

ϕ_{1, 1} α_{1}^{*} = ϕ_{1, 1}^{*} α_{1}

, which is a contradiction. If

t \neq b, b \in {0, log (2 + \sqrt{3}), log (2 - \sqrt{3})},

by (10) and by applying operation

\partial / \partial y

on both sides of (10) at

({\dot{x}}_{1}, \dot{y})

, then

ϕ_{1, 1} α_{1}^{2} = ϕ_{1, 1}^{*} α_{1}^{* 2}

and

ϕ_{1, 1} α_{1}^{3} = ϕ_{1, 1}^{*} α_{1}^{* 3}

, which means that

{ϕ_{1, 1} = ϕ_{1, 1}^{*}, α_{1} = α_{1}^{*}}

; then,

ϕ_{1, 1} α_{1}^{*} = ϕ_{1, 1}^{*} α_{1}

, which is a contradiction. Thus,

ϕ_{1, 1} α_{1}^{*} = ϕ_{1, 1}^{*} α_{1} .

Let

ϕ_{1, 1} / ϕ_{1, 1}^{*} = α / α^{*} = k,

and (10) reduces to

\begin{matrix} g^{(3)} (k s + ϕ_{1, 0}) k^{3} = g^{(3)} (s + ϕ_{1, 0}^{*}) \end{matrix}

(11)

with

s = α_{1}^{*} y + ϕ_{1, 1}^{*} x_{1} .

If

k \neq 1,

assume the line

k s + ϕ_{1, 0}

intersects with

s + ϕ_{1, 0}^{*}

at

\dot{s}

; then,

k \dot{s} + ϕ_{1, 0} = \dot{s} + ϕ_{1, 0}^{*} = t

. If

t \neq 0,

let

s = \dot{s}

in (11); then,

k = 1

, which is a contradiction. If

t = 0,

by applying operation

\partial / \partial s

on both sides of (11) at

\dot{s}

, we have

k^{4} = 1

, that is,

k = 1

or

k = - 1

. If

k = - 1,

then

α_{1} = - α_{1}^{*}, ϕ_{1, 1} = - ϕ_{1, 1}^{*},

and

- \dot{s} + ϕ_{1, 0} = \dot{s} + ϕ_{1, 0}^{*} = 0

, which means that

ϕ_{1, 0} = - ϕ_{1, 0}^{*},

that is,

(α_{1}^{*}, ϕ_{1, 0}^{*}, ϕ_{1, 1}^{*}) = - (α_{1}, ϕ_{1, 0}, ϕ_{1, 1}) .

Recall the condition that the sign of any element of

(α_{1}, ϕ_{1, 0}, ϕ_{1, 1})

is assumed to be known. Therefore, the case according to which

t = 0

and

k = - 1

is impossible. Therefore, we have

k = 1

; then, (11) reduces to

\begin{matrix} g^{(3)} (t + ϕ_{1, 0}) = g^{(3)} (t + ϕ_{1, 0}^{*}) \end{matrix}

and we have

ϕ_{1, 0} = ϕ_{1, 0}^{*}

because

g^{(3)} (\cdot)

has only one maximum point. Now, we have

(α_{1}^{*}, ϕ_{1, 0}^{*}, ϕ_{1, 1}^{*}) = (α_{1}, ϕ_{1, 0}, ϕ_{1, 1}),

and the condition in (8) reduces to

\begin{matrix} log p (y | x_{1}; β_{a, 0}, β_{a, 1}, σ_{a}^{2}) = log p (y | x_{1}; β_{a, 0}^{*}, β_{a, 1}^{*}, {σ_{a}^{2}}^{*}) \end{matrix}

(12)

We can readily obtain

(β_{a, 0}, β_{a, 1}, σ_{a}^{2}) = (β_{a, 0}^{*}, β_{a, 0}^{*}, {σ_{a}^{2}}^{*})

from (12). Thus,

(α_{1}, ϕ_{1}, θ_{a})

is identifiable. In an analogous manner to the above,

(α_{0}, ϕ_{0}, θ_{n})

is identifiable.

(ii) Similar to the proof of the above, we now only need to demonstrate the following: if

\begin{matrix} g (α_{1}, ϕ_{1}) + log \{\frac{ω_{a}}{ω_{a} + ω_{c}} p (y | x; θ_{a}) + \frac{ω_{c}}{ω_{a} + ω_{c}} p (y | x; θ_{1 c})\} \end{matrix}

(13)

\begin{matrix} = g (α_{1}^{*}, ϕ_{1}^{*}) + log \{\frac{ω_{a}}{ω_{a} + ω_{c}} p (y | x; θ_{a}^{*}) + \frac{ω_{c}}{ω_{a} + ω_{c}} p (y | x; θ_{1 c}^{*})\} \end{matrix}

(14)

we can obtain

θ_{1 c} = θ_{1 c}^{*}

, and then the

θ_{1 c}

is identifiable. We now substitute

(α_{1}, ϕ_{1}, θ_{a}) = (α_{1}^{*}, ϕ_{1}^{*}, θ_{a}^{*})

into (13); then, (13) reduces to

\begin{matrix} p (y | x; θ_{1 c}) = p (y | x; θ_{1 c}^{*}) \end{matrix}

(15)

and we can readily obtain

θ_{1 c} = θ_{1 c}^{*}

; then,

θ_{1 c}

is identifiable. Likewise,

θ_{0 c}

is identifiable.

Therefore, the conclusion of Theorem 1 follows. □

The sign of some element of

{(α_{d}, ϕ_{d}^{⊤})}^{⊤}

is easy to verify. According to [30], factors such as respondents’ cognitive level, motivation, and social status influence non-response probability. Leveraging this insight, we can speculate on the trend of non-response probability and infer the sign of the parameters in the missing mechanism model. For example, in a household income survey, low-income individuals may be less inclined to disclose their true income, implying

α > 0

.

Given the similarity between the methods for estimating

(α_{0}, ϕ_{0}, θ_{n})

and

θ_{0 c}

and those for estimating

α_{1}, ϕ_{1}, θ_{a}

and

θ_{1 c}

, we will delineate the estimation methods for estimating

(α_{1}, ϕ_{1}, θ_{a})

and

θ_{1 c}

separately in two steps. The likelihood function computed on always-takers with

(Z_{i}, D_{i}) = (1, 0)

is (4). Thus, the maximization problem in (4) with respect to parameters

(α_{1}, ϕ_{1}, θ_{a})

is equivalent to finding

α_{1}, ϕ_{1}

, and

θ_{a}

to maximize

\begin{matrix} l_{n} (α_{1}, ϕ_{1}, θ_{a}) = \sum_{i = 1}^{n} l_{n i} (α_{1}, ϕ_{1}, θ_{a}) \end{matrix}

(16)

with

\begin{matrix} l_{n i} (α_{1}, ϕ_{1}, θ_{a}) = r_{i} log π (Y_{i}, X_{i}; α_{1}, ϕ_{1}) + r_{i} log p_{a} (Y_{i} | X_{i}; θ_{a}) \\ + (1 - r_{i}) log \int {1 - π (y, X_{i}; α_{1}, ϕ_{1})} p_{a} (y | X_{i}; θ_{a}) d y \end{matrix}

and

\begin{matrix} p_{a} (Y_{i} | X_{i}; θ_{a}) = \frac{1}{\sqrt{2 π} σ_{a}} exp \{- \frac{{(Y_{i} - β_{a}^{⊤} X_{i})}^{2}}{2 σ_{a}^{2}}\} \end{matrix}

where

θ_{a} = {(β_{a}^{⊤}, σ_{a}^{2})}^{⊤}

.

For convenience, we denote

π (y, X_{i}; α_{1}, ϕ_{1})

by

π_{1, i}

. The score equations can be derived by means of taking the derivatives of (16) with respect to

α_{1}, ϕ_{1}

and

θ_{a}

, yielding

\begin{matrix} \sum_{i = 1}^{n} U_{i} (α_{1}, ϕ_{1}, θ_{a}) = (\begin{matrix} \sum_{i = 1}^{n} U_{α_{1}, i} (α_{1}, ϕ_{1}, θ_{a}) \\ \sum_{i = 1}^{n} U_{ϕ_{1}, i} (α_{1}, ϕ_{1}, θ_{a}) \\ \sum_{i = 1}^{n} U_{β_{a}, i} (α_{1}, ϕ_{1}, θ_{a}) \\ \sum_{i = 1}^{n} U_{σ_{a}^{2}, i} (α_{1}, ϕ_{1}, θ_{a}) \end{matrix}) = 0 \end{matrix}

(17)

where

\begin{matrix} U_{α_{1}, i} (α_{1}, ϕ_{1}, θ_{a}) = R_{i} (1 - π_{1, i}) Y_{i} - (1 - R_{i}) \frac{E {π_{1, i} (1 - π_{1, i}) Y_{i}} | X_{i}}{E (1 - π_{1, i} | X_{i})} \\ U_{ϕ_{1}, i} (α_{1}, ϕ_{1}, θ_{a}) = [R_{i} (1 - π_{1, i}) - (1 - R_{i}) \frac{E {π_{1, i} (1 - π_{1, i})} | X_{i}}{E (1 - π_{1, i} | X_{i})}] X_{i} \\ U_{β_{a}, i} (α_{1}, ϕ_{1}, θ_{a}) = \frac{1}{σ_{a}^{2}} [X_{i} \{R_{i} Y_{i} - (1 - R_{i}) \frac{E {(1 - π_{1, i}) Y_{i}} | X_{i}}}{E (1 - π_{1, i} | X_{i})}\} - X_{i} X_{i}^{⊤} β_{a}] \end{matrix}

and

\begin{matrix} U_{σ_{a}^{2}, i} (α_{1}, ϕ_{1}, θ_{a}) = \frac{1}{2 σ_{a}^{4}} [{R_{i} {(Y_{i} - X_{i}^{⊤} β_{a})}^{2} + \\ (1 - R_{i}) \frac{E {(1 - π_{1, i}) {(Y_{i} - X_{i}^{⊤} β_{a})}^{2} | X_{i}}}{E (1 - π_{1, i} | X_{i})}} - σ_{a}^{2}] \end{matrix}

However, computing the integral involved in (17) is challenging. To circumvent this difficulty, we propose utilizing Monte Carlo approximation. Given

θ_{a}

, we can straightforwardly sample Y directly from the conditional distribution with density function

p (y | X_{i}; θ_{a})

. Let

Y_{i 1}, Y_{i 2}, \dots, Y_{i m}

denote a sample of size m. For convenience, we denote

\begin{matrix} \frac{E {(1 - π_{1, i}) Y_{i} | X_{i}}}{E (1 - π_{1, i} | X_{i})} \approx \frac{\sum_{j = 1}^{m} (1 - π_{1, i}) Y_{i j}}{\sum_{j = 1}^{m} (1 - π_{1, i})} \equiv I_{1, i} (α_{1}, ϕ_{1}) \\ \frac{E {π_{1, i} (1 - π_{1, i}) | X_{i}}}{E (1 - π_{1, i} | X_{i})} \approx \frac{\sum_{j = 1}^{m} π_{1, i j} (1 - π_{1, i j})}{\sum_{j = 1}^{m} (1 - π_{1, i})} \equiv I_{2, i} (α_{1}, ϕ_{1}) \\ \frac{E {(1 - π_{1, i}) Y_{i}^{2} | X_{i}}}{E (1 - π_{1, i} | X_{i})} \approx \frac{\sum_{j = 1}^{m} (1 - π_{1, i j}) Y_{1, i j}^{2}}{\sum_{j = 1}^{m} (1 - π_{1, i})} \equiv I_{3, i} (α_{1}, ϕ_{1}) \end{matrix}

As stated in [17], the introduced additional variability becomes ignorable when the resample size m is sufficiently large; for example,

m = O (n^{1 + ϵ})

, where

ϵ

is a small positive constant, so we use ≈ here. We record the parameters of t iterations as

(α_{1}^{t}, ϕ_{1}^{t}, θ_{a}^{t})

and then obtain the estimated value of

(α_{1}^{t + 1}, ϕ_{1}^{t + 1}, θ_{a}^{t + 1})

times as follows:

\begin{matrix} \sum_{i = 1}^{n} [R_{i} (1 - π_{1, i}) Y_{i} - (1 - R_{i}) I_{i, 1} (α_{1}^{t}, ϕ_{1}^{t})] = 0 \\ \sum_{i = 1}^{n} [R_{i} (1 - π_{1, i}) - (1 - R_{i}) I_{i, 2} (α_{1}^{t}, ϕ_{1}^{t})] X_{i} = 0 \\ \sum_{i = 1}^{n} [X_{i} {R_{i} Y_{i} - (1 - R_{i}) I_{i, 1} (α_{1}^{t}, ϕ_{1}^{t})} - X_{i} X_{i}^{⊤} β_{a}] = 0 \end{matrix}

and

\begin{matrix} \sum_{i = 1}^{n} [R_{i} {(Y_{i} - X_{i}^{⊤} β_{a})}^{2} + (1 - R_{i}) {I_{i, 3} (α_{1}^{t}, ϕ_{1}^{t}) \\ - 2 I_{i, 1} (α_{1}^{t}, ϕ_{1}^{t}) X_{i}^{⊤} β_{a} + {(X_{i}^{⊤} β_{a})}^{2}} - σ_{a}^{2}] = 0 \end{matrix}

The above formula is iterated through until the following is satisfied:

\begin{matrix} ∥ {α_{1}^{t} - α_{1}^{t + 1}, {(ϕ_{1}^{t} - ϕ_{1}^{t + 1})}^{⊤}, {(θ_{a}^{t} - θ_{a}^{t + 1})}^{⊤}}^{⊤} ∥ < t o l \end{matrix}

where tol is a given constant that can be arbitrarily small. Ultimately, we obtain

({\hat{α}}_{1}, {\hat{ϕ}}_{1}, {\hat{θ}}_{a})

for the parameters

(α_{1}, ϕ_{1}, θ_{a})

. In the same manner, we can obtain

({\hat{α}}_{0}, {\hat{ϕ}}_{0}, {\hat{θ}}_{n})

, where

{\hat{θ}}_{a} = {({\hat{β}}_{a}^{⊤}, {\hat{σ}}_{a}^{2})}^{⊤}, {\hat{θ}}_{n} = {({\hat{β}}_{n}^{⊤}, {\hat{σ}}_{n}^{2})}^{⊤}

. In practice, the above computation procedure is straightforward to implement, as it relies on the use of the empirical distribution. Moreover, it is attractive because the introduced additional variability becomes ignorable when the resample size m is sufficiently large; for instance,

m = O (n^{1 + ϵ})

, where

ϵ

is a small positive constant.

According to the assumption of Condition 7, the absence of Y is unrelated to Z. Then, we can focus on the individuals having

(Z_{i}, D_{i}) = (1, 1)

and, using the inverse probability weighting method, the estimated equation is as follows:

\begin{matrix} \sum_{i = 1}^{n} \frac{Y_{i} - X_{i}^{⊤} β_{M}}{π (y, X_{i}; {\hat{α}}_{1}, {\hat{ϕ}}_{1})} = 0 \\ \sum_{i = 1}^{n} \frac{{(Y_{i} - X_{i}^{⊤} β_{M})}^{2}}{π (y, X_{i}; {\hat{α}}_{1}, {\hat{ϕ}}_{1})} - σ_{M}^{2} = 0 \end{matrix}

We can readily obtain that the estimators for

β_{M}

and

σ_{M}^{2}

are

{\hat{β}}_{M}

and

{\hat{σ}}_{M}^{2}

from the above estimation equation, and considering the numerical characteristics of the mixed normal distribution, we have

\begin{matrix} {\hat{β}}_{M} = \frac{{\hat{ω}}_{a}}{{\hat{ω}}_{a} + {\hat{ω}}_{1 c}} {\hat{β}}_{a} + \frac{{\hat{ω}}_{1 c}}{{\hat{ω}}_{a} + {\hat{ω}}_{1 c}} β_{1 c} \\ {\hat{σ}}_{M}^{2} = \frac{{\hat{ω}}_{a}}{{\hat{ω}}_{a} + {\hat{ω}}_{1 c}} {\hat{σ}}_{a}^{2} + \frac{{\hat{ω}}_{1 c}}{{\hat{ω}}_{a} + {\hat{ω}}_{1 c}} σ_{1 c}^{2} + \frac{{\hat{ω}}_{a} {\hat{ω}}_{1 c}}{{\hat{ω}}_{a} + {\hat{ω}}_{1 c}} {{({\hat{β}}_{a} - β_{1 c})}^{⊤} E X}^{2} \end{matrix}

By solving the two equations above, we can derive estimators for

β_{1 c}

and

σ_{1 c}^{2}

, denoted as

{\hat{β}}_{1 c}

and

{\hat{σ}}_{1 c}^{2}

. In the same manner, we can obtain

{\hat{θ}}_{0 c}

. Here,

{\hat{θ}}_{1 c} = {({\hat{β}}_{1 c}^{⊤}, {\hat{σ}}_{1 c}^{2})}^{⊤}, {\hat{θ}}_{0 c} = {({\hat{β}}_{0 c}^{⊤}, {\hat{σ}}_{0 c}^{2})}^{⊤}

. By denoting

β_{c} = β_{1 c} - β_{0 c}

, we arrive at the complier average causal effect given the predictors

X = x

as

\begin{matrix} CACE (x) = E {Y (1) - Y (0) | U = c, X = x} = {({\hat{β}}_{1 c} - {\hat{β}}_{0 c})}^{⊤} X = {\hat{β}}_{c}^{⊤} X \end{matrix}

(18)

Considering the variance of the mixed normal distribution, the estimator for the corresponding variance

σ_{c}^{2} = κ σ_{1 c}^{2} + (1 - κ) σ_{0 c}^{2} + κ (1 - κ) {(μ_{1 c} - μ_{0 c})}^{2}

, where

κ = 0.5, μ_{1 c} = β_{1 c}^{⊤} X, μ_{0 c} = β_{0 c}^{⊤} X

. Thus,

\begin{matrix} {\hat{σ}}_{c}^{2} = 0.5 {\hat{σ}}_{1 c}^{2} + 0.5 {\hat{σ}}_{0 c}^{2} + 0.25 {({\hat{β}}_{c}^{⊤} E X)}^{2} \end{matrix}

In practice, we can utilize the bootstrap method to approximate the sampling variance of the estimator of CACE.

4. Simulations

In this section, we perform simulations to assess the effectiveness of the outlined estimation procedure. Assuming

P (Z = 1) = ξ = 0.5

and

P (U = a) = ω_{a} = 0.3, P (U = n) = ω_{n} = 0.3

, we divide all samples into six groups based on

(Z = z, U = u)

, with the number in each group denoted as

n_{z u}

. The covariates

X

in all groups are two-dimensional normal distributions, with the mean vector and the covariance matrix as follows, respectively:

\begin{matrix} μ = (\begin{matrix} 1 \\ 0 \end{matrix}), Σ = (\begin{matrix} 1 & 1 \\ 1 & 2 \end{matrix}) \end{matrix}

Conditional on

X

, if

(Z_{i}, U_{i}) = (1, a)

or

(0, a)

, the response Y is generated from

\begin{matrix} Y | X \sim N (β_{a, 0} + β_{a, 1} X_{1} + β_{a, 2} X_{2}, σ_{a}^{2}) \end{matrix}

where

{(β_{a, 0}, β_{a, 1}, β_{a, 2}, σ_{a}^{2})}^{⊤} = {(1, 1, 2, 1)}^{⊤}

. Instead, if

(Z_{i}, U_{i}) = (1, n)

or

(0, n)

, we use

\begin{matrix} Y | X \sim N (β_{n, 0} + β_{n, 1} X_{1} + β_{n, 2} X_{2}, σ_{n}^{2}) \end{matrix}

where

{(β_{n, 0}, β_{n, 1}, β_{n, 2}, σ_{n}^{2})}^{⊤} = {(- 1, - 1, 2, 1)}^{⊤}

. If

(Z_{i}, U_{i}) = (1, c)

, we simulate Y via the conditional distribution

\begin{matrix} Y | X \sim N (β_{1 c, 0} + β_{1 c, 1} X_{1} + β_{1 c, 2} X_{2}, σ_{1 c}^{2}) \end{matrix}

where

{(β_{1 c, 0}, β_{1 c, 1}, β_{1 c, 2}, σ_{1 c}^{2})}^{⊤} = {(0.5, 1.5, 2, 1)}^{⊤}

, while if

(Z_{i}, U_{i}) = (0, c)

, we have

\begin{matrix} Y | X \sim N (β_{0 c, 0} + β_{0 c, 1} X_{1} + β_{0 c, 2} X_{2}, σ_{0 c}^{2}) \end{matrix}

where

{(β_{0 c, 0}, β_{0 c, 1}, β_{0 c, 2}, σ_{0 c}^{2})}^{⊤} = {(- 0.5, - 1.5, 2, 1)}^{⊤}

. The missing mechanism model in the group with

D = 1

is as follows:

\begin{matrix} P (R = 1 | Y, X) = {logit}^{- 1} {α_{1} Y + ϕ_{a, 0} + ϕ_{a, 1} X_{1}} \end{matrix}

where

{logit}^{- 1} (\cdot) = exp (\cdot) / {1 + exp (\cdot)}

and

{(α_{1}, ϕ_{1, 0}, ϕ_{1, 1})}^{⊤} = {(0.2, 1.5, 0.3)}^{⊤}

. Instead, if

D_{i} = 0

, we use

\begin{matrix} P (R = 1 | Y, X) = {logit}^{- 1} {α_{0} Y + ϕ_{0, 0} + ϕ_{0, 1} X_{1}} \end{matrix}

where

{(α_{0}, ϕ_{0, 0}, ϕ_{0, 1})}^{⊤} = {(0.2, 2, 0.2)}^{⊤}

. The missing rate for the outcome is

13.4 %

in group

D = 1

and

15.7 %

in group

D = 0

.

The Monte Carlo samples have a size of

n = 3000

. Table 1 and Table 2 report the empirical bias, standard deviation obtained by means of non-parametric bootstrap with

B = 1000

replications, and coverage of

95 %

Wald-type confidence intervals. The numerical results of Table 1 demonstrate a highly accurate estimation of

η

, the missing mechanism models parameters, the always-taker parameters, and the never-taker parameters, which means that the proposed method can still accurately estimate the parameters in the causal model under the non-ignorable missingness. The numerical results in Table 2 demonstrate a highly accurate estimation of the complier parameters and CACE components. However, there is some under-coverage in the Wald confidence intervals for

β_{1 c, 0}

and

β_{c, 0}

.

Table 1. Empirical bias (Bias), bootstrap standard deviation (Std. dev.) and

95 %

Wald coverage probability (

95 %

Cover) of the estimators for

η

, the missing mechanism models parameters, the always-taker parameters, and the never-taker parameters.

Table 2. Empirical bias (Bias), bootstrap standard deviation (Std. dev.) and

95 %

Wald coverage probability (

95 %

Cover) of the estimators for the compliers parameters and the CACE components.

The estimators for

η

exhibit significantly smaller bias and variance compared to other parameters, attributed to the larger sample size used for their estimation. Conversely, the estimators for the always-taker and never-taker parameters show similar bias and variance due to their comparable missing rates. The subpar estimation results for the compliers parameters arise from various factors, including the numerical characteristics of the normal distribution, which we aim to optimize in future iterations.

5. Real Data Analysis

The CHIP study monitors income distribution and economic factors among rural, rural-to-urban migrant, and urban households in China [31]. The purpose of this study is to investigate whether there is a significant change in the return rate of education when the male population transitions from rural to migrant status. The survey captures details such as employment status, education level, age, income, and other relevant information. Data collection involves systematic random sampling from various regions to ensure geographic representation. The survey includes data from cities and towns in fifteen provinces, representing different regions across the country. These provinces include Liaoning, Shanxi, Jiangsu, Shandong, Guangdong, Anhui, Henan, Sichuan, Hunan, Hubei, Gansu, Xinjiang, Yunnan, Beijing, and Chongqing, covering the north, eastern coastal areas, interior regions, and western regions of China.

Each respondent’s actual annual income serves as a proxy for the individual’s earnings E, utilized in the response of the Mincer earnings function [32]. We compute the number of years since leaving school as

age - years of schooling - 6

, as Chinese children typically start school at 7 years old. The years since the onset of labor market experience are calculated as

age - 16

, as Chinese individuals who have reached the age of 16 can legally participate in the labor market.

The rural population’s migration for work primarily depends on two factors: the economic development status of the region and the distance between their home and work. If the region’s economic development is favorable, migration may be less likely. Conversely, shorter distances between the home and work place increase the likelihood of migration. Hence, leveraging Chinese administrative division data, we assign

Z_{i} = 1

if the individual’s household head’s registered residence is Beijing, Shanxi, Guangdong, Hubei Chongqing, or Liaoning, and

Z_{i} = 0

otherwise. The sample sizes of the observed data are shown in Table 3. In the 2013 wave of the study, the rural and migrant sub-samples accounted for 62.97% and 4.54% of the dataset, respectively, including urban individuals. However, in China in 2013, the percentages of rural and migrant households were of 45.77% and 13.30%, respectively [33]. To align with this distribution, we adjust the sample weights. Specifically, each migrant respondent is weighted four times as much as a rural one, since

(62.97 / 4.54) / (45.77 / 13.30) \approx 4

.

Table 3. Sample sizes of different groups. The initials in brackets identify never-takers (n), always-takers (a), and compliers (c).

The Mincer earnings function is a single-equation model that explains wage income as a function of education and work experience, which can be expressed as follows:

\begin{matrix} log (1 + E) = β_{0} + β_{1} S + β_{2} E x p e r + β_{3} {E x p e r}^{2} + ε \end{matrix}

(19)

In the specification, E represents earnings (in CNY), S indicates years of schooling, and

E x p e r

stands for potential work experience.

ε

is an unobserved normal random error with mean 0 and variance

σ^{2} > 0

. Without considering the cost of schooling, the returns to education are given by

\partial log (1 + E) / \partial S = β_{1}

. The missing data mechanism is modeled using the following model:

\begin{matrix} P (R = 1 | E, S, E x p e r) = {logit}^{- 1} (α log (1 + E) \\ + θ_{0} + θ_{1} log S + θ_{2} log E x p e r + θ_{3} log {E x p e r}^{2}) \end{matrix}

(20)

Using the Assumption 9 described in Section 2 in this setting, model (19) implies a normal distribution of the outcome

Y = log (1 + E)

:

\begin{matrix} p (y | x; θ_{z u}) = \frac{1}{\sqrt{2 π} σ_{z u}} exp [- \frac{{y - β_{z u, 0} - β_{z u, 1} s - β_{z u, 2} e x p e r - β_{z u, 3} e x p e r^{2}}^{2}}{2 σ_{z u}^{2}}] \end{matrix}

(21)

where

x = {(s, e x p e r)}^{⊤}

is the realization of the covariates vector

X = {(S, E x p e r)}^{⊤}

. Thus, the complier average causal effect which captures the causal effect of migrant work on the earnings of compliers is estimated by

\begin{matrix} C A C E (s, e x p e r; {\hat{β}}_{c}) = {\hat{β}}_{c, 0} + {\hat{β}}_{c, 1} s + {\hat{β}}_{c, 3} e x p e r + {\hat{β}}_{c, 3} e x p e r^{2} \end{matrix}

with

{\hat{β}}_{c} = {({\hat{β}}_{c, 0}, {\hat{β}}_{c, 1}, {\hat{β}}_{c, 2}, {\hat{β}}_{c, 3})}^{⊤} = {({\hat{β}}_{1 c, 0} - {\hat{β}}_{0 c, 0}, {\hat{β}}_{1 c, 1} - {\hat{β}}_{0 c, 1}, {\hat{β}}_{1 c, 2} - {\hat{β}}_{0 c, 2}, {\hat{β}}_{1 c, 3} - {\hat{β}}_{0 c, 3})}^{⊤}

. Specifically, the coefficient

{\hat{β}}_{c, 1}

gives the estimated causal effect of migrant work on the returns to education of a complier.

The analysis of the 2013 CHIP data conducted in R version 4.3.2, which invented in August 1993 by statisticians Ross Ihaka and Robert Jetman of the University of Auckland, New Zealand. Utilizing the estimation methods outlined in Section 3 of this paper, yielded results presented in Table 4 and Table 5. The results from Table 4 show that log-income has a significant positive effect on the probability of missingness, as evidenced by

α_{1} = 0.4130

and

α_{0} = 0.5284

. Table 5 illustrates that the estimated returns to education are

{\hat{β}}_{1 c, 1} = 2.81 %

for migrant compliers and

{\hat{β}}_{0 c, 1} = - 0.85 %

for rural compliers. The difference,

{\hat{β}}_{c, 1} = 3.66 %

, indicates that migrant work enhances the returns to education for compliers. Additionally, based on the observation that

β_{0} > 0

, we infer that the initial wages of migrant compliers are notably higher than those of rural compliers. This suggests that migrant work also contributes to higher initial wages for individuals. In conclusion, the estimated returns to education vary significantly among different target groups. However, it is noteworthy that the estimated returns to education of migrant workers and rural residents are consistently lower than those of urban residents across the board, with an average difference of 8.4%, as documented by [34]. This finding scientifically validates the social and economic significance of migrant work from a human capital perspective, offering a basis for decision-making aimed at enhancing the condition of Chinese rural labor.

Table 4. Parameter estimators, bootstrap standard deviation (Std. dev.), and empirical

95 %

Wald confidence intervals (

95 %

CI) along with the estimators for

η

, the missing mechanism model parameters, the always-taker parameters, and the never-taker parameters.

Table 5. Parameter estimators, bootstrap standard deviation (Std. dev.), and empirical

95 %

Wald confidence intervals (

95 %

CI) along with the estimators for the complier parameters and the CACE components.

6. Conclusions

In this study, the challenge of identifying and estimating complier average causal effect parameters under non-ignorable missingness is tackled by increasing covariates to mitigate the sensitivity to the violation of specific identification assumptions. The missing data mechanism is assumed to follow a logistic model, wherein the absence of the outcome is explained by the outcome itself, the treatment received and the covariates, giving it an advantage over the assumptions proposed by [2,3]. The identifiability of the models is established under mild conditions by assuming that the outcome follows a normal distribution. A computational method is developed to estimate model parameters through a two-step likelihood estimation approach, utilizing subgroup analysis.

Some studies in the literature discuss the consistency of parameter estimation in the presence of non-ignorable missing and non-compliant data. The authors of [23] demonstrated that parameter estimators are consistent under the assumption of a correct missing mechanism model. The authors of [26] obtained the consistency of parameters of interest even when confounders are missing not at random. The authors of [3] established the asymptotic results of the estimators when the missing outcome depends only on itself. Estimating the parameters of interest based on subgroup analysis poses significant challenges when the absence of the outcome is explained by the outcome itself, the treatment received, and the covariates. We plan to address this aspect in future research endeavors to enhance our methodologies.

There are many directions worthy of further research. A possible extension in this research area involves utilizing instrumental variables to transform the identifiability of the observation likelihood into the identifiability of the parameters of interest. Indeed, we propose a relaxation of the exclusion restriction based on likelihood analysis, resulting in a parametric model characterized by mixtures of distributions. Furthermore, we can adopt a semi-parametric model for theoretical modeling, incorporating more sophisticated structures into the missing mechanism models and regression models.

Author Contributions

J.D.: methodology, software, writing—original draft preparation; G.W.: data curation, software, validation; X.L.: methodology, writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Social Science Foundation of Guangxi Province in China, grant number 23BTJ001.

Data Availability Statement

Data are contained within the article.

Acknowledgments

We thank the Editor, Associate Editor, and referees, as well as our financial sponsors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chen, H.; Geng, Z.; Zhou, X. Identifiability and estimation of causal effects in randomized trials with noncompliance and completely nonignorable missing data. Biometrics 2009, 109, 142–149. [Google Scholar] [CrossRef] [PubMed]
Chen, H.; Ding, P.; Geng, Z.; Zhou, X. Semiparametric Inference of the Complier Average Causal Effect with Nonignorable Missing Outcomes. ACM Trans. Intell. Syst. Technol. (TIST) 2015, 7, 1–15. [Google Scholar] [CrossRef]
Zheng, R. Causal Inference with Unmeasured Confounding from Nonignorable Missing Outcomes. arXiv 2023, arXiv:2305.07226. [Google Scholar]
Zhao, J.; Ma, Y. A versatile estimation procedure without estimating the nonignorable missingness mechanism. J. Am. Stat. Assoc. 2022, 117, 1916–1930. [Google Scholar] [CrossRef]
Du, J.; Cui, X. Semiparametric estimation in generalized additive partial linear models with nonignorable nonresponse data. Stat. Pap. 2023, 1–25. [Google Scholar] [CrossRef]
Imbens, G.W.; Rubin, D.B. Bayesian inference for causal effects in randomized experiments with noncompliance. Ann. Stat. 1997, 25, 305–327. [Google Scholar] [CrossRef]
Zhou, J.; Hodges, J.S.; Suri, M.F.K.; Chu, H. A Bayesian hierarchical model estimating CACE in meta-analysis of randomized clinical trials with noncompliance. Biometrics 2019, 75, 978–987. [Google Scholar] [CrossRef] [PubMed]
Park, S.; Kürüm, E. A two-stage joint modeling method for causal mediation analysis in the presence of treatment noncompliance. J. Causal Inference 2020, 8, 131–149. [Google Scholar] [CrossRef]
Dempster, A.P.; Laird, N.P.; Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 1977, 39, 1–22. [Google Scholar] [CrossRef]
Mercatanti, A. A Likelihood-Based Analysis for Relaxing the Exclusion Restriction in Randomized Experiments with Noncompliance. Aust. N. Z. J. Stat. 2013, 55, 129–153. [Google Scholar] [CrossRef]
Abadie, A. Semiparametric instrumental variable estimation of treatment response models. J. Econom. 2003, 55, 231–263. [Google Scholar] [CrossRef]
Wang, S.; Shao, J.; Kim, J.K. An instrumental variable approach for identification and estimation with nonignorable nonresponse. Stat. Sin. 2014, 24, 1097–1116. [Google Scholar] [CrossRef]
Tchetgen, T.E.; Eric, J.; Wirth, K.E. A general instrumental variable framework for regression analysis with outcome missing not at random. Biometrics 2017, 73, 1123–1131. [Google Scholar] [CrossRef] [PubMed]
Wang, L.; Shao, J.; Fang, F. Propensity model selection with nonignorable nonresponse and instrument variable. Stat. Sin. 2021, 31, 647–672. [Google Scholar] [CrossRef]
Morikawa, K.; Kim, J.K. Semiparametric optimal estimation with nonignorable nonresponse data. Ann. Stat. 2021, 49, 2991–3014. [Google Scholar] [CrossRef]
Cui, X.; Guo, J.; Yang, G. On the identifiability and estimation of generalized linear models with parametric nonignorable missing data mechanism. Comput. Stat. Data Anal. 2017, 107, 64–80. [Google Scholar] [CrossRef]
Du, J.; Li, Y.; Cui, X. Identification and Estimation of Generalized Additive Partial Linear Models with Nonignorable Missing Response. Commun. Math. Stat. 2024, 12, 113–156. [Google Scholar] [CrossRef]
Ding, P.; Li, F. Causal inference. Stat. Sci. 2018, 33, 214–237. [Google Scholar] [CrossRef]
Yang, S.; Wang, L.; Ding, P. Causal inference with confounders missing not at random. Biometrika 2018, 106, 875–888. [Google Scholar] [CrossRef]
Frangakis, C.E.; Rubin, D.P. Principal stratification in causal inference. Biometrics 2002, 58, 21–29. [Google Scholar] [CrossRef]
Taylor, L.; Zhou, X. Multiple imputation methods for treatment noncompliance and nonresponse in randomized clinical trials. Biometrics 2009, 65, 88–95. [Google Scholar] [CrossRef] [PubMed]
Nguyen, T.Q.; Carlson, M.C.; Stuart, E.A. Identification of complier and noncomplier average causal effects in the presence of latent missing-at-random (LMAR) outcomes: A unifying view and choices of assumptions. arXiv 2023, arXiv:2312.11136. [Google Scholar] [CrossRef] [PubMed]
Li, W.; Zhou, X. Identifiability and estimation of causal mediation effects with missing data. Stat. Med. 2017, 36, 3948–3965. [Google Scholar] [CrossRef] [PubMed]
Sun, Z.; Liu, L. Semiparametric inference of causal effect with nonignorable missing confounders. Stat. Sin. 2021, 31, 1669–1688. [Google Scholar] [CrossRef]
He, Y.; Zheng, L.; Luo, P. Treatment Benefit and Treatment Harm Rates with Nonignorable Missing Covariate, Endpoint, or Treatment. Mathematics 2023, 11, 4459. [Google Scholar] [CrossRef]
Sun, J.; Fu, B. Identification and Estimation of Causal Effects with Confounders Missing Not at Random. arXiv 2023, arXiv:2303.05878. [Google Scholar]
Rubin, D.B. Randomization analysis of experimental data: The Fisher randomization test comment. J. Am. Stat. Assoc. 1980, 75, 591–593. [Google Scholar] [CrossRef]
Ogburn, E.L.; Rotnitzky, A.; Robins, J.M. Doubly robust estimation of the local average treatment effect curve. J. R. Stat. Soc. Ser. B 2015, 77, 373–396. [Google Scholar] [CrossRef] [PubMed]
Seaman, S.R.; White, I.R. Review of inverse probability weighting for dealing with missing data. Stat. Methods Med. Res. 2013, 22, 278–295. [Google Scholar] [CrossRef]
Ogburn, E.L.; Rotnitzky, A.; Robins, J.M. The impact of “no opinion” response options on data quality: Non-attitude reduction or an invitation to satisfice? Public Opin. Q. 2002, 66, 371–403. [Google Scholar]
Terry, S.; Li, S.; Yue, X.; Hiroshi, S. Changing Trends in China’s Inequality: Evidence, Analysis, and Prospects; Oxford University Press: New York, NY, USA, 2020. [Google Scholar]
Mincer, J. Schooling, Experience and Earnings; Columbia University Press: New York, NY, USA, 1974. [Google Scholar]
Luo, C.; Li, S.; Yue, X. An analysis of changes in the extent of income disparity in China (2013–2018). Soc. Sci. China 2021, 1, 33–54. [Google Scholar]
Li, H.; Liu, P.W.; Zhang, J. Estimating returns to education using twins in urban China. J. Dev. Econ. 2012, 97, 494–504. [Google Scholar] [CrossRef]

Table 1. Empirical bias (Bias), bootstrap standard deviation (Std. dev.) and

95 %

Wald coverage probability (

95 %

Cover) of the estimators for

η

, the missing mechanism models parameters, the always-taker parameters, and the never-taker parameters.

Table 1. Empirical bias (Bias), bootstrap standard deviation (Std. dev.) and

95 %

Wald coverage probability (

95 %

Cover) of the estimators for

η

, the missing mechanism models parameters, the always-taker parameters, and the never-taker parameters.

	Parameter	Bias	Std. dev.	$95 %$ Cover
$η$	$ξ$	0.0004	0.009	0.983
	$ω_{a}$	0.0001	0.012	0.915
	$ω_{n}$	0.0001	0.012	0.906
$D = 1$	$α_{1}$	−0.0009	0.074	0.972
	$ϕ_{1, 0}$	0.0115	0.205	0.978
	$ϕ_{1, 1}$	0.0231	0.270	0.992
$D = 0$	$α_{0}$	0.0035	0.073	0.957
	$ϕ_{0, 0}$	0.0329	0.311	0.991
	$ϕ_{0, 1}$	−0.0038	0.148	0.985
$U = a$	$β_{a, 0}$	0.0035	0.093	0.926
	$β_{a, 1}$	−0.0014	0.073	0.944
	$β_{a, 2}$	−0.0006	0.050	0.960
	$σ_{a}^{2}$	−0.0084	0.071	0.933
$U = n$	$β_{n, 0}$	−0.0020	0.089	0.937
	$β_{n, 1}$	0.0012	0.072	0.923
	$β_{n, 2}$	0.0003	0.054	0.952
	$σ_{n}^{2}$	−0.0047	0.074	0.918

Table 2. Empirical bias (Bias), bootstrap standard deviation (Std. dev.) and

95 %

Wald coverage probability (

95 %

Cover) of the estimators for the compliers parameters and the CACE components.

Table 2. Empirical bias (Bias), bootstrap standard deviation (Std. dev.) and

95 %

Wald coverage probability (

95 %

Cover) of the estimators for the compliers parameters and the CACE components.

	Parameter	Bias	Std. dev.	$95 %$ Cover
$(Z, U) = (1, c)$	$β_{1 c, 0}$	−0.059	0.258	0.947
	$β_{1 c, 1}$	−0.0026	0.139	0.881
	$β_{1 c, 2}$	0.0014	0.107	0.937
	$σ_{1 c}^{2}$	0.0025	0.074	0.987
$(Z, U) = (0, c)$	$β_{0 c, 0}$	0.0039	0.134	0.949
	$β_{0 c, 1}$	−0.0040	0.110	0.969
	$β_{0 c, 2}$	−0.0001	0.077	0.989
	$σ_{0 c}^{2}$	−0.0031	0.124	0.907
CACE	$β_{c, 0}$	−0.0065	0.193	0.883
	$β_{c, 1}$	0.0054	0.155	0.936
	$β_{c, 2}$	0.0025	0.106	0.991
	$σ_{c}^{2}$	−0.0559	0.166	0.908

Table 3. Sample sizes of different groups. The initials in brackets identify never-takers (n), always-takers (a), and compliers (c).

	$(Z, D) = (1, 1)$	$(Z, D) = (1, 0)$	$(Z, D) = (0, 1)$	$(Z, D) = (0, 0)$
$R = 1$	472 (a, c)	229 (n)	4052 (a)	5464 (c, n)
$R = 0$	37 (a, c)	13 (n)	1059 (a)	1857 (c, n)

Table 4. Parameter estimators, bootstrap standard deviation (Std. dev.), and empirical

95 %

Wald confidence intervals (

95 %

CI) along with the estimators for

η

, the missing mechanism model parameters, the always-taker parameters, and the never-taker parameters.

Table 4. Parameter estimators, bootstrap standard deviation (Std. dev.), and empirical

95 %

Wald confidence intervals (

95 %

CI) along with the estimators for

η

, the missing mechanism model parameters, the always-taker parameters, and the never-taker parameters.

	Parameter	Estimator	Std. dev.	$95 %$ CI
$η$	$ξ$	0.1946	0.0031	[0.1886, 0.2006]
	$ω_{a}$	0.1169	0.0027	[0.1117, 0.1222]
	$ω_{n}$	0.7154	0.0060	[0.7036, 0.7271]
$D = 1$	$α_{1}$	0.4130	0.1521	[0.1150, 0.7111]
	$ϕ_{1, 0}$	−7.0218	1.5790	[10.1167, −3.9269]
	$ϕ_{1, 1}$	0.3597	0.0627	[0.2369, 0.4826]
	$ϕ_{1, 2}$	0.3003	0.0391	[0.2238, 0.3769]
	$ϕ_{1, 3}$	−0.0060	0.0008	[−0.0076, −0.0045]
$D = 0$	$α_{0}$	0.5284	0.2365	[0.0647, 0.9920]
	$ϕ_{0, 0}$	−5.2175	2.2725	[−9.6715, −0.7635]
	$ϕ_{0, 1}$	0.2046	0.0707	[0.0659, 0.3432]
	$ϕ_{0, 2}$	0.0624	0.0335	[−0.0032, 0.1281]
	$ϕ_{0, 3}$	−0.0018	0.0006	[−0.0031, −0.0006]
$U = a$	$β_{a, 0}$	9.1899	0.1464	[8.9029, 9.4771]
	$β_{a, 1}$	0.0356	0.0084	[0.0191, 0.0521]
	$β_{a, 2}$	0.0894	0.0079	[0.0739, 0.1050]
	$β_{a, 3}$	−0.0020	0.0002	[−0.0023, −0.0016]
	$σ_{a}^{2}$	0.5331	0.0282	[0.4777, 0.5884]
$U = n$	$β_{n, 0}$	9.3902	0.1468	[9.1024, 9.6780]
	$β_{n, 1}$	0.0498	0.1011	[0.0300, 0.0696]
	$β_{n, 2}$	0.0393	0.0072	[0.0252, 0.0534]
	$β_{n, 3}$	−0.0009	0.0001	[−0.0012, −0.0006]
	$σ_{n}^{2}$	0.5513	0.0382	[0.4764, 0.6262]

Table 5. Parameter estimators, bootstrap standard deviation (Std. dev.), and empirical

95 %

Wald confidence intervals (

95 %

CI) along with the estimators for the complier parameters and the CACE components.

Table 5. Parameter estimators, bootstrap standard deviation (Std. dev.), and empirical

95 %

Wald confidence intervals (

95 %

CI) along with the estimators for the complier parameters and the CACE components.

	Parameter	Estimator	Std. dev.	$95 %$ CI
$(Z, U) = (1, c)$	$β_{1 c, 0}$	10.2684	0.1981	[9.8800, 10.6567]
	$β_{1 c, 1}$	0.0281	0.0131	[0.0023, 0.0537]
	$β_{1 c, 2}$	0.0149	0.0132	[−0.0111, 0.0408]
	$β_{1 c, 3}$	−0.0005	0.0003	[−0.0011, 0.0001]
	$σ_{1 c}^{2}$	0.5972	0.0518	[0.4958, 0.6987]
$(Z, U) = (0, c)$	$β_{0 c, 0}$	9.0075	1.1509	[6.7518, 11.2633]
	$β_{0 c, 1}$	−0.0085	0.1009	[−0.2063, 0.1893]
	$β_{0 c, 2}$	0.0310	0.0472	[−0.0616, 0.1236]
	$β_{0 c, 3}$	−0.0012	0.0011	[−0.0033, 0.0009]
	$σ_{0 c}^{2}$	2.2993	1.0110	[0.3177, 4.2809]
CACE	$β_{c, 0}$	1.2609	1.1994	[−1.0899, 3.6116]
	$β_{c, 1}$	0.0366	0.1026	[−0.1646, 0.2378]
	$β_{c, 2}$	0.0162	0.0490	[−0.1122, 0.0799]
	$β_{c, 3}$	0.0007	0.0011	[−0.0015, 0.0029]
	$σ_{c}^{2}$	2.2835	0.5075	[1.2888, 3.2782]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Estimating the Complier Average Causal Effect with Non-Ignorable Missing Outcomes Using Likelihood Analysis

Abstract

1. Introduction

2. Notation and Assumptions

3. Identifiability and Estimation

4. Simulations

5. Real Data Analysis

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics