# Consequences of Model Misspecification for Maximum Likelihood Estimation with Missing Data

^{1}

^{2}

^{3}

^{4}

^{5}

^{6}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

**Missing Data Problem.**The missing data problem is prevalent throughout economics (Abrevaya and Donald 2017; Breunig 2019; Fomby and Hill 1998; McDonough and Millimet 2016; Miller 2010; Wooldridge 2004). Further, missing data is ubiquitous in other fields of science, engineering (Markovsky 2017), and machine learning (Leke and Marwala 2019). This includes clinical trials and health sciences analyses (e.g., Enders 2010; Little et al. 2012; Molenberghs and Kenward 2007; Zhou et al. 2014), survey data analysis (e.g., Gmel 2001; Troxel et al. 1998a), regression analysis (e.g., Graham et al. 1997; Greenland and Finkle 1995), verification bias (e.g., Harel and Zhou 2006; Kosinski and Barnhart 2003a, 2003b), hierarchical modeling (e.g., Agresti 2002, chp. 12), and mixed modeling (e.g., Verbeke and Lesaffre 1997). Moreover, latent variable models arising in factor analysis and structural equation modeling contexts (e.g., Arminger and Sobel 1990; Gallini 1983), hidden Markov chain models (e.g., McLachlan and Krishnan 1997; Visser 2011), mixed Markov field models (e.g., Fridman 2003) and hidden Markov random field models (HMRF) (e.g., Ryden and Titterington 1998) are interpretable as missing data models where the “hidden states” correspond to the missing data. Additionally, unsupervised and temporal reinforcement learning methods relevant for building sophisticated behavioral learning process models are naturally represented as partially observable Markov decision processes (e.g., Littman 2009).

**Model Misspecification.**The problems of estimation and inference in the presence of model misspecification are important for several reasons. First, model misspecification may be present in many, if not most, situations; and so robust methods that address the assumption of correct specification are necessary (White 1980, 1982, 1994; Golden 1995, 1996, 2000, 2003). While a correctly specified model is always desirable, in many fields such as econometrics, medicine, and psychology, some degree of model misspecification may be inevitable despite the researcher’s best efforts (e.g., White 1980, 1982, 1994). Thus, the development and application of robust methods (Golden et al. 2013, 2016; Henley et al. 2019) that address the challenges posed by model misspecification (e.g., White 1980, 1982, 1994) has been and continues to be an active area of research (e.g., see Fomby and Hill 2003; Hardin 2003, for relevant reviews). Second, situations arise where the Quasi-Maximum Likelihood Estimates (QMLE) converge to the true parameter value despite the presence of model misspecification. For example, the QMLE can be shown to be consistent to the true parameter value for both linear and nonlinear exponential family regression models even though only the conditional expectation of the response variable given the predictors (covariates) is correctly specified (e.g., Gourieroux et al. 1984; Royall 1986; Wedderburn 1974; Wei 1998; White 1994, Corollary 5.5, p. 67). Consistent parameter estimation of the true parameter values of the researcher’s model in the complete data case may also occur for misspecified models where: (i) heteroscedasticity is present (e.g., Verbeek 2008, sec. 6.3), (ii) the random effects distribution is misspecified in linear hierarchical models (e.g., Verbeke and Lesaffre 1997), or (iii) correlations among dependent observations are misspecified (e.g., Hosmer and Lemeshow 2000, pp. 315–317; Liang and Zeger 1986; Wall et al. 2005; Vittinghoff et al. 2012). Third, in more complicated missing data situations, consistent estimation of the true parameter values is possible in linear structural equation models even though only the first two moments have been correctly specified (e.g., Arminger and Sobel 1990), and in longitudinal time-series modeling even though dependent observations are approximately modeled as independent (Parzen et al. 2006; Troxel et al. 1998b; Zhao et al. 1996).

#### 1.1. Maximum Likelihood Estimation for Models with Partially Observable Data

**Representing Partially Observable Data Generating Processes.**In the selection model framework for representing partially observable data generating processes (Rubin 1976; Little 1994; Molenberghs et al. 1998; Little and Rubin 2002), it is assumed that nature creates a complete-data record (observation) by sampling from the complete-data Data Generating Process (DGP). The complete-data record containing the observation’s values is then decimated by a pattern of missingness sampled from the missing-data mechanism, thus hiding those values in the complete-data record. Rubin (1976) defined three types of missing-data mechanisms. A missing-data mechanism is termed Missing At Random (MAR) when the probability distribution of the pattern of missingness is functionally dependent only upon the observed data. A special case of MAR, called Missing Completely at Random (MCAR), occurs when the probability distribution of the pattern of missingness is not functionally dependent on either observed or unobserved data. Missing data generating processes that are not MAR are termed Missing Not at Random MNAR (i.e., not MAR), also called NMAR. The probability distribution of the pattern of missingness for an MNAR missing-data mechanism is functionally dependent on unobservable data.

**Overview of Parameter Estimation in the Presence of Partially Observable Data.**If the missing-data mechanism is MCAR, then maximum likelihood estimation can be utilized by first applying listwise deletion (Allison 2001; King et al. 2001) also known as complete-case analysis (Little and Rubin 2002), which involves simply removing data records (observations) containing missing values from the data set (also see Groenwold et al. 2012). The resulting dataset, containing no missing values, is then used for statistical modeling. A problem with using listwise deletion to handle a MCAR missing-data mechanism is that the standard errors of the parameter estimates for the researcher’s observable-data model may be larger because the information contained in records with missing values has been removed from the data set. However, a more serious issue with the listwise deletion method is that for MAR data the maximum likelihood estimates may be biased (e.g., Allison 2001; Ibrahim et al. 2005; King et al. 2001).

#### 1.2. Prior Work on Misspecification in Missing Data Models

#### 1.3. A Framework for Understanding Misspecification in Missing Data Models

## 2. Assumptions

#### 2.1. Data Generating Process Assumptions

**Assumption**

**1.**

**I.I.D. Partially Observable Data Generating Process.**Let$({X}_{i},{H}_{i}),\hspace{0.17em}\hspace{0.17em}i=1,2,\dots $be a sequence of independent and identically distributed (i.i.d.) random vectors where$({X}_{i},{H}_{i})$has a common Radon–Nikodým probability density${p}_{x,h}:{R}^{d}\times {\{0,1\}}^{d}\to \left[0,\infty \right)$defined with respect to a sigma-finite measure${\nu}_{x,h}$.

**x**

_{i}(a realization of

**X**

_{i}) is a value of the outcome variable for a regression model associated with the ith data record while the remaining elements of

**x**

_{i}are values for the predictor variables associated with the ith data record, i = 1,…,n. The ith observed data indicator record

**h**

_{i}(a realization of

**H**

_{i}) is a d-dimensional binary vector defined so that its jth element is 1 if the jth element of

**x**

_{i}is observable and the jth element of

**h**

_{i}is 0 otherwise, i = 1,…,n. Let ${x}^{n}\equiv \left[{x}_{1},\dots ,{x}_{n}\right],\hspace{0.17em}\hspace{0.17em}{x}^{n}\in {\mathrm{R}}^{dn}$. Let ${h}^{n}\equiv \left[{h}_{1},\dots ,{h}_{n}\right],\hspace{0.17em}\hspace{0.17em}{h}^{n}\in {\mathrm{L}}^{\hspace{0.17em}dn}$. Given the full partially observable record $(x,h)$, let the number of observable elements of

**x**be defined such that $\rho (h)\equiv {\left({1}_{d}\right)}^{T}h,\hspace{0.17em}\forall h\in {\mathrm{L}}^{\hspace{0.17em}d}$ where $\rho :{\mathrm{L}}^{\hspace{0.17em}\hspace{0.17em}d}\to \{0,1,2,\dots ,d\}$ and the notation

**1**

_{d}is used to denote a d-dimensional column vector of ones. For convenience, let ${\rho}_{h}\equiv \rho \left(h\right)$. Also define the observable-data selection matrix $s(h)$ generated by

**h**as a matrix with ${\rho}_{h}$ rows and d columns such that the kth element of the jth row of $s(h)$ is equal to 1 if the jth non-zero element in

**h**is the kth element in

**h**; and set the jkth element of $s(h)$ equal to 0 otherwise. Let ${y}_{h}:{\mathrm{R}}^{d}\to {\mathrm{R}}^{{\rho}_{h}}$ be defined such that: ${y}_{h}\left(x\right)=s\left(h\right)x$ for $h\in {\mathrm{L}}^{\hspace{0.17em}d}\backslash {0}_{d}$. The ith observable-data record component ${y}_{i}\equiv {y}_{{h}_{i}}\left({x}_{i}\right)$ is thus a ${\rho}_{{h}_{i}}$-dimensional column vector generated from the realization of the full partially observable record $({X}_{i},{H}_{i})$, for i = 1…, n. Let ${Y}_{i}\equiv {y}_{{H}_{i}}\left({X}_{i}\right)$, i = 1, …, n. The sequence of random variables ${Y}_{1},\dots ,{Y}_{n}$ is i.i.d. because $\left({X}_{1},{H}_{1}\right),\dots ,\left({X}_{n},{H}_{n}\right)$ are i.i.d. distributed by Assumption 1. Let ${y}^{n}\equiv \left\{{y}_{1},\dots ,{y}_{n}\right\}$ and ${h}^{n}\equiv \left\{{h}_{1},\dots ,{h}_{n}\right\}$ be realizations of ${Y}^{n}\equiv \left\{{Y}_{1},\dots ,{Y}_{n}\right\}$ and ${H}^{n}\equiv \left\{{H}_{1},\dots ,{H}_{n}\right\}$ respectively. Let $\left({y}^{n},{h}^{n}\right)$ denote the observed data sample.

**h**as a matrix with $d-{\rho}_{h}$ rows and d columns such that the kth element of the jth row of $\overline{s}(h)$ is equal to 1 if the jth zero element in

**h**is the kth element in

**h**; and set the jkth element of $\overline{s}(h)$ equal to 0 otherwise. Let ${z}_{h}:{R}^{d}\to {R}^{d-{\rho}_{h}}$ be defined such that: ${z}_{h}\left(x\right)=\overline{s}\left(h\right)x$ for $h\in {L}^{\hspace{0.17em}d}\backslash {1}_{d}$. Thus, the $d-{\rho}_{{h}_{i}}$-dimensional column vector ${z}_{i}\equiv {z}_{{h}_{i}}\left({x}_{i}\right)$ contains the unobservable components associated with the ith observed data record, i = 1…, n. Let ${Z}_{i}\equiv {z}_{{H}_{i}}\left({X}_{i}\right)$, i = 1, …, n.

**X**

_{i}consists of discrete or absolutely continuous random variables, but also for situations where the complete data vector

**X**

_{i}includes both discrete and absolutely continuous random variables. In fact, the Radon–Nikodým density ${p}_{x,h}$ is also applicable to situations where the elements of

**X**

_{i}are constructed from combinations of both discrete and absolutely continuous random variables. In the special case, where

**X**

_{i}is a vector consisting of only discrete random variables, then ${p}_{x,h}$ may be interpreted as a probability mass function.

_{x,h}, define ${p}_{x}(\cdot )\equiv {\displaystyle \underset{}{\int}{p}_{x,h}(\cdot \hspace{0.17em},h)d{\nu}_{h}(h)}$ and ${p}_{h}(\cdot )\equiv {\displaystyle \underset{}{\int}{p}_{x,h}(x,\cdot )d{\nu}_{x}(x)}$. Let the observable-data density ${p}_{{y}_{h}}\left({y}_{h}\left(x\right)\right)\equiv {\displaystyle \int {p}_{x}}\left(x\right)d{\nu}_{{z}_{h}}\left({z}_{h}\left(x\right)\right)$, which can be rewritten using a more implicit compact notation as ${p}_{{y}_{h}}\left({y}_{h}\right)\equiv {\displaystyle \int {p}_{x}}\left(x\right)d{\nu}_{{z}_{h}}\left({z}_{h}\right)$. The density ${p}_{{y}_{h}}:{\mathrm{R}}^{{\rho}_{h}}\to \left[0,\infty \right)$ specifies the conditional probability distribution of the random vector ${y}_{h}\left(X\right)=s(h)X$ given a particular observed data indicator record

**h**. The observable-data density ${p}_{{y}_{h},h}\left({y}_{h}\left(x\right),h\right)\equiv {\displaystyle \int {p}_{x,h}}\left(x,h\right)d{\nu}_{{z}_{h}}\left({z}_{h}\left(x\right)\right)$ specifies the joint probability distribution of the observed data record (

**Y**,

**H**) that includes the pattern of missingness

**H**as well as the observable data component

**Y**.

#### 2.2. Probability Model Assumptions

**X**.

**Assumption**

**2.**

**Parametric Densities.**(i) Let$\Theta $be a compact and non-empty subset of${\mathrm{R}}^{r}$, $r\in \mathbb{N}$. (ii) Let$f:{\mathrm{R}}^{d}\times \Theta \to [0,\infty )$. For each

**θ**in$\Theta $, $f(\cdot \hspace{0.17em};\theta )$is a density with respect to ν

_{x}and, for each$\mathit{x}\in supp\hspace{0.17em}\mathit{X}$, $f(x\hspace{0.17em};\cdot )$ is continuous on $\Theta $. (iii) $f(x;\cdot )$ is continuously differentiable on $\Theta $ for each $\mathit{x}\in supp\hspace{0.17em}\mathit{X}$. (iv) $f(x;\cdot )$ is twice continuously differentiable on $\Theta $ for each $\mathit{x}\in supp\hspace{0.17em}\mathit{X}$.

**θ**in the parameter space $\Theta $. A set of complete-data densities indexed by the parameter vector

**θ**specifies the researcher’s complete data model: ${\mathrm{M}}_{c}\equiv \left\{f\left(x;\theta \right):\theta \in \Theta \right\}$.

**Assumption**

**3.**

**Ignorable Missing-Data Mechanism.**Let${q}_{h|x}:{\mathrm{L}}^{\hspace{0.17em}d}\times {\mathrm{R}}^{d}\to \left[0,\infty \right)$be a measurable function. (i) For each$x\in supp\hspace{0.17em}\mathit{X}$, ${q}_{h|x}\left(\cdot |x\right)$is a density with respect to ν

_{h}. (ii)${q}_{h|x}$is MAR.

**θ**in $\Theta $ together with the approximating missing-data mechanism ${q}_{h|x}$. In many practical missing data applications, it is common practice to only implicitly specify the missing-data probability model since the researcher explicitly provides only the complete-data density $f\left(\cdot \hspace{0.17em};\theta \right)$ and implicitly assumes an ignorable missing-data mechanism.

**Definition**

**1.**

**Misspecified Model.**(i) The complete-data model${\mathrm{M}}_{\mathrm{c}}$is called a correctly specified complete-data model if the complete-data DGP density${p}_{x}\in {\mathrm{M}}_{c}$holds${\upsilon}_{x}$-a.e.; otherwise${\mathrm{M}}_{\mathrm{c}}$is misspecified complete data model. (ii) The observable-data model${\mathrm{M}}_{o}$is called a correctly specified observable-data model if the observable-data DGP density${p}_{{y}_{h}}\in {\mathrm{M}}_{o}$holds${\upsilon}_{{y}_{h}}$-a.e.; for all$h\in \mathrm{H}$; otherwise${\mathrm{M}}_{o}$is a misspecified observable data model.

**x**is commonly partitioned such that $x=\left[R,u\right]$ where R is the regression model response variable and

**u**is the predictor variables for the regression model. The complete-data probability model is specified by f. Typically, $f\left(x;\theta \right)$ is factored such that: $f\left(x;\theta \right)={f}_{R|u}\left(R|u;{\theta}_{R|u}\right){f}_{u}\left(u;{\theta}_{u}\right)$ where $\theta =\left[{\theta}_{R|u},{\theta}_{u}\right]\in {\Theta}_{R|u}\times {\Theta}_{u}$. Thus, misspecification of the researcher’s complete-data probability model in a regression modeling application may be due to either a misspecification of either (or both) the regression model and the conditional missing predictor variable model. In practice, the researcher’s conditional missing predictor model is specified by densities of the form ${f}_{{u}_{miss}|{u}_{obs}}\equiv {f}_{u}/{f}_{{u}_{obs}}$ where ${f}_{{u}_{obs}}$ is the marginal distribution for the predictors that are fully observable according to the researcher’s missing-data probability model. Additional discussion of conditional missing predictor models may be found in Chen (2004) and Ibrahim et al. (1999).

#### 2.3. Likelihood Functions, Pseudo-True Parameter Values, and True Parameter Values

**Definition**

**2.**

**Complete-Data Likelihood Function.**Assume Assumptions 1, 2(i), and 2(ii) hold. Given a data sample${x}^{n}$, the complete-data likelihood function ${L}_{n}^{\hspace{0.17em}x}:\Theta \times {\mathrm{R}}^{dn}\to \left[0,\infty \right)$is defined such that:${L}_{n}^{\hspace{0.17em}x}(\theta \hspace{0.17em};{x}^{n})={\displaystyle \prod _{i=1}^{n}f\left({x}_{i};\theta \right)}$for all$\theta \in \Theta $. The complete-data negative average log-likelihood${\overline{l}}_{n}^{\hspace{0.17em}x}:\Theta \times {\mathrm{R}}^{dn}\to \left[0,\infty \right)$is defined such that:${\overline{l}}_{n}^{x}\left(\theta ;\hspace{0.17em}{x}^{n}\right)=-{n}^{-1}\mathrm{log}{L}_{n}^{\hspace{0.17em}x}(\theta ;\hspace{0.17em}{x}^{n})$for all$\theta \in \Theta $. The complete-data expected negative average log-likelihood${l}^{x}:\Theta \to \left[0,\infty \right)$is defined (when it exists) such that:${l}^{x}\left(\theta \right)=-{\displaystyle \int {p}_{x}\left(x\right)}\mathrm{log}\left(f\left(x;\theta \right)\right)d{\nu}_{x}\left(x\right)$. The complete-data Kullback–Leibler Information Criterion (KLIC)${\ddot{l}}_{n}^{\hspace{0.17em}x}:\Theta \to \left[0,\infty \right)$is defined (when it exists) such that:${\ddot{l}}^{x}\left(\theta \right)={l}^{x}\left(\theta \right)+{\displaystyle \int {p}_{x}\left(x\right)}\mathrm{log}\left({p}_{x}\left(x\right)\right)d{\nu}_{x}\left(x\right)$.

**Definition**

**3.**

**Complete-Data True Parameter Value.**Assume that Assumptions 1, 2(i), and 2(ii) hold. A global minimizer of the complete-data negative average likelihood function${\overline{l}}_{n}^{\hspace{0.17em}x}\left(\cdot \hspace{0.17em}\hspace{0.17em};{X}^{n}\right):\Theta \to \left[0,\infty \right)$on the parameter space$\Theta $is called a complete-data quasi-maximum likelihood estimator. A global minimizer of the complete-data expected negative log-likelihood${l}_{}^{\hspace{0.17em}x}\left(\cdot \hspace{0.17em}\right):\Theta \to \left[0,\infty \right)$is called a complete-data pseudo-true parameter value${\theta}_{x}^{*}$. A parameter value${\theta}_{0}\in \Theta $defined such that for all$x\in \mathrm{supp}\hspace{0.17em}\mathit{X}$: $f\left(x\hspace{0.17em};{\theta}_{0}^{}\right)={p}_{x}\left(x\right)$is called a complete-data true parameter value.

**θ**when the data generating process is MNAR is quasi-maximum likelihood estimation (e.g., White 1982, 1994). These remarks thus motivate the following definition of the observable-data likelihood function (e.g., Schafer 1997, pp. 11–12; Little and Rubin 2002, p. 119) that is central to the objectives of this article.

**Definition**

**4.**

**Observable-Data Likelihood Function.**Assume that Assumptions 1, 2(i), 2(ii), and 3 hold. Let${\mathrm{Y}}^{n}\equiv \underset{i=1}{\overset{n}{\times}}{\mathrm{R}}^{{\rho}_{{h}_{i}}}$. Given an observable data sample$\left({y}^{n},{h}^{n}\right)$, the observable-data likelihood function${L}_{n}^{\hspace{0.17em}y}:\Theta \times {\mathrm{Y}}^{n}\times {\mathrm{R}}^{dn}\to \left[0,\infty \right)$is defined such that${L}_{n}^{\hspace{0.17em}y}(\theta \hspace{0.17em};{y}^{n},{h}^{n})={\displaystyle \prod _{i=1}^{n}{q}_{{y}_{{h}_{i}}}\left({y}_{i};\theta \right)}$where${q}_{{y}_{h}}\left({y}_{h}\left(x\right);\theta \right)={\displaystyle \int f\left(x;\theta \right)\hspace{0.17em}}d{\nu}_{{z}_{h}}\left({z}_{h}\left(x\right)\right)$for all$\theta \in \Theta $. The observable-data negative average log-likelihood${\overline{l}}_{n}^{}:\Theta \times {\mathrm{Y}}^{n}\times {\mathrm{R}}^{dn}\to \left[0,\infty \right)$is defined such that:

**Definition**

**5.**

**Observable-Data True Parameter Value.**Assume that Assumptions 1, 2(i), and 2(ii) hold. A global minimizer of the observable-data negative average likelihood function${\overline{l}}_{n}^{\hspace{0.17em}}$on the parameter space$\Theta $is called an observable-data quasi-maximum likelihood estimator. A global minimizer of the observable-data expected negative log-likelihood$l$is called an observable-data pseudo-true parameter value${\theta}_{}^{*}$. A parameter value${\theta}_{0}^{*}\in \Theta $defined such that for all${y}_{h}\in \mathrm{supp}\hspace{0.17em}{Y}_{h}$:${q}_{{y}_{h}}\left({y}_{h}\hspace{0.17em};{\theta}_{0}^{*}\right)={p}_{{y}_{h}}({y}_{h})$for each$h\in \mathrm{H}$is called an observable-data true parameter value.

#### 2.4. Moment Assumptions

**Assumption**

**4.**

**Domination Conditions**. For each$h\in \mathbf{H}\cup \left\{{1}_{d}\right\}$:

- (i)
- (a)
- $\mathrm{log}\hspace{0.17em}{q}_{{y}_{h}}$is dominated on Θ with respect to${p}_{{y}_{h}}$;
- (b)
- each element of$\nabla \mathrm{log}\hspace{0.17em}{q}_{{y}_{h}}$is dominated on Θ with respect to${p}_{{y}_{h}}$;
- (c)
- ${\Vert \nabla \mathrm{log}\hspace{0.17em}{q}_{{y}_{h}}\Vert}^{2}$is dominated on Θ with respect to${p}_{{y}_{h}}$;
- (d)
- each element of${\nabla}^{2}\mathrm{log}\hspace{0.17em}{q}_{{y}_{h}}$is dominated on Θ with respect to${p}_{{y}_{h}}$; and

- (ii)
- there exists a finite positive number K such that for all$x\in \mathrm{supp}\hspace{0.17em}\mathit{X}$and for all$\theta \in \Theta $:$f\left(x;\theta \right)\le K{p}_{x}\left(x\right)$.

#### 2.5. Solution Assumptions

**Assumption**

**5.**

**Uniqueness**. (i) For some${\theta}^{*}\in \Theta $, ${l}_{}^{}$has a unique minimum at${\theta}^{*}$. (ii)${\theta}^{*}$is interior to Θ.

**Assumption**

**6.**

**Positive Definiteness.**(i)${A}^{*}$is positive definite. (ii)${B}^{*}$is positive definite.

## 3. Theorems

#### 3.1. Quasi-Maximum Likelihood Estimation for Possibly Misspecified Missing Data Models

**Proposition**

**1.**

**Missing-Data Average Negative Log Likelihood Function and Gradient Estimation.**Assume that Assumptions 1, 2(i), 2(ii), 2(iii), 4(i)a, and 5 hold. Then as n → ∞,${\overline{l}}_{n}\left(\cdot \hspace{0.17em};{Y}^{n},{H}^{n}\right)\to l$and${\overline{g}}_{n}\left(\cdot \hspace{0.17em};{Y}^{n},{H}^{n}\right)\to g$uniformly on$\Theta $with probability one. In addition, l and

**g**are continuous on$\Theta $.

**Theorem**

**1.**

**Estimator Consistency.**Assume that Assumptions 1, 2(i), 2(ii), 4(i)a, and 5 hold. Then as n → ∞,${\widehat{\mathsf{\theta}}}_{n}^{}\to {\theta}_{}^{*}$with probability one.

#### 3.2. QMLE Asymptotic Distribution for Possibly Misspecified Missing Data Models

**Theorem**

**2.**

**Asymptotic Distribution of Quasi-Maximum Likelihood Estimates.**Assume that Assumptions 1, 2, 4, 5, and 6 hold. (i) As n → ∞,$\sqrt{n}\left({\widehat{\mathsf{\theta}}}_{n}^{}-{\theta}_{}^{*}\right)$converges in distribution to a zero-mean Gaussian random vector with non-singular covariance matrix${C}^{*}\equiv {\left({A}^{*}\right)}^{-1}{B}^{*}{\left({A}^{*}\right)}^{-1}$. (ii) If, in addition, the observable-data probability model${\mathrm{M}}_{o}$is correctly specified, then${A}^{*}={B}^{*}$.

#### 3.3. Validity of Missing Information Principles When Model Misspecification Is Present

**Proposition**

**2.**

**Theorem**

**3.**

#### 3.4. Detection of Model Misspecification in the Presence of Missing Data

#### 3.5. Estimating the Fraction of Missing Information with Possible Model Misspecification

**Definition**

**6.**

**Fraction of Information Loss Functions.**(i) The Hessian fraction of information loss function${\xi}_{A}:\Theta \to \mathrm{R}$is defined such that for all$\theta \in \Theta $: ${\xi}_{A}\left(\theta \right)={\lambda}_{\mathrm{max}}\left[{\left(\tilde{A}\left(\theta \right)\right)}^{-1}\stackrel{\u2322}{A}\left(\theta \right)\right]$when${\left(\tilde{A}\left(\theta \right)\right)}^{-1}$exists. The quantity${\xi}_{A}^{*}\equiv {\xi}_{A}\left({\theta}^{*}\right)$is called the Hessian fraction of information loss. (ii) The OPG fraction of information loss function${\xi}_{B}:\Theta \to R$is defined such that for all$\theta \in \Theta $: ${\xi}_{B}\left(\theta \right)={\lambda}_{\mathrm{max}}\left[{\left(\tilde{B}\left(\theta \right)\right)}^{-1}\stackrel{\u2322}{B}\left(\theta \right)\right]$when${\left(\tilde{B}\left(\theta \right)\right)}^{-1}$exists. The quantity${\xi}_{B}^{*}\equiv {\xi}_{B}\left({\theta}^{*}\right)$is called the OPG fraction of information loss. (iii) The robust fraction of information loss function${\xi}_{C}:\Theta \to R$is defined such that for all$\theta \in \Theta $:${\xi}_{C}\left(\theta \right)={\lambda}_{\mathrm{max}}\left[\tilde{C}\left(\theta \right)\stackrel{\u2322}{C}\left(\theta \right)\right]$where$\stackrel{\u2322}{C}\left(\theta \right)\equiv {\tilde{C}}^{-1}\left(\theta \right)-{\overline{C}}^{-1}\left(\theta \right)$when$\stackrel{\u2322}{C}\left(\theta \right)$exists. The quantity${\xi}_{C}^{*}\equiv {\xi}_{C}\left({\theta}^{*}\right)$is called the robust fraction of information loss.

**Theorem**

**4.**

- (i)
- Let${\theta}^{\u2020}$be a point in the interior of$\Theta $Assume that$\tilde{A}\left({\theta}^{\u2020}\right)$is positive definite. Both${\xi}_{A}\left({\theta}^{\u2020}\right)\le 1$and${\ddot{\xi}}_{A}\left({\theta}^{\u2020}\right)\le 1$if and only if there exists a non-empty open convex subset$\Gamma $ of $\Theta $ which contains ${\theta}^{\u2020}$ such that l is convex on $\Gamma $. In addition, the range of ${\xi}_{A}$ and ${\ddot{\xi}}_{A}$ on $\Gamma $ is the set of non-negative real numbers.
- (ii)
- Assume that$\tilde{A}$is positive definite on a non-empty open convex subset$\Gamma $of$\Theta $. Both${\xi}_{A}\left(\theta \right)\le 1$or${\ddot{\xi}}_{A}\left(\theta \right)\le 1$for all$\theta \in \Gamma $if and only if l is convex on$\Gamma $. In addition, the range of${\xi}_{A}$and${\ddot{\xi}}_{A}$on$\Gamma $is the set of non-negative real numbers.

**Proposition**

**3.**

**Identifiability.**Assume that Assumptions 1, 2(i), 2(ii), and 4(i)(a) hold. Let$\Gamma $be a non-empty open subset of the parameter space$\Theta $. Assume that the observable-data expected negative log-likelihood l is a convex function on$\Gamma $. Let${\theta}^{*}\in \Gamma $be a strict local minimizer of l. Then the following assertions hold.

- (i)
- The minimizer${\theta}^{*}$is the unique global minimizer of l on$\Gamma $.
- (ii)
- If the missing-data mechanism${p}_{h|x}$is MAR and the observable-data model is correctly specified on$\Gamma $, then the unique global minimizer${\theta}^{*}$is the unique observable-data true parameter value for l on$\Gamma $.
- (iii)
- If the missing-data mechanism${p}_{h|x}$is MAR and the complete-data model is correctly specified on$\Gamma $, then the unique global minimizer${\theta}^{*}$is both the observable-data true, and complete-data true parameter value for l on$\Gamma $.

## 4. Summary and Conclusions

**Estimation.**Theorem 1 establishes that the quasi-maximum likelihood estimator (QMLE) is a consistent estimator of the pseudo-true parameter values of the observable-data model with an assumed ignorable missing-data mechanism (MCAR, MAR). Further, QMLEs are shown to be consistent for the observable-data model in the presence of an MNAR missing-data mechanism. Our framework not only characterizes the asymptotic behavior of the quasi-maximum likelihood estimators in the presence of model misspecification and missing data, but also provides conditions for those estimators to converge to the true parameter values for the complete-data model. When the amount of missing data as measured by the Fraction of Information Loss is small and the complete-data model negative log-likelihood is strictly convex on a convex region of the parameter space, then the observable data negative log-likelihood will be convex on the same region of the parameter space (Theorem 4). This key result supports the Identifiability Proposition 3 that shows when the observable-data model or the complete-data model true parameter values may be estimated.

**Inference.**In our framework, the correct specification of the complete-data model always implies correct specification of the observable-data model. Therefore, a key theoretical result provided in Theorem 2(i) is that if the complete-data probability model is correctly specified and the missing-data mechanism is correctly specified as ignorable, then either the robust missing-data sandwich covariance matrix estimator ${\widehat{C}}_{n}^{-1}$, missing-data Hessian covariance matrix estimator ${\widehat{A}}_{n}^{-1}$, or missing-data OPG covariance matrix estimator ${\widehat{B}}_{n}^{-1}$ may be used for the purposes of estimating the covariance matrix of the observable data pseudo-true parameter estimates. However, in general, only the missing-data sandwich covariance matrix estimator ${\widehat{C}}_{n}^{-1}$ can be used to obtain unbiased estimates of the covariance matrix of the observable data pseudo-true parameter estimates. Thus, for this reason, it is recommended that researchers always use the robust missing-data sandwich covariance matrix estimator instead of the missing-data Hessian covariance matrix estimator or the missing-data OPG covariance matrix estimator.

**Specification Analysis.**Our theory supports a new approach for the detection of model misspecification in missing data problems using the results of Theorem 2(ii) with the Generalized Information Matrix Test (GIMT) methods of Golden et al. (2013, 2016). Under the assumption that the missing-data mechanism is possibly misspecified, but postulated as ignorable (MAR), the GIMT method for specification testing can be used to detect the presence of model misspecification in the observable-data model. In practice, these results serve to elucidate the consequences of how ignorable and nonignorable missing-data mechanisms may affect a complete-data model, which may be either possibly correctly specified or misspecified (White 1982, 1994), when the researcher has postulated an ignorable mechanism. Table 2 depicts the relationships of missing-data mechanisms to the specification of the complete-data model when the observable-data model has been determined to be misspecified and the researcher’s model of missing-data mechanism is postulated as ignorable. As shown, it can be concluded that the complete-data model is misspecified when the observable-data model is misspecified in the presence of a MAR mechanism. Notably, in the special case where the complete-data model is known to be correctly specified, the presence of a MNAR missing-data mechanism may be detected within this framework. This result follows as a consequence of an ignorable missing-data mechanism not affecting the specification of the observable-data model. Thus, rejecting the null hypothesis for a specification test on the observable-data model evidences the presence of a nonignorable missing-data mechanism (MNAR). Such a test may also be viewed as testing the missing at random (MAR) hypothesis (Jaeger 2006; Lu and Copas 2004; Rhoads 2012). Finally, when the complete-data model may be possibly be misspecified the determination that the observable-data is misspecified based on specification testing leads to the conclusion that the either the complete-data model is misspecified OR the missing-data mechanism is MNAR.

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## Appendix A. Proofs of Theorems and Propositions

**Proof of Proposition**

**1.**

**Proof of Theorem**

**1.**

**Proof of Theorem**

**2.**

**Proof of Proposition**

**2.**

**Proof of Theorem**

**3.**

**Proof of Theorem**

**4.**

**Q.**Two r-dimensional square matrices

**Q**and

**R**will satisfy $\lambda \left(Q\right)=\lambda \left(R\right)$ if there exists a non-singular matrix

**T**such that ${T}^{-1}QT=R$ (Franklin 1968, Theorem 1, p. 76). Theorem 4(ii) will be proved first and then used to prove Theorem 4(i).

**A**is positive semidefinite if and only if ${\tilde{A}}^{-1/2}A{\tilde{A}}^{-1/2}$ is positive semidefinite. Thus, using (A10) implies that the matrix

**A**is positive semidefinite if and only if all eigenvalues of ${\tilde{A}}^{-1}\stackrel{\u2322}{B}$ (or ${\tilde{A}}^{-1}\stackrel{\u2322}{A}$) are less than or equal to one. Finally note that the Hessian of l,

**A**, is positive semidefinite on the non-empty open convex set $\Gamma $ if and only if l is convex on $\Gamma $ (see Proposition 5 of Luenberger 1984, p. 180). This establishes the first part of Theorem 4(ii).

**Proof of Proposition**

**3.**

- (i)
- Since l is convex on the non-empty open convex set $\Gamma $, and ${\theta}_{}^{*}$ is a strict local minimizer of l on $\Gamma $ then ${\theta}_{}^{*}$ is the unique global minimizer of l on $\Gamma $ (Bazarra et al. 2006, pp. 125–26).
- (ii)
- If the observable-data model is correctly specified on $\Gamma $, then the observable-data true parameter value is in $\Gamma $. Since the missing DGP density is MAR, every observable-data true parameter value is a global minimizer of l on $\Gamma $. By Proposition 3(i), ${\theta}_{}^{*}$ is the unique global minimizer of l on $\Gamma $ which implies the global minimizer ${\theta}_{}^{*}$ is the unique observable-data true parameter value.
- (iii)
- If the complete-data model is correctly specified on $\Gamma $, then the complete-data true parameter value is in $\Gamma $. If there exists a complete-data true parameter value ${\theta}_{0}$ so that $f\left(x;{\theta}_{0}\right)={p}_{x}\left(x\right)\hspace{0.17em}\hspace{0.17em}\left(a.e.-{\nu}_{x}\right)$ then $\int f\left(x;{\theta}_{0}\right)\hspace{0.17em}}d{\nu}_{{z}_{h}}\left({z}_{h}\right)={\displaystyle \int {p}_{x}\left(x\right)}d{\nu}_{{z}_{h}}\left({z}_{h}\right)$ and thus ${q}_{{y}_{h}}\left({y}_{h}\left(x\right);{\theta}_{0}^{}\right)={p}_{{y}_{h}}\left({y}_{h}\left(x\right)\right)$ for all
**x**in the support of**X**and for all $h\in H$. Thus, correct specification of the complete-data model on $\Gamma $ implies correct specification of the observable-data model on $\Gamma $. By the assumption that the missing DGP density is MAR, and the correct specification of the observable-data model, Proposition 3(i), and Proposition 3(ii), it follows that ${\theta}_{0}$ is the unique global minimizer ${\theta}_{}^{*}$ of l on $\Gamma $. □

## References

- Abrevaya, Jason, and Stephen G. Donald. 2017. A GMM approach for dealing with missing data on regressors. The Review of Economics and Statistics 99: 657–662. [Google Scholar] [CrossRef]
- Agresti, Alan. 2002. Categorical Data Analysis, 2nd ed. New York: Wiley. [Google Scholar]
- Allison, Paul D. 2001. Missing Data. Sage University Papers Series on Quantitative Applications in the Social Sciences, 07–136; Thousand Oaks: Sage. [Google Scholar]
- Arminger, Gerhard, and Michael E. Sobel. 1990. Pseudo-maximum likelihood estimation of mean and covariance structure with missing data. Journal of the American Statistical Association 85: 195–203. [Google Scholar] [CrossRef]
- Bartle, Robert G. 1966. The Elements of Integration. New York: Wiley. [Google Scholar]
- Bazarra, Mokhtar S., Hanif D. Sherali, and C. M. Shetty. 2006. Nonlinear Programming: Theory and Algorithms. Hoboken: Wiley. [Google Scholar]
- Berndt, Ernst K., Bronwyn H. Hall, Robert E. Hall, and Jerry A. Hausman. 1974. Estimation and inference in nonlinear structural models. Annals of Economic and Social Measurement 3: 653–65. [Google Scholar]
- Breunig, Christoph. 2019. Testing Missing at Random Using Instrumental Variables. Journal of Business & Economic Statistics 2017: 223–34. [Google Scholar]
- Chen, Hua Yun. 2004. Nonparametric and semiparametric models for missing covariates in parametric regression. Journal of the American Statistical Association 99: 1176–89. [Google Scholar] [CrossRef]
- Chen, Xiaohong, and Norman R. Swanson. 2013. Recent Advances and Future Directions in Causality, Prediction, and Specification Analysis. New York: Springer. [Google Scholar]
- Cho, Jin Seo, and Halbert White. 2014. Testing the Equality of Two Positive-Definite Matrices with Application to Information Matrix Testing. In Advances in Econometrics: Essays in Honor of Peter C. B. Phillips. Edited by Yoosoon Chang, Thomas B. Fomby and Joon Park. West Yorkshire: Emerald Group Publishing Limited, vol. 33, pp. 491–556. [Google Scholar]
- Cho, Jin Seo, and Peter C.B. Phillips. 2018. Pythagorean generalization of testing the equality of two symmetric positive definite matrices. Journal of Econometrics 202: 45–56. [Google Scholar] [CrossRef][Green Version]
- Clayton, David, David Spiegelhalter, Graham Dunn, and Andrew Pickles. 1998. Analysis of longitudinal binary data from multiphase sampling. Journal of the Royal Statistical Society Series B 60: 71–87. [Google Scholar] [CrossRef]
- Dempster, Arthur. P., Nan. M. Laird, and Donald. B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B 39: 1–38. [Google Scholar] [CrossRef]
- Dobson, Annette J. 2002. An Introduction to Generalized Linear Models. New York: CRC Press. [Google Scholar]
- Efron, Bradley. 1994. Missing data, imputation, and the bootstrap. Journal of the American Statistical Association 89: 463–75. [Google Scholar] [CrossRef]
- Enders, Craig K. 2010. Applied Missing Data Analysis, 1st ed. New York: The Guilford Press. [Google Scholar]
- Fomby, Thomas B., and R. Carter Hill. 2003. Maximum Likelihood Estimation of Misspecified Models: Twenty Years Later. New York: Elsevier. [Google Scholar]
- Fomby, Thomas B., and R. Carter Hill, eds. 1998. Messy Data—Missing Observations, Outliers, and Mixed-Frequency Data (Advances in Econometrics). Advances in Econometrics, No. 13. Bingley: Emerald Group Publishing Limited. [Google Scholar]
- Franklin, Joel N. 1968. Matrix Theory. Upper Saddle River: Prentice-Hall. [Google Scholar]
- Fridman, Arthur. 2003. Mixed Markov models. Proceedings of the National Academy of Sciences of the United States of America 100: 8092–96. [Google Scholar] [CrossRef]
- Gallini, Joan. 1983. Misspecifications that can result in path analysis structures. Applied Psychological Measurement 7: 125–37. [Google Scholar] [CrossRef]
- Gmel, Gerhard. 2001. Imputation of missing values in the case of a multiple item instrument measuring alcohol consumption. Statistics in Medicine 20: 2369–81. [Google Scholar] [CrossRef]
- Golden, Richard M. 1995. Making correct statistical inferences using a wrong probability model. Journal of Mathematical Psychology 39: 3–20. [Google Scholar] [CrossRef]
- Golden, Richard M. 1996. Mathematical Methods for Neural Network Analysis and Design. Cambridge: MIT Press. [Google Scholar]
- Golden, Richard M. 2000. Statistical tests for comparing possibly misspecified and nonnested models. Journal of Mathematical Psychology 44: 153–70. [Google Scholar] [CrossRef]
- Golden, Richard M. 2003. Discrepancy risk model selection test theory for comparing possibly misspecified or nonnested models. Psychometrika 68: 165–332. [Google Scholar] [CrossRef]
- Golden, Richard M., Steven S. Henley, Halbert White, and T. Michael Kashner. 2013. New directions in information matrix testing: Eigenspectrum tests. In Recent Advances and Future Directions in Causality, Prediction, and Specification Analysis. Edited by Xiaohong Chen and Norman R. Swanson. New York: Springer, pp. 145–77. [Google Scholar]
- Golden, Richard M., Steven S. Henley, Halbert White, and T. Michael Kashner. 2016. Generalized information matrix tests for detecting model misspecification. Econometrics 4: 46. [Google Scholar] [CrossRef]
- Gourieroux, Christian S., Alain Monfort, and Alain Trognon. 1984. Pseudo-maximum likelihood methods: Theory. Econometrica 52: 681–700. [Google Scholar] [CrossRef]
- Graham, John W., Scott M. Hofer, Stewart I. Donaldson, David P. MacKinnon, and Joseph L. Schafer. 1997. Analysis with missing data in prevention research. In New Methodological Approaches to Alcohol Prevention Research. Edited by Kendall J. Bryant, Michael Windle and Stephen G. West. Washington, DC: American Psychological Association. [Google Scholar]
- Greenland, Sander, and William D. Finkle. 1995. A critical look at methods for handling missing covariates in epidemiologic regression analyses. American Journal of Epidemiology 142: 1255–64. [Google Scholar] [CrossRef]
- Groenwold, Rolf H.H., Ian R. White, A. Rogier T. Donders, James R. Carpenter, Douglas G. Altman, and Karel G.M Moons. 2012. Missing covariate data in clinical research: when and when not to use the missing-indicator method for analysis. CMAJ 184: 1265–69. [Google Scholar] [CrossRef]
- Hardin, James W. 2003. The sandwich estimate of variance. In Maximum Likelihood Estimation of Misspeciifed Models: Twenty Years Later. Edited by Thomas B. Fomby and R. Carter Hill. New York: Elsevier, pp. 45–73. [Google Scholar]
- Harel, Ofer, and Xiao-Hu Zhou. 2006. Multiple imputation for correcting verification bias. Statistics in Medicine 25: 3769–86. [Google Scholar] [CrossRef]
- Heitjan, Daniel F. 1994. Ignorability in general incomplete-data models. Biometrika 81: 701–8. [Google Scholar] [CrossRef]
- Henley, Steven S., Richard M. Golden, and T. Michael Kashner. 2019. Statistical Modeling Methods: Challenges and Strategies. Biostatistics & Epidemiology, 1–35. [Google Scholar] [CrossRef]
- Hosmer, David W., and Stanley Lemeshow. 2000. Applied Logistic Regression, 2nd ed. New York: Wiley. [Google Scholar]
- Huang, Wanling, and Artem Prokhorov. 2014. A Goodness-of-Fit Test for Copulas. Econometric Reviews 33: 751–71. [Google Scholar] [CrossRef]
- Huber, Peter J. 1967. The behavior of maximum likelihood estimates under non-standard conditions. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. Berkeley: University of California Press, vol. 1, pp. 221–33. [Google Scholar]
- Ibragimov, Rustam, and Artem Prokhorov. 2017. Heavy Tails And Copulas: Topics In Dependence Modelling In Economics and Finance. Hackensack: World Scientific Publishing. [Google Scholar]
- Ibrahim, Joseph G., Chen Ming-Hui, Stuart R. Lipsitz, and Amy H. Herring. 2005. Missing-Data Methods for Generalized Linear Models: A Comparative Review. Journal of the American Statistical Association 100: 332–46. [Google Scholar] [CrossRef]
- Ibrahim, Joseph G., Stuart R. Lipsitz, and Ming-Hui Chen. 1999. Missing covariates in generalized linear models when the missing data mechanism is nonignorable. Journal of The Royal Statistical Society Series B 61: 173–90. [Google Scholar] [CrossRef]
- Jaeger, Manfred. 2006. On Testing the Missing at Random Assumption. In Machine Learning: ECML 2006. Lecture Notes in Computer Science. Edited by Johannes Fürnkranz, Tobias Scheffer and Myra Spiliopoulou. Berlin/Heidelberg: Springer, vol. 4212, pp. 671–78. [Google Scholar]
- Jamshidian, Mortaza, and Robert I. Jennrich. 2000. Standard errors for EM estimation. Journal of The Royal Statistical Society Series B 62: 257–70. [Google Scholar] [CrossRef]
- Jank, Wolfgang, and James Booth. 2003. Efficiency of Monte Carlo EM and Simulated Maximum Likelihood in Two-Stage Hierarchical Models. Journal of Computational and Graphical Statistics 12: 214–29. [Google Scholar] [CrossRef]
- Jennrich, Robert I. 1969. Asymptotic properties of nonlinear least squares estimators. Annals of Mathematical Statistics 40: 633–43. [Google Scholar] [CrossRef]
- Kashner, T. Michael, Steven S. Henley, Richard M. Golden, John M. Byrne, Sheri A. Keitz, Grant W. Cannon, Barbara K. Chang, Gloria J. Holland, David C. Aron, Elaine A. Muchmore, and et al. 2010. Studying the Effects of ACGME duty hours limits on resident satisfaction: Results from VA Learner’s Survey. Academic Medicine 85: 1130–39. [Google Scholar] [CrossRef]
- Kass, Robert E., and Paul W. Voss. 1997. Geometric Foundations of Asymptotic Inference. New York: Wiley. [Google Scholar]
- Kenward, Michael G., and Geert Molenberghs. 1998. Likelihood based frequentist inference when data are missing at random. Statistical Science 13: 236–47. [Google Scholar] [CrossRef]
- King, Gary, James Honaker, Anne Joseph, and Kenneth Scheve. 2001. Analyzing incomplete political science data: An alternative algorithm for multiple imputation. American Political Science Association 95: 49–69. [Google Scholar] [CrossRef]
- Kosinski, Andrzej S., and Huiman X. Barnhart. 2003a. A global sensitivity analysis of performance of a medical diagnostic test when verification bias is present. Statistics in Medicine 22: 2711–21. [Google Scholar] [CrossRef]
- Kosinski, Andrzej S., and Huiman X. Barnhart. 2003b. Accounting for Nonignorable Verification Bias in Assessment of Diagnostic Tests. Biometrics 59: 163–71. [Google Scholar] [CrossRef]
- Kullback, Solomon, and Richard A. Leibler. 1951. On information and sufficiency. Annals of Mathematical Statistics 22: 79–86. [Google Scholar] [CrossRef]
- Leke, Collins Achepsah, and Tshilidzi Marwala. 2019. Deep Learning and Missing Data in Engineering Systems, 1st ed. Cham: Springer Nature Switzerland. [Google Scholar]
- Liang, Kung-Yee, and Scott L. Zeger. 1986. Longitudinal data analysis using generalized linear models. Biometrika 73: 13–22. [Google Scholar] [CrossRef]
- Little, Roderick J. A., and Donald B. Rubin. 2002. Statistical Analysis with Missing Data, 2nd ed. New York: Wiley. [Google Scholar]
- Little, Roderick J., Ralph D’Agostino, Michael L. Cohen, Kay Dickersin, Scott S. Emerson, John T. Farrar, Constantine Frangakis, Joseph W. Hogan, Geert Molenberghs, Susan A. Murphy, and et al. 2012. The prevention and treatment of missing data in clinical trials. The New England Journal of Medicine 367: 1355–60. [Google Scholar] [CrossRef]
- Little, Roderick J.A. 1994. A class of pattern-mixture models for multivariate incomplete data. Biometrika 81: 471–83. [Google Scholar] [CrossRef]
- Little, Roderick J.A. 1988. A Test of Missing Completely at Random for Multivariate Data with Missing Values. Journal of the American Statistical Association 83: 1198–202. [Google Scholar] [CrossRef]
- Littman, Michael L. 2009. A tutorial on partially observable Markov decision processes. Journal of Mathematical Psychology 53: 119–25. [Google Scholar] [CrossRef]
- Louis, Thomas A. 1982. Finding the Observed Information Matrix when Using the EM Algorithm. Journal of The Royal Statistical Society Series B 44: 226–33. [Google Scholar] [CrossRef]
- Luenberger, David G. 1984. Linear and Nonlinear Programming, 2nd ed. Massachusetts: Addison-Wesley. [Google Scholar]
- . Lu, Guobing, and John B. Copas. 2004. Missing at Random, Likelihood Ignorability and Model Completeness. The Annals of Statistics 32: 754–65. [Google Scholar]
- Markovsky, Ivan. 2017. A Missing Data Approach to Data-Driven Filtering and Control. IEEE Transactions on Automatic Control 62: 1972–78. [Google Scholar] [CrossRef]
- McCullagh, P., and John A. Nelder. 1989. Generalized Linear Models. New York: Chapman and Hall. [Google Scholar]
- McDonough, Ian K., and Daniel L. Millimet. 2016. Missing Data, Imputation, and Endogeneity. Bonn: IZA Institute of Labor Economics. [Google Scholar]
- McLachlan, Geoffrey, and Thriyambakam Krishnan. 1997. The EM Algorithm and Extensions. New York: Wiley. [Google Scholar]
- Meng, Xiao-Li, and Donald B. Rubin. 1991. Using EM to obtain asymptotic variance-covariance matrices: The SEM algorithm. Journal of the American Statistical Association 86: 899–909. [Google Scholar] [CrossRef]
- Miller, J. 2010. Isaac 2010. Cointegrating regressions with messy regressors and an application to mixed-frequency series. Journal of Time Series Analysis 31: 255–77. [Google Scholar]
- Molenberghs, Geert, and Michael Kenward. 2007. Missing Data in Clinical Studies. New York: Wiley. [Google Scholar]
- Molenberghs, Geert, Bart Michiels, Michael G. Kenward, and P.J. Diggle. 1998. Missing data mechanisms and pattern-mixture models. Statistica Neerlandica 52: 153–61. [Google Scholar] [CrossRef]
- Molenberghs, Geert, Caroline Beunckens, and Cristina Sotto. 2008. Every missingness not at random model has a missingness at random counterpart with equal fit. Journal of The Royal Statistical Society Series B 70: 371–88. [Google Scholar] [CrossRef]
- Molenberghs, Geert, Garrett Fitzmaurice, Michael G. Kenward, Anastasios Tsiatis, and Geert Verbeke. 2014. Handbook of Missing Data Methodology, 1st ed. London: Chapman & Hal, Boca Raton: CRC. [Google Scholar]
- Murray, Gordon D. 1977. Contribution to the discussion of paper by A. P. Dempster, N. M. Laird, and D. B. Rubin. Journal of The Royal Statistical Society Series B 39: 27–28. [Google Scholar]
- Nielsen, Søren Feodor. 1997. Inference and missing data: Asymptotic results. Scandinavian Journal of Statistics 24: 261–74. [Google Scholar] [CrossRef]
- Orchard, Terence, and Max A. Woodbury. 1972. A missing information principle: Theory and applications. Proceedings of the 6th Berkeley Symposium on Mathematical Statistics and Probability 1: 697–715. [Google Scholar]
- Parzen, Michael, Stuart R. Lipsitz, Garrett M. Fitzmaurice, Joseph G. Ibrahim, and Andrea Troxel. 2006. Pseudo-likelihood methods for longitudinal binary data with nonignorable missing responses and covariates. Statistics in Medicine 25: 2784–96. [Google Scholar] [CrossRef]
- Prokhorov, Artem, Ulf Schepsmeier, and Yajing Zhu. 2019. Generalized Information Matrix Tests for Copulas. Econometric Reviews 25: 1024–54. [Google Scholar] [CrossRef]
- Rhoads, Christopher H. 2012. Problems with Tests of the Missingness Mechanism in Quantitative Policy Studies. Statistics, Politics, and Policy 3: 6. [Google Scholar] [CrossRef]
- Robins, James M., and Naisyin Wang. 2000. Inference for imputation estimators. Biometrika 87: 113–24. [Google Scholar] [CrossRef]
- Royall, Richard M. 1986. Model robust confidence intervals using maximum likelihood estimators. International Statistical Review 54: 221–26. [Google Scholar] [CrossRef]
- Rubin, Donald B. 1976. Inference and missing data. Biometrika 63: 581–92. [Google Scholar] [CrossRef]
- Rubin, Donald B. 1987. Multiple Imputation for Nonresponse in Surveys. New York: Wiley. [Google Scholar]
- Rubin, Donald B. 1996. Multiple imputation after 18+ years. Journal of the American Statistical Association 91: 473–89. [Google Scholar] [CrossRef]
- Ryden, Tobias, and D. M. Titterington. 1998. Computational Bayesian analysis of hidden Markov models. Journal of Computational and Graphical Statistics 7: 194–211. [Google Scholar]
- Schafer, Joseph L. 1997. Analysis of Incomplete Multivariate Data. New York: Chapman and Hall. [Google Scholar]
- Schenker, Nathaniel, and A. H. Welsh. 1988. Asymptotic results for multiple imputation. Annals of Statistics 16: 1550–66. [Google Scholar] [CrossRef]
- Schepsmeier, Ulf. 2015. Efficient information based goodness-of-fit tests for vine copula models with fixed margins: A comprehensive review. Journal of Multivariate Analysis 138: 34–52. [Google Scholar] [CrossRef]
- Schepsmeier, Ulf. 2016. A goodness-of-fit test for regular vine copula models. Econometric Reviews 38: 25–46. [Google Scholar] [CrossRef][Green Version]
- Serfling, Robert J. 1980. Approximation Theorems of Mathematical Statistics, 2nd ed. New York: Wiley-Interscience. [Google Scholar]
- Sung, Yun Ju, and Charles J. Geyer. 2007. Monte Carlo likelihood inference for missing data models. The Annals of Statistics 35: 990–1011. [Google Scholar] [CrossRef][Green Version]
- Troxel, Andrea B., Diane L. Fairclough, Desmond Curran, and Elizabeth A. Hahn. 1998. Statistical Analysis of Quality of Life with Missing Data in Cancer Clinical Trials. Statistics in Medicine 17: 653–66. [Google Scholar] [CrossRef]
- Troxel, Andrea B., Stuart R. Lipsitz, and David P. Harrington. 1998. Marginal models for the analysis of longitudinal measurements with nonignorable nonmontone missing data. Biometrika 85: 661–72. [Google Scholar] [CrossRef]
- Verbeek, Marno. 2008. A Guide to Modern Econometrics. New York: Wiley. [Google Scholar]
- Verbeke, Geert, and Emmanuel Lesaffre. 1997. The effect of misspecifying the random-effects distribution in linear mixed models for longitudinal data. Computational Statistics & Data Analysis 23: 541–56. [Google Scholar]
- Visser, Ingmar. 2011. Seven things to remember about hidden Markov models: A tutorial on Markovian models for time series. Journal of Mathematical Psychology 55: 403–15. [Google Scholar] [CrossRef]
- Vittinghoff, Eric, David V. Glidden, Stephen C. Shiboski, and Charles E. McCulloch. 2012. Regression Methods in Biostatistics: Linear, Logistic, Survival, and Repeated Measures Models, 2nd ed. New York: Springer. [Google Scholar]
- Wall, Melanie M., Yu Dai, and Lynn E. Eberly. 2005. GEE estimation of a misspecified time-varying covariate: An example with the effect of alcoholism treatment on medical utilization. Statistics in Medicine 24: 925–39. [Google Scholar] [CrossRef]
- Wang, Naisyin, and James M. Robins. 1998. Large-sample theory for parametric multiple imputation procedures. Biometrika 85: 935–48. [Google Scholar] [CrossRef]
- Wedderburn, Robert William Maclagan. 1974. Quasi-likelihood functions, generalized linear models, and the Gauss-Newton method. Biometrika 61: 439–47. [Google Scholar]
- Wei, Bo-Cheng. 1998. Exponential Family Nonlinear Models. New York: Springer. [Google Scholar]
- White, Halbert. 1980. A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica 48: 817–38. [Google Scholar] [CrossRef]
- White, Halbert. 1982. Maximum likelihood estimation of misspecified models. Econometrica 50: 1–25. [Google Scholar] [CrossRef]
- White, Halbert. 1984. Asymptotic Theory for Econometricians. New York: Academic Press. [Google Scholar]
- White, Halbert. 1994. Estimation, Inference and Specification Analysis. New York: Cambridge University Press. [Google Scholar]
- Woodbury, Max. A. 1971. Contribution to the discussion of “The analysis of incomplete data” by Herman. O. Hartley and Ronald. R. Hocking. Biometrics 27: 808–13. [Google Scholar]
- Wooldridge, Jeffrey M. 2004. Inverse Probability Weighted Estimation for General Missing Data Problems. Cemmap Working Paper, No. CWP05/04. London: Centre for Microdata Methods and Practice (cemmap). [Google Scholar]
- Yuan, Ke-Hai. 2009. Normal distribution based pseudo ML for missing data: With applications to mean and covariance structure analysis. Journal of Multivariate Analysis 100: 1900–18. [Google Scholar] [CrossRef]
- Zhao, Lue Ping, Lipsitz Stuart, and Danika Lew. 1996. Regression analysis with missing covariate data using estimating equations. Biometrics 52: 1165–82. [Google Scholar] [CrossRef]
- Zhou, Xiao-Hua, Chuan Zhou, Danping Lui, and Xaiobo Ding. 2014. Applied Missing Data Analysis in the Health Sciences, 1st ed. Statistics in Practice. New York: Wiley. [Google Scholar]
- Zhu, Yajing. 2017. Dependence Modelling and Testing: Copula and Varying Coefficient Model with Missing Data. Ph.D. thesis, Concordia University, Montreal, QC, Canada. [Google Scholar]

**Table 1.**Key theoretical results for probability models with assumed ignorable missing-data mechanisms.

Result | Description |
---|---|

Consistency Theorem T1 | QMLE is a consistent estimator of the pseudo-true parameter values for observable-data probability models with an assumed ignorable missing-data mechanism in the presence of a missing DGP specified by a MAR or MNAR missing-data mechanism. |

Asymptotic Distribution Theorem T2(i) | The asymptotic distribution of the QMLE is Gaussian with covariance matrix ${C}^{*}={\left({A}^{*}\right)}^{-1}={\left({B}^{*}\right)}^{-1}$ for observable-data probability models with an assumed ignorable missing-data mechanism in the presence of a missing DGP specified by a MAR or MNAR missing-data mechanism. |

Misspecification Detection Theorem T2(ii) | A GIMT may be used to detect the presence of misspecification in the observable-data probability model with an assumed ignorable missing-data mechanism in the presence of a missing DGP that is a MAR or MNAR missing-data mechanism. If this observable-data probability model is misspecified, this implies the complete-data probability model is misspecified when the missing-data mechanism is possibly misspecified but correctly specified as ignorable. |

Missing Information Principles Theorem T3 | Let ${\overline{l}}_{n}\left(\theta \right)=-{n}^{-1}{\displaystyle \sum _{i=1}^{n}\mathrm{log}\left({q}_{{y}_{{h}_{i}}}\left({y}_{i};\theta \right)\right)}$ denote the observable-data negative average log-likelihood. The Hessian of ${\overline{l}}_{n}\left(\theta \right)$ in the presence of possible model misspecification may be estimated using either: ${\overline{A}}_{n}={\tilde{A}}_{n}-{\stackrel{\u2322}{A}}_{n}$ and ${\overline{A}}_{n}={\tilde{A}}_{n}-{\stackrel{\u2322}{B}}_{n}$. If, in addition, either observable-data or complete-data model is correctly specified, then the Hessian of ${\overline{l}}_{n}\left(\theta \right)$ may be estimated using either: ${\overline{A}}_{n}={\tilde{B}}_{n}-{\stackrel{\u2322}{B}}_{n}$ and ${\overline{A}}_{n}={\tilde{B}}_{n}-{\stackrel{\u2322}{A}}_{n}$. |

Identifiability Proposition P3 | Assume that the observable-data negative log-likelihood is convex on a convex region, $\Gamma $, of the parameter space with a unique global minimizer in the interior of $\Gamma $. Assume that the observable-data model is correctly specified and the missing-data mechanism is correctly specified as ignorable. Then assume that global minimizer is the observable-data model true parameter value. If, in addition, the complete-data model is correctly specified on $\Gamma $, then the unique global minimizer on $\Gamma $, is the complete-data model true parameter value. |

Fraction of Information Loss Theorem T4 | If the amount of missing data as measured by the Fraction of Information Loss is small and the complete-data model negative log-likelihood is strictly convex on a convex region of the parameter space, then with appropriate regularity conditions the observable data negative log-likelihood will be convex on that convex region of the parameter space. |

**Table 2.**Consequences of missing-data mechanism specification: detecting and interpreting misspecification in the observable-data model.

Missing-Data Mechanism ^{1} | Complete-Data Model | Conclusion when Observable-Data Model ^{2} is misspecified ^{3} |
---|---|---|

MAR | Possibly Misspecified | Complete-Data Model is Misspecified. |

MNAR or MAR | Correctly Specified | Missing-Data mechanism is MNAR ^{4}. |

MNAR or MAR | Possibly Misspecified | Either the Complete-Data Model is Misspecified OR the Missing-Data Mechanism is MNAR. |

^{1}Researcher’s model of missing-data mechanism is postulated as ignorable.

^{2}Maximum likelihood estimates obtained by minimizing observable data likelihood (Equation (3)).

^{3}Generalized Information Matrix Tests (GIMT) (Golden et al. 2013, 2016) may be applied to detect misspecification in the observable-data model (Theorem 2(ii)).

^{4}GIMT provides a statistical test for detecting if the partially observable DGP has an MNAR missing-data mechanism in situations where the complete-data model is known to be correctly specified.

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Golden, R.M.; Henley, S.S.; White, H.; Kashner, T.M. Consequences of Model Misspecification for Maximum Likelihood Estimation with Missing Data. *Econometrics* **2019**, *7*, 37.
https://doi.org/10.3390/econometrics7030037

**AMA Style**

Golden RM, Henley SS, White H, Kashner TM. Consequences of Model Misspecification for Maximum Likelihood Estimation with Missing Data. *Econometrics*. 2019; 7(3):37.
https://doi.org/10.3390/econometrics7030037

**Chicago/Turabian Style**

Golden, Richard M., Steven S. Henley, Halbert White, and T. Michael Kashner. 2019. "Consequences of Model Misspecification for Maximum Likelihood Estimation with Missing Data" *Econometrics* 7, no. 3: 37.
https://doi.org/10.3390/econometrics7030037