The Data-Constrained Generalized Maximum Entropy Estimator of the GLM: Asymptotic Theory and Inference

Mittelhammer, Ron; Cardell, Nicholas Scott; Marsh, Thomas L.

doi:10.3390/e15051756

Open AccessArticle

The Data-Constrained Generalized Maximum Entropy Estimator of the GLM: Asymptotic Theory and Inference

by

Ron Mittelhammer

^1,*,

Nicholas Scott Cardell

² and

Thomas L. Marsh

³

¹

Economic Sciences and Statistics, Washington State University, Pullman, WA 99164, USA

²

Salford Systems, San Diego, CA 92126, USA

³

Economic Sciences and IMPACT Center, Washington State University, Pullman, WA 99164, USA

^*

Author to whom correspondence should be addressed.

Entropy 2013, 15(5), 1756-1775; https://doi.org/10.3390/e15051756

Submission received: 7 April 2013 / Revised: 23 April 2013 / Accepted: 7 May 2013 / Published: 14 May 2013

Download Versions Notes

Abstract

:

Maximum entropy methods of parameter estimation are appealing because they impose no additional structure on the data, other than that explicitly assumed by the analyst. In this paper we prove that the data constrained GME estimator of the general linear model is consistent and asymptotically normal. The approach we take in establishing the asymptotic properties concomitantly identifies a new computationally efficient method for calculating GME estimates. Formulae are developed to compute asymptotic variances and to perform Wald, likelihood ratio, and Lagrangian multiplier statistical tests on model parameters. Monte Carlo simulations are provided to assess the performance of the GME estimator in both large and small sample situations. Furthermore, we extend our results to maximum cross-entropy estimators and indicate a variant of the GME estimator that is unbiased. Finally, we discuss the relationship of GME estimators to Bayesian estimators, pointing out the conditions under which an unbiased GME estimator would be efficient.

Keywords:

generalized maximum entropy; generalized maximum cross-entropy; asymptotic Theory; GME computation; unbiased GME; GME as Bayesian estimation

MSC 2010 Codes:

62

1. Introduction

Information theoretic estimators have been receiving increasing attention in the econometric-statistics literature [1,2,3,4,5,6,7]. In other work, [3] proposed an information theoretic estimator based on minimization of the Kullback-Leibler Information Criterion as an alternative to optimally-weighted generalized method of moments estimation. This specific estimator handles weakly dependent data generating mechanisms and under reasonable regulatory assumptions it is consistent and asymptotically normally distributed. Subsequently, [1] proposed an information theoretic estimator based on minimization of the Cressie-Read discrepancy statistic as an alternative approach to inference in moment condition models. In [1] identified a special case of the Cressie-Read statistic—the Kullback-Leibler Information Criterion (e.g., maximum entropy)—as being preferred over other estimators (e.g., empirical likelihood) because of its efficiency and robustness properties. Special issues of the Journal of Econometrics (March 2002) and Econometric Reviews (May 2008) were devoted to this particular topic of information estimators.

Historically, information theoretic estimators have been motivated in several ways. The Cressie-Read statistic directly minimizes an information based concept of closeness between the estimated and empirical distribution [1]. Alternatively, the maximum entropy principle is based on an axiomatic approach that defines a unique objective function to measure uncertainty of a collection of events [8,9,10]. Interest in maximum entropy estimators stems from the prospect to recover and process information when the underlying sampling model is incompletely or incorrectly known and the data are limited, partial, or incomplete [10]. To date the principle of maximum entropy has been applied in an abundance of circumstances, including in the fields of econometrics and statistics [11,12,13,14,15,16,17], economic theory and applications [18,19,20,21,22,23,24], accounting and finance [25,26,27], and resources and agricultural economics [28,29,30,31,32]. Moreover, widely used econometric software packages are now incorporating procedures to calculate maximum entropy estimators in their latest releases (e.g., SAS, SHAZAM, and GAUSSX).

In most cases, rigorous investigation of small and large sample properties of information theoretic estimators have lagged far behind empirical applications [3]. Exceptions include [1,2,3] who examined information theoretic alternatives to generalized method of moments estimation; [14] who derived the statistical properties of the generalized maximum entropy estimator in the context of modeling multinomial response data; and, [10] who provided asymptotic properties for the moment-constrained generalized maximum entropy (GME) estimator for the general linear model (showing it is asymptotically equivalent to ordinary least squares). An alternative information theoretic estimator of the general linear model (GLM), yet to be rigorously investigated, but that has arisen in empirical applications (e.g., [24]), is the purely data-constrained formulation of the generalized maximum entropy estimator [10]. In a purely data-constrained formulation the regression model itself, as opposed to moment conditions of it, represents the constraining function to the entropy objective function. In the maximum entropy framework, unlike ordinary least square or maximum likelihood estimators of the GLM, moment constraints are not necessary to uniquely identify parameter estimates. Moreover, there exists distinct differences between the data and moment constrained versions of the GME for the GLM. For [10] have shown the data-constrained GME estimator to be mean square error superior to the moment-constrained GME estimator of the GLM in selected Monte Carlo experiments.

Our paper contributes to the econometric literature in several ways. First, regularity conditions are identified that provide a solid foundation from which to develop statistical properties of the data constrained GME estimator of the GLM and hypothesis tests on model parameters. Given the regularity conditions, we define a conditional maximum entropy function to rigorously prove consistency and asymptotic normality. As demonstrated in this paper the data-constrained GME estimator is not asymptotically equivalent to the moment-constrained GME estimator or ordinary least squares estimator. However, the GME estimator is shown to be nearly asymptotically efficient. Moreover, we derive formulae to compute the asymptotic variance of the proposed estimator. This allows us to define classical Wald, Likelihood Ratio, and Lagrange Multiplier tests for testing hypothesis about model parameters.

Second, theoretical extensions to unbiased, cross entropy, and Bayesian estimation are also identified. Further, we demonstrate that the GME specification can be extended from finite-discrete parameter and error spaces to infinite-continuous parameter and error spaces. Alternative formulations of the data constrained GME estimator of the GLM under selected regularity conditions, and the implications to properties of the estimator, are also discussed.

Third, to compliment the theoretical results, Monte Carlo experiments are used in comparing the performance of the data-constrained GME estimates to least squares estimates for small and medium size samples. The performance of the GME estimator is tested relative to selected distributions of the errors, to the user supplied supports of the parameters and errors, and to its robustness to model misspecification. Monte Carlo experiments are also performed to examine the size and power of the Wald, Likelihood Ratio, and Lagrange Multiplier test statistics.

Fourth, insight into computational efficiency and guidelines for setting boundaries of parameters and error support spaces are discussed. The conditional maximum entropy formulation utilized in proof of asymptotic properties provides a basis for new computationally efficient method of calculating GME estimates. The approach involves a nonlinear search over a K-vector of coefficient parameters, which is much more efficient than numerical approaches proposed elsewhere in the literature. Finally, practical guidelines for setting boundaries of parameters and error support spaces are analyzed and discussed.

2. The Data-Constrained GME Formulation

Let

Y = X β + ε

represent the general linear model with Y being an

N \times 1

dependent variable vector, X being a fixed

N \times K

matrix of explanatory variables, β being a

K \times 1

vector of parameters, and ε being an

N \times 1

vector of disturbance terms (All of our results can be extended to stochastic X. For example, if

X_{i •}

is iid with

V a r (X_{i •}) = Ω

, a positive definite matrix, then the asymptotic properties are identical to those developed below). The GME rule for defining the estimator of the unknown β in the general linear model formulation is given by

\hat{β} = Z \hat{p}

with

\hat{p} = ({\hat{p}}^{'}_{1}, \dots, {\hat{p}}^{'}_{K})^{'}

derived from the following constrained maximum entropy problem:

p = {({p^{'}}_{1}, \dots, {p^{'}}_{K})}^{'}

\underset{p_{k}, w_{i} : \forall k, i}{Max} (- \sum_{k = 1}^{K} {p^{'}}_{k} \ln (p_{k}) - \sum_{i = 1}^{N} {w^{'}}_{i} \ln (w_{i}))

subject to:

Y = X Z p + V w

1^{'} p_{k} = 1 \forall k

1^{'} w_{i} = 1 \forall i

p_{k} > [0], w_{i} > [0], \forall i, k .

In the preceding formulation, the matrices Z and V are

K \times K M

and

N \times N J

matrices of support points for the β and ε vectors, respectively, as:

Z = (\begin{matrix} {z^{'}}_{1} & 0 & \dots & 0 \\ 0 & {z^{'}}_{2} & \dots & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & \dots & {z^{'}}_{K} \end{matrix}) and V = (\begin{matrix} {v^{'}}_{1} & 0 & \dots & 0 \\ 0 & {v^{'}}_{2} & \dots & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & \dots & {v^{'}}_{N} \end{matrix}),

where

z_{k} = (z_{k 1}, \dots, z_{k M})^{'}

is a

M \times 1

vector such that

z_{k 1} \leq z_{k 2} \leq \dots \leq z_{k M}

and

β_{k} \in (z_{k 1}, z_{k M}) \forall k = 1, \dots, K

, and similarly

v_{i} = (v_{i 1}, \dots, v_{i J})^{'}

is a

J \times 1

vector such that

v_{i 1} \leq v_{i 2} \dots \leq v_{i J}

and

ε_{i} \in (v_{i 1}, v_{i J}) \forall i = 1, \dots, N

(in their original formulation, [10] required

ε_{i}

to be contained in a fixed interval with arbitrarily high probability. Here we assume such an event occurs with probability). The

M \times 1

p_{k}

vectors and The

J \times 1

w_{i}

vectors are weight vectors having nonnegative elements that sum to unity and are used to represent the β and ε vectors as

β = Z p,

for

p = ({p^{'}}_{1}, \dots, {p^{'}}_{K})^{'}

, and

ε = V w,

for

w = ({w^{'}}_{1}, \dots, {w^{'}}_{J})^{'}

.

The basic principle underlying the estimator

\hat{β} = Z \hat{p}

for β is to choose an estimate that contains only the information available. In this way the maximum entropy estimator is not constrained by any extraneous assumptions. The information used is the observed information contained in the data, the information contained in the constraints on the admissible values of β, and the information inherent in the structure of the model, including the choice of the supports for The

β_{k}

’s. In effect, the information set used in estimation is shrunk to the boundary of the observed data and the parameter constraint information. Because the objective function value increases as the weights in pi and wi are more uniformly distributed, any deviation from uniformity represents the effect of the data constraints on the weighting of the support points used for representing β and ε. This fact also motivates the interpretation of the GME as a shrinkage-type estimator that in the absence of constraints on β will shrink

\hat{β}

to the centers of the supports defined in the specification of Z. We next establish consistency and asymptotic normality results for the GME estimator under general regularity conditions on the specification of the estimation problem.

3. Consistency and Asymptotic Normality of the GME Estimator

Regularity Conditions. To establish asymptotic results for the GME estimator, we utilize the following regularity conditions for the problem of estimating β in

Y = X β + ε

.

R1: The $ε_{i}$ ′s are iid with $c_{1} + δ \leq ε_{i} \leq c_{J} - δ$ for some δ > 0 and large enough finite positive $c_{J} = - c_{1}$ .
R2: The pdf of $ε_{i}, f (ε_{i})$ , is symmetric around 0 with variance $σ^{2}$ .
R3: $β_{k} \in (β_{k L}, β_{k H})$ , for finite $β_{k L} and β_{k H}, \forall k = 1, \dots, K$ .
R4: X has full column rank.
R5: $\frac{1}{N} (X^{'} X)$ is O(1) and the smallest eigenvalue of $\frac{1}{N} (X^{'} X) > ε$ for some ε > 0, and $\forall N > N^{*}$ , where N^** is some positive integer.
R6: $\frac{1}{N} (X^{'} X) \to Ω$ , a finite positive definite symmetric matrix.

Note that condition R1 states that the support of

ε_{i}

is contained in the interior of some large enough closed finite interval

[c_{1}, c_{J}]

. Condition R3 states that the true value of parameter

β_{k}

can be enclosed within some open interval

(β_{k L}, β_{k H})

. The conditions R4-R6 on X are familiar analogues to typical assumptions made in the least squares context for establishing asymptotic properties of the least squares estimator of β. We utilize condition R6 to simplify the demonstration of asymptotic normality, but the result can be established under weaker conditions, as alluded to in the proof. Finally, our proof of the asymptotic results will utilize symmetry of the disturbance distribution, which is the content of condition R2.

Reformulated GME Rule. The asymptotic results are derived within the context of the following representation of the GME model, represented in scalar notation to facilitate exposition of the proof. The GME representation described below is completely consistent with the formulation in Section 2 under the condition that the support points represented by the vector vi are chosen to be symmetrically dispersed around 0. We use the same vector of support points for each of The

ε_{i}

′s, consistent with the iid nature of the disturbances, and so henceforth

v_{ℓ}

refers to the common

ℓ^{t h}

scalar support point in the development below. The representation is also more general than the representation in Section II in the sense that different numbers of support points can be used for the representation of different

β_{k}

parameters. The constrained maximum entropy problem is as follows:

\underset{b, p, w}{Max} (- \sum_{k = 1}^{K} \sum_{ℓ = 1}^{J_{k}} p_{k ℓ} \ln (p_{k ℓ}) - \sum_{i = 1}^{N} \sum_{ℓ = 1}^{J} w_{i ℓ} \ln (w_{i ℓ}))

(1)

subject to:

C1: $\sum_{ℓ = 1}^{J_{k}} z_{k ℓ} p_{k ℓ} = b_{k}, β_{k L} = z_{k 1} \leq z_{k 2} \leq \dots \leq z_{k J_{k}} = β_{k H}, k = 1, \dots, K$
C2: $\sum_{ℓ = 1}^{J} v_{ℓ} w_{i ℓ} = e_{i} = y_{i} - X_{i •} b = e_{i} (b), i = 1, \dots, N$
C3: $c_{1} = v_{1} \leq v_{2} \leq \dots \leq v_{J} = c_{J}$
C4: $- v_{ℓ} = v_{J + 1 - ℓ} (thus for J odd v_{\frac{J + 1}{2}} \equiv 0)$
C5: $\sum_{ℓ = 1}^{J_{k}} p_{k ℓ} = 1, k = 1, \dots, K$
C6: $\sum_{ℓ = 1}^{J} w_{i ℓ} = 1, i = 1, \dots, N$

As will become apparent, the nonnegativity restrictions on

p_{k ℓ}

and

w_{i ℓ}

are inherently enforced by the structure of the optimization problem itself, and thus need not be explicitly incorporated into the constraint set.

Asymptotic Properties. The following theorem establishes the consistency and asymptotic normality of the GME estimator of β in the GLM.

Theorem. Under regularity conditions R1-R5, the GME estimator

\hat{β} = Z \hat{p}

is a consistent estimator of β. With the addition of regularity condition R6, the GME estimator is asymptotically normally distributed as

\hat{β} \overset{a}{~} N (β, \frac{σ_{γ}^{2}}{N ξ^{2}} Ω^{- 1})

for appropriate definitions of

σ_{γ}^{2}, ξ, a n d Ω

.

Proof. Define the maximized entropy function, conditional on

b = τ

, as:

F (τ) = \underset{\underset{(C 1) - (C 6)}{p, w : b = τ}}{Max} (- \sum_{k = 1}^{K} \sum_{ℓ = 1}^{J_{k}} p_{k ℓ} \ln (p_{k ℓ}) - \sum_{i = 1}^{N} \sum_{ℓ = 1}^{J} w_{i ℓ} \ln (w_{i ℓ}))

(2)

The optimal value of

w_{i} = ({w^{'}}_{i 1}, \dots, {w^{'}}_{i J})^{'}

in the conditionally-maximized entropy function is given by:

w_{i} (τ) = \underset{w_{i} : C 6, \sum_{ℓ = 1}^{J} v_{ℓ} w_{i ℓ} = e_{i} (τ)}{\arg \max} (- \sum_{ℓ = 1}^{J} w_{i ℓ} \ln (w_{i ℓ})),

which is the maximizing solution to the Lagrangian:

L_{w_{i}} = - \sum_{ℓ = 1}^{J} w_{i ℓ} \ln (w_{i ℓ}) + λ_{i}^{w} (\sum_{ℓ = 1}^{J} w_{i ℓ} - 1) + γ_{i} (\sum_{ℓ = 1}^{J} v_{ℓ} w_{i ℓ} - e_{i} (τ)) .

The optimal value of

w_{i ℓ}

is then:

w_{i ℓ} (γ (e_{i} (τ))) = w_{ℓ} (γ (e_{i} (τ))) = \frac{e^{γ (e_{i} (τ)) v_{ℓ}}}{\sum_{m = 1}^{J} e^{γ (e_{i} (τ)) v_{m}}}, ℓ = 1, \dots, J,

(3)

where

γ (e_{i} (τ))

is the optimal value of the Lagrangian multiplier

γ_{i}

under the condition

b = τ

, and

w_{ℓ} (γ) \equiv \frac{e^{γ v_{ℓ}}}{\sum_{m = 1}^{J} e^{γ v_{m}}}

. It follows from the symmetry of the v_i’s around zero that:

\sum_{ℓ = 1}^{J} v_{ℓ} w_{ℓ} (- γ (e_{i} (τ))) = - \sum_{ℓ = 1}^{J} v_{ℓ} w_{ℓ} (γ (e_{i} (τ)))

(4)

Similarly, the optimal value of

p_{k} = (p_{k 1}, \dots, p_{k J_{k}})^{'}

in the conditionally-maximized entropy function is given by:

p_{k} (τ_{k}) = \underset{p_{k} : C 5, \sum_{ℓ = 1}^{J_{k}} z_{k ℓ} p_{k ℓ} = τ_{k}}{\arg \max} (- \sum_{ℓ = 1}^{J_{k}} p_{k ℓ} \ln (p_{k ℓ})),

which is the maximizing solution to the Lagrangian:

L_{p_{k}} = - \sum_{ℓ = 1}^{J_{k}} p_{k ℓ} \ln (p_{k ℓ}) + λ_{k}^{p} (\sum_{ℓ = 1}^{J_{k}} p_{k ℓ} - 1) + η_{k} (\sum_{ℓ = 1}^{J_{k}} z_{k ℓ} p_{k ℓ} - τ_{k}) .

The optimal value of

p_{k ℓ}

is then:

p_{k ℓ} (τ_{k}) = \frac{e^{η_{k} (τ_{k}) z_{k ℓ}}}{\sum_{m = 1}^{J_{k}} e^{η_{k} (τ_{k}) z_{k m}}}, k = 1, \dots, K,

(5)

where

η_{k} (τ_{k})

is the optimal value of the Lagrangian multiplier

η_{k}

under the condition

b_{k} = τ_{k}

.

Substituting the optimal solutions for the

p_{k ℓ}

’s and

w_{i ℓ}

’s into (2) obtains the conditional maximum value function:

\begin{matrix} F (τ) = - \sum_{k = 1}^{K} (η_{k} (τ_{k}) τ_{k} - \ln (\sum_{m = 1}^{J_{k}} e^{η_{k} (τ_{k}) z_{k m}})) \\ - \sum_{i = 1}^{N} (γ (e_{i} (τ)) e_{i} (τ) - \ln (\sum_{m = 1}^{J} e^{γ (e_{i} (τ)) v_{m}})) . \end{matrix}

Define the gradient vector of

F (τ) as G (τ) = \frac{\partial F (τ)}{\partial τ}

so that:

G_{k} (τ) = \frac{\partial F (τ)}{\partial τ_{k}} = - η_{k} (τ_{k}) + \sum_{i = 1}^{N} γ (e_{i} (τ)) X_{i k}, k = 1, \dots, K,

and thus

G (τ) = - η (τ) + X^{'} γ (e (τ))

, where

η (τ)

and

γ (e (τ))

are

K \times 1

and

N \times 1

vectors of Lagrangian multipliers. It follows that the Hessian matrix of

F (τ)

is given by:

\begin{matrix} H (τ) = \frac{\partial^{2} F (τ)}{\partial τ \partial τ^{'}} = \frac{\partial G (τ)}{\partial τ^{'}} \\ = - (\begin{matrix} \frac{\partial η_{1} (τ_{1})}{\partial τ_{1}} & 0 & \dots & 0 \\ 0 & \frac{\partial η_{2} (τ_{2})}{\partial τ_{2}} & \dots & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & \dots & \frac{\partial η_{K} (τ_{k})}{\partial τ_{K}} \end{matrix}) + X^{'} \frac{\partial γ (e (τ))}{\partial τ^{'}} . \end{matrix}

Regarding the functional form of the derivatives of the Lagrangian multipliers appearing in the definition of

H (τ)

, it follows from (C2) that:

\frac{\partial \sum_{ℓ = 1}^{J} ν_{ℓ} w_{ℓ} (γ (e_{i} (τ)))}{\partial γ (e_{i} (τ))} \frac{\partial γ (e_{i} (τ))}{\partial e_{i} (τ)} = 1,

so that from (3):

\frac{\partial γ (e_{i} (τ))}{\partial e_{i} (τ)} = {(\sum_{ℓ = 1}^{J} ν_{ℓ}^{2} w_{ℓ} (γ (e_{i} (τ))) - e_{i}^{2} (τ))}^{- 1 .}

Then, from (C2)

\frac{\partial e_{i} (τ)}{\partial τ_{k}} = - X_{i k}

, and thus:

H_{k ℓ} (τ) = - \sum_{i = 1}^{N} \frac{X_{i k} X_{i ℓ}}{(\sum_{ℓ = 1}^{J} ν_{ℓ}^{2} w_{ℓ} (γ (e_{i} (τ))) - e_{i}^{2} (τ))} for k \neq ℓ .

Also, based on (C1):

\frac{\partial η_{k} (τ_{k})}{\partial τ_{k}} = {(\sum_{ℓ = 1}^{J_{k}} z_{k ℓ}^{2} p_{k ℓ} - τ_{k}^{2})}^{- 1,}

so that:

H_{k k} (τ) = - (\sum_{i = 1}^{N} \frac{X_{i k}^{2}}{(\sum_{ℓ = 1}^{J} ν_{ℓ}^{2} w_{ℓ} (γ (e_{i} (τ))) - e_{i}^{2} (τ))}) - \frac{1}{(\sum_{ℓ = 1}^{J_{k}} z_{k ℓ}^{2} p_{k ℓ} - τ_{k}^{2}) .}

Because the denominators of the terms in the definition of the

H_{k ℓ}

’s are positive valued, it follows that

H (τ)

is a negative definite matrix, because

X^{'} X

is positive definite.

Now consider the case where

τ = β

, so that:

e_{i} (β) = y_{i} - X_{i •} β = ε_{i} = \sum_{ℓ = 1}^{J} ν_{ℓ} w_{ℓ} (γ (e_{i} (β))), i = 1, \dots, N

are iid with mean zero, and thus:

ε_{i} = \sum_{ℓ = 1}^{J} ν_{ℓ} \frac{e^{γ (e_{i} (β)) ν_{ℓ}}}{\sum_{m = 1}^{J} e^{γ (e_{i} (β)) ν_{m}}}

are iid with mean zero. Because

ε_{i}

is bounded in the interior of

[ν_{1}, ν_{J}]

, the range of

γ (e_{i} (β)) \equiv γ (ε_{i})

is bounded as well. In addition,

γ (e_{i} (β))

is symmetrically distributed around zero because The

ε_{i}

’s are so distributed, and, from (4):

ε_{i} = ζ = \sum_{ℓ = 1}^{J} ν_{ℓ} \frac{e^{γ (e_{i} (β)) ν_{ℓ}}}{\sum_{m = 1}^{J} e^{γ (e_{i} (β)) ν_{m}}} \Rightarrow \sum_{ℓ = 1}^{J} ν_{ℓ} \frac{e^{- γ (e_{i} (β)) ν_{ℓ}}}{\sum_{m = 1}^{J} e^{- γ (e_{i} (β)) ν_{m}}} = - ζ = - ε_{i}

(6)

It follows that

E (γ (e_{i} (β))) = 0

, the

γ (ε_{i}) \equiv γ (e_{i} (β))

’s are iid, and

γ (ε_{i})

has finite variance, say

V a r (γ (ε_{i})) = σ_{γ}^{2}

. Then, using a multivariate version of Liapounov's central limit theorem, and given condition R6 (asymptotic normality can be established without regularity condition R6. In fact, the boundedness properties on the X-matrix stated in R5 would be sufficient. See [33] for a related proof under the weaker regularity conditions).

\frac{1}{\sqrt{N}} G (β) = \frac{1}{\sqrt{N}} (- η (β) + X^{'} γ (e_{i} (β))) \overset{d}{\to} N ([0], σ_{γ}^{2} Ω)

3.1. Consistency

For any τ, represent the conditional maximum value function,

F (τ)

, by a second order Taylor series around β as:

F (τ) = F (β) + G (β)^{'} (τ - β) + \frac{1}{2} (τ - β)^{'} H (β^{*}) (τ - β)

(7)

where

β^{*}

lies between τ and β. The value of the quadratic term in the expansion can be bounded by:

\frac{1}{2} (τ - β)^{'} H (β^{*}) (τ - β) \leq - \frac{1}{2} λ_{s} (- \frac{1}{N} H (β^{*})) \cdot N \cdot {‖ τ - β ‖}^{2}

(8)

where

λ_{s} (- \frac{1}{N} H (β^{*}))

denotes the smallest eigenvalue of

- \frac{1}{N} H (β^{*}) and ‖ a ‖ \equiv {(\sum_{k = 1}^{K} a_{k}^{2})}^{\frac{1}{2}}

[34]. The smallest eigenvalue exhibits a positive lower bound given by

(\frac{1}{C_{J}^{2}}) λ_{s} (\frac{1}{N} X^{'} X)

whatever the value of

β^{*}

.

The value of the linear term in the expansion is bounded in probability; that is,

\forall α > 0

and for

N > N (α)

, there exists a finite

A (α)

such that:

P (| G (β)^{'} (τ - β) | < \sqrt{N} A (α) | τ - β |, \forall τ) > 1 - α

(9)

because

\frac{1}{\sqrt{N}} G (β) \overset{d}{\to} N ([0], σ_{γ}^{2} Ω)

. It follows from Equations (7)–(9) that, for all

δ > 0, P (\underset{τ : | β - τ | > δ}{Max} (F (τ)) < F (β)) \to 1 as N \to \infty

. Thus

\hat{β} = \underset{τ}{\arg \max} (F (τ)) \overset{p}{\to} β

, and the GME estimator of β is consistent.

3.2. Asymptotic Normality

Expand G(b) in a Taylor series around β, where

\hat{β} = \underset{τ}{\arg \max} F (τ)

is the GME estimator of β, to obtain:

G (\hat{β}) = G (β) + H (β^{*}) (\hat{β} - β)

(10)

where

β^{*}

is between

\hat{β}

and β. In general, different

β^{*}

points will be required to represent the different coordinate functions in

G (\hat{β})

. At the optimum,

G (\hat{β}) = [0]

and

\hat{β}

is a consistent estimator of β; therefore

β^{*} \overset{p}{\to} β

, and:

\sqrt{N} {(\hat{β} - β)}^{\underline{\underline{d}}} - {(\frac{1}{N} H (β))}^{- 1} \frac{1}{\sqrt{N}} G (β),

where

\underline{\underline{d}}

denotes equivalence of limiting distributions. Using

e_{i} (β) \equiv ε_{i}

, note that:

\frac{1}{N} H (β) = \frac{1}{N} \sum_{i = 1}^{N} \frac{- {X^{'}}_{i •} X_{i •}}{\sum_{ℓ = 1}^{J} ν_{ℓ}^{2} w_{ℓ} (γ (ε_{i})) - ε_{i}^{2}} + 0_{P} (\frac{1}{N}),

where

\sum_{ℓ = 1}^{J} ν_{ℓ}^{2} w_{ℓ} (γ (ε_{i})) - ε_{i}^{2}, i = 1, \dots, N

are iid. It follows from R6 that

\frac{1}{N} H (β) \overset{p}{\to} - ξ Ω

with

ξ = E ({(\sum_{ℓ = 1}^{J} ν_{ℓ}^{2} w_{ℓ} (γ (ε_{i})) - ε_{i}^{2})}^{- 1})

. Recalling that

\frac{1}{\sqrt{N}} G (β) \overset{d}{\to} N ([0], σ_{γ}^{2} Ω)

, Slutsky's Theorem [34] implies that:

\sqrt{N} (\hat{β} - β) \overset{d}{\to} N ([0], \frac{σ_{γ}^{2}}{ξ^{2}} Ω^{- 1})

Note that holding the support of ε constant, one can reduce the interval (c₁, c_J). As

δ \to 0

, the asymptotic variance of

\sqrt{N} (\hat{β} - β)

may tend to zero, but cannot grow without bound. For example, if at

δ = 0, \exists ε > 0

such that

P (ε \leq k) \geq ε (k - c_{1})

, all

k \in (c_{1}, c_{J})

(\Rightarrow P (ε \geq k) \geq ε (c_{J} - k)

all

k \in (c_{1}, c_{J}))

, then

\lim_{δ \to 0} \frac{σ_{γ}^{2}}{ξ^{2}} = 0

.

Also note that, for large samples, the parameters reliance on the supports vanishes. In contrast, the supports on the errors influence the computed covariance matrix. Finally, for non-homogenous errors, the covariance matrix estimator could be adjusted following a standard White’s covariance correction.

3.3. Cross-Entropy Extensions

To extend the previous asymptotic results to the case of cross-entropy maximization [10], first suppose that

z_{k ℓ} = z_{k ℓ + 1}

and/or

ν_{ℓ} = ν_{ℓ + 1}

for some

ℓ

. Let

z_{k ℓ}^{*}, ℓ = 1, \dots, J_{k}^{*} and ν_{ℓ}^{*}, ℓ = 1, \dots, J^{*}

denote the distinct values among the

z_{k ℓ}

’s and

ν_{ℓ}^{*}

’s, respectively, and let

a_{k ℓ} and α_{ℓ}

denote the respective multiplicities of the values

z_{k ℓ}^{*} and ν_{ℓ}^{*}

. From Equations (3) and (5),

w_{i ℓ} (γ (e_{i} (τ))) \equiv w_{i m} (γ (e_{i} (τ)))

if

ν_{ℓ} = ν_{m}

and

p_{k ℓ} (τ_{k}) \equiv p_{k m} (τ_{k})

if

z_{k ℓ} = z_{k m}

. Thus, the maximization problem given by Equation (2) and Conditions C1-C6 is equivalent to:

\max_{b, p, w} (- \sum_{k = 1}^{K} \sum_{ℓ = 1}^{J_{k}^{*}} p_{k ℓ}^{*} \ln (\frac{p_{k ℓ}^{*}}{a_{k ℓ}}) - \sum_{i = 1}^{N} \sum_{ℓ = 1}^{J^{*}} w_{i ℓ}^{*} \ln (\frac{w_{i ℓ}^{*}}{α_{ℓ}}))

(11)

with obvious changes being made to C1-C6. The only alterations needed to the preceding proof are:

w_{i ℓ}^{*} (γ (e_{i} (τ))) = w_{ℓ} (γ (e_{i} (τ))) = \frac{α_{ℓ} e^{γ (e_{i} (τ)) ν_{ℓ}}}{\sum_{m = 1}^{J^{*}} α_{m} e^{γ (e_{i} (τ)) ν_{m}}}, ℓ = 1, \dots, J^{*}, and

(12)

p_{k ℓ}^{*} (τ_{k}) = \frac{a_{k ℓ} e^{η_{k} (τ_{k}) z_{k ℓ}}}{\sum_{m = 1}^{J_{k}^{*}} a_{k m} e^{η_{k} (τ_{k}) z_{k m}}}, k = 1, \dots, K

(13)

More generally, the same representation (11)-(13) applies for any

a_{k ℓ} > 0, α_{ℓ} > 0

. Furthermore, Equations (12) and (13) are homogeneous of degree zero in

(α_{1}, \dots, α_{J^{*}})

and

(a_{k 1,} \dots, a_{k J_{k}^{*}})

, respectively. Thus, without loss of generality, the normalization conditions:

\sum_{ℓ = 1}^{J^{*}} α_{ℓ} \equiv 1 and \sum_{ℓ = 1}^{J_{k}^{*}} a_{k ℓ} \equiv 1

can be imposed.

Using Equations (11), (12), and (13), we have characterized the maximum cross entropy solution. Upon substitution of Equations (11)–(13) in the appropriate arguments, all results, including the results in the next section on statistical testing, apply to the maximum cross-entropy paradigm.

4. Statistical Tests

The GME estimator

\hat{β} = Z \hat{p}

is consistent and asymptotically normally distributed. Therefore, asymptotically valid normal and

χ^{2}

test statistics can be used to test hypotheses about β. For empirical implementation of such tests a consistent estimate of the asymptotic covariance matrix of

\hat{β}

will be required. An estimate of

\frac{1}{N ξ^{2}} Ω^{- 1}

is straightforwardly obtained by calculating

M {(\hat{β})}^{- 1} (X^{'} X) M {(\hat{β})}^{- 1}

, where:

M (\hat{β}) = \sum_{i = 1}^{N} (\frac{- {X^{'}}_{i •} X_{i •}}{\sum_{ℓ = 1}^{J} ν_{ℓ}^{2} w_{ℓ} (γ (e_{i} (\hat{β}))) - e_{i} {(\hat{β})}^{2}}) .

An estimate of the variance,

σ_{γ}^{2}

, of The

γ_{i}

’s can be constructed as

{\hat{σ}}_{γ}^{2} (\hat{β}) = \frac{1}{N} \sum_{i = 1}^{N} γ {(e_{i} (\hat{β}))}^{2}

. Then the asymptotic covariance matrix of

\hat{β}

can be estimated by:

\hat{V a r} (\hat{β}) = {\hat{σ}}_{γ}^{2} (\hat{β}) M {(\hat{β})}^{- 1} (X^{'} X) M {(\hat{β})}^{- 1 .}

Alternatively, ξ can be estimated by:

\hat{ξ} (\hat{β}) = \frac{1}{N} \sum_{i = 1}^{N} \frac{1}{\sum_{ℓ = 1}^{J} ν_{ℓ}^{2} w_{ℓ} (γ (e_{i} (\hat{β}))) - e_{i} {(\hat{β})}^{2} .}

Then:

\hat{V a r} (\hat{β}) = \frac{{\hat{σ}}_{γ}^{2} (\hat{β})}{{\hat{ξ}}^{2} (\hat{β})} {(X^{'} X)}^{- 1 .}

4.1. Asymptotically Normal Tests

Because

T_{Z} = \frac{{\hat{β}}_{k} - β_{k}^{0}}{\sqrt{\hat{V a r} {(\hat{β})}_{k k}}}

is asymptotically N(0,1) under the null hypothesis

H_{0} : β_{k} = β_{k}^{0}

the statistic T_z can be used to test hypotheses about the values of the

β_{k}

’s.

4.2. Wald Tests

Wald tests of linear restrictions on the elements of β can be expressed in the usual form. Let

H_{0} : R β = r

be the null hypothesis to be tested, where R is a L × K matrix with rank

(R) = L \leq K

. Then

\sqrt{N} (R \hat{β} - r) \overset{d}{\to} N (0, R (\frac{σ_{γ}^{2}}{ξ^{2}} Ω^{- 1}) R^{'})

. Thus, the Wald test statistic has a

χ^{2}

limiting distribution as:

T_{W} = (R \hat{β} - r)^{'} {(R (\hat{V a r} (\hat{β})) R^{'})}^{- 1} (R \hat{β} - r) \overset{d}{\to} χ_{L}^{2}

under the null hypothesis H₀. Similarly, for nonlinear restrictions

g (β) = [0]

, where

g (β)

is a continuously differentiable L-dimensional vector function with

q = \frac{\partial g (b)}{\partial b}

and rank

(q (β)) = L \leq K

, it follows that:

T_{W} = g (\hat{β})^{'} {(q (\hat{β})^{'} \hat{V a r} (\hat{β}) q (\hat{β}))}^{- 1} g (\hat{β}) \overset{d}{\to} χ_{L}^{2} .

4.3. Likelihood Ratio Tests

To establish a pseudo-likelihood ratio test of functional restrictions on the β vector, first note that:

F (\hat{β}) - F (β) \overset{d}{\to} \frac{1}{2} {(\frac{1}{\sqrt{N}} G (β))}^{'} \frac{1}{ξ} Ω^{- 1} (\frac{1}{\sqrt{N}} G (β)),

which follows from Equations (7) and (10) and the fact that

\frac{1}{N} H (β) \overset{p}{\to} - ξ Ω

. Thus:

\frac{2 \hat{ξ} (\hat{β})}{{\hat{σ}}_{γ}^{2} (\hat{β})} (F (\hat{β}) - F (β)) \overset{d}{\to} χ_{K}^{2} .

Now let

{\hat{β}}_{R}

be a restricted GME estimator of β. Thus,

{\hat{β}}_{R} = \underset{b : R b = r}{\arg \max} (F (b))

for a linear null hypothesis

H_{0} : R β = r, or {\hat{β}}_{R} = \underset{b : g (b) = 0}{\arg \max} (F (b))

for a general null hypothesis

H_{0} : g (β) = [0]

. As before, let L = rank

(R) \leq K

for a linear hypothesis or L = rank

(q (β)) \leq K

for a general hypothesis.

Then:

\frac{2 \hat{ξ} (\hat{β})}{{\hat{σ}}_{γ}^{2} (\hat{β})} (F (\hat{β}) - F ({\hat{β}}_{R})) \overset{d}{\to} χ_{L}^{2}

under the null hypothesis.

Lagrange Multiplier Tests

Define R, r, g, J, and

{\hat{β}}_{R}

as above. Then a Lagrangian multiplier test of functional restrictions on β can be based on the fact that:

\frac{1}{{\hat{σ}}_{γ}^{2} ({\hat{β}}_{R})} G ({\hat{β}}_{R})^{'} {(X^{'} X)}^{- 1} G ({\hat{β}}_{R}) \overset{d}{\to} χ_{L}^{2}

under the null hypothesis.

5. Monte Carlo Simulations

A Monte Carlo experiment was conducted to explore the sampling behavior of test situations based on the Generalized Maximum Entropy Estimator. The data were generated based on a linear model containing an intercept term, a dichotomous explanatory variable, and two continuously measured explanatory variables. The results of the Monte Carlo experiment also add additional perspective to simulation results relating the bias and mean square error to the maximum entropy estimator generated previously by [10].

The linear model Y = Xβ + ε is specified as

Y = 2 + 1 X_{• 1} - 1 X_{• 2} + 3 X_{• 3} + ε

, where

X_{• 1}

is a discrete random variable such that

X_{i 1} \overset{i i d}{~}

Bernoulli(.5), observations on the pair of explanatory random variables

(X_{i 2}, X_{i 3})

are generated from iid outcomes of

N ((\begin{matrix} 2 \\ 5 \end{matrix}), (\begin{matrix} 1 & .5 \\ .5 & 1 \end{matrix}))

that are censored at the mean ±3 standard deviations, and outcomes of the disturbance term are defined as

ε = (\sum_{i = 1}^{12} U_{i}) - 6

, where

U_{i} \overset{i i d}{~}

Uniform(0,1). The support points for the disturbance terms were specified as V = (−10, 0, 10)' (recall C2 and C3) for all experiments. Three different sets of support points were specified for the β-vector, given by:

\begin{matrix} Z_{I} = (\begin{matrix} - 2 & 2 & 6 \\ - 3 & 1 & 5 \\ - 5 & - 1 & 3 \\ - 1 & 3 & 7 \end{matrix}), \\ Z_{I I} = (\begin{matrix} - 3 & 1 & 5 \\ - 4 & 0 & 4 \\ - 4 & 0 & 4 \\ 0 & 4 & 8 \end{matrix}), \end{matrix}

and:

Z_{I I I} = (\begin{matrix} - 10 & 0 & 10 \\ - 10 & 0 & 10 \\ - 10 & 0 & 10 \\ - 10 & 0 & 10 \end{matrix})

(recall C1). The support points in Z_I were chosen to be most favorable to the GME estimator, where the elements of the true β-vector are located in the center of their respective supports and the widths of the supports are relatively narrow. The supports represented by Z_II are tilted to the left of β₁ and β₂ and to the right of β₃ and β₄ by 1 unit, with the widths of the supports being the same as their counterparts in Z_I . The last set of supports represented by Z_III are wider and effectively define an upper bound of 10 on the absolute values of each of the elements of β.

To explore the respective sizes of the various tests presented in Section 4, the hypothesis

H_{0} : β_{2} = c

was tested using the T_Z test, and the hypothesis

H_{0} : β_{2} = c, β_{3} = d

was tested using the Wald, pseudo-likelihood, and Lagrange Multiplier tests, with c and d set equal to the true values of β₂ and β₃, i.e., c = 1 and d = −1. Critical values of the tests were based on their respective asymptotic distributions and a 0.05 level of significance. An observation on the power of the respective tests was obtained by performing a test of significance whereby c = d = 0 in the preceding hypotheses. All scenarios were analyzed using 10,000 Monte Carlo repetitions, and sample sizes of n = 25, 100, 400, and 1,600 were examined. In the course of calculating values of the test statistics, both unrestricted and restricted (by β₂ = c and/or β₃ = d) GME estimators needed to be calculated. Therefore, bias and mean square error measures relating to these and the least squares estimators were calculated as well. Monte Carlo results for the test statistics and for the unrestricted GME and OLS estimators are presented in Table 1 and Table 2, respectively, while results relating to the restricted GME and OLS estimators are presented in Table 3. Because the choice of which asymptotic covariance matrix to use in calculating the T_Z and Wald tests was inconsequential, only the results for the second suggested covariance matrix representation are presented here.

Regarding properties of the test statistics, their behavior under a true H₀ is consistent with the behavior expected from the respective asymptotic distributions when n is large (sample size of 1600), their sizes being approximately .05 regardless of the choice of support for β. The sizes of the tests remain within 0.01 of their asymptotic size when n decreases to 400, except for the Lagrange Multiplier test under support Z_II, which has a slightly larger size. Across all support choices and ranging over all sample sizes from small to large, the sizes of the T_Z and Wald tests remain in the 0−0.10 range; for Z_I supports and small sample sizes, the sizes of the tests are substantially less than 0.05. Results were similar for the pseudo-likelihood and Lagrange Multiplier tests, except for the cases of Z_II support and n ≤ 100, where the size of the test increased as high as 0.36 for the pseudo-likelihood test and 0.73 for the Lagrange multiplier test when n = 25.

Table 1. Rejection Probabilities for True

(β_{2} = 1, β_{3} = - 1)

and False

(β_{2} = β_{3} = 0)

Hypotheses.

**Table 1.** Rejection Probabilities for True $(β_{2} = 1, β_{3} = - 1)$ and False $(β_{2} = β_{3} = 0)$ Hypotheses.
Supports	T_z		WALD		Pseudo-Likelihood		Lagrange Multiplier
	H₀		H₀		H₀		H₀
Z_I			β₂ = 1	β₂ = 0	β₂ = 1	β₂ = 0	β₂ = 1	β₂ = 0
Z_I	β₂ = 1	β₂ = 0	β₃ = −1	β₃ = 0	β₃ = −1	β₃ = 0	β₃ = −1	β₃ = 0
n = 25	0.000	0.825	0.004	0.998	0.021	1.000	0.059	1.000
n = 100	0.017	0.999	0.022	1.000	0.038	1.000	0.056	1.000
n = 400	0.041	1.000	0.042	1.000	0.048	1.000	0.053	1.000
n = 1600	0.047	1.000	0.046	1.000	0.049	1.000	0.050	1.000
Z_II
n = 25	0.101	0.047	0.080	0.894	0.357	0.980	0.734	0.995
n = 100	0.085	0.996	0.067	1.000	0.114	1.000	0.172	1.000
n = 400	0.053	1.000	0.048	1.000	0.058	1.000	0.066	1.000
n = 1600	0.052	1.000	0.052	1.000	0.055	1.000	0.057	1.000
Z_III
n = 25	0.038	0.670	0.070	0.967	0.097	0.980	0.088	0.972
n = 100	0.045	0.999	0.050	1.000	0.057	1.000	0.052	1.000
n = 400	0.045	1.000	0.050	1.000	0.051	1.000	0.050	1.000
n = 1600	0.051	1.000	0.051	1.000	0.052	1.000	0.051	1.000

The powers of the tests were all substantial in rejecting false null hypotheses except for the T_Z test in the case of Z_II support and the smallest sample size, the latter result being indicative of a notably biased test. Overall, the choice of support did impact the power of tests for rejecting the errant hypotheses, although the effect was small for all but the T_Z test.

In the case of unrestricted estimators and the most favorable support choice (Z_I ), the GME estimator dominated the OLS estimator in terms of MSE, and GME superiority was substantial for sample sizes of n ≤ 100 (Table 2). The GME-Z_I estimator and, of course, the OLS estimator, were unbiased, with the GME-Z_I estimator exhibiting substantially smaller variances for smaller n. The choice of support has a significant effect on the bias and MSE of the GME estimator for small sample sizes. Neither the GME-Z_II or GME-Z_III estimator dominates the OLS estimator, although the GME-Z_III estimator is generally the better estimator across the various sample sizes. When n = 25, the GME-Z_II estimator offers notable improvement over OLS for estimating three of the four elements of β, but is significantly worse for estimating β₂. For larger sample sizes, the GME-Z_II estimator is generally inferior to the OLS estimator. Although the centers of the Z_III support are on average further from the true β’s than are the centers of the Z_II support, the wider widths of the former result in a superior GME estimator.

The results for the restricted GME estimators in Table 3 indicate that under the errant constraints

β_{2} = β_{3} = 0

, the GME dominates the OLS estimator for all sample sizes and for all support choices. The superiority of the GME estimator is substantial for smaller sample sizes, but dissipates as sample size increases. The results suggest a misspecification robustness of the GME estimator that deserves further investigation.

Table 2.

E ({\hat{β}}_{i})

and Mean Square Error Measures–Unrestricted Estimators.

**Table 2.** $E ({\hat{β}}_{i})$ and Mean Square Error Measures–Unrestricted Estimators.
Estimator	β₁ = 2		β₂ = 1		β₃ = −1		β₄ = 3
Estimator	$E ({\hat{β}}_{1})$	MSE	$E ({\hat{β}}_{2})$	MSE	$E ({\hat{β}}_{3})$	MSE	$E ({\hat{β}}_{4})$	MSE
GME-Z_I
n = 25	2.000	0.015	1.001	0.038	−1.001	0.028	3.000	0.006
n = 100	2.003	0.034	1.003	0.026	−1.000	0.011	2.999	0.004
n = 400	2.000	0.032	1.001	0.009	−1.000	0.003	3.000	0.002
n = 1600	2.000	0.014	1.000	0.002	−1.000	0.001	3.000	0.001
GME-Z_II
n = 25	1.022	0.977	0.484	0.309	−0.840	0.058	3.182	0.040
n = 100	1.306	0.519	0.826	0.056	−0.966	0.013	3.139	0.023
n = 400	1.672	0.141	0.960	0.010	−0.996	0.003	3.066	0.006
n = 1600	1.892	0.026	0.991	0.002	−1.000	0.001	3.022	0.001
GME-Z_III
n = 25	1.278	0.757	0.946	0.131	−0.881	0.069	3.092	0.028
n = 100	1.709	0.252	0.995	0.037	−0.978	0.014	3.046	0.011
n = 400	1.914	0.068	0.999	0.010	−0.996	0.003	3.015	0.003
n = 1600	1.978	0.017	0.999	0.002	−0.999	0.001	3.004	0.001
OLS
n = 25	1.997	1.342	1.002	0.181	−1.002	0.066	3.001	0.065
n = 100	2.009	0.283	1.003	0.041	−1.000	0.014	2.998	0.014
n = 400	2.001	0.068	1.001	0.010	−1.000	0.003	3.000	0.003
n = 1600	2.000	0.017	1.000	0.003	−1.000	0.001	3.000	0.001

Table 3.

E ({\hat{β}}_{i})

and Mean Square Error Measures – Restricted Estimators Under the Errant Restriction

β_{2} = β_{3} = 0

.

**Table 3.** $E ({\hat{β}}_{i})$ and Mean Square Error Measures – Restricted Estimators Under the Errant Restriction $β_{2} = β_{3} = 0$ .
Estimator	β₁ = 2		β₄ = 3
Estimator	$E ({\hat{β}}_{1})$	MSE	$E ({\hat{β}}_{4})$	MSE
GME-Z_I
n = 25	2.078	0.041	2.681	0.011
n = 100	2.340	0.191	2.630	0.142
n = 400	2.689	0.537	2.600	0.196
n = 1600	2.898	0.832	2.520	0.232
GME-Z_II
n = 25	1.064	0.915	2.885	0.018
n = 100	1.603	0.234	2.772	0.056
n = 400	2.330	0.169	2.630	0.140
n = 1600	2.776	0.628	2.543	0.210
GME-Z_III
n = 25	1.686	0.589	2.750	0.084
n = 100	2.468	0.542	2.601	0.172
n = 400	2.842	0.823	2.530	0.225
n = 1600	2.958	0.948	2.508	0.243
OLS
n = 25	3.011	3.342	2.497	0.342
n = 100	3.013	1.575	2.497	0.274
n = 400	3.005	1.138	2.499	0.256
n = 1600	2.999	1.030	2.500	0.251

Asymmetric Error Supports

We present further Monte Carlo simulations to show that regularity condition R2, which assumes symmetry of the disturbance term, is not a necessary condition for identification of the GME slope parameters. It is demonstrated below that if the supports of the error distribution asymmetric, then only the intercept term of the GME regression estimator is asymptotically biased.

The Monte Carlo experiments that follow are identical to those above except for specification of the user supplied support points for the error terms and the underlying true error distribution. To illustrative the impact of asymmetric errors, experiments are based on one set of support points symmetric about zero,

V_{I} = (- 10, 0, 10)^{'}

, and two sets of support points not symmetric about zero,

V_{I I} = (- 5, 5, 15)^{'}

and

V_{I I I} = (- 5, 0, 15)^{'}

. The support V_II is a simple translation of V_I by five positive units in magnitude and retaining symmetry centered about 5. The asymmetric support V_III translates the truncation points by five positive units in magnitude, but retains the center support point 0. The true error distribution is generated in two ways: a symmetric distribution specified as a N(0,1) distribution truncated at (−3,3) and an asymmetric distribution specified as a Beta(3,2) translated and scaled from support (0,1) to (−3,3) with mean 0.6. Supports on the parameter coefficients terms are retained as Z_I, providing symmetric support points about the true coefficient values.

Monte Carlo experiments presented in Table 4 and Table 5 are generated for sample sizes 25, 100, and 400 with 1,000 replications for each sample size. Consider when the true distribution is symmetric about zero. Slope coefficients for error supports that are not symmetric about zero appear biased in smaller sample sizes. However, the bias and MSE of the slope coefficients decrease as the sample sizes increases. Next, suppose the true distribution is asymmetric. For symmetric and asymmetric supports only the intercept terms are persistently biased, diverging from the true parameter values as the sample size increases. These results demonstrate the robustness of GME slope coefficients to asymmetric error distributions and user supplied supports.

Table 4. Mean and MSE of 1,000 Monte Carlo Simulations with True Distribution Symmetric. Symmetric and Asymmetric Error Supports and Coefficient Support Z_I.

**Table 4.** Mean and MSE of 1,000 Monte Carlo Simulations with True Distribution Symmetric. Symmetric and Asymmetric Error Supports and Coefficient Support Z_I.
Estimator	β₁ = 2		β₂ = 1		β₃ = −1		β₄ = 3
Estimator	E(β₁)	MSE	E(β₂)	MSE	E(β₃)	MSE	E(β₄)	MSE
GME-Z_I,V_I
25	2.002	0.016	1.003	0.042	−1.000	0.030	2.997	0.007
100	2.000	0.033	1.001	0.026	−1.002	0.011	3.002	0.004
400	2.000	0.035	1.001	0.010	−0.998	0.003	2.999	0.002
GME-Z_I,V_II
25	1.259	0.585	0.815	0.101	−1.009	0.048	2.209	0.636
100	0.208	3.258	0.804	0.071	−0.944	0.020	2.381	0.389
400	−1.144	9.903	0.868	0.028	−0.959	0.005	2.640	0.132
GME-Z_I,V_III
25	1.506	0.271	0.875	0.069	−1.005	0.038	2.476	0.282
100	0.752	1.598	0.875	0.045	−0.961	0.015	2.602	0.163
400	−0.235	5.024	0.925	0.015	−0.977	0.004	2.794	0.044
OLS
25	2.014	1.321	1.007	0.204	−0.998	0.069	2.993	0.065
100	1.999	0.280	1.001	0.042	−1.002	0.014	3.002	0.014
400	2.001	0.075	1.001	0.011	−0.997	0.003	2.999	0.003

Table 5. Mean and MSE of 1000 Monte Carlo Simulations with True Distribution Asymmetric. Symmetric and Asymmetric Error Supports and Coefficient Support Z_I.

**Table 5.** Mean and MSE of 1000 Monte Carlo Simulations with True Distribution Asymmetric. Symmetric and Asymmetric Error Supports and Coefficient Support Z_I.
	β₁=2		β₂=1		β₃=−1		β₄=3
Estimator	E(β₁)	MSE	E(β₂)	MSE	E(β₃)	MSE	E(β₄)	MSE
GME-Z_I,V_I
25	2.089	0.031	1.038	0.060	−1.005	0.041	3.094	0.018
100	2.233	0.108	1.023	0.033	−1.006	0.016	3.071	0.010
400	2.427	0.229	1.015	0.012	−1.004	0.005	3.033	0.004
GME-Z_I,V_II
25	1.358	0.449	0.843	0.103	−1.021	0.057	2.305	0.496
100	0.410	2.583	0.826	0.073	−0.966	0.019	2.463	0.294
400	−0.860	8.209	0.890	0.025	−0.966	0.006	2.700	0.092
GME-Z_I,V_III
25	1.597	0.190	0.905	0.075	−1.019	0.049	2.574	0.193
100	0.964	1.129	0.889	0.055	−0.967	0.020	2.674	0.112
400	0.126	3.553	0.946	0.016	−0.981	0.005	2.835	0.030
OLS
25	2.600	2.324	1.041	0.261	−1.009	0.097	2.998	0.099
100	2.616	0.813	1.001	0.052	−0.999	0.020	2.997	0.021
400	2.610	0.471	1.003	0.013	−1.000	0.005	2.997	0.005

6. Further Results

Unbiased GME Estimation. It is apparent from the proof of the theorem in Section 3 that the

- \sum_{ℓ = 1}^{J_{k}} p_{k ℓ} \ln (p_{k ℓ})

terms are asymptotically uninformative. It is instructive to note that if these terms are deleted from the GME objective function and the resulting objective function is then maximized through choosing b and w subject to constraints C2–C4 and C6, the resulting GME estimator is in fact unbiased for estimating β. This follows because The

ε_{i}

’s are iid mean zero and symmetrically distributed around zero, and the new estimator, say

\tilde{β}

, is such that

\tilde{β} - β

is a symmetric function of The

ε_{i}

’s.

Bayesian Analogues. As pointed out by [35] maximum entropy methods can be motivated as an empirical Bayes rule. We expand on their analogy by noting a strong formal parallel to the traditional Bayesian framework of inference. In particular, one can view

- \sum_{k = 1}^{K} \sum_{ℓ = 1}^{J_{k}} p_{k ℓ} \ln (p_{k ℓ})

as the maximum entropy analogue to the log of a non-normalized Bayesian prior and

- \sum_{i = 1}^{N} \sum_{ℓ = 1}^{J} w_{i ℓ} \ln (w_{i ℓ})

as the maximum entropy analogue to the non-normalized log of the probability density kernel or log-likelihood function. For any given set of support points Z and V, we can define functions

f_{β_{k}} and f_{ε}

by:

f_{β_{k}} (b_{k}) = \frac{e^{- \sum_{ℓ = 1}^{J_{K}} p_{k ℓ} (b_{k}) \ln (p_{k ℓ} (b_{k}))}}{{\int_{z_{k 1}}^{z_{k J_{k}}} e}^{- \sum_{ℓ = 1}^{J_{K}} p_{k ℓ} (τ_{k}) \ln (p_{k ℓ} (τ_{k}))} d τ_{k}} and f_{ε} (x) = \frac{e^{- \sum_{ℓ = 1}^{J} w_{k ℓ} (x) \ln (w_{ℓ} (x))}}{{\int_{v_{1}}^{v_{J}} e}^{- \sum_{ℓ = 1}^{J} w_{ℓ} (y) \ln (w_{ℓ} (y))} d y}

Then for

ε \overset{i i d}{~} f_{ε}

, the maximum likelihood estimator of β is

\tilde{β}

, and if one adds priors

β_{k} \overset{i n d}{~} f_{β_{k}}

, then

\hat{β}

is the Bayesian posterior mode estimator of β. We note the following consequences of these equivalences. First, if the support points

v_{1}, \dots, v_{J}

can be chosen so that

f_{ε}

is very close to the true distribution of ε, then the GME estimator should be nearly asymptotically efficient. Second, in finite samples the prior information influences

\hat{β}

such that

\hat{β}

is generally not unbiased. Third, the support points used in the GME estimator have no particular relationship to the points of support of the distribution of a discrete random variable. The distributions

f_{ε}

and

f_{β_{k}}

are absolutely continuous for any choice of Z and V.

The previous Monte Carlo results illustrate the Bayesian-like character of the maximum entropy results. The GME with reasonably narrow points of support centered on the true values of β dominated the OLS estimator and was sometimes far better. On the other hand, the GME performed poorly when the points of support were similarly narrow and mis-centered by only one-eighth the range of the points of support. In the latter case, mean squared errors were often much worse than OLS and biases were often substantial. Finally, wider points of support, even though they were the most mis-centered of the cases examined, were quite similar to OLS results for moderate to large sample sizes, and provided some degree of improvement over OLS for small samples.

Finally, the GME approach is a special case of generalized cross entropy, which incorporates a reference probability distribution over support points. This allows a direct method of including prior information, akin to a Bayesian framework. However, in a classical sense, the empirical estimation strategies are inherently different.

GME Calculation Method. The conditional maximum entropy formulation (2) utilized in the proof of asymptotic results represents the basis for a computationally efficient method of obtaining GME estimates. In particular, maximizing

F (τ)

through choice of τ involves a nonlinear search over a vector of relatively low dimension (K) as opposed to searching over the (KM + NJ) dimensional space of (p,w) values. In the process of concentrating the objective function, note that the needed Lagrange multiplier functions

η_{k} (τ_{K}) and γ (e_{i} (τ))

can be expressed as elementary functions for three support points or less, and still exist in closed form (using inverse hyperbolic functions) for support vectors having five elements. As a point of comparison, the calculation of GME estimates in the Monte Carlo experiment with N = 1,600 was completed in a matter of seconds on a 133 mhz personal computer. Such a calculation would be intractable, let alone efficient, in the space of (p,w) values. We note further that the dual algorithm of [10] would still involve a search over a space of dimension N = 1,600, which would be infeasible here and in other problems in which the number of data points is large.

7. Conclusions

We have shown that the data-constrained GME estimator of the GLM is consistent and asymptotically normal as long as the coefficients and errors obey the constraints of the constrained maximum entropy problem. Furthermore, we have demonstrated the possibility that the GME estimator can be asymptotically efficient. Thus, depending on the distribution of the errors, GME may be more or less efficient than alternatives such as least squares. We performed Monte Carlo tests showing that the quality of the GME estimates depends on the quality of the supports chosen. The Monte Carlo results suggest that GME with wide supports will often perform better than OLS while providing some robustness to misspecification.

We have shown how all the conventional types of asymptotic tests can be calculated for GME estimates. In the Monte Carlo study these asymptotic tests performed extremely well in samples of 400 or more. In smaller samples the tests performed less well, particularly when the supports were narrow, although some of the results were quite acceptable. We have also demonstrated that all our results can be applied to a maximum cross-entropy estimator. While our focus has been on asymptotic properties, we have also shown how the entropy terms involving the coefficients play a role analogous to a Bayesian prior. Furthermore, these terms are asymptotically uninformative and can be omitted if the researcher wishes to use an unbiased GME estimator.

References

Imbens, G.; Spady, R.; Johnson, P. Information Theoretic Approaches to Inference in Moment Condition Models. Econometrica 1998, 66, 333–357. [Google Scholar] [CrossRef]
Imbens, G. A New Approach to Generalized Method of Moments Estimation, Harvard Institute of Economic Research Discussion Paper No. 1633. Harvard University: Cambridge, MA, USA, 1993. [Google Scholar]
Kitamura, Y.; Stutzer, M. An Information-Theoretic Alternative to Generalized Method of Moments Estimation. Econometrica 1997, 65, 861–874. [Google Scholar] [CrossRef]
Cressie, N.; Read, T.R.C. Multinomial goodness of fit tests. J. Roy. Stat. Soc B 1984, 46, 440–464. [Google Scholar]
Pompe, B. On Some Entropy Measures in Data Analysis. Chaos Solutions Fractals 1994, 4, 83–96. [Google Scholar] [CrossRef]
Seidenfeld, T. Entropy and Uncertainty. Philo. Sci. 1986, 53, 467–491. [Google Scholar] [CrossRef]
Judge, G.G.; Mittelhammer, R.C. An Information Theoretic Approach to Econometrics; Cambridge University Press: Cambridge, UK, 2012. [Google Scholar]
Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
Jaynes, E.T. Information Theory and Statistical Mechanics. In Statistical Physics; Ford, K., Ed.; Benjamin: New York, NY, USA, 1963; p. 181. [Google Scholar]
Golan, A.; Judge, G.; Miller, D. Maximum Entropy Econometrics: Robust Estimation with Limited Data; Wiley & Sons: New York, NY, USA, 1996. [Google Scholar]
Zellner, A.; Highfield, R.A. Calculation of maximum entropy distribution and approximation of marginal posterior distributions. J. Econometrics 1998, 37, 195–209. [Google Scholar] [CrossRef]
Soofi, E. Information Theoretic Regression Methods. In Applying Maximum Entropy to Econometric Problems (Advances in Econometrics); Fomby, T., Hill, R.C., Eds.; Emerald Group Publishing Limited: London, UK, 1997. [Google Scholar]
Ryu, H.K. Maximum entropy estimation of density and regression functions. J. Econometrics 1993, 56, 397–440. [Google Scholar] [CrossRef]
Golan, A.; Judge, G.; Perloff, J.M. A maximum entropy approach to recovering information from multinomial response data. JASA 1996, 91, 841–853. [Google Scholar] [CrossRef]
Vinod, H.D. Maximum Entropy Ensembles for Time Series Inference in Economics. Asian Econ. 2006, 17, 955–978. [Google Scholar] [CrossRef]
Holm, J. Maximum entropy Lorenz curves. J. Econometrics 1993, 59, 377–389. [Google Scholar] [CrossRef]
Marsh, T.L.; Mittelhammer, R.C. Generalized Maximum Entropy Estimation of a First Order Spatial Autoregressive Model. In Spatial and Spatiotemporal Econometrics (Advances in Econometrics); Pace, R.K., LeSage, J.P., Eds.; Emerald Group Publishing Limited: London, UK, 2004. [Google Scholar]
Krebs, T. Statistical Equilibrium in One-Step Forward Looking Economic Models. JET 1997, 73, 365–394. [Google Scholar] [CrossRef]
Golan, A.; Judge, G.; Karp, L. A maximum entropy approach to estimation and inference in dynamic models or counting fish in the sea using maximum entropy. JEDC 1996, 20, 559–582. [Google Scholar] [CrossRef]
Kattuman, P.A. On the size Distribution of Establishments of Large Enterprises: An Analysis for UK Manufacturing; University of Cambridge: Cambridge, UK, 1995. [Google Scholar]
Callen, J.L.; Kwan, C.C.Y.; Yip, P.C.Y. Foreign-Exchange Rate Dynamics: An Empirical Study Using Maximum Entropy Spectral Analysis. J. Bus. Econ. Stat. 1985, 3, 149–155. [Google Scholar]
Bellacicco, A.; Russo, A. Dynamic Updating of Labor Force Estimates: JARES. Labor 1991, 5, 165–175. [Google Scholar]
Sengupta, J.K. The maximum entropy approach in production frontier estimation. Math. Soc. Sci. 1992, 25, 41–57. [Google Scholar] [CrossRef]
Fraser, I. An application of maximum entropy estimation: The demand for meat in the United Stated Kingdom. Appl. Econ. 2000, 32, 45–59. [Google Scholar] [CrossRef]
Lev, B.; Theil, H. A Maximum Entropy Approach to the Choice of Asset Depreciation. J. Accounting Res. 1978, 16, 286–293. [Google Scholar] [CrossRef]
Stuzer, M. A simple nonparametric approach to derivative security valuation. J. Financ. 1996, 51, 1633–1652. [Google Scholar] [CrossRef]
Buchen, P.W.; Kelly, M. The Maximum Entropy Distribution of an Asset Inferred from Option Prices. J. Financ. and Quant. Anal. 1996, 31, 143–159. [Google Scholar] [CrossRef]
Preckel, P.V. Least squares and entropy: A penalty function perspective. Am. J. Agr. Econ. 2001, 83, 366–377. [Google Scholar] [CrossRef]
Paris, Q.; Howitt, R. An Analysis of Ill-Posed Production Problems Using Maximum Entropy. Am. J. Agr. Econ. 1998, 80, 124–138. [Google Scholar] [CrossRef]
Lence, S.H.; Miller, D.J. Recovering Output-Specific Inputs from Aggregate Input Data: A Generalized Cross-Entropy Approach. Am. J. Agr. Econ. 1998, 80, 852–867. [Google Scholar] [CrossRef]
Miller, D.J.; Plantinga, A.J. Modeling Land Use Decisions with Aggregate Data. Am. J. Agr. Econ. 1999, 81, 180–194. [Google Scholar] [CrossRef]
Fernandez, L. Recovering Wastewater Treatment Objectives: An Application of Entropy Estimation for Inverse Control Problems. In Advances in Econometrics, Applying Maximum Entropy to Econometric Problems; Fomby, T., Hill, R.C., Eds.; Jai Press Inc.: London, UK, 1997. [Google Scholar]
White, H. Asymptotic Theory for Econometricians; Academic Press: New York, NY, USA, 1984. [Google Scholar]
Rao, C.R. Linear Statistical Inference and Its Applications, 2nd ed.; Wiley & Sons: New York, NY, USA, 1973. [Google Scholar]
Miller, D.; Judge, G.; Golan, A. Robust Estimation and Conditional Inference with Noisy Data; University of California: Berkeley, CA, USA, 1996. [Google Scholar]

© 2013 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Share and Cite

MDPI and ACS Style

Mittelhammer, R.; Cardell, N.S.; Marsh, T.L. The Data-Constrained Generalized Maximum Entropy Estimator of the GLM: Asymptotic Theory and Inference. Entropy 2013, 15, 1756-1775. https://doi.org/10.3390/e15051756

AMA Style

Mittelhammer R, Cardell NS, Marsh TL. The Data-Constrained Generalized Maximum Entropy Estimator of the GLM: Asymptotic Theory and Inference. Entropy. 2013; 15(5):1756-1775. https://doi.org/10.3390/e15051756

Chicago/Turabian Style

Mittelhammer, Ron, Nicholas Scott Cardell, and Thomas L. Marsh. 2013. "The Data-Constrained Generalized Maximum Entropy Estimator of the GLM: Asymptotic Theory and Inference" Entropy 15, no. 5: 1756-1775. https://doi.org/10.3390/e15051756

APA Style

Mittelhammer, R., Cardell, N. S., & Marsh, T. L. (2013). The Data-Constrained Generalized Maximum Entropy Estimator of the GLM: Asymptotic Theory and Inference. Entropy, 15(5), 1756-1775. https://doi.org/10.3390/e15051756

Article Menu

The Data-Constrained Generalized Maximum Entropy Estimator of the GLM: Asymptotic Theory and Inference

Abstract

1. Introduction

2. The Data-Constrained GME Formulation

3. Consistency and Asymptotic Normality of the GME Estimator

3.1. Consistency

3.2. Asymptotic Normality

3.3. Cross-Entropy Extensions

4. Statistical Tests

4.1. Asymptotically Normal Tests

4.2. Wald Tests

4.3. Likelihood Ratio Tests

Lagrange Multiplier Tests

5. Monte Carlo Simulations

Asymmetric Error Supports

6. Further Results

7. Conclusions

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI