Open Access
This article is

- freely available
- re-usable

*Entropy*
**2009**,
*11*(4),
917-930;
https://doi.org/10.3390/e11040917

Article

A Weighted Generalized Maximum Entropy Estimator with a Data-driven Weight

Department of Agricultural Economics, Texas A&M University, College Station, TX 77843-2124, USA

Received: 24 September 2009 / Accepted: 16 November 2009 / Published: 26 November 2009

## Abstract

**:**

The method of Generalized Maximum Entropy (GME), proposed in Golan, Judge and Miller (1996), is an information-theoretic approach that is robust to multicolinearity problem. It uses an objective function that is the sum of the entropies for coefficient distributions and disturbance distributions. This method can be generalized to the weighted GME (W-GME), where different weights are assigned to the two entropies in the objective function. We propose a data-driven method to select the weights in the entropy objective function. We use the least squares cross validation to derive the optimal weights. Monte Carlo simulations demonstrate that the proposed W-GME estimator is comparable to and often outperforms the conventional GME estimator, which places equal weights on the entropies of coefficient and disturbance distributions.

Keywords:

maximum entropy; generalized maximum entropy method; cross validation## 1. Introduction

Jaynes’ Principle of Maximum Entropy provides a method of constructing distributions based on limited information. This approach, and its generalization through minimization of the cross entropy by Kullback, Leibler and others, have found wide-spread applications in various fields of science. See, for example, [1] and references therein. As illustrated by the famous Jaynes’ die problem, this principle provides a solution to the “ill-posed” inverse problem. Golan, Judge and Miller ([2], GJM henceforth) generalizes this method to the regression framework. In particular, they reparameterize the coefficients and disturbances in a linear regression model as discrete random variables on bounded supports. The sum of entropies of distributions of the coefficients and disturbances are maximized subject to model consistency constraints. The coefficients of interest are then calculated as the expectation of random variables on the prescribed supports under the derived distributions of the entropy maximization. They further generalize this so-called Generalized Maximum Entropy (GME) method to a weighted one, in which different weights are assigned to the entropies of coefficient and disturbance distributions.

Although the specifications of the coefficient and disturbance supports can be guided by non-sample information and preliminary estimates, there is no clear guidance on how to select the weights placed on the entropies of the coefficient and disturbance distributions. In this study, we propose a data-driven method of selecting this weight parameter, which balances the two components in the entropy maximization objective function in an automatic, objective manner. We use the least squares cross validation in our implement of the proposed method. The results are shown to improve on the conventional GME estimator under various scenarios.

## 2. Generalized Maximum Entropy Estimator

In this section, we briefly review the literature on information entropy, the principle of maximum entropy and its applications to possibly ill-posed inverse problem. We then discuss the generalized maximum entropy estimator for linear regressions and its statistical properties.

#### 2.1. The ME principle

Let X be a random variable with possible outcome values ${x}_{k},k=1,\cdots ,K$ and probabilities ${p}_{k}$ such that ${\sum}_{k=1}^{K}{p}_{k}=1$. [3] defined the information entropy of the distribution of probabilities, $\mathit{p}={\left\{{p}_{k}\right\}}_{k=1}^{K}$ as the measure
where $0log0=0$. The entropy measures the uncertainty of a distribution and reaches a maximum when ${p}_{1}={p}_{2}=\cdots ={p}_{K}=1/K$ or, in other words, when the probabilities are uniform.

$$H(\mathit{p})=-\sum _{k=1}^{K}{p}_{k}log{p}_{k}$$

Reference [4] proposed using the entropy concept in choosing the unknown distribution of probabilities. Under what Jaynes called the maximum entropy principle, one chooses the distribution for which the information (data) is just sufficient to determine the probability assignment. More precisely, one chooses the distribution, among those distributions consistent with known information, that maximizes the entropy. This maximum entropy formulation that is based on the work of [3] and [4] has been extended by [5], [6] and many others who are identified in the collection of papers in [7]. Axiomatic arguments for the justification of the ME principle have been made by [1], [8], [9] and [10]. See GJM for an in-depth review of this literature.

Suppose that $E\left[X\right]=y$. According to the ME principle, one can construct a density of X by maximizing
subject to the data consistency and normalization-additivity requirements
where $\mathit{X},\mathit{p}$ are $K\times 1$ vectors, and 1 is a $K\times 1$ vector of ones. The analytical solution to the entropy maximization problem can be obtained by the Lagrangian function
with optimality conditions
We can then solve for $\widehat{\mathit{p}}$, in terms of $\widehat{\lambda}$ to get
where
is a normalization factor that converts the relative probabilities into absolute probabilities.

$$H(\mathit{p})=-{\mathit{p}}^{\prime}log\mathit{p}$$

$$\begin{array}{c}\hfill y={\mathit{X}}^{\prime}\mathit{p}\\ \hfill {\mathit{p}}^{\prime}\mathbf{1}=1\end{array}$$

$$\mathcal{L}=-{\mathit{p}}^{\prime}log\mathit{p}+\lambda (y-{\mathit{X}}^{\prime}\mathit{p})+\mu (1-{\mathit{p}}^{\prime}\mathbf{1})$$

$$\begin{array}{cc}\hfill \partial \mathcal{L}/\partial \mathit{p}& =-log\widehat{\mathit{p}}-\mathbf{1}-{\mathit{X}}^{\prime}\widehat{\lambda}-\widehat{\mu}=\mathit{0}\hfill \\ \hfill \partial \mathcal{L}/\partial \lambda & =y-{\mathit{X}}^{\prime}\widehat{\mathit{p}}=0\hfill \\ \hfill \partial \mathcal{L}/\partial \mu & =1-{\widehat{\mathit{p}}}^{\prime}\mathbf{1}=0\hfill \end{array}$$

$$\widehat{\mathit{p}}=exp(-{\mathit{X}}^{\prime}\widehat{\lambda})/\Omega (\widehat{\lambda})$$

$$\Omega (\widehat{\lambda})=\sum _{k}exp(-{x}_{k}^{\prime}\widehat{\lambda})$$

Solution (1) establishes a unique non-linear relation between $\widehat{\mathit{p}}$ and y through $\widehat{\lambda}$. Unlike conventional regression methods such as the least squares estimator, the ME method can be used for inferences in the so-called ill-posed problem. For instance, let us look at the famous Jaynes’ die problem. Suppose that one is given a six-sided die that can take on the values $k=1,2,\cdots ,6$, and asked to estimate the probabilities for each possible outcome given that the average outcome from a large number of independent rolls of the die was y. The ME formulation of this problem is as follows:
subject to
This is an inverse problem with one observation (the mean) and six unknowns and thus clearly ill-posed. Using the ME framework, one is able to assign unique probability to each possible outcome. For example, when the average outcome is 3.5, the ME method assigns equal weights to all six outcomes. If the average outcome is larger/smaller than 3.5, the ME method “tilts” the distribution smoothly such that the weight to each side of the die increases/decreases with number of dots on it.

$$maxH(\mathit{p})=-\sum _{k=1}^{6}{p}_{k}log{p}_{k}$$

$$\begin{array}{cc}\hfill \sum _{k=1}^{6}{p}_{k}k& =y\hfill \\ \hfill \sum _{k=1}^{6}{p}_{k}& =1\hfill \end{array}$$

#### 2.2. The GME estimator

GJM generalizes the ME solution to the inverse problem to the regression framework. Consider the linear model
where

$$\mathit{y}=\mathit{X}\mathit{\beta}+\mathit{e}$$

**y**is a T-dimensional vector of observables,**X**is a $T\times K$ design matrix, and**β**is a K-dimensional vector of unknown parameters. The unobservable disturbance vector**e**may represent one or more sources of noise in the observed system, including sample and non-sample errors in the data, randomness in the behavior of the economic agents, and specification or modeling errors.GJM reparameterize model (2) such that
where ${\mathit{p}}_{\mathit{k}}={[{p}_{k1},\cdots ,{p}_{kM}]}^{\prime}$ is an M-dimensional vector of positive weights that sum to one. Further, these convex combinations may be assembled in matrix form so that
where

**β**are represented by expectations of random variables with compact supports. In particular, one can parameterize ${\beta}_{k}$ as a discrete random variable with a compact support and M possible outcomes ${\mathit{z}}_{\mathit{k}}={[{z}_{k1},\cdots ,{z}_{kM}]}^{\prime}$, where $2\le M<\infty $, and ${z}_{k1}$ and ${z}_{kM}$ are the plausible extreme values (upper and lower bounds) of ${\beta}_{k}$. We can express ${\beta}_{k}$ as a convex combination
$${\beta}_{k}={\mathit{z}}_{\mathit{k}}^{\prime}{\mathit{p}}_{\mathit{k}}$$

**β**may be written as
$$\mathit{\beta}=\mathit{Z}\mathit{p}=\left[\begin{array}{cccc}{\mathit{z}}_{\mathbf{1}}^{\prime}& \mathbf{0}& \mathbf{\xb7}& \mathbf{0}\\ \mathbf{0}& {\mathit{z}}_{\mathbf{2}}^{\prime}& \mathbf{\xb7}& \mathbf{0}\\ \mathbf{\xb7}& \mathbf{\xb7}& \mathbf{\xb7}& \mathbf{\xb7}\\ \mathbf{0}& \mathbf{0}& \mathbf{\xb7}& {\mathit{z}}_{\mathit{K}}^{\prime}\end{array}\right]\left[\begin{array}{c}{\mathit{p}}_{\mathit{1}}\\ {\mathit{p}}_{\mathbf{2}}\\ \mathbf{\xb7}\\ {\mathit{p}}_{\mathit{K}}\end{array}\right]$$

**Z**is a $K\times KM$ matrix and**p**is a $KM$-dimensional vector of weights.Further assuming that e is a random vector with finite location and scale parameters, one can represent his uncertainty about the outcome of the error process by representing each ${e}_{t}$ as a finite and discrete random variable with $2\le J<\infty $ possible outcomes. Suppose that there exist sets of error bounds, ${v}_{t1}$ and ${v}_{tJ}$, for each ${e}_{t}$ so that $1-Pr[{v}_{t1}<{e}_{t}<{v}_{tJ}]$ may be made arbitrarily small. One can then write
where ${\mathit{v}}_{t}={[{v}_{t1},\cdots ,{v}_{tJ}]}^{\prime}$ is a finite support for ${e}_{t}$, and ${\mathit{w}}_{t}={[{w}_{t1},\cdots ,{w}_{tJ}]}^{\prime}$ is a J-dimensional vector of positive weights that sum to one. The T unknown disturbances may be written in matrix form as
where

$${e}_{t}={\mathit{v}}_{\mathit{t}}^{\prime}{\mathit{w}}_{\mathit{t}}$$

$$\mathit{e}=\mathit{V}\mathit{w}=\left[\begin{array}{cccc}{\mathit{v}}_{\mathbf{1}}^{\prime}& \mathbf{0}& \mathbf{\xb7}& \mathbf{0}\\ \mathbf{0}& {\mathit{v}}_{\mathbf{2}}^{\prime}& \mathbf{\xb7}& \mathbf{0}\\ \mathbf{\xb7}& \mathbf{\xb7}& \mathbf{\xb7}& \mathbf{\xb7}\\ \mathbf{0}& \mathbf{0}& \mathbf{\xb7}& {\mathit{v}}_{\mathit{T}}^{\prime}\end{array}\right]\left[\begin{array}{c}{\mathit{w}}_{\mathbf{1}}\\ {\mathit{w}}_{\mathbf{2}}\\ \mathbf{\xb7}\\ {\mathit{w}}_{\mathit{T}}\end{array}\right]$$

**V**is a $T\times TJ$ matrix and w is a $TJ$-dimensional vector of weights, which are strictly positive and sum to one for each t.Using the reparameterized unknowns, $\mathit{\beta}=\mathit{Z}\mathit{p}$ and $\mathit{e}=\mathit{V}\mathit{w}$, one can rewrite model (2) as
The Generalized Maximum Entropy (GME) estimator is then defined by
subject to

$$\mathit{y}=\mathit{X}\mathit{\beta}+\mathit{e}=\mathit{X}\mathit{Z}\mathit{p}+\mathit{V}\mathit{w}$$

$$maxH(\mathit{p},\mathit{w})=-{\mathit{p}}^{\prime}log\mathit{p}-{\mathit{w}}^{\prime}log\mathit{w}$$

$$\begin{array}{cc}\hfill \mathit{y}& =\mathit{X}\mathit{Z}\mathit{p}+\mathit{V}\mathit{w}\hfill \\ \hfill {\mathbf{1}}_{\mathit{K}}& =({\mathit{I}}_{\mathit{K}}\otimes {\mathbf{1}}_{\mathit{M}}^{\prime})\mathit{p}\hfill \\ \hfill {\mathbf{1}}_{\mathit{T}}& =({\mathit{I}}_{\mathit{T}}\otimes {\mathbf{1}}_{\mathit{J}}^{\prime})\mathit{w}\hfill \end{array}$$

This optimization problem can be solved using the Lagrangian method. The Lagrangian equation takes the form
where $\mathit{\lambda},\mathit{\theta},\mathit{\tau}$ are $T\times 1,K\times 1,T\times 1$ vectors of Lagrangian multipliers respectively. Solving the first order conditions yields
where
Furthermore, this constrained optimization problem can be rewritten as an unconstrained one, in which the objective function takes the form
The minimal value function, $\mathcal{M}(\mathit{\lambda})$, may be interpreted as a constrained expected log-likelihood function. This dual version of the GME problem simplifies the estimation considerably. The analytical gradient of the dual problem
is simply the model consistency constraint. The Hessian matrix of $\mathcal{M}(\mathit{\lambda})$ takes the form
where ${\Sigma}_{\mathit{Z}}(\mathit{\lambda})$ and ${\Sigma}_{\mathit{V}}(\mathit{\lambda})$ are covariance matrices for distributions $\mathit{p}(\mathit{\lambda})$ and $\mathit{w}(\mathit{\lambda})$ respectively. Both covariance matrices are strictly positive definite for any interior solution, $(\widehat{\mathit{p}},\widehat{\mathit{w}})$, which ensures the uniqueness of the solution.

$$\mathcal{L}=H(\mathit{p},\mathit{w})+{\mathit{\lambda}}^{\prime}[\mathit{y}-\mathit{X}\mathit{Z}\mathit{p}-\mathit{V}\mathit{w}]+{\mathit{\theta}}^{\prime}[{\mathbf{1}}_{\mathit{K}}-({\mathit{I}}_{\mathit{K}}\otimes {\mathbf{1}}_{\mathit{M}}^{\prime})\mathit{p}]+{\mathit{\tau}}^{\prime}[{\mathbf{1}}_{\mathit{T}}-({\mathit{I}}_{\mathit{T}}\otimes {\mathbf{1}}_{\mathit{J}}^{\prime})\mathit{w}]$$

$$\begin{array}{cc}\hfill {\widehat{p}}_{km}& =\frac{exp(-{z}_{km}{X}_{k}^{\prime}\widehat{\mathit{\lambda}})}{{\Omega}_{k}(\widehat{\mathit{\lambda}})}\hfill \\ \hfill {\widehat{w}}_{tj}& =\frac{exp(-{v}_{tj}\widehat{\mathit{\lambda}})}{{\Psi}_{t}(\widehat{\mathit{\lambda}})}\hfill \end{array}$$

$$\begin{array}{cc}\hfill {\Omega}_{k}(\widehat{\mathit{\lambda}})& =\sum _{m=1}^{M}exp(-{z}_{km}{X}_{k}^{\prime}\widehat{\mathit{\lambda}})\hfill \\ \hfill {\Psi}_{t}(\widehat{\mathit{\lambda}})& =\sum _{j=1}^{J}exp(-{v}_{tj}\widehat{\mathit{\lambda}})\hfill \end{array}$$

$$\mathcal{L}(\mathit{\lambda})={\mathit{y}}^{\prime}\mathit{\lambda}-\sum _{k=1}^{K}log({\Omega}_{k}(\mathit{\lambda}))-\sum _{t=1}^{T}log({\Psi}_{t}(\mathit{\lambda}))\equiv \mathcal{M}(\mathit{\lambda})$$

$${\nabla}_{\lambda}\mathcal{M}(\mathit{\lambda})=\mathit{y}-\mathit{X}\mathit{Z}\mathit{p}-\mathit{V}\mathit{w}$$

$${\nabla}_{\lambda {\lambda}^{\prime}}\mathcal{M}(\mathit{\lambda})=-\mathit{X}\mathit{Z}{\nabla}_{{\lambda}^{\prime}}\mathit{p}(\mathit{\lambda})-\mathit{V}{\nabla}_{{\lambda}^{\prime}}\mathit{w}(\mathit{\lambda})=-\mathit{X}{\Sigma}_{\mathit{Z}}(\mathit{\lambda}){\mathit{X}}^{\prime}-{\Sigma}_{\mathit{V}}(\mathit{\lambda})$$

#### 2.3. Statistical properties of GME

Under some mild regularity conditions, GJM establish large sample properties of the GME estimation. They also analyze its small sample properties, both analytically for some special cases and numerically using Monte Carlo simulations.

The noise term, $\mathit{V}\mathit{w}$, effectively “loosens” the model constraints for a given set of observations, and thus an interior solution is more likely. On the other hand, because of the presence of ${\Sigma}_{\mathit{V}}(\mathit{\lambda})$, which is positive definite, in the Hessian matrix (5), the GME estimator behaves like the ridge estimator in the sense that all coefficients are shrunk toward zero. Consider, for simplicity, the case where $\text{var}(\mathit{e})={\sigma}^{2}{\mathit{I}}_{T}$ and
The finite sample performance of this estimator clearly depends on the specification of the error support

**X**is orthogonal. The approximate covariance matrix of the GME estimate $\widehat{\mathit{\beta}}$ is
$${\sigma}^{2}{\Sigma}_{Z}{({\Sigma}_{Z}+{\Sigma}_{V})}^{-2}{\Sigma}_{Z}$$

**V**. Intuitively, the wider is**V**, the larger is the degree of shrinkage toward zero. GJM proposed to use the 3σ rule for the error support, where σ refers to the standard deviation of the disturbance. In practice, σ is replaced by its consistent estimator, such as that based on the OLS regression.A second factor that may influence the finite sample performance of the GME estimator is the specification of the coefficient support,

**Z**. The restrictions imposed on the parameter space through**Z**reflect prior knowledge about the unknown parameters. However, such knowledge is not always available, and researchers may want to entertain a variety of plausible bounds on**β**. As the parameter supports are widened, the GME risk functions modestly shift upward reflecting the reduced constraints on the parameter space. Hence, wide bounds may be used without extreme risk consequences, if one’s knowledge is minimal, to ensure that**Z**contains**β**. Intuitively, widening the bounds increases the impact of the data and decreases the impact of the support. On the other hand, narrowing the parameter supports only improves the risk as long as the true parameter vector is well in the interior of the support. GJM conducted Monte Carlo simulations on the impact of**Z**by using different supports. They found modest impacts of varying the parameter support on the estimation.For both the coefficient and error supports, we need to select the number of points, M and J, respectively. Since the variances of the distributions $\mathit{p}(\mathit{\lambda})$ and $\mathit{w}(\mathit{\lambda})$ depend on the specifications of the supports, the dimension of the supports may affect the sampling properties of the estimator. Adding more points to the support of

**Z**should decrease the variance of the associated point estimator. On the other hand, it increases the computational burden of the optimization problem. GJM reported an experiment showing that the estimator improves as the number of support points M increases for small and modest M. The greatest improvement is observed when M is increased from three to five.GJM demonstrate various merits of the GME estimator, especially its resistance to multicollinearity problem. The implementation of the GME estimation, however, requires several “human” decisions which are not required in the OLS. GJM provides some guidance on the specifications of these factors. First, non-sample information can be useful. For example, it is not uncommon in practice that the sign, range or approximate multitude of coefficients in question are known a priori. This information provides useful guidance on the specification of the coefficient support. Similarly, non-sample information regarding the error distribution is sometimes available. For instance, it is well known that the error distributions in financial studies have fat tails. Accordingly, one can use a wider error support than the usual 3σ rule does.

A second useful principle is adaptation. Generally any consistent estimators can provide useful information on the coefficients and the distribution of the disturbance. Thus, one can tailor his specification of the coefficient and error supports based on preliminary consistent estimators. For example, to use the usual 3σ rule, one can replace σ with a consistent estimator. In the spirit of adaptive estimation, one can further tailor the error support such that it reflects characterizations of the error distribution, such as skewness, fat-tailedness, and so on.

Lastly, the maximum entropy problem can be further generalized to the minimum cross entropy problem. The cross entropy, or Kullback-Leibler information criteria, for two distributions,
The cross entropy measures the discrepancy between

**p**and**q**(with a common support) is defined as
$$D(\mathit{p},\mathit{q})=\sum _{k}{p}_{k}log({p}_{k}/{q}_{k})$$

**p**and**q**. Suppose in addition to the model consistency requirement, prior information is available in the form of distributions of probabilities on the discrete supports for the coefficients and disturbances, it can be incorporated into the estimation by minimizing the cross entropy subject to the model consistency and additivity constraint. The ME principle is a special case of the minimum cross entropy principle, with the prior distributions set to constant.## 3. The Weighted GME Estimator with a Data-driven Weight

#### 3.1. The weighted GME estimator (W-GME)

As discussed above, the specifications of the coefficient and error supports may affect the GME estimation results. In addition, the specification of the dual loss objective function (3) can also influence the estimator. By accounting for the unknown signal and noise components in the consistency relations, the GME estimates of the unknown parameter

**β**and disturbances**e**are jointly determined. As a result, the entropy based objective function reflects statistical losses in the sample space (prediction) and in the parameter space (precision). It is noted, however, the objective function (3) implicitly places equal weights on the parameter and error entropies.To avoid arbitrarily assigning weights to the two loss components, GJM suggested a weighted GME (W-GME) estimator with the following objective function
where $\gamma \in (0,1)$ controls the weights given to the two entropies. The corresponding unconstrained weighted GME(γ) objective function is
One can then show that
where $\widehat{\mathit{\lambda}}$ are functions of γ, and

$$H(\mathit{p},\mathit{w};\gamma )=-(1-\gamma ){\mathit{p}}^{\prime}log\mathit{p}-\gamma {\mathit{w}}^{\prime}log\mathit{w}$$

$$\mathcal{M}(\mathit{\lambda};\gamma )={\mathit{y}}^{\prime}\mathit{\lambda}-(1-\gamma )\sum _{k=1}^{K}log({\Omega}_{k}(\mathit{\lambda};\gamma ))-\gamma \sum _{t=1}^{T}log({\Psi}_{t}(\lambda ;\mathit{\gamma}))$$

$$\begin{array}{cc}\hfill {\widehat{p}}_{km}& =\frac{exp(-{z}_{km}{X}_{k}^{\prime}\widehat{\mathit{\lambda}}/(1-\gamma ))}{{\Omega}_{k}(\widehat{\mathit{\lambda}};\gamma )}\hfill \\ \hfill {\widehat{w}}_{tj}& =\frac{exp(-{v}_{tj}\widehat{\mathit{\lambda}}/\gamma )}{{\Psi}_{t}(\widehat{\mathit{\lambda}};\gamma )}\hfill \end{array}$$

$$\begin{array}{cc}\hfill {\Omega}_{k}(\widehat{\mathit{\lambda}};\gamma )& =\sum _{m=1}^{M}exp(-{z}_{km}{X}_{k}^{\prime}\widehat{\mathit{\lambda}}/(1-\gamma ))\hfill \\ \hfill {\Psi}_{t}(\widehat{\mathit{\lambda}};\gamma )& =\sum _{j=1}^{J}exp(-{v}_{tj}\widehat{\mathit{\lambda}}/\gamma )\hfill \end{array}$$

GJM illustrated that the entropy optimization results are affected by γ. Furthermore, they reported that the effect of the weight on the estimation results cannot be determined unambiguously even for some very simple cases.

#### 3.2. W-GME with a data-driven weight

GJM show that one can use non-sample information and preliminary estimates to aid the specification of the coefficient and error supports. The prior distributions on these supports can be further “tilted” exponentially by non-uniform prior distributions incorporated through the minimum cross entropy framework. On the other hand, there is no clear guidance on the selection of γ in the W-GME problem. In this section, we propose a data-driven method to select γ for the W-GME estimator.

Specially, we use the method of least squares cross-validation (LSCV), which is widely used in nonparametric estimations. This method is implemented as follows:

- Given the coefficient support $\mathit{Z}$, disturbance support $\mathit{V}$, and weight $\gamma \in (0,1)$, estimate
**β**using the W-GME method (6), on $T-1$ observations, with the ${t}^{th}$ observation omitted for $t=1,\cdots ,T$. Denote each estimate ${\widehat{\mathit{\beta}}}_{-t}(\gamma )$. For simplicity, we use uniform prior distributions for $\mathit{Z}$ and $\mathit{V}$. - Calculate the squared prediction error ${\widehat{s}}_{t}(\gamma )={({y}_{t}-{\mathit{x}}_{\mathit{t}}{\widehat{\mathit{\beta}}}_{-t}(\gamma ))}^{2}$ for each t.
- Select γ such that it minimizes the sum of the squared prediction errors ${\sum}_{t=1}^{T}{\widehat{s}}_{t}(\gamma )$.

## 4. Monte Carlo Simulations

To investigate the finite sample performance of the proposed W-GME method, we conducted some Monte Carlo simulations. The purpose of these experiments is to compare the W-GME with the conventional GME, which places equal weights on the entropy of the coefficient distribution and that of the disturbance distribution. It is not intended as an investigation of the GME method, where careful selection of the support for the coefficient and the disturbance is crucial to its performance. For simplicity, we choose not to use non-sample information in the specification of coefficient and error support, and to use the simple GME approach, or, uniform prior distributions in the minimum cross entropy framework.

Following GJM’s Monte Carlo simulation setup, we investigate the performance of the W-GME on linear models where the design matrices vary in degree of multicollinearity. Recall that the GME is similar to the ridge regression as a robust estimator against multicollinearity. Thus we are interested in its performance in the presence of multicollinearity.

We measure a matrix’s multicollinearity using its condition number, which is the ratio between its largest and smallest eigenvalues. Let
which has length $K=4$. The new design matrix, ${\mathit{X}}_{a}=\mathit{Q}{\mathit{L}}_{\mathit{a}}\mathit{R}$, is characterized by $\kappa ({\mathit{X}}_{\mathit{a}}^{\prime}{\mathit{X}}_{\mathit{a}})=\mu $, and the condition number may be specified a priori. 1 We then set
where $\mathit{\beta}={[2,1,-3,2]}^{\prime}$, and

**X**be a $T\times 4$ matrix which is generated randomly from an i.i.d. standard normal distribution. To form a design matrix with a desired condition number, $\kappa ({\mathit{X}}^{\prime}\mathit{X})=\mu $, the singular value decomposition of $\mathit{X}=\mathit{Q}\mathit{L}\mathit{R}$ was recovered. Then, the eigenvalues in**L**were replaced with the vector
$$a=\left[\sqrt{\frac{2}{1+\mu}},1,1,\sqrt{\frac{2\mu}{1+\mu}}\right]$$

$$\mathit{y}={\mathit{X}}_{\mathit{a}}\mathit{\beta}+\mathit{e}$$

**e**are T i.i.d. random errors.In the Monte Carlo simulations, we consider three estimators: the OLS, GME and W-GME. We consider sample size $n=30$ and $n=50$. 500 samples are generated for each case. We use the LSCV to select the optimal weight γ. In particular, we use a line search over the interval $(0,1)$ to locate the γ that minimizes the squared prediction errors.2

#### 4.1. Regressions with normal errors

Firstly, we assume that $\mathit{e}$ are iid standard normal random errors. For both GME estimators, we use a five-point support $\mathit{Z}=[-z,-z/2,0,z/2,z]$ for $z=10,20,30,50,100$ respectively. For the error support, we set $\mathit{V}=[-\widehat{\sigma},-\widehat{\sigma}/2,0,\widehat{\sigma}/2,\widehat{\sigma}]\times 3$, where $\widehat{\sigma}$ is the standard error of the OLS residuals. We report the Mean Squared Errors (MSE) of coefficient estimates, $\left|\right|\widehat{\mathit{\beta}}-\mathit{\beta}{\left|\right|}^{2}$, in Table 1. When $\mu =1$, the MSE of the OLS is close to 4, its theoretical value, in all cases. Not surprisingly, when $\mathit{X}$ is orthogonal, the OLS outperforms both GME estimators, which are shrinkage estimators and thus biased. On the other hand, in most cases where $\mu >1$, the two GME estimators have smaller MSEs than the OLS does. This is consistent with the famous Stein’s phenomenon that the OLS is dominated by some shrinkage estimators in multiple linear regressions.

Comparing the GME and W-GME, we note that when $z=10$, or the coefficient support is defined on $[-10,10]$, the MSEs of the GME are smaller than those of the W-GME. This result suggests that when a relatively precise coefficient support is used, the GME estimator has a smaller risk. Intuitively, with a narrow support for the coefficients that covers the true values, the coefficients can be estimated precisely, regardless the choice of the weight γ in a weighted GME framework. On the other hand, the potential benefit of the W-GME is largely offset by the additional variation entailed by the data-driven method of selecting the entropy weight γ. However, in practice, the improvement due to narrow coefficient supports is only obtainable if the supports contain the true unknown coefficient values. Without prior or non-sample information, using a narrow coefficient support increases the risk of missing the true values and renders the estimator inconsistent.

When $z\ge 20$, the W-GME outperforms the GME considerably. Furthermore, the performance of the W-GME relative to that of the GME improves with both the width of coefficient support and the condition number. The average ratios between the MSEs of the W-GME and those of the GME across two sample sizes are respectively [1.11, 0.87, 0.78, 0.69, 0.70] for $z=[10,20,30,50,100]$, while these ratios are respectively [1.00, 0.87, 0.76, 0.69] across the condition numbers $\mu =[1,10,20,50]$. In addition, it is noted that the performance of the W-GME appears to stabilize for $z\ge 50$. In other words, its performance seems to be affected little when a wide coefficient support is further widened. In contrast, the MSE of the GME increases with the coefficient support and reaches the level of that of the OLS for $z\ge 50$. Given the fact that a narrow coefficient support increases the risk of inconsistency, the stability of the W-GME under a wide range of coefficient supports is highly desirable.

$n=30$ | $n=50$ | ||||||

z | κ | OLS | W-GME($\widehat{\gamma}$) | GME | OLS | W-GME($\widehat{\gamma}$) | GME |

10 | 1 | 3.84 | 4.06 (0.26) | 3.57 | 3.84 | 4.26 (0.24) | 3.67 |

10 | 7.38 | 5.97 (0.24) | 5.37 | 6.57 | 5.77 (0.23) | 4.91 | |

20 | 10.83 | 7.29 (0.24) | 6.78 | 10.63 | 7.06 (0.22) | 6.49 | |

50 | 19.85 | 8.28 (0.23) | 7.99 | 21.29 | 8.15 (0.23) | 7.50 | |

20 | 1 | 3.94 | 4.25 (0.08) | 4.03 | 4.07 | 4.28 (0.08) | 4.18 |

10 | 8.35 | 6.81 (0.11) | 7.59 | 7.55 | 6.55 (0.09) | 7.27 | |

20 | 13.00 | 8.72 (0.11) | 10.69 | 13.85 | 8.63 (0.10) | 11.13 | |

50 | 27.58 | 12.81 (0.14) | 16.05 | 25.91 | 11.88 (0.12) | 16.27 | |

30 | 1 | 3.86 | 4.00 (0.04) | 3.92 | 4.10 | 4.42 (0.04) | 4.53 |

10 | 7.73 | 6.07 (0.05) | 7.49 | 8.11 | 6.73 (0.05) | 8.21 | |

20 | 13.31 | 8.29 (0.07) | 12.11 | 12.35 | 7.90 (0.06) | 11.33 | |

50 | 26.89 | 12.67 (0.08) | 21.09 | 26.70 | 12.75 (0.08) | 21.01 | |

50 | 1 | 4.16 | 3.89 (0.02) | 4.28 | 3.92 | 3.98 (0.02) | 4.35 |

10 | 7.44 | 5.55 (0.02) | 7.58 | 8.70 | 6.91 (0.02) | 9.90 | |

20 | 13.14 | 8.19 (0.03) | 12.98 | 12.78 | 8.21 (0.03) | 13.58 | |

50 | 28.40 | 13.87 (0.05) | 26.65 | 31.14 | 15.29 (0.04) | 28.83 | |

100 | 1 | 4.02 | 3.82 (0.02) | 4.15 | 3.74 | 4.39 (0.01) | 4.74 |

10 | 7.71 | 6.04 (0.02) | 7.93 | 8.03 | 6.64 (0.01) | 8.81 | |

20 | 12.21 | 7.93 (0.02) | 12.14 | 12.77 | 8.32 (0.02) | 13.47 | |

50 | 26.33 | 13.52 (0.03) | 26.38 | 26.98 | 12.84 (0.02) | 27.20 |

Next we turn our attention to the empirically determined weight $\widehat{\gamma}$ in the weighted entropy objective function. For each experiment, the average $\widehat{\gamma}$ is reported in parenthesis for the W-GME estimator. We observe two note-worthy features. First, $\widehat{\gamma}$ increases generally with μ. Recall that γ is the weight placed on the entropy of the disturbance distributions. Thus the more severe the “ill-posed” problem is, the larger is the weight selected by the LSCV. In other words, the data-driven method automatically relaxes the model consistency constraints when the underlying linear inverse problem associated with the OLS becomes problematic. Second, $\widehat{\gamma}$ decreases with the width of the coefficient support across all condition numbers. Intuitively, the wider is the coefficient support, the weaker are the restrictions imposed by the GME estimation procedure. Correspondingly, the smaller is the need to regulate the entropy, or uncertainty, of the disturbance distribution.3

Lastly, we note that the overall performance of the estimators in question remains quite stable when the sample size is increased from 30 to 50. The average ratios, across all cases, in the MSE between the W-GME and GME are 0.834 and 0.828 respectively for $n=30$ and 50. It is well-known that data-driven methods normally require a sizeable sample to attain its theoretical advantages. Nonetheless our results demonstrate that the W-GME can outperform the GME with quite small sample sizes under various scenarios.

$n=30$ | $n=50$ | ||||||

z | κ | OLS | W-GME($\widehat{\gamma}$) | GME | OLS | W-GME($\widehat{\gamma}$) | GME |

10 | 1 | 3.98 | 3.41 (0.42) | 2.81 | 4.08 | 3.43 (0.44) | 2.73 |

10 | 8.07 | 5.64 (0.41) | 4.18 | 7.75 | 5.06 (0.45) | 3.67 | |

20 | 12.17 | 6.50 (0.40) | 4.99 | 12.16 | 5.88 (0.43) | 4.69 | |

50 | 29.33 | 7.70 (0.44) | 5.66 | 27.46 | 7.09 (0.44) | 5.55 | |

20 | 1 | 4.03 | 3.54 (0.18) | 3.18 | 3.92 | 3.39 (0.19) | 2.92 |

10 | 8.53 | 5.66 (0.19) | 5.40 | 7.79 | 5.26 (0.20) | 4.83 | |

20 | 12.12 | 7.27 (0.18) | 6.76 | 12.58 | 7.16 (0.21) | 6.08 | |

50 | 28.15 | 11.22 (0.22) | 9.23 | 28.61 | 10.82 (0.22) | 8.29 | |

30 | 1 | 4.21 | 3.62 (0.11) | 3.44 | 4.34 | 3.60 (0.10) | 3.46 |

10 | 8.09 | 5.66 (0.12) | 5.82 | 8.41 | 5.85 (0.12) | 6.05 | |

20 | 13.60 | 7.58 (0.12) | 9.00 | 13.76 | 7.91 (0.13) | 8.76 | |

50 | 26.99 | 11.45 (0.15) | 13.06 | 25.87 | 10.95 (0.15) | 11.57 | |

50 | 1 | 4.11 | 3.46 (0.05) | 3.35 | 3.90 | 3.23 (0.04) | 3.14 |

10 | 8.58 | 5.48 (0.05) | 6.79 | 7.76 | 5.36 (0.05) | 6.01 | |

20 | 13.15 | 6.97 (0.06) | 10.27 | 13.34 | 7.39 (0.06) | 9.89 | |

50 | 27.81 | 11.16 (0.08) | 17.84 | 28.01 | 11.28 (0.08) | 17.47 | |

100 | 1 | 3.90 | 3.02 (0.02) | 3.45 | 4.00 | 2.94 (0.01) | 3.25 |

10 | 7.55 | 4.70 (0.02) | 6.35 | 8.54 | 5.12 (0.02) | 6.65 | |

20 | 13.62 | 6.86 (0.02) | 11.49 | 13.52 | 6.94 (0.03) | 10.81 | |

50 | 28.27 | 11.29 (0.04) | 22.14 | 29.26 | 12.03 (0.04) | 21.93 |

$n=30$ | $n=50$ | ||||||

z | κ | OLS | W-GME($\widehat{\gamma}$) | GME | OLS | W-GME($\widehat{\gamma}$) | GME |

10 | 1 | 3.84 | 4.09 (0.47) | 3.61 | 3.36 | 4.71 (0.44) | 4.02 |

10 | 7.01 | 5.68 (0.44) | 4.98 | 6.35 | 6.47 (0.42) | 5.49 | |

20 | 10.65 | 6.41 (0.44) | 5.47 | 10.31 | 7.31 (0.42) | 5.91 | |

50 | 20.90 | 7.03 (0.44) | 6.28 | 20.58 | 7.98 (0.43) | 6.93 | |

20 | 1 | 3.84 | 4.11 (0.21) | 4.54 | 3.70 | 6.91 (0.18) | 6.92 |

10 | 7.46 | 6.07 (0.21) | 6.30 | 7.18 | 8.12 (0.20) | 8.95 | |

20 | 11.02 | 6.83 (0.21) | 7.48 | 11.11 | 9.70 (0.20) | 10.69 | |

50 | 24.31 | 9.98 (0.24) | 9.37 | 24.20 | 11.64 (0.22) | 12.03 | |

30 | 1 | 4.60 | 4.33 (0.21) | 5.39 | 3.59 | 6.23 (0.18) | 6.50 |

10 | 7.83 | 5.91 (0.21) | 7.82 | 7.19 | 9.00 (0.20) | 10.77 | |

20 | 12.03 | 7.73 (0.21) | 9.84 | 11.38 | 9.71 (0.20) | 13.32 | |

50 | 23.92 | 10.75 (0.24) | 13.89 | 25.65 | 15.82 (0.22) | 19.98 | |

50 | 1 | 5.64 | 5.29 (0.05) | 7.15 | 3.78 | 8.32 (0.04) | 8.76 |

10 | 7.86 | 5.74 (0.05) | 9.28 | 7.80 | 12.54 (0.05) | 17.37 | |

20 | 13.41 | 8.18 (0.06) | 13.91 | 11.89 | 9.69 (0.05) | 16.53 | |

50 | 24.07 | 11.05 (0.08) | 21.30 | 25.82 | 20.27 (0.08) | 32.39 | |

100 | 1 | 3.96 | 4.01 (0.02) | 5.19 | 3.80 | 11.75 (0.01) | 12.88 |

10 | 6.92 | 5.33 (0.02) | 8.68 | 8.31 | 14.55 (0.02) | 19.93 | |

20 | 14.25 | 7.78 (0.02) | 17.64 | 11.64 | 17.64 (0.02) | 25.50 | |

50 | 32.09 | 14.24 (0.03) | 33.82 | 28.35 | 41.85 (0.04) | 59.32 |

#### 4.2. Regressions with non-normal errors

Next we investigate the performance of the proposed estimator when the errors are generated from some non-normal distributions. Using the same sample design outlined above, we generated the errors instead from a ${\chi}^{2}(4)$ and a $t(3)$ distribution. The ${\chi}^{2}(4)$ errors were centered by subtracting the mean (i.e., 4), and all drawings were scaled to have unit variance by dividing each by the associated standard deviation ($\sqrt{3}$ and $\sqrt{8}$ respectively). Under a ${\chi}^{2}$ error distribution, we set the disturbance support to $\mathit{V}=[-\widehat{\sigma},-\widehat{\sigma}/2,0,\widehat{\sigma},2\widehat{\sigma}]\times 3$ to account for the skewness of the ${\chi}^{2}$ distribution.4 When the disturbance terms were generated from the t distribution, instead of using the 3σ rule, we set $\mathit{V}=[-\widehat{\sigma},-\widehat{\sigma}/2,0,\widehat{\sigma}/2,\widehat{\sigma}]\times 5$ to account for the fat-tailedness of the error distribution.

The estimation results for the ${\chi}^{2}$ case are reported in Table 2. The overall pattern is similar to that with normal errors. With narrow coefficient supports, the GME has a smaller MSE than the W-GME. On the other hand, when $z\ge 30$, the W-GME outperforms the GME, and the performance gap generally increases with the condition number. The performance is quite similar between $n=30$ and $n=50$. On the other hand, the average γ is larger than that for normal error case, indicating a heavier penalty for the uncertainty in error distribution when it is skewed.

Lastly, Table 3 reports the results for t error distributions. With $n=30$, the overall pattern is again similar to those of the first two cases. A noteworthy difference is that the MSEs for the two GME estimators increase substantially when the sample size is raised to 50. In contrast, the OLS does not seem to be affected by the change in sample size. Nonetheless, except when the condition number is small or a very wide coefficient support is used (i.e., $z=100$), the W-GME still outperforms the OLS. We also note that the weight γ is larger than that chosen under normal errors.

## 5. Concluding Remarks

The Generalized Maximum Entropy (GME) estimator is a robust estimator that is resistant to multicollinearity. Like other robust estimators, the estimator requires specification of some “tuning” parameters. In particular, it requires users to specify discrete supports for the coefficients and disturbances. In a more general weighted GME framework, one also needs to specify a weight that determines the relative weight placed on the entropies of the coefficient and error distributions. Although the specifications of the coefficient and error supports can be guided by non-sample information and preliminary estimates, there is no clear guidance on the selection of the weight in a weighted GME estimation.

In this study, we have presented a weighted-GME estimator with a data-driven weight. The conventional GME estimator places equal weights on the entropies of coefficient and disturbance distributions. Instead, we proposed to use the method of least squares cross validation to select this weight in a data-driven manner. We demonstrate numerically that the proposed W-GME estimator provides superior performance under various scenarios. Investigation on combining the data-driven selection of the weight parameter and automatic specification of the supports for the coefficients and errors to achieve adaptiveness and further improvement shall be of interest for future studies.

## Notes

^{1.}A high condition number indicates a high degree of multicollinearity, and vise versa. A condition number of one signifies that the columns of the matrix in question are orthogonal to each other.^{2.}We searched over an equally-spaced interval $\mathit{\rho}=[log(0.01),log(0.01)+h,log(0.01)+2h,\cdots ,log(0.99)]$, where $h=(log(0.99)-log(0.01))/15$ and γ is set to $exp(\rho )$.^{3.}Recall that the GME estimator implicitly assumes a uniform prior distribution for the error support. Since we use a symmetric error support centered at zero, a uniform distribution over this support leads to a zero disturbance. A wider coefficient support means less restrictive constraints on**β**and thus the a smaller $e=y-x\beta $ in absolute value. With the error terms more likely to be close to zero, the need to regulate the entropy of error term distributions is less.^{4.}A non-uniform prior distribution $\mathit{u}={[4/15,4/15,1/5,2/15,2/15]}^{\prime}$ is used for the error support such that the prior distribution is centered at zero.

## References

- Skilling, J. The axioms of maximum entropy. In Maximum Entropy and Bayesian Methods in Science and Engineering; Skilling, J., Ed.; Kluwer: Dordrecht, The Netherlands, 1989; pp. 173–187. [Google Scholar]
- Golan, A.; Judge, G.; Miller, D. Maximum Entropy Econometrics: Robust Estimation with Limited Data; Wiley: Chichester, UK, 1996. [Google Scholar]
- Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J.
**1948**, 27, 379–423. [Google Scholar] [CrossRef] - Jaynes, E.T. Information theory and statistical mechanics. Phys. Rev.
**1957**, 106, 620–630. [Google Scholar] [CrossRef] - Kullback, J. Information Theory and Statistics; John Wiley: New York, NY, USA, 1959. [Google Scholar]
- Levine, R.D. An information theoretic approach to inversion problems. J. Phys. A-Math. Gen.
**1980**, 13, 91–108. [Google Scholar] [CrossRef] - Levine, R.D.; Tribus, M. The Maximum Entropy Formalism; MIT Press: Cambridge, MA, USA, 1979. [Google Scholar]
- Shore, J.E.; Johnson, R.W. Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy. IEEE Trans. Inform. Theory
**1980**, 26, 26–37. [Google Scholar] [CrossRef] - Jaynes, E.T. Prior information and ambiguity in inverse problems. In Inverse Problems; McLaughlin, D.W., Ed.; SIAM Proceedings, American Mathematical Society: Providence, RI, USA, 1984; pp. 151–166. [Google Scholar]
- Ciszár, I. Why least squares and maximum entropy? An axiomatic approach to inference for linear inverse problems. Ann. Stat.
**1991**, 19, 2032–2066. [Google Scholar] [CrossRef] - Hall, P. Large sample optimality of least squares cross-validation in density estimation. Ann. Stat.
**1984**, 11, 1156–1174. [Google Scholar] - Stone, C.J. An asymptotically optimal window selection rule for kernel density estimates. Ann. Stat.
**1984**, 12, 1285–1297. [Google Scholar] [CrossRef] - Hall, P.; Marron, J.S. Extent to which least-squares cross-validation minimises integrated square error in nonparametric density estimation. Probab. Theory Rel.
**1987**, 74, 567–581. [Google Scholar] [CrossRef]

© 2009 by the author; licensee Molecular Diversity Preservation International, Basel, Switzerland. This article is an open-access article distributed under the terms and conditions of the Creative Commons Attribution license http://creativecommons.org/licenses/by/3.0/.