An Entropic Estimator for Linear Inverse Problems

Golan, Amos; Gzyl, Henryk

doi:10.3390/e14050892

Open AccessArticle

An Entropic Estimator for Linear Inverse Problems

by

Amos Golan

¹ and

Henryk Gzyl

^2,*

¹

Department of Economics, Info-Metrics Institute, American University, 4400 Massachusetts Ave., Washington, DC 20016, USA

²

Centro de Finanzas, IESA, Caracas 1010, Venezuela

^*

Author to whom correspondence should be addressed.

Entropy 2012, 14(5), 892-923; https://doi.org/10.3390/e14050892

Submission received: 29 February 2012 / Revised: 2 April 2012 / Accepted: 17 April 2012 / Published: 10 May 2012

(This article belongs to the Special Issue Concepts of Entropy and Their Applications)

Download Versions Notes

Abstract

:

In this paper we examine an Information-Theoretic method for solving noisy linear inverse estimation problems which encompasses under a single framework a whole class of estimation methods. Under this framework, the prior information about the unknown parameters (when such information exists), and constraints on the parameters can be incorporated in the statement of the problem. The method builds on the basics of the maximum entropy principle and consists of transforming the original problem into an estimation of a probability density on an appropriate space naturally associated with the statement of the problem. This estimation method is generic in the sense that it provides a framework for analyzing non-normal models, it is easy to implement and is suitable for all types of inverse problems such as small and or ill-conditioned, noisy data. First order approximation, large sample properties and convergence in distribution are developed as well. Analytical examples, statistics for model comparisons and evaluations, that are inherent to this method, are discussed and complemented with explicit examples.

Keywords:

maximun entropy method; generalized entropy estimator; information-theoretic methods; parameter estimation; inverse problems

PACS Code:

02.50Ga; 02.50.Tt; 02.70.Rr; 83.85.Ns

1. Introduction

Researchers in all disciplines are often faced with small and/or ill-conditioned data. Unless much is known, or assumed, about the underlying process generating these data (the signal and the noise) these types of data lead to ill-posed noisy (inverse) problems. Traditionally, these types of problems are solved by using parametric and semi-parametric estimators such as the least squares, regularization and non-likelihood methods. In this work, we propose a semi-parametric information theoretic method for solving these problems while allowing the researcher to impose prior knowledge in a non-Bayesian way. The model developed here provides a major extension of the Generalized Maximum Entropy model of Golan, Judge and Miller [1] and provides new statistical results of estimators discussed in Gzyl and Velásquez [2].

The overall purpose of this paper is fourfold. First, we develop a generic information theoretic method for solving linear, noisy inverse problems that uses minimal distributional assumptions. This method is generic in the sense that it provides a framework for analyzing non-normal models and it allows the user to incorporate prior knowledge in a non-Bayesian way. Second, we provide detailed analytic solutions for a number of possible priors. Third, using the concentrated (unconstrained) model, we are able to compare our estimator to other estimators, such as the Least Squares, regularization and Bayesian methods. Our proposed model is easy to apply and suitable for analyzing a whole class of linear inverse problems across the natural and social sciences. Fourth, we provide the large sample properties of our estimator.

To achieve our goals, we build on the current Information-Theoretic (IT) literature that is founded on the basis of the Maximum Entropy (ME) principle (Jaynes [3,4]) and on Shannon’s [5] information measure (entropy) as well as other generalized entropy measures. To understand the relationship between the familiar linear statistical model and the approach we take here, we now briefly define our basic problem, discuss its traditional solution and provide the basic logic and related literature we use here in order to solve that problem such that our objectives are achieved.

Consider the basic (linear) problem of estimating the K-dimensional location parameter vector (signal, input) β given an N-dimensional observed sample (response) vector y and an N × K design (transfer) matrix X such that y = Xβ + ε and ε is an N-dimensional random vector such that E[ε] = 0 and with some positive definite covariance matrix with a scale parameter σ². The statistical nature of the unobserved noise term is supposed to be known, and we suppose that the second moments of the noise are finite. The researcher’s objective is to estimate the unknown vector β with minimal assumptions on ε. Recall that under the traditional regularity conditions for the linear model (and for X of rank K), the least squares, (LS), unconstrained, estimator is Entropy 14 00892 i001

and

where “t” stands for transpose.

Consider now the problem of estimating β and ε simultaneously while imposing minimal assumptions on the likelihood structure and while incorporating certain constraints on the signal and perhaps on the noise. Further, rather than following the tradition of employing point estimators, consider estimating the empirical distribution of the unknown quantities β_k and ε_n with the joint objectives of maximizing the in-and-out of sample prediction.

With these objectives, the problem is inherently under-determined and cannot be solved with the traditional least squares or likelihood approaches. Therefore, one must resort to a different principle. In the work done here, we follow the Maximum Entropy (ME) principle that was developed by Jaynes [3,4] for similar problems. The classical ME method consists of using a variational method to choose a probability distribution from a class of probability distributions having pre-assigned generalized moments.

In more general terms, consider the problem of estimating an unknown discrete probability distribution from a finite and possibly noisy set of observed generalized (sample) moments, that is, arbitrary functions of the data. These moments (and the fact that the distribution is proper) are supposed to be the only available information. Regardless of the level of noise in these observed moments, if the dimension of the unknown distribution is larger than the number of observed moments, there are infinitely many proper probability distributions satisfying this information. Such a problem is called an under-determined problem. Which one of the infinitely many solutions that satisfy the data should one choose? Within the class of information-theoretic (IT) methods, the chosen solution is the one that maximizes an information criterion-entropy. Procedure that we propose below to solve the estimation problem described above, fits in that framework.

We construct our proposed estimator for solving the noisy, inverse, linear problem in two basic steps. In our first step, each unknown parameter (β_k and ε_n) is constructed as the expected value of a certain random variable. That is, we view the possible values of the unknown parameters as values of random variables whose distributions are to be determined. We will assume that the range of each such random variable contains the true unknown value of β_k and ε_n respectively. This step actually involves two specifications. The first one is the pre-specified support space for the two sets of parameters (finite/infinite and/or bounded/unbounded). At the outset of section two we shall do this as part of the mathematical statement of the problem. Any further information we may have about the parameters is incorporated into the choice of a prior (reference) measure on these supports. Since usually a model for the noise is supposed to be known, the statistical nature of the noise is incorporated at this stage. As far as the signal goes, this is an auxiliary construction. This constitutes our second specification.

In our second step, because minimal assumptions on the likelihood implies that such a problem is under-determined, we resort to the ME principle. This means that we need to convert this under-determined problem to a well-posed, constrained optimization. Similar to the classical ME method the objective function in that constrained optimization problem is composed of N × K entropy functions: one for each one of the N × K proper probability distributions (one for each signal β_k and one for each noise component ε_n). The constraints are just the observed information (data) and the requirement that all probability distributions are proper. Maximizing (simultaneously) the N × K entropies subject to the constraints yields the desired solution. This optimization yields a unique solution in terms of a unique set of proper probability distribution which in turn yields the desired point estimates β_k and ε_n. Once the constrained model is solved, we construct the concentrated (unconstrained) model. In the method proposed here, we also allow introduction of different priors corresponding to one’s beliefs about the data generating process and the structure of the unknown β’s.

Our proposed estimator is a member of the IT family of estimators. The members of this family of estimators include the Empirical Likelihood (EL), the Generalized EL (GEL), the Generalized Method of Moments (GMM), the Bayesian Method of Moments, (BMOM), the Generalized Maximum Entropy (GME), and the Maximum Entropy in the Mean (MEM), and are all related to the classical Maximum Entropy, ME. (e.g., Owen [6,7]; Qin and Lawless [8]; Smit, [9]; Newey and Smith [10]; Kitamura and Stutzer [11]; Imbens et al. [12]; Zellner [13,14]; Zellner and Tobias [15]; Golan, Judge and Miller [1]; Gamboa and Gassiat [16]; Gzyl [17]; Golan and Gzyl [18]). See also Gzyl and Velásquez [2], which builds upon Golan and Gzyl [18] where the synthesis was first proposed. If, in addition, the data are ill-conditioned, one often has to resort to the class of regularization methods (e.g., Hoerl and Kennard [19] O’Sullivan [20], Breiman [21], Tibshirani [22], Titterington [23], Donoho et al. [24]; Besnerais et al. [25]. A reference for regularization in statistics is Bickel and Li [26]. If some prior information on the data generation process or on the model is available, Bayesian methods are often used. For a detailed review and synthesis of the IT family of estimators, historical perspective and synthesis, see Golan [27]. For other background and related entropy and IT methods of estimation see the special volume of Advances in Econometrics (Fomby and Hill [28]) and the two special issues of the Journal of Econometrics [29,30]. For additional mathematical background see Mynbaev [31] and Asher, Borchers and Thurber [32].

Our proposed generic IT method will provide us with an estimator for the parameters of the linear statistical model that reconciles some of the objectives achieved by each one of the above methods. Like the philosophy behind the EL, we do not assume a pre-specified likelihood, but rather recover the (natural) weight of each observation via the optimization procedure (e.g., Owen [7]; Qin and Lawless [8]). Similar to regularization methods used for ill-behaved data, we follow the GME logic and use here the pre-specified support space for each one of the unknown parameters as a form of regularization (e.g., Golan, Judge and Miller [1]). The estimated parameters must fall within that space. However, unlike the GME, our method allows for infinitely large support spaces and continuous prior distributions. Like Bayesian approaches, we do use prior information. But we use these priors in a different way—in a way consistent with the basics of information theory and in line with the Kullback–Liebler entropy discrepancy measure. In that way, we are able to combine ideas from the different methods described above that together yield an efficient and consistent IT estimator that is statistically and computationally efficient and easy to apply.

In Section 2, we lay out the basic formulation and then develop our basic model. In Section 3, we provide a detailed closed form examples of the normal priors’ case and other priors. In Section 4 we develop the basic statistical properties of our estimator including first order approximation. In Section 5, we compare our method with Least Squares, regularization and Bayesian methods, including the Bayesian Method of Moments. The comparisons are done under the normal priors. An additional set of analytical examples, providing the formulation and solution of four basic priors (bounded, unbounded and a combination of both) is developed in Section 6. In Section 7, we comment on model comparison. We provide detailed closed form formulations for that section in an Appendix. We conclude in Section 8. The Appendices provide the proofs and detailed analytical formulations.

2. Problem Statement and Solution

2.1. Notation and Problem Statement

Consider the linear statistical model

(1)

where

is an unknown K-dimensional signal vector that cannot be directly measured but is required to satisfy some convex constraints expressed as Entropy 14 00892 i005

where C_s is a closed convex set. For example, Entropy 14 00892 i006

with constants Entropy 14 00892 i007

. (These constraints may come from constraints on Entropy 14 00892 i008

, and may have a natural reason for being imposed). X is an N × K known linear operator (design matrix) that can be either fixed or stochastic, Entropy 14 00892 i009

is a vector of noisy observations, and Entropy 14 00892 i010

is a noise vector. Throughout this paper we assume that the components of the noise vector ε are i.i.d. random variables with zero mean and a variance σ² with respect to a probability law dQ_n(v) on Entropy 14 00892 i011

We denote by Q_s and Q_n the prior probability measures reflecting our knowledge about β and ε respectively.

Given the indirect noisy observations y, our objective is to simultaneously recover Entropy 14 00892 i012

and the residuals Entropy 14 00892 i013

so that Equation (1) holds. For that, we convert problem (1) into a generalized moment problem and consider the estimated β and ε as expected values of random variables z and v with respect to an unknown probability law P. Note that z is an auxiliary random variable whereas v is the actual model for the noise perturbing the measurements. Formally:

Assumption 2.1.

The range of z is the constraint set C_s embodying the constraints that the unknown β is to satisfy. Similarly, we assume that the range of v is a closed convex set C_n where “s” and “n” stand for signal and noise respectively. Unless otherwise specified, and in line with tradition, it is assumed that v is symmetric about zero.

Comment.

It is reasonable to assume that C_n is convex and symmetric in Entropy 14 00892 i011

. Further, in some cases the researcher may know the statistical model of the noise. In that case, this model should be used. As stated earlier, Q_s and Q_n are the prior probability measures for β and ε respectively. To ensure that the expected values of z and v fall in C = C_s × C_n we need the following assumption.

Assumption 2.2.

The closures of the convex hulls of the supports of Q_s and Q_n are respectively C_s and C_n and we set dQ = dQ_s × dQ_n.

Comment.

This assumption implies that for any strictly positive density ρ(z,v) we have:

To solve problems like (1) with minimal assumptions one has to (i) incorporate some prior knowledge, or constraints, on the solution, or (ii) specify a certain criterion to choose among the infinitely many solutions, or (iii) use both approaches. The different criteria used within the IT methods are all directly related to the Shannon’s information (entropy) criterion (Golan [33]). The criterion used in the method developed and discussed here is the Shannon’s entropy. For a detailed discussion and further background see for example the two special issues of the Journal of Econometrics [29,30].

2.2. The Solution

In what follows we explain how to transform the original linear problem into a generalized moment problem, or how to transform any constrained linear model like (1) into a problem consisting of finding an unknown density.

Instead of searching directly for the point estimates (β, ε)^t we view it as the expected value of auxiliary random variables (z, v)^t that take values in the convex set C_s×C_n distributed according to some unknown auxiliary probability law dP(z, v). Thus:

(2)

where E_P denotes the expected value with respect to P.

To obtain P, we introduce the reference measure dQ(z, v) = dQ_s(z) dQ_n(v) on the Borel subsets of the product space C = C_s × C_n Again, note that while C is binding, Q_s describes one’s own belief/knowledge on the unknown β, whereas Q_n describes the actual model for ε. With the above specification, problem (1) becomes:

Problem (1) restated:

We search for a density ρ(z, v) such that dP = ρdQ is a probability law on C and the linear relations:

(3)

are satisfied, where:

Under this construction, Entropy 14 00892 i018

is a random estimator of the unknown parameter vector β and Entropy 14 00892 i019

is an estimator of the noise.

Comment.

Using dQ(z, v) = dQ_s(z) dQ_n(v) amounts to assuming an a priori independence of signal and noise. This is a natural assumption as the signal part is a mathematical artifact and the noise part is the actual model of the randomness/noise.

There are potentially many candidates ρ's that satisfy (3). To find one (the least informative one given the data), we set up the following variational problem: Find ρ^*(z, v) that maximizes the entropy functional, S_Q (ρ) defined by:

(4)

on the following admissible class of densities:

P (C) = {ρ : C \to [0, \infty) | d P = ρ d Q is a proper probability satisfying (3)}

(5)

where “ln” stands for the natural logarithm. As usual we extend xlnx as 0 to x = 0. If the maximization problem has a solution, the estimates satisfy the constraints and Equations (1) or (3). The familiar and classical answer to the problem of finding such a ρ^* is expressed in the following lemma.

Lemma 2.1.

Assume that ρ is any positive density with respect to dQ and that lnρ is integrable with respect to dP = ρdQ, then S_Q(P) < 0.

Proof.

By the concavity of the logarithm and Jensen’s inequality it is immediate to verify that:

(6)

Before applying this result to our model, we define A=[X I] as an N × (K + N) matrix obtained from juxtaposing X and the N × N identity matrix I. We now work with the matrix A which allows us to consider the larger space rather than just the more traditional moment space. This is shown and discussed explicitly in the examples and derivations of Section 4, Section 5 and Section 6. For practical purposes, when facing a relatively small sample, the researcher may prefer working with A, rather than with the sample moments. This is because for finite sample the total information captured by using A is larger than when using the sample’s moments.

To apply lemma (1) to our model, let ρ be any member of the exponential (parametric) family:

(7)

where

denotes the Euclidean scalar (inner) product of vectors a and b, and Entropy 14 00892 i024

are N free parameters that will play the role of Lagrange multipliers (one multiplier for each observation). The quantity Ω(λ) is the normalization function:

(8)

where:

is the Laplace transform of Q. Next taking logs in (7) and defining:

(9)

Lemma 2.1 implies that ∑(λ) ≥ S_Q (ρ) for any Entropy 14 00892 i024

and for any ρ in the class of probability laws P(C) defined in (5). However, the problem is that we do not know whether the solution ρ^*(λ, z, v) is a member of P(C) for some λ. Therefore, we search for λ^* such that ρ^* = ρ(λ^*) is in P(C) and λ^* is a minimum. If such a λ^* is found, then we would have found a density (the unique one, for S_Q is strictly convex in ρ) that maximizes the entropy, and by using the fact that β^* = E_P^* [z] and ε^* = E_P^* [v], the solution to (1), which is consistent with the data (3), is found. Formally, the result is contained in the following theorem. (Note that the Kullback’s measure (Kullback [34]), is a particular case of S_Q(P), with a sign change and when both P and Q have densities).

Theorem 2.1.

Assume that Entropy 14 00892 i028

has a non-empty interior and that the minimum of the (convex) function ∑(λ) is achieved at λ^*. Then, Entropy 14 00892 i029

satisfies the set of constrains (3) or (1) and maximizes the entropy.

Proof.

Consider the gradient of ∑(λ) at λ^*. The equation to be solved to determine λ^* is Entropy 14 00892 i030

which coincides with Equation (3) when the gradient is written out explicitly.

Note that this is equivalent to minimizing (9) which is the concentrated likelihood-entropy function. Notice as well that ∑(λ^*) = S_Q(ρ^*)

Comment.

This theorem is practically equivalent to representing the estimator in terms of the estimating equations. Estimation equations (or functions) are the underlying equations from which the roots or solutions are derived. The logic for using these equations is (i) they have simpler form (e.g., a linear form for the LS estimator) than their roots, and (ii) they preserve the sampling properties of their roots (Durbin, [35]). To see the direct relationship between estimation equations and the dual/concentrated model (extremum estimator), note that the estimation equations are the first order conditions of the respective extremum problem. The choice of estimation equations is appropriate whenever the first order conditions characterize the global solution to the (extremum) optimization problem, which is the case in the model discussed here.

Theorem 2.1 can be summarized as follows: in order to determine β and ε from (1), it is easier to transform the algebraic problem into the problem of obtaining a minimum of the convex function ∑(λ) and then use β^* = E_P^* [z] and ε^* = E_P^* [v] to compute the estimates β^* and ε^*. The above procedure is designed in such a way that Entropy 14 00892 i031

is automatically satisfied. Since the actual measurement noise is unknown, it is treated as a quantity to be determined, and treated (mathematically) as if both β and ε were unknown. The interpretations of the reconstructed residual ε^* and the reconstructed β^*, are different. The latter is the unknown parameter vector we are after while the former is the residual (reconstructed error) such that the linear Equation (1), Entropy 14 00892 i032

, is satisfied. With that background, we now discuss the basic properties of our model. For a detailed comparison of a large number of IT estimation methods see Golan ([27,28,29,30,31,32,33]) and the nice text of Mittelhammer, Judge and Miller [36]

3. Closed Form Examples

With the above formulation, we now turn to a number of relatively simple analytical examples. These examples demonstrate the advantages of our method and its simplicity. In Section 6 we provide additional closed form examples.

3.1. Normal Priors

In this example the index d takes the possible values (dimensions) K, N, or K+N depending if it relates to C_s (or z), to C_n (or v) or to both. Assume the reference prior dQ is a normal random vector with d × d [i.e., K × K, N × N or (N + K) × (N + K)] covariance matrix D, the law of which has density Entropy 14 00892 i033

where

is the vector of prior means and is specified by the researcher. Next, we define the Laplace transform, ω(τ), of the normal prior. This transform involves the diagonal covariance matrix for the noise and signal models:

(10)

Since

then replacing τ by either X^tλ or by λ, (for the noise vector) verifies that Ω(λ) turns out to be of a quadratic form, and therefore the problem of minimizing ∑(λ) is just a quadratic minimization problem. In this case, no bounds are specified on the parameters. Instead, normal priors are used.

From (10) we get the concentrated model:

(11)

with a minimum at λ^*, satisfying:

(12)

If M^# denotes the generalized inverse of M = ADA^t then Entropy 14 00892 i039

and therefore:

(13)

For the general case A = [X I] and:

the generalized entropy solution for the traditional linear model is:

so:

and finally:

(14)

Here

. See Appendix 2 for a detailed derivation.

3.2. Discrete Uniform Priors — A GME Model

Consider now the uniform priors, which is basically the GME method (Golan, Judge and Miller [1]). Jaynes’s classical ME estimator (Jaynes [3,4]) is a special case of the GME. Let the components of z take discrete values, and let Entropy 14 00892 i046

for 1 ≤ k ≤ K Note that we allow for the cardinality of each of these sets to vary. Next, define Entropy 14 00892 i047

. A similar construction may be proposed for the noise terms, namely we put Entropy 14 00892 i048

. Since the spaces are discrete, the information is described by the obvious σ-algebras and both the prior and post-data measures will be discrete. As a prior on the signal space, we may consider:

where a similar expression may be specified for the priors on C_n. Finally, we get:

together with a similar expression for the Laplace transform of the noise prior. Notice that since the noise and signal are independent in the priors, this is also true for the post-data, so:

Finally,

and

. For detailed derivations and discussion of the GME see Golan, Judge and Miller [1].

3.3. Signal and Noise Bounded Above and Below

Consider the case in which both β and ε are bounded above and below. This time we place a Bernoulli measure on the constraint space C_s and the noise space C_n. Let Entropy 14 00892 i054

and

for the signal and noise bounds a_j, b_j and e respectively. The Bernoulli a priori measure on C = C_s × C_n is:

where δ_c (dz) denotes the (Dirac) unit point mass at some point c. Recalling that A = [X ,I] we now compute the Laplace transform ω(t) of Q, which in turn yields Ω(λ)=ω(A^Tλ):

The concentrated entropy function is:

The minimizer of this function is the Lagrange multiplier vector λ^*. Once it has been found, then Entropy 14 00892 i059

and

. Explicitly:

where:

and τ = (X^Tλ^*). Similarly:

where:

These are respectively the Maximum Entropy probabilities that the auxiliary random variables z_j will attain the values a_j or b_j, or the auxiliary random variables v_l describing the error terms attain the values ±e. These can be also obtained as the expected values of v and z with respect to the post-data measure P^*(λ,dξ) given by:

Note that this model is the continuous version of the discrete GME model described earlier.

4. Main Results

4.1. Large Sample Properties

In this section we develop the basic statistical results. In order to develop these results for our generic IT estimator, we needed to employ tools that are different than the standard tools used for developing asymptotic theories (e.g., Mynbaev [31] or in Mittelhammer et al. [36]).

4.1.1. Notations and First Order Approximation

Denote by

the estimator of the true β when the sample size is N. Throughout this section we add a subscript N to all quantities introduced in Section 2 to remind us that the size of the data set is N. We want to show that Entropy 14 00892 i066

and

as N → ∞ in some appropriate way (for some covariance V). We state here the basic notations, assumptions and results and leave the details to the Appendix. The problem is that when N varies, we are dealing with problems of different sizes (recall λ is of dimension N in our generic model). To turn all problems to the same size let:

(15)

The modified data vector and the modified error terms are K-dimensional (moment) vectors, and the modified design matrix is a K×K-matrix. Problem (15), call it the moment, or the stochastic moment, problem, can be solved using the above generic IT approach which reduces to minimizing the modified concentrated (dual) entropy function:

where

.and

Assumption 4.1.

Assume that there exits an invertible K × K symmetric and positive definite matrix W such that Entropy 14 00892 i072

More precisely, assume that Entropy 14 00892 i073

as N → ∞. Assume as well that for any N-vector v, as Entropy 14 00892 i074

.

Recall that in finite dimensions all norms are equivalent so convergence in any norm is equivalent to component wise convergence. This implies that under Assumption 4.1, the vectors Entropy 14 00892 i075

converge to 0 in L₂, therefore in probability. To see the logic for that statement, recall that the vector ε_N has covariance matrix σ²I_N. Therefore, Entropy 14 00892 i076

and assumption 4.1 yields the above conclusion. (To keep notations simple, and without loss of generality, we discuss here the case of σ²I_N.)

Corollary 4.1.

By Equation (15) Entropy 14 00892 i077

. Let

where β is the true but unknown vector of parameters. Then, Entropy 14 00892 i079

as N → ∞ (the proof is immediate).

Lemma 4.1.

Under Assumption 4.1 and assume that for real a Entropy 14 00892 i080

Then, for

,

as N → ∞. Equivalently, Entropy 14 00892 i082

as N → ∞ weakly in Entropy 14 00892 i083

with respect to the appropriate induced measure.

Proof of lemma 4.1.

Note that for Entropy 14 00892 i070

:

This is equivalent to the assertion of the lemma.

Lemma 4.2.

Let

Then, under Assumption 4.1:

Comment.

Observe that the μ^* that minimizes Entropy 14 00892 i086

satisfies: Entropy 14 00892 i087

Or since W is invertible, β admits the representation Entropy 14 00892 i088

Note that this last identity can be written as Entropy 14 00892 i089

Next, we define the function:

Assumption 4.2.

The function θ(τ) is invertible and continuously differentiable.

Observe that we also have Entropy 14 00892 i091

To relate the solution to problem (1) to that of problem (15), observe that Entropy 14 00892 i092

as well as Entropy 14 00892 i093

where Ω_N and ∑_N are the functions introduced in Section 2 for a problem of size N. To relate the solution of the problem of size K to that of the problem of size N, we have:

Lemma 4.3.

If

denotes the minimizer of Entropy 14 00892 i095

then

is the minimizer of ∑_N(λ).

Proof of Lemma 4.3.

Recall that:

From this, the desired result follows after a simple computation.

We write the post data probability that solves problem (15) (or (1)) as:

Recalling that Entropy 14 00892 i099

is the solution for the N-dimensional (data) problem and Entropy 14 00892 i100

is the solution for the moment problem, we have the following result:

Corollary 4.2.

With the notations introduced above and by Lemma 4.3 we have Entropy 14 00892 i101

.

To state Lemma 4.4 we must consider the functions Entropy 14 00892 i102

defined by:

Denote by

the measure with density Entropy 14 00892 i105

with respect to Q. The invertibility of the functions defined above is related to the non-singularity of their Jacobian matrices, which are the Entropy 14 00892 i104

-covariances of ξ. These functions will be invertible as long as these quantities are positive definite. The relationship among the above quantities is expressed in the following lemma:

Lemma 4.4.

With the notations introduced above and in (8), and recall that we suppose that D₂ = σ²I_N we have: Entropy 14 00892 i106

and

where

,

and θ' is the first derivative of θ.

Comment.

The block structure of the covariance matrix results from the independence of the signal and the noise components in both in the prior measure dQ and the post data (maximum entropy) probability measure dP*.

Following the above, we assume:

Assumption 4.3.

The eigenvalues of the Hessian matrix Entropy 14 00892 i110

are uniformly (with respect to N and μ) bounded below away from zero.

Proposition 4.1.

Let ψ_N (Y), ψ_∞ (Y) respectively denote the compositional inverses of φ_N (μ), φ_∞ (μ). Then, as N → ∞, (i) φ_N (μ) → φ_∞ (μ) and (ii) ψ_N (y) → ψ_∞ (y).

The proof is presented in the Appendix.

4.1.2. First Order Unbiasedness

Lemma 4.5.

(First Order Unbiasedness). With the notations introduced above and under Assumptions 4.1–4.3, assume furthermore that Entropy 14 00892 i111

as N →∞. Then up to o(1/N), Entropy 14 00892 i065

is an unbiased estimator of β.

The proof is presented in the Appendix.

4.1.3. Consistency

The following lemma and proposition provide results related to the large sample behavior of our generalized entropy estimator. For simplicity of the proof and without loss of generality, we suppose here that Entropy 14 00892 i112

.

Lemma 4.6.

(Consistency in squared mean). Under the same assumptions of Lemma 4.5, since E[ε]=0 and the ε are homoskedastic, then Entropy 14 00892 i113

in square mean as N → ∞.

Next, we provide our main result of convergence in distribution.

Proposition 4.2.

(Convergence in distribution). Under the same assumptions as in Lemma 4.5 we have

(a): as N → ∞,
(b): as N → ∞,

where

stands for convergence in distribution (or law).

Both proofs are presented in the Appendix.

4.2. Forecasting

Once the Generalized Entropy (GE) estimated vector β^* has been found, we can use it to predict future (yet) “unobserved” values. If additive noise (ε or v) is distributed according to the same prior Q_n, and if future observations are determined by the design matrix X_f, then the possible future observations are described by a random variable y_f given by Entropy 14 00892 i117

. For example, if v_f is centered (on 0), then Entropy 14 00892 i118

and:

In the next section we contrast our estimator with other estimators. Then, in Section 6 we provide more analytic solutions for different priors.

5. Method Comparison

In this Section we contrast our IT estimator with other estimators that are often used for estimating the location vector β in the noisy, inverse linear problem. We start with the least squares (LS) model, continue with the generalized LS (GLS) and then discuss the regularization method often used for ill-posed problems. We then contrast our estimator with a Bayesian one and with the Bayesian Method of Moments (BMOM). We also show that exact correspondence between our estimator and the other estimators under normal priors.

5.1. The Least Squares Methods

5.1.1. The General Case

We first consider the purely geometric/algebraic approach for solving the linear model (1). A traditional method consists of solving the variational problem:

(16)

The rationale here is that because of the noise ε, the data Entropy 14 00892 i121

may fall outside the range Entropy 14 00892 i122

of X, so the objective is to minimize that discrepancy. The minimizer Entropy 14 00892 i123

of (16) provides us with the LS estimates that minimize the errors sum of square distance from the data to Entropy 14 00892 i124

. When (X^tX)⁻ exists, then Entropy 14 00892 i125

. The reconstruction error Entropy 14 00892 i126

can be thought of as our estimate of the “minimal error in quadratic norm” of the measurement errors, or of the noise present in the measurements.

The optimization (16) can be carried out with respect to different norms. In particular, we could have considered Entropy 14 00892 i127

. In this case we get the GLS solution Entropy 14 00892 i128

for any general (covariance) matrix D with blocks D₁ and D₂.

If, on the other hand, our objective is to reconstruct simultaneously both the signal and the noise, we can rewrite (1) as:

(17)

where A and ξ are as defined in Section 2. Since Entropy 14 00892 i130

,

and the matrix A is of dimension N × (N + K), there are infinitely many solutions that satisfy the observed data in (1) (or (17)). To choose a single solution we solve the following model:

(18)

In the more general case we can incorporate the covariance matrix to weigh the different components of ξ:

(19)

where

is a weighted norm in the extended signal-noise space (C = C_s × C_n) and D can be taken to be the full covariance matrix composed of both D₁ and D₂ defined in Section 3.1. Under the assumption that M ≡ (ADA^t) is invertible, the solution to the variational problem (19) is given by Entropy 14 00892 i135

. This solution coincides with our Generalized Entropy formulation when normal priors are imposed and are centered about zero (c₀ = 0) as is developed explicitly in Equation (14).

If, on the other hand, the problem is ill-posed (e.g., X is not invertible), then the solution is not unique, and a combination of the above two methods (16 and 18) can be used. This yields the regularization method consisting of finding β such that:

(20)

is achieved (see for example, Donoho et al. [25] for a nice discussion of regularization within the ME formulation.) Traditionally, the positive penalization parameter α is specified to favor small sized reconstructions, meaning that out of all possible reconstructions with a given discrepancy, those with the smallest norms are chosen. The norms in (20) can be chosen to be weighted, so that the model can be generalized to:

(21)

The solution is:

(22)

where D₁ and D₂ can be substituted for any weight matrix of interest. Using the first component of Entropy 14 00892 i139

, we can state the following.

Lemma 5.1.

With the above notations, Entropy 14 00892 i140

for α = 1.

Proof of Lemma 5.1.

The condition Entropy 14 00892 i140

amounts to:

independently of y. For this equality to hold, α=1.

The above result shows that if we weigh the discrepancy between the observed data (y) and its true value (Xβ) by the prior covariance matrix D₂, the penalized GLS and our entropy solutions coincide for α=1 and for normal priors.

The comparison of Entropy 14 00892 i142

with

is stated below.

Lemma 5.2.

With the above notations, Entropy 14 00892 i142

=

when the constraints are in terms of pure moments (zero moments).

Proof of Lemma 5.2.

If

=

, then

for all y, which implies the following chain of identities:

Clearly there are only two possibilities. First, if the noise components are not constant, D₂ is invertible and therefore X^t must vanish (trivial but an uninteresting case). Second, if the variance of the noise component is zero, (1) becomes a pure linear inverse problem (i.e., we solve y = Xβ).

5.1.2. The Moments’ Case

Up to now, the comparison was done where the Generalized Entropy, GE, estimator was optimized under a larger space (A) than the other LS or GLS estimators. In other words, the constraints in the GE estimator are the data points rather than the moments. The comparison is easier if one performs the above comparisons under similar spaces, namely using the sample’s moments. This can easily be done if X^tX is invertible, and where we re-specify A to be the generic matrix A = [X^tX X^t], rather than A = [X I]. Now, let y’ ≡ X^ty, X’ ≡ X^tX, and ε’ ≡ X^tε , then the problem is represented as y’ = X’β + ε’. In that case the conditions for Entropy 14 00892 i142

=

is the trivial condition X^tD₂X = 0.

In general, when X^tX is invertible, it is easy to verify that the solutions to variational problems of the type y’ ≡ X^ty = X^tXβ are of the form (X^tX)⁻¹X^ty. In one case, the problem is to find:

(23)

while in the other case the solution consists of finding:

(24)

Under this “moment” specification, the solutions to the three different methods described above (16, 23 and 24) coincide.

5.2. The Basic Bayesian Method

Under the Bayesian approach we may think of our problem in the following way. Assume, as before, that C_n and C_s are closed convex subsets of Entropy 14 00892 i148

and

respectively and that Entropy 14 00892 i149

and

. For the rest of this section, the priors

g_{s} (z),

g_n(v) will have their usual Bayesian interpretation. For a given z, we think of y = Xz + v as a realization of the random variable Y = Xz + V. Then, Entropy 14 00892 i151

. The joint density g_y,z(y,z) of Y and Z, where Z is distributed according to the prior Q_s is:

The marginal distribution of y is Entropy 14 00892 i153

and therefore by Bayes Theorem the posterior (post-data) conditional Entropy 14 00892 i154

is

from which:

(25)

As usual

minimizes

where Z and Y are distributed according to g_y,z (y,z). The conditional covariance matrix:

is such that:

where Var(Z|y) is the total variance of the K random variates z in Z. Finally, it is important to emphasize here that the Bayesian approach provides us with a whole range of tools for inference, forecasting, model averaging, posterior intervals, etc. In this paper, however, the focus is on estimation and on the basic comparison of our GE method with other methods under the notations and formulations developed here. Extensions to testing and inference are left for future work.

5.2.1. A Standard Example: Normal Priors

As before, we view β and ε as realizations of random variables Z and V having the informative normal “a priori” (priors for signal and noise) distributions:

and:

For notational convenience we assume that both Z and V are centered on zero and independent, and both covariance matrices D₁ and D₂ are strictly positive definite. For comparison purposes, we are using the same notation as in Section 3. The randomness is propagated to the data Y such that the conditional density (or the conditional priors on y) of Y is:

(26)

Then, the marginal distribution of Y is Entropy 14 00892 i164

. The conditional distribution of Z given Y is easy to obtain under the normal setup. Thus, the post-data distribution of the signal, β, given the data y is:

(27)

where

and

That is, “the posterior (post-data)” distribution of Z has changed (relative to the prior) by the data. Finally, the post-data expected value of Z is given by:

(28)

This is the traditional Bayesian solution for the linear regression using the support spaces for both signal and noise within the framework developed here. As before, one can compare this Bayesian solution with our Generalized Entropy solution. Equation (28) is comparable with our solution (14) for z⁰ = 0 which is the Generalized Entropy method with normal priors and center of supports equal zero. In addition, it is easy to see that the Bayesian solution (28) coincides with the penalized GLS (model (24)) for α = 1.

A few comments on these brief comparisons are in place. First, under both approaches the complete posterior (or post-data) density is estimated and not only the posterior mean, though under the GE estimator the post-data is related to the pre-specified spaces and priors. (Recall that the Bayesian posterior means are specific to a particular loss function.) Second, the agreement between the Bayesian result and the minimizer of (24) with α = 1 assumes a known value of Entropy 14 00892 i169

, which is contained in D₂. In the Bayesian result Entropy 14 00892 i169

is marginalized, so it is not conditional on that parameter. Therefore, with a known value of Entropy 14 00892 i169

, both estimators are the same.

There are two reasons for the equivalence of the three methods (GE, Bayes and Penalized GLS). The first is that there are no binding constraints imposed on the signal and the noise. The second is the choice of imposing the normal densities as informative priors for both signal and noise. In fact, this result is standard in inverse problem theory where L is known as the Wiener filter (see for example Bertero and Boccacci [37]). In that sense, the Bayesian technique and the GE technique have some procedural ingredients in common, but the distinguishing factor is the way the posterior (post-data) is obtained. (Note that “posterior” for the entropy method, means the “post data” distribution which is based on both the priors and the data, obtained via the optimization process). In one case it is obtained by maximizing the entropy functional while in the Bayesian approach it is obtained by a direct application of Bayes theorem. For more background and related derivation of the ME and Bayes rule see Zellner [38,39].

5.3. Comparison with the Bayesian Method of Moments (BMOM)

The basic idea behind Zellner’s BMOM is to avoid a likelihood function. This is done by maximizing the continuous (Shannon) entropy subject to the empirical moments of the data. This yields the most conservative (closest to uniform) post data density (Zellner [14,40,41,42,43]; Zellner and Tobias [15]). In that way the BMOM uses only assumptions on the realized error terms which are used to derive the post data density.

Building on the above references, assume (X^tX)⁻¹ exists, then the LS solution to (1) is Entropy 14 00892 i170

which is assumed to be the post data mean with respect to (yet) unknown distribution (likelihood). This is equivalent to assuming Entropy 14 00892 i171

(the columns of X are orthogonal to the N × 1 vector E[V|Data]). To find g(z|Data), or in Zellener’s notation g(β|Data), one applies the classical ME with the following constraints (information):

and:

where

is based on the assumption that Entropy 14 00892 i175

, or similarly under Zellner’s notation Entropy 14 00892 i176

, and σ² is a positive parameter. Then, the maximum entropy density satisfying these two constraints (and the requirement that it is a proper density) is:

This is the BMOM post data density for the parameter vector with mean Entropy 14 00892 i178

under the two side conditions used here. If more side conditions are used, the density function g will not be normal. Information other than moments can also be incorporated within the BMOM.

Comparing Zellner’s BMOM with our Generalized Entropy method we note that the BMOM produces the post data density from which one can compute the vector of point estimates Entropy 14 00892 i178

of the unconstrained problem (1). Under the GE model, the solution Entropy 14 00892 i142

satisfies the data/constraints within the joint support space C. Further, for the GE construction there is no need to impose exact moment constraints, meaning it provides a more flexible post data density. Finally, under both methods one can use the post data densities to calculate the uncertainties around future (unobserved) observations.

6. More Closed Form Examples

In Section 3, we formulated three relatively simple closed form cases. In the current section, we extend our analytic solutions to a host of potential priors. This section demonstrates the capabilities of our proposed estimator. We do not intend here to formulate our model under all possible priors. We present our examples in such a way that the priors can be assigned for either the signal or the noise components. The different priors we discuss correspond to different prior beliefs: unbounded (unconstrained), bounded below, bounded above, or bounded below and above. The following set of examples, together with those in Section 3, represents different cases of commonly used prior distributions, and their corresponding partition functions. Specifically, the different cases are the Laplace (bilateral exponential) which is symmetric but with heavy tails, the Gamma distribution that is bounded below and is non-symmetric, the continuous and discrete uniform distributions, and the Bernoulli distribution that allows an easy specification of a prior mean that is not at the center of the pre-specified supports. In all the examples below, the index d takes the possible values (dimensions) K, N, or K+N depending if it relates to C_s (or z), to C_n (or v) or to both.

We note that information theoretic procedures were also used for producing priors (e.g., Jeffreys’, Berger and Bernardo’s, Zellner, etc.). In future work we will try to relate them to the procedure developed here. In these approaches, β is not always viewed as the mean, as given in Equation (2). For example, Jeffreys, Zellner (e.g., Zellner [41]) and others have used Cauchy priors and unbounded measure priors, for which the mean does not exist.

6.1. The Basic Formulation

Case 1. Bilateral Exponential—Laplace Distribution. Like the normal distribution, another possible unconstrained model is obtained if we take as reference measure a bilateral exponential, or Laplace distribution. This is useful for modeling distributions with tails heavier than the normal. The following derivation holds for both our generic model (captured via the generic matrix A) and for just the signal or noise parts separately. We only provide here the solution for the signal part.

In this case, the density of dQ is Entropy 14 00892 i179

The parameters Entropy 14 00892 i180

is the set of prior means and 1/2σ_j is the variance of each component. The Laplace transform of dQ is:

(29)

Next, we use the relationship τ = X^tλ. (Note that under the generic formulation, instead of X^t, we can work with X^* which stands for either X^t, I or A^t) We compute Ω(λ) via the Laplace transformation (8). It then follows that Entropy 14 00892 i182

where ω(t) is always finite and positive. For this relationship to be satisfied, Entropy 14 00892 i183

for all j = 1, 2, …, d. Finally, replacing τ by X^tλ yields D(Ω).

Next, minimizing the concentrated entropy function:

by equating its gradient with respect toλ to 0, we obtain that at the minimum:

(30)

Explicitly:

(31)

Finally, having solved for the optimal vector λ^* that minimizes ∑(λ), and such that the previous identity holds, we can rewrite our model as:

(32)

where rather than solve (31) directly, we make use of λ^*, that minimizes ∑(λ) and satisfies (30).

As expected, the post-data has a well-defined Laplace distribution (Kotz et al. [44]) but this distribution in not symmetrical anymore, and the decay rate is modified by the data. Specifically:

Case 2. Lower Bounds—Gamma Distribution. Suppose that the β’s are all bounded below by theory. Then, we can specify a random vector Z with values in the positive orthant translated to the lower bound K-dimensional vector l, so Entropy 14 00892 i189

, where [l_j,∞) = [Z_j,∞). Like related methods, we assume that each component Z_j of Z is distributed in [l_j,∞) according to a translated Entropy 14 00892 i190

. With this in mind, a direct calculation yields:

(33)

where the particular case of b_j = 0 corresponds to the standard exponential distribution defined on [l_j,∞). Finally, when τ is replaced by X^tλ, we get:

(34)

Case 3. Bounds on Signal and Noise. Consider the case where each component Z andV takes values in some bounded interval [a_j, b_j]. A common choice for the bounds of the errors supports in that case are the three-sigma rule (Pukelsheim [44]) where “sigma” is the empirical standard deviation of the sample analyzed (see for example Golan, Judge and Miller, [1] for a detailed discussion). In this situation we provide two simple (and extreme) choices for the reference measure. The first is a uniform measure on [a_j, b_j], and the second is a Bernoulli distribution supported on a_j and b_j.

6.1.2. Uniform Reference Measure

In this case the reference (prior) measure dQ(z) is distributed according to the uniform density Entropy 14 00892 i193

and the Laplace transform of this density is:

and is finite for every vector .

(35)

and ω (τ) id finite for every vector τ.

6.1.3. Bernoulli Reference Measure

In this case the reference measure is singular (with respect to the volume measure) and is given by dQ(z) = Entropy 14 00892 i195

, where δ_c(dz) denotes the (Dirac) unit point mass at some point c, and where p_j and q_j do not have to sum up to one, yet they determine the weight within the bounded interval [a_j, b_j]. The Laplace transform of dQ is:

(36)

where again, ω (τ) is finite for all τ.

In this case, there is no common criterion that can be used to decide which a priori reference measure to choose. In many specific cases, we have noticed that a reconstruction with the discrete Bernoulli prior of Entropy 14 00892 i197

yields estimates that are very similar to the continuous uniform prior.

Case 4. Symmetric Bounds. This is a special case of Case 3 above for a_j= −c_j and b_j = c_j for positive c_j’s. The corresponding versions of (35) and (36), the uniform and Bernoulli, are respectively:

(37)

and:

(38)

6.2 The Full Model

Having developed the basic formulations and building blocks of our model, we note that the list can be amplified considerably and these building blocks can be assembled into a variety of combinations. We already demonstrated such a case in Section 3.3. We now provide such an example.

6.2.1. Bounded Parameters and Normally Distributed Errors

Consider the common case of naturally bounded signal and normally distributed errors. This case combines Case 2 of Section 6.1 together with the normal case discussed in Section 3.1. Let Entropy 14 00892 i200

but we impose no constraints on the ε. From Section 2, Entropy 14 00892 i201

with

, and

. The signal component is formulated earlier, while Entropy 14 00892 i204

. Using A=[X I] we have Entropy 14 00892 i205

for the N-dimensional vector λ, and therefore, Entropy 14 00892 i206

. The maximal entropy probability measures (post-data) are:

where λ^* is found by minimizing the concentrated entropy function:

and l determines the “shift” for each coordinate. (For example, if Entropy 14 00892 i210

, then

, or for the simple heteroscedastic case, we have Entropy 14 00892 i211

). Finally, once λ^* is found, we get:

For example, if Entropy 14 00892 i210

, then

, or for the simple heteroscedastic case, we have Entropy 14 00892 i211

.

7. A Comment on Model Comparison

So far we have described our model, its properties and specified some closed form examples. The next question facing the researcher is how to decide on the most appropriate prior/model to use for a given set of data. In this section, we briefly comment on a few possible model comparison techniques.

A possible criterion for comparing estimations (reconstructions) resulting from different priors should be based on a comparison of the post-data entropies associated with the proposed setup. Implicit in the choice of priors is the choice of supports (Z and V), which in turn is dictated by the constraints imposed on the β’s. Assumption 2.1 means that these constraints are properly specified, namely there is no arbitrariness in the choice of C_s. The choice of a specific model for the noise involves two assumptions. The first one is about the support that reflects the actual range of the errors. The second is the choice of a prior describing the distribution of the noise within that support. To contrast two possible priors, we want to compare the reconstructions provided by the different models for the signal and noise variables. Within the information theoretic approach taken here, comparing the post-data entropies seems a reasonable choice.

From a practical point of view, the post-data entropies depend on the priors and the data in an explicit but nonlinear way. All we can say for certain is that for all models (or priors) the optimal solution is:

(39)

where λ^* has to be computed by minimizing the concentrated entropy function ∑(λ), and it is clear that the total entropy difference between the post-data and the priors is just the entropy difference for the signal plus the entropy difference for the noise. Note that Entropy 14 00892 i214

where d is the dimension of λ. This is the entropy ratio statistics which is similar in nature to the empirical likelihood ratio statistic (e.g., Golan [27]). Rather than discussing this statistic here, we provide in Appendix 3 analytic formulations of Equation (39) for a large number of prior distributions. These formulations are based on the examples of earlier sections. Last, we note that in some cases, where the competing models are of different dimensions, a normalization of both statistics is necessary.

8. Conclusions

In this paper we developed a generic information theoretic method for solving a noisy, linear inverse problem. This method uses minimal a-priori assumptions, and allows us to incorporate constraints and priors in a natural way for a whole class of linear inverse problems across the natural and social sciences. This inversion method is generic in the sense that it provides a framework for analyzing non-normal models and it performs well also for data that are not of full rank.

We provided detailed analytic solutions for a large class of priors. We developed the first order properties as well as the large sample properties of that estimator. In addition, we compared our model to other methods such as the Least Squares, Penalized LS, Bayesian and the Bayesian Method of Moments.

The proposed model main advantage over other LS and ML methods is that it has better performance (more stable and lower variances) for (possibly small) finite samples. The smaller the sample and/or the more ill-behaved (e.g., collinear) is the sample, the better this method performs. However, if one knows the underlying distribution, the sample is well behaved and large enough the traditional ML is the correct model to use. The other advantages of our proposed model (relative to the GME and other IT estimators) are that (i) we can impose different priors (discrete or continuous) for the signal and the noise, (ii) we estimate the full distribution of each one of the two sets of unknowns (signal and noise), and (iii) our model is based on minimal assumptions.

In future research, we plan to study the small sample properties as well as develop statistics to evaluate the performance of the competing priors and models. We conclude by noting that the same framework developed here can be easily extended for nonlinear estimation problems. This is because all the available information enters as stochastic constraints within the constrained optimization problem.

Acknowledgments

We thank Bertrand Clarke, George Judge, Doug Miller, Arnold Zellner and Ed Greenberg as well as participants of numerous conferences and seminars for their comments and suggestions on earlier versions of that paper. Golan thanks the Edwin T. Jaynes International Center for Bayesian Methods and Maximum Entropy for partially supporting this project. We also want to thank the referees for their comments, for their careful reading of the manuscript and for pointing out reference [26] to us. Their comments improved the presentation considerably.

References

Golan, A.; Judge, G.G.; Miller, D. Maximum Entropy Econometrics: Robust Estimation with Limited Data; John Wiley & Sons: New York, NY, USA, 1996. [Google Scholar]
Gzyl, H.; Velásquez, Y. Linear Inverse Problems: The Maximum Entropy Connection; World Scientific Publishers: Singapore, 2011. [Google Scholar]
Jaynes, E.T. Information theory and statistical mechanics. Phys. Rev. 1957, 106, 620–630. [Google Scholar] [CrossRef]
Jaynes, E.T. Information theory and statistical mechanics II. Phys. Rev. 1957, 108, 171–190. [Google Scholar] [CrossRef]
Shannon, C. A mathematical theory of communication. Bell System Technical. J. 1948, 27, 379–423, 623–656. [Google Scholar] [CrossRef]
Owen, A. Empirical likelihood for linear models. Ann. Stat. 1991, 19, 1725–1747. [Google Scholar] [CrossRef]
Owen, A. Empirical Likelihood; Chapman & Hall/CRC: Boca Raton, FL, USA, 2001. [Google Scholar]
Qin, J.; Lawless, J. Empirical likelihood and general estimating equations. Ann. Stat. 1994, 22, 300–325. [Google Scholar] [CrossRef]
Smith, R.J. Alternative semi parametric likelihood approaches to GMM estimations. Econ. J. 1997, 107, 503–510. [Google Scholar] [CrossRef]
Newey, W.K.; Smith, R.J. Higher order properties of GMM and Generalized empirical likelihood estimators. Department of Economics, MIT: Boston, MA, USA, Unpublished work. 2002. [Google Scholar]
Kitamura, Y.; Stutzer, M. An information-theoretic alternative to generalized method of moment estimation. Econometrica 1997, 66, 861–874. [Google Scholar] [CrossRef]
Imbens, G.W.; Johnson, P.; Spady, R.H. Information-theoretic approaches to inference in moment condition models. Econometrica 1998, 66, 333–357. [Google Scholar] [CrossRef]
Zellner, A. Bayesian Method of Moments/Instrumental Variables (BMOM/IV) analysis of mean and regression models. In Prediction and Modeling Honoring Seymour Geisser; Lee, J.C., Zellner, A., Johnson, W.O., Eds.; Springer Verlag: New York, NY, USA, 1996. [Google Scholar]
Zellner, A. The Bayesian Method of Moments (BMOM): Theory and applications. In Advances in Econometrics; Fomby, T., Hill, R., Eds.; JAI Press: Greenwich, CT, USA, 1997; Volume 12, pp. 85–105. [Google Scholar]
Zellner, A.; Tobias, J. Further results on the Bayesian method of moments analysis of multiple regression model. Int. Econ. Rev. 2001, 107, 1–15. [Google Scholar] [CrossRef]
Gamboa, F.; Gassiat, E. Bayesian methods and maximum entropy for ill-posed inverse problems. Ann. Stat. 1997, 25, 328–350. [Google Scholar] [CrossRef]
Gzyl, H. Maxentropic reconstruction in the presence of noise. In Maximum Entropy and Bayesian Studies; Erickson, G., Ryckert, J., Eds.; Kluwer: Dordrecht, The Netherlands, 1998. [Google Scholar]
Golan, A.; Gzyl, H. A generalized maxentropic inversion procedure for noisy data. Appl. Math. Comput. 2002, 127, 249–260. [Google Scholar] [CrossRef]
Hoerl, A.E.; Kennard, R.W. Ridge regression: Biased estimation for non-orthogonal problems. Technometrics 1970, 1, 55–67. [Google Scholar] [CrossRef]
O’Sullivan, F. A statistical perspective on ill-posed inverse problems. Stat. Sci. 1986, 1, 502–527. [Google Scholar] [CrossRef]
Breiman, L. Better subset regression using the nonnegative garrote. Technometrics 1995, 37, 373–384. [Google Scholar] [CrossRef]
Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 1996, 58, 267–288. [Google Scholar]
Titterington, D.M. Common structures of smoothing techniques in statistics. Int. Stat. Rev. 1985, 53, 141–170. [Google Scholar] [CrossRef]
Donoho, D.L.; Johnstone, I.M.; Hoch, J.C.; Stern, A.S. Maximum entropy and the nearly black object. J. R. Stat. Soc. Ser. B 1992, 54, 41–81. [Google Scholar]
Besnerais, G.L.; Bercher, J.F.; Demoment, G. A new look at entropy for solving linear inverse problems. IEEE Trans. Inf. Theory 1999, 45, 1565–1578. [Google Scholar] [CrossRef]
Bickel, P.; Li, B. Regularization methods in statistics. Test 2006, 15, 271–344. [Google Scholar] [CrossRef]
Golan, A. Information and entropy econometrics—A review and synthesis. Found. Trends Econometrics 2008, 2, 1–145. [Google Scholar] [CrossRef]
Fomby, T.B.; Hill, R.C. Advances in Econometrics; JAI Press: Greenwich, CT, USA, 1997. [Google Scholar]
Golan, A. (Ed.) Special Issue on Information and Entropy Econometrics (Journal of Econometrics); Elsevier: Amsterdam, The Netherlands, 2002; Volume 107, Issues 1–2, pp. 1–376.
Golan, A.; Kitamura, Y. (Eds.) Special Issue on Information and Entropy Econometrics: A Volume in Honor of Arnold Zellner (Journal of Econometrics); Elsevier: Amsterdam, The Netherlands, 2007; Volume 138, Issue 2, pp. 379–586.
Mynbayev, K.T. Short-Memory Linear Processes and Econometric Applications; John Wiley & Sons: Hoboken, NY, USA, 2011. [Google Scholar]
Asher, R.C.; Borchers, B.; Thurber, C.A. Parameter Estimation and Inverse Problems; Elsevier: Amsterdam, Holland, 2003. [Google Scholar]
Golan, A. Information and entropy econometrics—Editor’s view. J. Econom. 2002, 107, 1–15. [Google Scholar] [CrossRef]
Kullback, S. Information Theory and Statistics; John Wiley & Sons: New York, NY, USA, 1959. [Google Scholar]
Durbin, J. Estimation of parameters in time-series regression models. J. R. Stat. Soc. Ser. B 1960, 22, 139–153. [Google Scholar]
Mittelhammer, R.; Judge, G.; Miller, D. Econometric Foundations; Cambridge Univ. Press: Cambridge, UK, 2000. [Google Scholar]
Bertero, M.; Boccacci, P. Introduction to Inverse Problems in Imaging; CRC Press: Boca Raton, FL, USA, 1998. [Google Scholar]
Zellner, A. Optimal information processing and Bayes theorem. Am. Stat. 1988, 42, 278–284. [Google Scholar]
Zellner, A. Information processing and Bayesian analysis. J. Econom. 2002, 107, 41–50. [Google Scholar] [CrossRef]
Zellner, A. Bayesian Method of Moments (BMOM) Analysis of Mean and Regression Models. In Modeling and Prediction; Lee, J.C., Johnson, W.D., Zellner, A., Eds.; Springer: New York, NY, USA, 1994; pp. 17–31. [Google Scholar]
Zellner, A. Models, prior information, and Bayesian analysis. J. Econom. 1996, 75, 51–68. [Google Scholar] [CrossRef]
Zellner, A. Bayesian Analysis in Econometrics and Statistics: The Zellner View and Papers; Edward Elgar Publishing Ltd.: Cheltenham Glos, UK, 1997; pp. 291–304, 308–318. [Google Scholar]
Kotz, S.; Kozubowski, T.; Podgórski, K. The Laplace Distribution and Generalizations; Birkhauser: Boston, MA, USA, 2001. [Google Scholar]
Pukelsheim, F. The three sigma rule. Am. Stat. 1994, 48, 88–91. [Google Scholar]

Appendix 1: Proofs

Proof of Proposition 4.1.

From Assumptions 4.1–4.3 we have:

Note that if the Entropy 14 00892 i220

-covariance of the noise component of ξ is Entropy 14 00892 i216

, then

-covariance of ξ is an (N + K) × (N + K)-matrix given by Entropy 14 00892 i218

. Here

is the

-covariance of the signal component of ξ. Again, from Assumptions 4.1–4.3 it follows Entropy 14 00892 i221

that which is the covariance of the signal component of ξ with respect to the limit probability Entropy 14 00892 i222

Therefore, φ_∞ is also invertible and Entropy 14 00892 i223

To verify the uniform convergence of ψ_N (y) towards ψ_∞ (y) note that:

Proof of Lemma 4.5.

(First Order Unbiasedness). Observe that for N large, keeping only the first term of the Taylor expansion we have:

after we drop the o(1/N) term. Keeping only the first term of the Taylor expansion, and invoking the assumptions of Lemma 4.5:

Incorporating the model’s equations, we see that under the approximations made so far:

We used the fact that Entropy 14 00892 i228

and

are the respective Jacobian matrices. The first order unbiasedness follows by taking expectations, Entropy 14 00892 i230

Proof of Lemma 4.6.

(Consistency in squared mean). With the same notations as above, consider Entropy 14 00892 i231

. Using the representation of lemma 4.5, Entropy 14 00892 i232

and computing the expected square norm indicated above, we obtain Entropy 14 00892 i233

which from Assumption 4.1 tends to 0 as N → ∞.

Proof of Proposition 4.2.

Part (a) The proof is based on Lemma 4.5. Notice that under Q_n for any Entropy 14 00892 i234

,

where

Since the components of ε are i.i.d. random variables, the standard approximations yield: Entropy 14 00892 i237

, where

, and therefore the law of Entropy 14 00892 i239

concentrates at 0 asymptotically. This completes Part (a).

Part (b) This part is similar to the previous proof, except that now the Entropy 14 00892 i240

factor in the exponent changes the result to be Entropy 14 00892 i241

as N → ∞, from which assertion (b) of the proposition follows by the standard continuity theorem.

Appendix 2: Normal Priors — Derivation of the Basic Linear Model

Consider the linear model y_i = a + bx_i + ε_i where X = (l, x), and β = (ab)^t. We assume that (i) both Q_s and Q_n are normal and that (ii) Q_s and Q_n are independent, meaning the Laplace transform is just (10). Recalling that t = A^tλ, and that in the generic model A = [X I] = [1 x I] where A is an N × (N + 2), or N × (N + K) for the general model with K > 2, dimensional matrix. The log of the normalization factor of the post-data, Ω(λ), is:

Building on (11), the concentrated (dual)entropy function is:

where

Solving for λ^*, Entropy 14 00892 i245

, yields:

and finally, Entropy 14 00892 i247

. Explicitly, M is:

where:

and 11^t = 1_N.

Next, we solve for the optimal β and ε. Recalling the optimal solution is Entropy 14 00892 i247

and

, then following the derivations of Section 3 we get:

and

so:

Rewriting the exponent in the numerator as:

and incorporating it in dP^* yields:

where the second right-hand side term equals 1. Finally:

To check our solution, note that Entropy 14 00892 i257

, so:

and finally:

which is (14), where B = y − Ac_0. Within the basic model, it is clear that Entropy 14 00892 i260

, or

. Under the natural case where the errors’ priors are centered on zero (v⁰ = 0), Entropy 14 00892 i262

and

. If in addition z⁰ = 0, then Entropy 14 00892 i264

.

Appendix 3: Model Comparisons — Analytic Examples

We provide here the detailed analytical formulations of constructing the dual (concentrated) GE model for different priors. It can be used to derive the entropy ratio statistics based on Equation (39).

Example 1. The priors for both the signal and the noise are normal. In that case, the final post-data entropy, computed in Section 3, is:

This seems to be the only case amenable to full analytical computation.

Example 2. Laplace prior for state space variables plus an uniform prior (in [−e, e]) for the noise term. The full post-data entropy is:

Example 3. Normal prior for state space variables and an uniform prior (in [−e, e]) for the noise. The post-data entropy is:

where z₀ is the center of the normal priors and D₁ is the covariance matrix of the state space variables.

Example 4. A Gamma prior for the state space variables, and an uniform prior (in [−e, e]) for the noise term. In this case the post-data entropy is:

Example 5. The priors for the state space variables are Laplace and the priors for the noise are normal. Here, the post-data entropy is:

Example 6. Both signal and noise have bounded supports, and we assume uniform priors for both. The post-data is:

Finally, we complete the set of examples with, probably, the most common case.

Example 7. Uniform priors on bounded intervals for the signal components and normal priors for the noise. The post-data entropy is:

We reemphasize that this model comparison can only be used to compare models after each model has been completely worked out and for a given data set. Finally, we presented here the case of comparing the total entropies of post-data to priors, but as Equation (39) shows, one can just compare the post and pre data entropies of only the signal, Entropy 14 00892 i272

, or only the noise, Entropy 14 00892 i273

.

© 2012 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Share and Cite

MDPI and ACS Style

Golan, A.; Gzyl, H. An Entropic Estimator for Linear Inverse Problems. Entropy 2012, 14, 892-923. https://doi.org/10.3390/e14050892

AMA Style

Golan A, Gzyl H. An Entropic Estimator for Linear Inverse Problems. Entropy. 2012; 14(5):892-923. https://doi.org/10.3390/e14050892

Chicago/Turabian Style

Golan, Amos, and Henryk Gzyl. 2012. "An Entropic Estimator for Linear Inverse Problems" Entropy 14, no. 5: 892-923. https://doi.org/10.3390/e14050892

APA Style

Golan, A., & Gzyl, H. (2012). An Entropic Estimator for Linear Inverse Problems. Entropy, 14(5), 892-923. https://doi.org/10.3390/e14050892

Article Menu

An Entropic Estimator for Linear Inverse Problems

Abstract

1. Introduction

2. Problem Statement and Solution

2.1. Notation and Problem Statement

2.2. The Solution

3. Closed Form Examples

3.1. Normal Priors

3.2. Discrete Uniform Priors — A GME Model

3.3. Signal and Noise Bounded Above and Below

4. Main Results

4.1. Large Sample Properties

4.1.1. Notations and First Order Approximation

4.1.2. First Order Unbiasedness

4.1.3. Consistency

4.2. Forecasting

5. Method Comparison

5.1. The Least Squares Methods

5.1.1. The General Case

5.1.2. The Moments’ Case

5.2. The Basic Bayesian Method

5.2.1. A Standard Example: Normal Priors

5.3. Comparison with the Bayesian Method of Moments (BMOM)

6. More Closed Form Examples

6.1. The Basic Formulation

6.1.2. Uniform Reference Measure

6.1.3. Bernoulli Reference Measure

6.2 The Full Model

6.2.1. Bounded Parameters and Normally Distributed Errors

7. A Comment on Model Comparison

8. Conclusions

Acknowledgments

References

Appendix 1: Proofs

Appendix 2: Normal Priors — Derivation of the Basic Linear Model

Appendix 3: Model Comparisons — Analytic Examples

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI