Next Article in Journal / Special Issue
Entropy Concept for Paramacrosystems with Complex States
Previous Article in Journal / Special Issue
Second Law Constraints on the Dynamics of a Mixture of Two Fluids at Different Temperatures
Order Article Reprints
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:

An Entropic Estimator for Linear Inverse Problems

by 1 and 2,*
Department of Economics, Info-Metrics Institute, American University, 4400 Massachusetts Ave., Washington, DC 20016, USA
Centro de Finanzas, IESA, Caracas 1010, Venezuela
Author to whom correspondence should be addressed.
Entropy 2012, 14(5), 892-923;
Received: 29 February 2012 / Revised: 2 April 2012 / Accepted: 17 April 2012 / Published: 10 May 2012
(This article belongs to the Special Issue Concepts of Entropy and Their Applications)


In this paper we examine an Information-Theoretic method for solving noisy linear inverse estimation problems which encompasses under a single framework a whole class of estimation methods. Under this framework, the prior information about the unknown parameters (when such information exists), and constraints on the parameters can be incorporated in the statement of the problem. The method builds on the basics of the maximum entropy principle and consists of transforming the original problem into an estimation of a probability density on an appropriate space naturally associated with the statement of the problem. This estimation method is generic in the sense that it provides a framework for analyzing non-normal models, it is easy to implement and is suitable for all types of inverse problems such as small and or ill-conditioned, noisy data. First order approximation, large sample properties and convergence in distribution are developed as well. Analytical examples, statistics for model comparisons and evaluations, that are inherent to this method, are discussed and complemented with explicit examples.

1. Introduction

Researchers in all disciplines are often faced with small and/or ill-conditioned data. Unless much is known, or assumed, about the underlying process generating these data (the signal and the noise) these types of data lead to ill-posed noisy (inverse) problems. Traditionally, these types of problems are solved by using parametric and semi-parametric estimators such as the least squares, regularization and non-likelihood methods. In this work, we propose a semi-parametric information theoretic method for solving these problems while allowing the researcher to impose prior knowledge in a non-Bayesian way. The model developed here provides a major extension of the Generalized Maximum Entropy model of Golan, Judge and Miller [1] and provides new statistical results of estimators discussed in Gzyl and Velásquez [2].
The overall purpose of this paper is fourfold. First, we develop a generic information theoretic method for solving linear, noisy inverse problems that uses minimal distributional assumptions. This method is generic in the sense that it provides a framework for analyzing non-normal models and it allows the user to incorporate prior knowledge in a non-Bayesian way. Second, we provide detailed analytic solutions for a number of possible priors. Third, using the concentrated (unconstrained) model, we are able to compare our estimator to other estimators, such as the Least Squares, regularization and Bayesian methods. Our proposed model is easy to apply and suitable for analyzing a whole class of linear inverse problems across the natural and social sciences. Fourth, we provide the large sample properties of our estimator.
To achieve our goals, we build on the current Information-Theoretic (IT) literature that is founded on the basis of the Maximum Entropy (ME) principle (Jaynes [3,4]) and on Shannon’s [5] information measure (entropy) as well as other generalized entropy measures. To understand the relationship between the familiar linear statistical model and the approach we take here, we now briefly define our basic problem, discuss its traditional solution and provide the basic logic and related literature we use here in order to solve that problem such that our objectives are achieved.
Consider the basic (linear) problem of estimating the K-dimensional location parameter vector (signal, input) β given an N-dimensional observed sample (response) vector y and an N × K design (transfer) matrix X such that y = Xβ + ε and ε is an N-dimensional random vector such that E[ε] = 0 and with some positive definite covariance matrix with a scale parameter σ2. The statistical nature of the unobserved noise term is supposed to be known, and we suppose that the second moments of the noise are finite. The researcher’s objective is to estimate the unknown vector β with minimal assumptions on ε. Recall that under the traditional regularity conditions for the linear model (and for X of rank K), the least squares, (LS), unconstrained, estimator is Entropy 14 00892 i001 and Entropy 14 00892 i002 where “t” stands for transpose.
Consider now the problem of estimating β and ε simultaneously while imposing minimal assumptions on the likelihood structure and while incorporating certain constraints on the signal and perhaps on the noise. Further, rather than following the tradition of employing point estimators, consider estimating the empirical distribution of the unknown quantities βk and εn with the joint objectives of maximizing the in-and-out of sample prediction.
With these objectives, the problem is inherently under-determined and cannot be solved with the traditional least squares or likelihood approaches. Therefore, one must resort to a different principle. In the work done here, we follow the Maximum Entropy (ME) principle that was developed by Jaynes [3,4] for similar problems. The classical ME method consists of using a variational method to choose a probability distribution from a class of probability distributions having pre-assigned generalized moments.
In more general terms, consider the problem of estimating an unknown discrete probability distribution from a finite and possibly noisy set of observed generalized (sample) moments, that is, arbitrary functions of the data. These moments (and the fact that the distribution is proper) are supposed to be the only available information. Regardless of the level of noise in these observed moments, if the dimension of the unknown distribution is larger than the number of observed moments, there are infinitely many proper probability distributions satisfying this information. Such a problem is called an under-determined problem. Which one of the infinitely many solutions that satisfy the data should one choose? Within the class of information-theoretic (IT) methods, the chosen solution is the one that maximizes an information criterion-entropy. Procedure that we propose below to solve the estimation problem described above, fits in that framework.
We construct our proposed estimator for solving the noisy, inverse, linear problem in two basic steps. In our first step, each unknown parameter (βk and εn) is constructed as the expected value of a certain random variable. That is, we view the possible values of the unknown parameters as values of random variables whose distributions are to be determined. We will assume that the range of each such random variable contains the true unknown value of βk and εn respectively. This step actually involves two specifications. The first one is the pre-specified support space for the two sets of parameters (finite/infinite and/or bounded/unbounded). At the outset of section two we shall do this as part of the mathematical statement of the problem. Any further information we may have about the parameters is incorporated into the choice of a prior (reference) measure on these supports. Since usually a model for the noise is supposed to be known, the statistical nature of the noise is incorporated at this stage. As far as the signal goes, this is an auxiliary construction. This constitutes our second specification.
In our second step, because minimal assumptions on the likelihood implies that such a problem is under-determined, we resort to the ME principle. This means that we need to convert this under-determined problem to a well-posed, constrained optimization. Similar to the classical ME method the objective function in that constrained optimization problem is composed of N × K entropy functions: one for each one of the N × K proper probability distributions (one for each signal βk and one for each noise component εn). The constraints are just the observed information (data) and the requirement that all probability distributions are proper. Maximizing (simultaneously) the N × K entropies subject to the constraints yields the desired solution. This optimization yields a unique solution in terms of a unique set of proper probability distribution which in turn yields the desired point estimates βk and εn. Once the constrained model is solved, we construct the concentrated (unconstrained) model. In the method proposed here, we also allow introduction of different priors corresponding to one’s beliefs about the data generating process and the structure of the unknown β’s.
Our proposed estimator is a member of the IT family of estimators. The members of this family of estimators include the Empirical Likelihood (EL), the Generalized EL (GEL), the Generalized Method of Moments (GMM), the Bayesian Method of Moments, (BMOM), the Generalized Maximum Entropy (GME), and the Maximum Entropy in the Mean (MEM), and are all related to the classical Maximum Entropy, ME. (e.g., Owen [6,7]; Qin and Lawless [8]; Smit, [9]; Newey and Smith [10]; Kitamura and Stutzer [11]; Imbens et al. [12]; Zellner [13,14]; Zellner and Tobias [15]; Golan, Judge and Miller [1]; Gamboa and Gassiat [16]; Gzyl [17]; Golan and Gzyl [18]). See also Gzyl and Velásquez [2], which builds upon Golan and Gzyl [18] where the synthesis was first proposed. If, in addition, the data are ill-conditioned, one often has to resort to the class of regularization methods (e.g., Hoerl and Kennard [19] O’Sullivan [20], Breiman [21], Tibshirani [22], Titterington [23], Donoho et al. [24]; Besnerais et al. [25]. A reference for regularization in statistics is Bickel and Li [26]. If some prior information on the data generation process or on the model is available, Bayesian methods are often used. For a detailed review and synthesis of the IT family of estimators, historical perspective and synthesis, see Golan [27]. For other background and related entropy and IT methods of estimation see the special volume of Advances in Econometrics (Fomby and Hill [28]) and the two special issues of the Journal of Econometrics [29,30]. For additional mathematical background see Mynbaev [31] and Asher, Borchers and Thurber [32].
Our proposed generic IT method will provide us with an estimator for the parameters of the linear statistical model that reconciles some of the objectives achieved by each one of the above methods. Like the philosophy behind the EL, we do not assume a pre-specified likelihood, but rather recover the (natural) weight of each observation via the optimization procedure (e.g., Owen [7]; Qin and Lawless [8]). Similar to regularization methods used for ill-behaved data, we follow the GME logic and use here the pre-specified support space for each one of the unknown parameters as a form of regularization (e.g., Golan, Judge and Miller [1]). The estimated parameters must fall within that space. However, unlike the GME, our method allows for infinitely large support spaces and continuous prior distributions. Like Bayesian approaches, we do use prior information. But we use these priors in a different way—in a way consistent with the basics of information theory and in line with the Kullback–Liebler entropy discrepancy measure. In that way, we are able to combine ideas from the different methods described above that together yield an efficient and consistent IT estimator that is statistically and computationally efficient and easy to apply.
In Section 2, we lay out the basic formulation and then develop our basic model. In Section 3, we provide a detailed closed form examples of the normal priors’ case and other priors. In Section 4 we develop the basic statistical properties of our estimator including first order approximation. In Section 5, we compare our method with Least Squares, regularization and Bayesian methods, including the Bayesian Method of Moments. The comparisons are done under the normal priors. An additional set of analytical examples, providing the formulation and solution of four basic priors (bounded, unbounded and a combination of both) is developed in Section 6. In Section 7, we comment on model comparison. We provide detailed closed form formulations for that section in an Appendix. We conclude in Section 8. The Appendices provide the proofs and detailed analytical formulations.

2. Problem Statement and Solution

2.1. Notation and Problem Statement

Consider the linear statistical model
Entropy 14 00892 i003
where Entropy 14 00892 i004 is an unknown K-dimensional signal vector that cannot be directly measured but is required to satisfy some convex constraints expressed as Entropy 14 00892 i005 where Cs is a closed convex set. For example, Entropy 14 00892 i006 with constants Entropy 14 00892 i007. (These constraints may come from constraints on Entropy 14 00892 i008, and may have a natural reason for being imposed). X is an N × K known linear operator (design matrix) that can be either fixed or stochastic, Entropy 14 00892 i009 is a vector of noisy observations, and Entropy 14 00892 i010 is a noise vector. Throughout this paper we assume that the components of the noise vector ε are i.i.d. random variables with zero mean and a variance σ2 with respect to a probability law dQn(v) on Entropy 14 00892 i011 We denote by Qs and Qn the prior probability measures reflecting our knowledge about β and ε respectively.
Given the indirect noisy observations y, our objective is to simultaneously recover Entropy 14 00892 i012 and the residuals Entropy 14 00892 i013 so that Equation (1) holds. For that, we convert problem (1) into a generalized moment problem and consider the estimated β and ε as expected values of random variables z and v with respect to an unknown probability law P. Note that z is an auxiliary random variable whereas v is the actual model for the noise perturbing the measurements. Formally:
Assumption 2.1.
The range of z is the constraint set Cs embodying the constraints that the unknown β is to satisfy. Similarly, we assume that the range of v is a closed convex set Cn where “s” and “n” stand for signal and noise respectively. Unless otherwise specified, and in line with tradition, it is assumed that v is symmetric about zero.
It is reasonable to assume that Cn is convex and symmetric in Entropy 14 00892 i011. Further, in some cases the researcher may know the statistical model of the noise. In that case, this model should be used. As stated earlier, Qs and Qn are the prior probability measures for β and ε respectively. To ensure that the expected values of z and v fall in C = Cs × Cn we need the following assumption.
Assumption 2.2.
The closures of the convex hulls of the supports of Qs and Qn are respectively Cs and Cn and we set dQ = dQs × dQn.
This assumption implies that for any strictly positive density ρ(z,v) we have:
Entropy 14 00892 i014
To solve problems like (1) with minimal assumptions one has to (i) incorporate some prior knowledge, or constraints, on the solution, or (ii) specify a certain criterion to choose among the infinitely many solutions, or (iii) use both approaches. The different criteria used within the IT methods are all directly related to the Shannon’s information (entropy) criterion (Golan [33]). The criterion used in the method developed and discussed here is the Shannon’s entropy. For a detailed discussion and further background see for example the two special issues of the Journal of Econometrics [29,30].

2.2. The Solution

In what follows we explain how to transform the original linear problem into a generalized moment problem, or how to transform any constrained linear model like (1) into a problem consisting of finding an unknown density.
Instead of searching directly for the point estimates (β, ε)t we view it as the expected value of auxiliary random variables (z, v)t that take values in the convex set Cs×Cn distributed according to some unknown auxiliary probability law dP(z, v). Thus:
Entropy 14 00892 i015
where EP denotes the expected value with respect to P.
To obtain P, we introduce the reference measure dQ(z, v) = dQs(z) dQn(v) on the Borel subsets of the product space C = Cs × Cn Again, note that while C is binding, Qs describes one’s own belief/knowledge on the unknown β, whereas Qn describes the actual model for ε. With the above specification, problem (1) becomes:
Problem (1) restated:
We search for a density ρ(z, v) such that dP = ρdQ is a probability law on C and the linear relations:
Entropy 14 00892 i016
are satisfied, where:
Entropy 14 00892 i017
Under this construction, Entropy 14 00892 i018 is a random estimator of the unknown parameter vector β and Entropy 14 00892 i019 is an estimator of the noise.
Using dQ(z, v) = dQs(z) dQn(v) amounts to assuming an a priori independence of signal and noise. This is a natural assumption as the signal part is a mathematical artifact and the noise part is the actual model of the randomness/noise.
There are potentially many candidates ρ's that satisfy (3). To find one (the least informative one given the data), we set up the following variational problem: Find ρ*(z, v) that maximizes the entropy functional, SQ (ρ) defined by:
Entropy 14 00892 i020
on the following admissible class of densities:
P ( C ) = { ρ : C [ 0 , ) |   d P = ρ d Q   is a proper probability satisfying (3) }
where “ln” stands for the natural logarithm. As usual we extend xlnx as 0 to x = 0. If the maximization problem has a solution, the estimates satisfy the constraints and Equations (1) or (3). The familiar and classical answer to the problem of finding such a ρ* is expressed in the following lemma.
Lemma 2.1.
Assume that ρ is any positive density with respect to dQ and that lnρ is integrable with respect to dP = ρdQ, then SQ(P) < 0.
By the concavity of the logarithm and Jensen’s inequality it is immediate to verify that:
Entropy 14 00892 i021
Before applying this result to our model, we define A=[X I] as an N × (K + N) matrix obtained from juxtaposing X and the N × N identity matrix I. We now work with the matrix A which allows us to consider the larger space rather than just the more traditional moment space. This is shown and discussed explicitly in the examples and derivations of Section 4, Section 5 and Section 6. For practical purposes, when facing a relatively small sample, the researcher may prefer working with A, rather than with the sample moments. This is because for finite sample the total information captured by using A is larger than when using the sample’s moments.
To apply lemma (1) to our model, let ρ be any member of the exponential (parametric) family:
Entropy 14 00892 i022
where Entropy 14 00892 i023 denotes the Euclidean scalar (inner) product of vectors a and b, and Entropy 14 00892 i024 are N free parameters that will play the role of Lagrange multipliers (one multiplier for each observation). The quantity Ω(λ) is the normalization function:
Entropy 14 00892 i025
Entropy 14 00892 i026
is the Laplace transform of Q. Next taking logs in (7) and defining:
Entropy 14 00892 i027
Lemma 2.1 implies that ∑(λ) ≥ SQ (ρ) for any Entropy 14 00892 i024 and for any ρ in the class of probability laws P(C) defined in (5). However, the problem is that we do not know whether the solution ρ*(λ, z, v) is a member of P(C) for some λ. Therefore, we search for λ* such that ρ* = ρ(λ*) is in P(C) and λ* is a minimum. If such a λ* is found, then we would have found a density (the unique one, for SQ is strictly convex in ρ) that maximizes the entropy, and by using the fact that β* = EP* [z] and ε* = EP* [v], the solution to (1), which is consistent with the data (3), is found. Formally, the result is contained in the following theorem. (Note that the Kullback’s measure (Kullback [34]), is a particular case of SQ(P), with a sign change and when both P and Q have densities).
Theorem 2.1.
Assume that Entropy 14 00892 i028 has a non-empty interior and that the minimum of the (convex) function ∑(λ) is achieved at λ*. Then, Entropy 14 00892 i029 satisfies the set of constrains (3) or (1) and maximizes the entropy.
Consider the gradient of ∑(λ) at λ*. The equation to be solved to determine λ* is Entropy 14 00892 i030 which coincides with Equation (3) when the gradient is written out explicitly.
Note that this is equivalent to minimizing (9) which is the concentrated likelihood-entropy function. Notice as well that ∑(λ*) = SQ(ρ*)
This theorem is practically equivalent to representing the estimator in terms of the estimating equations. Estimation equations (or functions) are the underlying equations from which the roots or solutions are derived. The logic for using these equations is (i) they have simpler form (e.g., a linear form for the LS estimator) than their roots, and (ii) they preserve the sampling properties of their roots (Durbin, [35]). To see the direct relationship between estimation equations and the dual/concentrated model (extremum estimator), note that the estimation equations are the first order conditions of the respective extremum problem. The choice of estimation equations is appropriate whenever the first order conditions characterize the global solution to the (extremum) optimization problem, which is the case in the model discussed here.
Theorem 2.1 can be summarized as follows: in order to determine β and ε from (1), it is easier to transform the algebraic problem into the problem of obtaining a minimum of the convex function ∑(λ) and then use β* = EP* [z] and ε* = EP* [v] to compute the estimates β* and ε*. The above procedure is designed in such a way that Entropy 14 00892 i031 is automatically satisfied. Since the actual measurement noise is unknown, it is treated as a quantity to be determined, and treated (mathematically) as if both β and ε were unknown. The interpretations of the reconstructed residual ε* and the reconstructed β*, are different. The latter is the unknown parameter vector we are after while the former is the residual (reconstructed error) such that the linear Equation (1), Entropy 14 00892 i032, is satisfied. With that background, we now discuss the basic properties of our model. For a detailed comparison of a large number of IT estimation methods see Golan ([27,28,29,30,31,32,33]) and the nice text of Mittelhammer, Judge and Miller [36]

3. Closed Form Examples

With the above formulation, we now turn to a number of relatively simple analytical examples. These examples demonstrate the advantages of our method and its simplicity. In Section 6 we provide additional closed form examples.

3.1. Normal Priors

In this example the index d takes the possible values (dimensions) K, N, or K+N depending if it relates to Cs (or z), to Cn (or v) or to both. Assume the reference prior dQ is a normal random vector with d × d [i.e., K × K, N × N or (N + K) × (N + K)] covariance matrix D, the law of which has density Entropy 14 00892 i033 where Entropy 14 00892 i034 is the vector of prior means and is specified by the researcher. Next, we define the Laplace transform, ω(τ), of the normal prior. This transform involves the diagonal covariance matrix for the noise and signal models:
Entropy 14 00892 i035
Since Entropy 14 00892 i036 then replacing τ by either Xtλ or by λ, (for the noise vector) verifies that Ω(λ) turns out to be of a quadratic form, and therefore the problem of minimizing ∑(λ) is just a quadratic minimization problem. In this case, no bounds are specified on the parameters. Instead, normal priors are used.
From (10) we get the concentrated model:
Entropy 14 00892 i037
with a minimum at λ*, satisfying:
Entropy 14 00892 i038
If M# denotes the generalized inverse of M = ADAt then Entropy 14 00892 i039 and therefore:
Entropy 14 00892 i040
For the general case A = [X I] and:
Entropy 14 00892 i041
the generalized entropy solution for the traditional linear model is:
Entropy 14 00892 i042
Entropy 14 00892 i043
and finally:
Entropy 14 00892 i044
Here Entropy 14 00892 i045. See Appendix 2 for a detailed derivation.

3.2. Discrete Uniform Priors — A GME Model

Consider now the uniform priors, which is basically the GME method (Golan, Judge and Miller [1]). Jaynes’s classical ME estimator (Jaynes [3,4]) is a special case of the GME. Let the components of z take discrete values, and let Entropy 14 00892 i046 for 1 ≤ kK Note that we allow for the cardinality of each of these sets to vary. Next, define Entropy 14 00892 i047. A similar construction may be proposed for the noise terms, namely we put Entropy 14 00892 i048. Since the spaces are discrete, the information is described by the obvious σ-algebras and both the prior and post-data measures will be discrete. As a prior on the signal space, we may consider:
Entropy 14 00892 i049
where a similar expression may be specified for the priors on Cn. Finally, we get:
Entropy 14 00892 i050
together with a similar expression for the Laplace transform of the noise prior. Notice that since the noise and signal are independent in the priors, this is also true for the post-data, so:
Entropy 14 00892 i051
Finally, Entropy 14 00892 i052 and Entropy 14 00892 i053. For detailed derivations and discussion of the GME see Golan, Judge and Miller [1].

3.3. Signal and Noise Bounded Above and Below

Consider the case in which both β and ε are bounded above and below. This time we place a Bernoulli measure on the constraint space Cs and the noise space Cn. Let Entropy 14 00892 i054 and Entropy 14 00892 i055 for the signal and noise bounds aj, bj and e respectively. The Bernoulli a priori measure on C = Cs × Cn is:
Entropy 14 00892 i056
where δc (dz) denotes the (Dirac) unit point mass at some point c. Recalling that A = [X ,I] we now compute the Laplace transform ω(t) of Q, which in turn yields Ω(λ)=ω(ATλ):
Entropy 14 00892 i057
The concentrated entropy function is:
Entropy 14 00892 i058
The minimizer of this function is the Lagrange multiplier vector λ*. Once it has been found, then Entropy 14 00892 i059 and Entropy 14 00892 i060. Explicitly:
Entropy 14 00892 i061
Entropy 14 00892 i062
and τ = (XTλ*). Similarly:
Entropy 14 00892 i063
Entropy 14 00892 i274
These are respectively the Maximum Entropy probabilities that the auxiliary random variables zj will attain the values aj or bj, or the auxiliary random variables vl describing the error terms attain the values ±e. These can be also obtained as the expected values of v and z with respect to the post-data measure P*(λ,dξ) given by:
Entropy 14 00892 i064
Note that this model is the continuous version of the discrete GME model described earlier.

4. Main Results

4.1. Large Sample Properties

In this section we develop the basic statistical results. In order to develop these results for our generic IT estimator, we needed to employ tools that are different than the standard tools used for developing asymptotic theories (e.g., Mynbaev [31] or in Mittelhammer et al. [36]).

4.1.1. Notations and First Order Approximation

Denote by Entropy 14 00892 i065 the estimator of the true β when the sample size is N. Throughout this section we add a subscript N to all quantities introduced in Section 2 to remind us that the size of the data set is N. We want to show that Entropy 14 00892 i066 and Entropy 14 00892 i067 as N → ∞ in some appropriate way (for some covariance V). We state here the basic notations, assumptions and results and leave the details to the Appendix. The problem is that when N varies, we are dealing with problems of different sizes (recall λ is of dimension N in our generic model). To turn all problems to the same size let:
Entropy 14 00892 i068
The modified data vector and the modified error terms are K-dimensional (moment) vectors, and the modified design matrix is a K×K-matrix. Problem (15), call it the moment, or the stochastic moment, problem, can be solved using the above generic IT approach which reduces to minimizing the modified concentrated (dual) entropy function:
Entropy 14 00892 i069
where Entropy 14 00892 i070.and Entropy 14 00892 i071
Assumption 4.1.
Assume that there exits an invertible K × K symmetric and positive definite matrix W such that Entropy 14 00892 i072 More precisely, assume that Entropy 14 00892 i073 as N → ∞. Assume as well that for any N-vector v, as Entropy 14 00892 i074.
Recall that in finite dimensions all norms are equivalent so convergence in any norm is equivalent to component wise convergence. This implies that under Assumption 4.1, the vectors Entropy 14 00892 i075 converge to 0 in L2, therefore in probability. To see the logic for that statement, recall that the vector εN has covariance matrix σ2IN. Therefore, Entropy 14 00892 i076 and assumption 4.1 yields the above conclusion. (To keep notations simple, and without loss of generality, we discuss here the case of σ2IN.)
Corollary 4.1.
By Equation (15) Entropy 14 00892 i077. Let Entropy 14 00892 i078 where β is the true but unknown vector of parameters. Then, Entropy 14 00892 i079 as N → ∞ (the proof is immediate).
Lemma 4.1.
Under Assumption 4.1 and assume that for real a Entropy 14 00892 i080 Then, for Entropy 14 00892 i070, Entropy 14 00892 i081 as N → ∞. Equivalently, Entropy 14 00892 i082 as N → ∞ weakly in Entropy 14 00892 i083 with respect to the appropriate induced measure.
Proof of lemma 4.1.
Note that for Entropy 14 00892 i070:
Entropy 14 00892 i084
This is equivalent to the assertion of the lemma.
Lemma 4.2.
Let Entropy 14 00892 i078 Then, under Assumption 4.1:
Entropy 14 00892 i085
Observe that the μ* that minimizes Entropy 14 00892 i086 satisfies: Entropy 14 00892 i087 Or since W is invertible, β admits the representation Entropy 14 00892 i088 Note that this last identity can be written as Entropy 14 00892 i089
Next, we define the function:
Entropy 14 00892 i090
Assumption 4.2.
The function θ(τ) is invertible and continuously differentiable.
Observe that we also have Entropy 14 00892 i091 To relate the solution to problem (1) to that of problem (15), observe that Entropy 14 00892 i092 as well as Entropy 14 00892 i093 where ΩN and ∑N are the functions introduced in Section 2 for a problem of size N. To relate the solution of the problem of size K to that of the problem of size N, we have:
Lemma 4.3.
If Entropy 14 00892 i094 denotes the minimizer of Entropy 14 00892 i095 then Entropy 14 00892 i096 is the minimizer of ∑N(λ).
Proof of Lemma 4.3.
Recall that:
Entropy 14 00892 i097
From this, the desired result follows after a simple computation.
We write the post data probability that solves problem (15) (or (1)) as:
Entropy 14 00892 i098
Recalling that Entropy 14 00892 i099 is the solution for the N-dimensional (data) problem and Entropy 14 00892 i100 is the solution for the moment problem, we have the following result:
Corollary 4.2.
With the notations introduced above and by Lemma 4.3 we have Entropy 14 00892 i101.
To state Lemma 4.4 we must consider the functions Entropy 14 00892 i102 defined by:
Entropy 14 00892 i103
Denote by Entropy 14 00892 i104 the measure with density Entropy 14 00892 i105 with respect to Q. The invertibility of the functions defined above is related to the non-singularity of their Jacobian matrices, which are the Entropy 14 00892 i104-covariances of ξ. These functions will be invertible as long as these quantities are positive definite. The relationship among the above quantities is expressed in the following lemma:
Lemma 4.4.
With the notations introduced above and in (8), and recall that we suppose that D2 = σ2IN we have: Entropy 14 00892 i106 and Entropy 14 00892 i107 where Entropy 14 00892 i108, Entropy 14 00892 i109 and θ' is the first derivative of θ.
The block structure of the covariance matrix results from the independence of the signal and the noise components in both in the prior measure dQ and the post data (maximum entropy) probability measure dP*.
Following the above, we assume:
Assumption 4.3.
The eigenvalues of the Hessian matrix Entropy 14 00892 i110 are uniformly (with respect to N and μ) bounded below away from zero.
Proposition 4.1.
Let ψN (Y), ψ (Y) respectively denote the compositional inverses of φN (μ), φ (μ). Then, as N → ∞, (i) φN (μ) → φ (μ) and (ii) ψN (y) → ψ (y).
The proof is presented in the Appendix.

4.1.2. First Order Unbiasedness

Lemma 4.5.
(First Order Unbiasedness). With the notations introduced above and under Assumptions 4.1–4.3, assume furthermore that Entropy 14 00892 i111 as N →∞. Then up to o(1/N), Entropy 14 00892 i065 is an unbiased estimator of β.
The proof is presented in the Appendix.

4.1.3. Consistency

The following lemma and proposition provide results related to the large sample behavior of our generalized entropy estimator. For simplicity of the proof and without loss of generality, we suppose here that Entropy 14 00892 i112.
Lemma 4.6.
(Consistency in squared mean). Under the same assumptions of Lemma 4.5, since E[ε]=0 and the ε are homoskedastic, then Entropy 14 00892 i113 in square mean as N → ∞.
Next, we provide our main result of convergence in distribution.
Proposition 4.2.
(Convergence in distribution). Under the same assumptions as in Lemma 4.5 we have
Entropy 14 00892 i114 as N → ∞,
Entropy 14 00892 i115 as N → ∞,
where Entropy 14 00892 i116 stands for convergence in distribution (or law).
Both proofs are presented in the Appendix.

4.2. Forecasting

Once the Generalized Entropy (GE) estimated vector β* has been found, we can use it to predict future (yet) “unobserved” values. If additive noise (ε or v) is distributed according to the same prior Qn, and if future observations are determined by the design matrix Xf, then the possible future observations are described by a random variable yf given by Entropy 14 00892 i117. For example, if vf is centered (on 0), then Entropy 14 00892 i118 and:
Entropy 14 00892 i119
In the next section we contrast our estimator with other estimators. Then, in Section 6 we provide more analytic solutions for different priors.

5. Method Comparison

In this Section we contrast our IT estimator with other estimators that are often used for estimating the location vector β in the noisy, inverse linear problem. We start with the least squares (LS) model, continue with the generalized LS (GLS) and then discuss the regularization method often used for ill-posed problems. We then contrast our estimator with a Bayesian one and with the Bayesian Method of Moments (BMOM). We also show that exact correspondence between our estimator and the other estimators under normal priors.

5.1. The Least Squares Methods

5.1.1. The General Case

We first consider the purely geometric/algebraic approach for solving the linear model (1). A traditional method consists of solving the variational problem:
Entropy 14 00892 i120
The rationale here is that because of the noise ε, the data Entropy 14 00892 i121 may fall outside the range Entropy 14 00892 i122 of X, so the objective is to minimize that discrepancy. The minimizer Entropy 14 00892 i123 of (16) provides us with the LS estimates that minimize the errors sum of square distance from the data to Entropy 14 00892 i124. When (XtX) exists, then Entropy 14 00892 i125. The reconstruction error Entropy 14 00892 i126 can be thought of as our estimate of the “minimal error in quadratic norm” of the measurement errors, or of the noise present in the measurements.
The optimization (16) can be carried out with respect to different norms. In particular, we could have considered Entropy 14 00892 i127. In this case we get the GLS solution Entropy 14 00892 i128 for any general (covariance) matrix D with blocks D1 and D2.
If, on the other hand, our objective is to reconstruct simultaneously both the signal and the noise, we can rewrite (1) as:
Entropy 14 00892 i129
where A and ξ are as defined in Section 2. Since Entropy 14 00892 i130, Entropy 14 00892 i131 and the matrix A is of dimension N × (N + K), there are infinitely many solutions that satisfy the observed data in (1) (or (17)). To choose a single solution we solve the following model:
Entropy 14 00892 i132
In the more general case we can incorporate the covariance matrix to weigh the different components of ξ:
Entropy 14 00892 i133
where Entropy 14 00892 i134 is a weighted norm in the extended signal-noise space (C = Cs × Cn) and D can be taken to be the full covariance matrix composed of both D1 and D2 defined in Section 3.1. Under the assumption that M ≡ (ADAt) is invertible, the solution to the variational problem (19) is given by Entropy 14 00892 i135. This solution coincides with our Generalized Entropy formulation when normal priors are imposed and are centered about zero (c0 = 0) as is developed explicitly in Equation (14).
If, on the other hand, the problem is ill-posed (e.g., X is not invertible), then the solution is not unique, and a combination of the above two methods (16 and 18) can be used. This yields the regularization method consisting of finding β such that:
Entropy 14 00892 i136
is achieved (see for example, Donoho et al. [25] for a nice discussion of regularization within the ME formulation.) Traditionally, the positive penalization parameter α is specified to favor small sized reconstructions, meaning that out of all possible reconstructions with a given discrepancy, those with the smallest norms are chosen. The norms in (20) can be chosen to be weighted, so that the model can be generalized to:
Entropy 14 00892 i137
The solution is:
Entropy 14 00892 i138
where D1 and D2 can be substituted for any weight matrix of interest. Using the first component of Entropy 14 00892 i139, we can state the following.
Lemma 5.1.
With the above notations, Entropy 14 00892 i140 for α = 1.
Proof of Lemma 5.1.
The condition Entropy 14 00892 i140 amounts to:
Entropy 14 00892 i141
independently of y. For this equality to hold, α=1.
The above result shows that if we weigh the discrepancy between the observed data (y) and its true value () by the prior covariance matrix D2, the penalized GLS and our entropy solutions coincide for α=1 and for normal priors.
The comparison of Entropy 14 00892 i142 with Entropy 14 00892 i143 is stated below.
Lemma 5.2.
With the above notations, Entropy 14 00892 i142 = Entropy 14 00892 i143 when the constraints are in terms of pure moments (zero moments).
Proof of Lemma 5.2.
If Entropy 14 00892 i142 = Entropy 14 00892 i143, then Entropy 14 00892 i144 for all y, which implies the following chain of identities:
Entropy 14 00892 i145
Clearly there are only two possibilities. First, if the noise components are not constant, D2 is invertible and therefore Xt must vanish (trivial but an uninteresting case). Second, if the variance of the noise component is zero, (1) becomes a pure linear inverse problem (i.e., we solve y = ).

5.1.2. The Moments’ Case

Up to now, the comparison was done where the Generalized Entropy, GE, estimator was optimized under a larger space (A) than the other LS or GLS estimators. In other words, the constraints in the GE estimator are the data points rather than the moments. The comparison is easier if one performs the above comparisons under similar spaces, namely using the sample’s moments. This can easily be done if XtX is invertible, and where we re-specify A to be the generic matrix A = [XtX Xt], rather than A = [X I]. Now, let y’ ≡ Xty, X’ ≡ XtX, and ε’ ≡ Xtε , then the problem is represented as y’ = Xβ + ε’. In that case the conditions for Entropy 14 00892 i142 = Entropy 14 00892 i143 is the trivial condition XtD2X = 0.
In general, when XtX is invertible, it is easy to verify that the solutions to variational problems of the type y’ ≡ Xty = Xt are of the form (XtX)−1Xty. In one case, the problem is to find:
Entropy 14 00892 i146
while in the other case the solution consists of finding:
Entropy 14 00892 i147
Under this “moment” specification, the solutions to the three different methods described above (16, 23 and 24) coincide.

5.2. The Basic Bayesian Method

Under the Bayesian approach we may think of our problem in the following way. Assume, as before, that Cn and Cs are closed convex subsets of Entropy 14 00892 i148 and Entropy 14 00892 i083 respectively and that Entropy 14 00892 i149 and Entropy 14 00892 i150. For the rest of this section, the priors g s ( z ) ,   gn(v) will have their usual Bayesian interpretation. For a given z, we think of y = Xz + v as a realization of the random variable Y = Xz + V. Then, Entropy 14 00892 i151. The joint density gy,z(y,z) of Y and Z, where Z is distributed according to the prior Qs is:
Entropy 14 00892 i152
The marginal distribution of y is Entropy 14 00892 i153 and therefore by Bayes Theorem the posterior (post-data) conditional Entropy 14 00892 i154 is Entropy 14 00892 i155 from which:
Entropy 14 00892 i156
As usual Entropy 14 00892 i157 minimizes Entropy 14 00892 i158 where Z and Y are distributed according to gy,z (y,z). The conditional covariance matrix:
Entropy 14 00892 i159
is such that:
Entropy 14 00892 i160
where Var(Z|y) is the total variance of the K random variates z in Z. Finally, it is important to emphasize here that the Bayesian approach provides us with a whole range of tools for inference, forecasting, model averaging, posterior intervals, etc. In this paper, however, the focus is on estimation and on the basic comparison of our GE method with other methods under the notations and formulations developed here. Extensions to testing and inference are left for future work.

5.2.1. A Standard Example: Normal Priors

As before, we view β and ε as realizations of random variables Z and V having the informative normal “a priori” (priors for signal and noise) distributions:
Entropy 14 00892 i161
Entropy 14 00892 i162
For notational convenience we assume that both Z and V are centered on zero and independent, and both covariance matrices D1 and D2 are strictly positive definite. For comparison purposes, we are using the same notation as in Section 3. The randomness is propagated to the data Y such that the conditional density (or the conditional priors on y) of Y is:
Entropy 14 00892 i163
Then, the marginal distribution of Y is Entropy 14 00892 i164. The conditional distribution of Z given Y is easy to obtain under the normal setup. Thus, the post-data distribution of the signal, β, given the data y is:
Entropy 14 00892 i165
where Entropy 14 00892 i166 and Entropy 14 00892 i167 That is, “the posterior (post-data)” distribution of Z has changed (relative to the prior) by the data. Finally, the post-data expected value of Z is given by:
Entropy 14 00892 i168
This is the traditional Bayesian solution for the linear regression using the support spaces for both signal and noise within the framework developed here. As before, one can compare this Bayesian solution with our Generalized Entropy solution. Equation (28) is comparable with our solution (14) for z0 = 0 which is the Generalized Entropy method with normal priors and center of supports equal zero. In addition, it is easy to see that the Bayesian solution (28) coincides with the penalized GLS (model (24)) for α = 1.
A few comments on these brief comparisons are in place. First, under both approaches the complete posterior (or post-data) density is estimated and not only the posterior mean, though under the GE estimator the post-data is related to the pre-specified spaces and priors. (Recall that the Bayesian posterior means are specific to a particular loss function.) Second, the agreement between the Bayesian result and the minimizer of (24) with α = 1 assumes a known value of Entropy 14 00892 i169, which is contained in D2. In the Bayesian result Entropy 14 00892 i169 is marginalized, so it is not conditional on that parameter. Therefore, with a known value of Entropy 14 00892 i169, both estimators are the same.
There are two reasons for the equivalence of the three methods (GE, Bayes and Penalized GLS). The first is that there are no binding constraints imposed on the signal and the noise. The second is the choice of imposing the normal densities as informative priors for both signal and noise. In fact, this result is standard in inverse problem theory where L is known as the Wiener filter (see for example Bertero and Boccacci [37]). In that sense, the Bayesian technique and the GE technique have some procedural ingredients in common, but the distinguishing factor is the way the posterior (post-data) is obtained. (Note that “posterior” for the entropy method, means the “post data” distribution which is based on both the priors and the data, obtained via the optimization process). In one case it is obtained by maximizing the entropy functional while in the Bayesian approach it is obtained by a direct application of Bayes theorem. For more background and related derivation of the ME and Bayes rule see Zellner [38,39].

5.3. Comparison with the Bayesian Method of Moments (BMOM)

The basic idea behind Zellner’s BMOM is to avoid a likelihood function. This is done by maximizing the continuous (Shannon) entropy subject to the empirical moments of the data. This yields the most conservative (closest to uniform) post data density (Zellner [14,40,41,42,43]; Zellner and Tobias [15]). In that way the BMOM uses only assumptions on the realized error terms which are used to derive the post data density.
Building on the above references, assume (XtX)−1 exists, then the LS solution to (1) is Entropy 14 00892 i170 which is assumed to be the post data mean with respect to (yet) unknown distribution (likelihood). This is equivalent to assuming Entropy 14 00892 i171 (the columns of X are orthogonal to the N × 1 vector E[V|Data]). To find g(z|Data), or in Zellener’s notation g(β|Data), one applies the classical ME with the following constraints (information):
Entropy 14 00892 i172
Entropy 14 00892 i173
where Entropy 14 00892 i174 is based on the assumption that Entropy 14 00892 i175, or similarly under Zellner’s notation Entropy 14 00892 i176, and σ2 is a positive parameter. Then, the maximum entropy density satisfying these two constraints (and the requirement that it is a proper density) is:
Entropy 14 00892 i177
This is the BMOM post data density for the parameter vector with mean Entropy 14 00892 i178 under the two side conditions used here. If more side conditions are used, the density function g will not be normal. Information other than moments can also be incorporated within the BMOM.
Comparing Zellner’s BMOM with our Generalized Entropy method we note that the BMOM produces the post data density from which one can compute the vector of point estimates Entropy 14 00892 i178 of the unconstrained problem (1). Under the GE model, the solution Entropy 14 00892 i142 satisfies the data/constraints within the joint support space C. Further, for the GE construction there is no need to impose exact moment constraints, meaning it provides a more flexible post data density. Finally, under both methods one can use the post data densities to calculate the uncertainties around future (unobserved) observations.

6. More Closed Form Examples

In Section 3, we formulated three relatively simple closed form cases. In the current section, we extend our analytic solutions to a host of potential priors. This section demonstrates the capabilities of our proposed estimator. We do not intend here to formulate our model under all possible priors. We present our examples in such a way that the priors can be assigned for either the signal or the noise components. The different priors we discuss correspond to different prior beliefs: unbounded (unconstrained), bounded below, bounded above, or bounded below and above. The following set of examples, together with those in Section 3, represents different cases of commonly used prior distributions, and their corresponding partition functions. Specifically, the different cases are the Laplace (bilateral exponential) which is symmetric but with heavy tails, the Gamma distribution that is bounded below and is non-symmetric, the continuous and discrete uniform distributions, and the Bernoulli distribution that allows an easy specification of a prior mean that is not at the center of the pre-specified supports. In all the examples below, the index d takes the possible values (dimensions) K, N, or K+N depending if it relates to Cs (or z), to Cn (or v) or to both.
We note that information theoretic procedures were also used for producing priors (e.g., Jeffreys’, Berger and Bernardo’s, Zellner, etc.). In future work we will try to relate them to the procedure developed here. In these approaches, β is not always viewed as the mean, as given in Equation (2). For example, Jeffreys, Zellner (e.g., Zellner [41]) and others have used Cauchy priors and unbounded measure priors, for which the mean does not exist.

6.1. The Basic Formulation

Case 1. Bilateral Exponential—Laplace Distribution. Like the normal distribution, another possible unconstrained model is obtained if we take as reference measure a bilateral exponential, or Laplace distribution. This is useful for modeling distributions with tails heavier than the normal. The following derivation holds for both our generic model (captured via the generic matrix A) and for just the signal or noise parts separately. We only provide here the solution for the signal part.
In this case, the density of dQ is Entropy 14 00892 i179 The parameters Entropy 14 00892 i180 is the set of prior means and 1/2σj is the variance of each component. The Laplace transform of dQ is:
Entropy 14 00892 i181
Next, we use the relationship τ = Xtλ. (Note that under the generic formulation, instead of Xt, we can work with X* which stands for either Xt, I or At) We compute Ω(λ) via the Laplace transformation (8). It then follows that Entropy 14 00892 i182 where ω(t) is always finite and positive. For this relationship to be satisfied, Entropy 14 00892 i183 for all j = 1, 2, , d. Finally, replacing τ by Xtλ yields D(Ω).
Next, minimizing the concentrated entropy function:
Entropy 14 00892 i184
by equating its gradient with respect toλ to 0, we obtain that at the minimum:
Entropy 14 00892 i185
Entropy 14 00892 i186
Finally, having solved for the optimal vector λ* that minimizes ∑(λ), and such that the previous identity holds, we can rewrite our model as:
Entropy 14 00892 i187
where rather than solve (31) directly, we make use of λ*, that minimizes ∑(λ) and satisfies (30).
As expected, the post-data has a well-defined Laplace distribution (Kotz et al. [44]) but this distribution in not symmetrical anymore, and the decay rate is modified by the data. Specifically:
Entropy 14 00892 i188
Case 2. Lower Bounds—Gamma Distribution. Suppose that the β’s are all bounded below by theory. Then, we can specify a random vector Z with values in the positive orthant translated to the lower bound K-dimensional vector l, so Entropy 14 00892 i189, where [lj,∞) = [Zj,∞). Like related methods, we assume that each component Zj of Z is distributed in [lj,∞) according to a translated Entropy 14 00892 i190. With this in mind, a direct calculation yields:
Entropy 14 00892 i191
where the particular case of bj = 0 corresponds to the standard exponential distribution defined on [lj,∞). Finally, when τ is replaced by Xtλ, we get:
Entropy 14 00892 i192
Case 3. Bounds on Signal and Noise. Consider the case where each component Z andV takes values in some bounded interval [aj, bj]. A common choice for the bounds of the errors supports in that case are the three-sigma rule (Pukelsheim [44]) where “sigma” is the empirical standard deviation of the sample analyzed (see for example Golan, Judge and Miller, [1] for a detailed discussion). In this situation we provide two simple (and extreme) choices for the reference measure. The first is a uniform measure on [aj, bj], and the second is a Bernoulli distribution supported on aj and bj.

6.1.2. Uniform Reference Measure

In this case the reference (prior) measure dQ(z) is distributed according to the uniform density Entropy 14 00892 i193 and the Laplace transform of this density is:
and is finite for every vector .
Entropy 14 00892 i194
and ω (τ) id finite for every vector τ.

6.1.3. Bernoulli Reference Measure

In this case the reference measure is singular (with respect to the volume measure) and is given by dQ(z) = Entropy 14 00892 i195, where δc(dz) denotes the (Dirac) unit point mass at some point c, and where pj and qj do not have to sum up to one, yet they determine the weight within the bounded interval [aj, bj]. The Laplace transform of dQ is:
Entropy 14 00892 i196
where again, ω (τ) is finite for all τ.
In this case, there is no common criterion that can be used to decide which a priori reference measure to choose. In many specific cases, we have noticed that a reconstruction with the discrete Bernoulli prior of Entropy 14 00892 i197 yields estimates that are very similar to the continuous uniform prior.
Case 4. Symmetric Bounds. This is a special case of Case 3 above for aj= −cj and bj = cj for positive cj’s. The corresponding versions of (35) and (36), the uniform and Bernoulli, are respectively:
Entropy 14 00892 i198
Entropy 14 00892 i199

6.2 The Full Model

Having developed the basic formulations and building blocks of our model, we note that the list can be amplified considerably and these building blocks can be assembled into a variety of combinations. We already demonstrated such a case in Section 3.3. We now provide such an example.

6.2.1. Bounded Parameters and Normally Distributed Errors

Consider the common case of naturally bounded signal and normally distributed errors. This case combines Case 2 of Section 6.1 together with the normal case discussed in Section 3.1. Let Entropy 14 00892 i200 but we impose no constraints on the ε. From Section 2, Entropy 14 00892 i201 with Entropy 14 00892 i202, and Entropy 14 00892 i203. The signal component is formulated earlier, while Entropy 14 00892 i204. Using A=[X I] we have Entropy 14 00892 i205 for the N-dimensional vector λ, and therefore, Entropy 14 00892 i206. The maximal entropy probability measures (post-data) are:
Entropy 14 00892 i207
where λ* is found by minimizing the concentrated entropy function:
Entropy 14 00892 i208
Entropy 14 00892 i209 and l determines the “shift” for each coordinate. (For example, if Entropy 14 00892 i210, then Entropy 14 00892 i211, or for the simple heteroscedastic case, we have Entropy 14 00892 i211). Finally, once λ* is found, we get:
Entropy 14 00892 i212
For example, if Entropy 14 00892 i210, then Entropy 14 00892 i211, or for the simple heteroscedastic case, we have Entropy 14 00892 i211.

7. A Comment on Model Comparison

So far we have described our model, its properties and specified some closed form examples. The next question facing the researcher is how to decide on the most appropriate prior/model to use for a given set of data. In this section, we briefly comment on a few possible model comparison techniques.
A possible criterion for comparing estimations (reconstructions) resulting from different priors should be based on a comparison of the post-data entropies associated with the proposed setup. Implicit in the choice of priors is the choice of supports (Z and V), which in turn is dictated by the constraints imposed on the β’s. Assumption 2.1 means that these constraints are properly specified, namely there is no arbitrariness in the choice of Cs. The choice of a specific model for the noise involves two assumptions. The first one is about the support that reflects the actual range of the errors. The second is the choice of a prior describing the distribution of the noise within that support. To contrast two possible priors, we want to compare the reconstructions provided by the different models for the signal and noise variables. Within the information theoretic approach taken here, comparing the post-data entropies seems a reasonable choice.
From a practical point of view, the post-data entropies depend on the priors and the data in an explicit but nonlinear way. All we can say for certain is that for all models (or priors) the optimal solution is:
Entropy 14 00892 i213
where λ* has to be computed by minimizing the concentrated entropy function ∑(λ), and it is clear that the total entropy difference between the post-data and the priors is just the entropy difference for the signal plus the entropy difference for the noise. Note that Entropy 14 00892 i214 where d is the dimension of λ. This is the entropy ratio statistics which is similar in nature to the empirical likelihood ratio statistic (e.g., Golan [27]). Rather than discussing this statistic here, we provide in Appendix 3 analytic formulations of Equation (39) for a large number of prior distributions. These formulations are based on the examples of earlier sections. Last, we note that in some cases, where the competing models are of different dimensions, a normalization of both statistics is necessary.

8. Conclusions

In this paper we developed a generic information theoretic method for solving a noisy, linear inverse problem. This method uses minimal a-priori assumptions, and allows us to incorporate constraints and priors in a natural way for a whole class of linear inverse problems across the natural and social sciences. This inversion method is generic in the sense that it provides a framework for analyzing non-normal models and it performs well also for data that are not of full rank.
We provided detailed analytic solutions for a large class of priors. We developed the first order properties as well as the large sample properties of that estimator. In addition, we compared our model to other methods such as the Least Squares, Penalized LS, Bayesian and the Bayesian Method of Moments.
The proposed model main advantage over other LS and ML methods is that it has better performance (more stable and lower variances) for (possibly small) finite samples. The smaller the sample and/or the more ill-behaved (e.g., collinear) is the sample, the better this method performs. However, if one knows the underlying distribution, the sample is well behaved and large enough the traditional ML is the correct model to use. The other advantages of our proposed model (relative to the GME and other IT estimators) are that (i) we can impose different priors (discrete or continuous) for the signal and the noise, (ii) we estimate the full distribution of each one of the two sets of unknowns (signal and noise), and (iii) our model is based on minimal assumptions.
In future research, we plan to study the small sample properties as well as develop statistics to evaluate the performance of the competing priors and models. We conclude by noting that the same framework developed here can be easily extended for nonlinear estimation problems. This is because all the available information enters as stochastic constraints within the constrained optimization problem.


We thank Bertrand Clarke, George Judge, Doug Miller, Arnold Zellner and Ed Greenberg as well as participants of numerous conferences and seminars for their comments and suggestions on earlier versions of that paper. Golan thanks the Edwin T. Jaynes International Center for Bayesian Methods and Maximum Entropy for partially supporting this project. We also want to thank the referees for their comments, for their careful reading of the manuscript and for pointing out reference [26] to us. Their comments improved the presentation considerably.


  1. Golan, A.; Judge, G.G.; Miller, D. Maximum Entropy Econometrics: Robust Estimation with Limited Data; John Wiley & Sons: New York, NY, USA, 1996. [Google Scholar]
  2. Gzyl, H.; Velásquez, Y. Linear Inverse Problems: The Maximum Entropy Connection; World Scientific Publishers: Singapore, 2011. [Google Scholar]
  3. Jaynes, E.T. Information theory and statistical mechanics. Phys. Rev. 1957, 106, 620–630. [Google Scholar] [CrossRef]
  4. Jaynes, E.T. Information theory and statistical mechanics II. Phys. Rev. 1957, 108, 171–190. [Google Scholar] [CrossRef]
  5. Shannon, C. A mathematical theory of communication. Bell System Technical. J. 1948, 27, 379–423, 623–656. [Google Scholar] [CrossRef]
  6. Owen, A. Empirical likelihood for linear models. Ann. Stat. 1991, 19, 1725–1747. [Google Scholar] [CrossRef]
  7. Owen, A. Empirical Likelihood; Chapman & Hall/CRC: Boca Raton, FL, USA, 2001. [Google Scholar]
  8. Qin, J.; Lawless, J. Empirical likelihood and general estimating equations. Ann. Stat. 1994, 22, 300–325. [Google Scholar] [CrossRef]
  9. Smith, R.J. Alternative semi parametric likelihood approaches to GMM estimations. Econ. J. 1997, 107, 503–510. [Google Scholar] [CrossRef]
  10. Newey, W.K.; Smith, R.J. Higher order properties of GMM and Generalized empirical likelihood estimators. Department of Economics, MIT: Boston, MA, USA, Unpublished work. 2002. [Google Scholar]
  11. Kitamura, Y.; Stutzer, M. An information-theoretic alternative to generalized method of moment estimation. Econometrica 1997, 66, 861–874. [Google Scholar] [CrossRef]
  12. Imbens, G.W.; Johnson, P.; Spady, R.H. Information-theoretic approaches to inference in moment condition models. Econometrica 1998, 66, 333–357. [Google Scholar] [CrossRef]
  13. Zellner, A. Bayesian Method of Moments/Instrumental Variables (BMOM/IV) analysis of mean and regression models. In Prediction and Modeling Honoring Seymour Geisser; Lee, J.C., Zellner, A., Johnson, W.O., Eds.; Springer Verlag: New York, NY, USA, 1996. [Google Scholar]
  14. Zellner, A. The Bayesian Method of Moments (BMOM): Theory and applications. In Advances in Econometrics; Fomby, T., Hill, R., Eds.; JAI Press: Greenwich, CT, USA, 1997; Volume 12, pp. 85–105. [Google Scholar]
  15. Zellner, A.; Tobias, J. Further results on the Bayesian method of moments analysis of multiple regression model. Int. Econ. Rev. 2001, 107, 1–15. [Google Scholar] [CrossRef]
  16. Gamboa, F.; Gassiat, E. Bayesian methods and maximum entropy for ill-posed inverse problems. Ann. Stat. 1997, 25, 328–350. [Google Scholar] [CrossRef]
  17. Gzyl, H. Maxentropic reconstruction in the presence of noise. In Maximum Entropy and Bayesian Studies; Erickson, G., Ryckert, J., Eds.; Kluwer: Dordrecht, The Netherlands, 1998. [Google Scholar]
  18. Golan, A.; Gzyl, H. A generalized maxentropic inversion procedure for noisy data. Appl. Math. Comput. 2002, 127, 249–260. [Google Scholar] [CrossRef]
  19. Hoerl, A.E.; Kennard, R.W. Ridge regression: Biased estimation for non-orthogonal problems. Technometrics 1970, 1, 55–67. [Google Scholar] [CrossRef]
  20. O’Sullivan, F. A statistical perspective on ill-posed inverse problems. Stat. Sci. 1986, 1, 502–527. [Google Scholar] [CrossRef]
  21. Breiman, L. Better subset regression using the nonnegative garrote. Technometrics 1995, 37, 373–384. [Google Scholar] [CrossRef]
  22. Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 1996, 58, 267–288. [Google Scholar]
  23. Titterington, D.M. Common structures of smoothing techniques in statistics. Int. Stat. Rev. 1985, 53, 141–170. [Google Scholar] [CrossRef]
  24. Donoho, D.L.; Johnstone, I.M.; Hoch, J.C.; Stern, A.S. Maximum entropy and the nearly black object. J. R. Stat. Soc. Ser. B 1992, 54, 41–81. [Google Scholar]
  25. Besnerais, G.L.; Bercher, J.F.; Demoment, G. A new look at entropy for solving linear inverse problems. IEEE Trans. Inf. Theory 1999, 45, 1565–1578. [Google Scholar] [CrossRef]
  26. Bickel, P.; Li, B. Regularization methods in statistics. Test 2006, 15, 271–344. [Google Scholar] [CrossRef]
  27. Golan, A. Information and entropy econometrics—A review and synthesis. Found. Trends Econometrics 2008, 2, 1–145. [Google Scholar] [CrossRef]
  28. Fomby, T.B.; Hill, R.C. Advances in Econometrics; JAI Press: Greenwich, CT, USA, 1997. [Google Scholar]
  29. Golan, A. (Ed.) Special Issue on Information and Entropy Econometrics (Journal of Econometrics); Elsevier: Amsterdam, The Netherlands, 2002; Volume 107, Issues 1–2, pp. 1–376.
  30. Golan, A.; Kitamura, Y. (Eds.) Special Issue on Information and Entropy Econometrics: A Volume in Honor of Arnold Zellner (Journal of Econometrics); Elsevier: Amsterdam, The Netherlands, 2007; Volume 138, Issue 2, pp. 379–586.
  31. Mynbayev, K.T. Short-Memory Linear Processes and Econometric Applications; John Wiley & Sons: Hoboken, NY, USA, 2011. [Google Scholar]
  32. Asher, R.C.; Borchers, B.; Thurber, C.A. Parameter Estimation and Inverse Problems; Elsevier: Amsterdam, Holland, 2003. [Google Scholar]
  33. Golan, A. Information and entropy econometrics—Editor’s view. J. Econom. 2002, 107, 1–15. [Google Scholar] [CrossRef]
  34. Kullback, S. Information Theory and Statistics; John Wiley & Sons: New York, NY, USA, 1959. [Google Scholar]
  35. Durbin, J. Estimation of parameters in time-series regression models. J. R. Stat. Soc. Ser. B 1960, 22, 139–153. [Google Scholar]
  36. Mittelhammer, R.; Judge, G.; Miller, D. Econometric Foundations; Cambridge Univ. Press: Cambridge, UK, 2000. [Google Scholar]
  37. Bertero, M.; Boccacci, P. Introduction to Inverse Problems in Imaging; CRC Press: Boca Raton, FL, USA, 1998. [Google Scholar]
  38. Zellner, A. Optimal information processing and Bayes theorem. Am. Stat. 1988, 42, 278–284. [Google Scholar]
  39. Zellner, A. Information processing and Bayesian analysis. J. Econom. 2002, 107, 41–50. [Google Scholar] [CrossRef]
  40. Zellner, A. Bayesian Method of Moments (BMOM) Analysis of Mean and Regression Models. In Modeling and Prediction; Lee, J.C., Johnson, W.D., Zellner, A., Eds.; Springer: New York, NY, USA, 1994; pp. 17–31. [Google Scholar]
  41. Zellner, A. Models, prior information, and Bayesian analysis. J. Econom. 1996, 75, 51–68. [Google Scholar] [CrossRef]
  42. Zellner, A. Bayesian Analysis in Econometrics and Statistics: The Zellner View and Papers; Edward Elgar Publishing Ltd.: Cheltenham Glos, UK, 1997; pp. 291–304, 308–318. [Google Scholar]
  43. Kotz, S.; Kozubowski, T.; Podgórski, K. The Laplace Distribution and Generalizations; Birkhauser: Boston, MA, USA, 2001. [Google Scholar]
  44. Pukelsheim, F. The three sigma rule. Am. Stat. 1994, 48, 88–91. [Google Scholar]

Appendix 1: Proofs

Proof of Proposition 4.1.
From Assumptions 4.1–4.3 we have:
Entropy 14 00892 i215
Note that if the Entropy 14 00892 i220-covariance of the noise component of ξ is Entropy 14 00892 i216, then Entropy 14 00892 i217-covariance of ξ is an (N + K) × (N + K)-matrix given by Entropy 14 00892 i218. Here Entropy 14 00892 i219 is the Entropy 14 00892 i220-covariance of the signal component of ξ. Again, from Assumptions 4.1–4.3 it follows Entropy 14 00892 i221 that which is the covariance of the signal component of ξ with respect to the limit probability Entropy 14 00892 i222 Therefore, φ is also invertible and Entropy 14 00892 i223 To verify the uniform convergence of ψN (y) towards ψ (y) note that:
Entropy 14 00892 i224
Proof of Lemma 4.5.
(First Order Unbiasedness). Observe that for N large, keeping only the first term of the Taylor expansion we have:
Entropy 14 00892 i225
after we drop the o(1/N) term. Keeping only the first term of the Taylor expansion, and invoking the assumptions of Lemma 4.5:
Entropy 14 00892 i226
Incorporating the model’s equations, we see that under the approximations made so far:
Entropy 14 00892 i227
We used the fact that Entropy 14 00892 i228 and Entropy 14 00892 i229 are the respective Jacobian matrices. The first order unbiasedness follows by taking expectations, Entropy 14 00892 i230
Proof of Lemma 4.6.
(Consistency in squared mean). With the same notations as above, consider Entropy 14 00892 i231. Using the representation of lemma 4.5, Entropy 14 00892 i232 and computing the expected square norm indicated above, we obtain Entropy 14 00892 i233 which from Assumption 4.1 tends to 0 as N → ∞.
Proof of Proposition 4.2.
Part (a) The proof is based on Lemma 4.5. Notice that under Qn for any Entropy 14 00892 i234, Entropy 14 00892 i235 where Entropy 14 00892 i236 Since the components of ε are i.i.d. random variables, the standard approximations yield: Entropy 14 00892 i237, where Entropy 14 00892 i238, and therefore the law of Entropy 14 00892 i239 concentrates at 0 asymptotically. This completes Part (a).
Part (b) This part is similar to the previous proof, except that now the Entropy 14 00892 i240 factor in the exponent changes the result to be Entropy 14 00892 i241 as N → ∞, from which assertion (b) of the proposition follows by the standard continuity theorem.

Appendix 2: Normal Priors — Derivation of the Basic Linear Model

Consider the linear model yi = a + bxi + εi where X = (l, x), and β = (ab)t. We assume that (i) both Qs and Qn are normal and that (ii) Qs and Qn are independent, meaning the Laplace transform is just (10). Recalling that t = Atλ, and that in the generic model A = [X I] = [1 x I] where A is an N × (N + 2), or N × (N + K) for the general model with K > 2, dimensional matrix. The log of the normalization factor of the post-data, Ω(λ), is:
Entropy 14 00892 i242
Building on (11), the concentrated (dual)entropy function is:
Entropy 14 00892 i243
where Entropy 14 00892 i244 Solving for λ*, Entropy 14 00892 i245, yields:
Entropy 14 00892 i246
and finally, Entropy 14 00892 i247. Explicitly, M is:
Entropy 14 00892 i248
Entropy 14 00892 i249
and 11t = 1N.
Next, we solve for the optimal β and ε. Recalling the optimal solution is Entropy 14 00892 i247 and Entropy 14 00892 i250, then following the derivations of Section 3 we get:
Entropy 14 00892 i251
Entropy 14 00892 i252
Entropy 14 00892 i253
Rewriting the exponent in the numerator as:
Entropy 14 00892 i254
and incorporating it in dP* yields:
Entropy 14 00892 i255
where the second right-hand side term equals 1. Finally:
Entropy 14 00892 i256
To check our solution, note that Entropy 14 00892 i257, so:
Entropy 14 00892 i258
and finally:
Entropy 14 00892 i259
which is (14), where B = y − Ac0. Within the basic model, it is clear that Entropy 14 00892 i260, or Entropy 14 00892 i261. Under the natural case where the errors’ priors are centered on zero (v0 = 0), Entropy 14 00892 i262 and Entropy 14 00892 i263. If in addition z0 = 0, then Entropy 14 00892 i264.

Appendix 3: Model Comparisons — Analytic Examples

We provide here the detailed analytical formulations of constructing the dual (concentrated) GE model for different priors. It can be used to derive the entropy ratio statistics based on Equation (39).
Example 1. The priors for both the signal and the noise are normal. In that case, the final post-data entropy, computed in Section 3, is:
Entropy 14 00892 i265
This seems to be the only case amenable to full analytical computation.
Example 2. Laplace prior for state space variables plus an uniform prior (in [−e, e]) for the noise term. The full post-data entropy is:
Entropy 14 00892 i266
Example 3. Normal prior for state space variables and an uniform prior (in [−e, e]) for the noise. The post-data entropy is:
Entropy 14 00892 i267
where z0 is the center of the normal priors and D1 is the covariance matrix of the state space variables.
Example 4. A Gamma prior for the state space variables, and an uniform prior (in [−e, e]) for the noise term. In this case the post-data entropy is:
Entropy 14 00892 i268
Example 5. The priors for the state space variables are Laplace and the priors for the noise are normal. Here, the post-data entropy is:
Entropy 14 00892 i269
Example 6. Both signal and noise have bounded supports, and we assume uniform priors for both. The post-data is:
Entropy 14 00892 i270
Finally, we complete the set of examples with, probably, the most common case.
Example 7. Uniform priors on bounded intervals for the signal components and normal priors for the noise. The post-data entropy is:
Entropy 14 00892 i271
We reemphasize that this model comparison can only be used to compare models after each model has been completely worked out and for a given data set. Finally, we presented here the case of comparing the total entropies of post-data to priors, but as Equation (39) shows, one can just compare the post and pre data entropies of only the signal, Entropy 14 00892 i272, or only the noise, Entropy 14 00892 i273.

Share and Cite

MDPI and ACS Style

Golan, A.; Gzyl, H. An Entropic Estimator for Linear Inverse Problems. Entropy 2012, 14, 892-923.

AMA Style

Golan A, Gzyl H. An Entropic Estimator for Linear Inverse Problems. Entropy. 2012; 14(5):892-923.

Chicago/Turabian Style

Golan, Amos, and Henryk Gzyl. 2012. "An Entropic Estimator for Linear Inverse Problems" Entropy 14, no. 5: 892-923.

Article Metrics

Back to TopTop