An Entropic Estimator for Linear Inverse Problems

In this paper we examine an Information-Theoretic method for solving noisy linear inverse estimation problems which encompasses under a single framework a whole class of estimation methods. Under this framework, the prior information about the unknown parameters (when such information exists), and constraints on the parameters can be incorporated in the statement of the problem. The method builds on the basics of the maximum entropy principle and consists of transforming the original problem into an estimation of a probability density on an appropriate space naturally associated with the statement of the problem. This estimation method is generic in the sense that it provides a framework for analyzing non-normal models, it is easy to implement and is suitable for all types of inverse problems such as small and or ill-conditioned, noisy data. First order approximation, large sample properties and convergence in distribution are developed as well. Analytical examples, statistics for model comparisons and evaluations, that are inherent to this method, are discussed and complemented with explicit examples.


Introduction
Researchers in all disciplines are often faced with small and/or ill-conditioned data.Unless much is known, or assumed, about the underlying process generating these data (the signal and the noise) these types of data lead to ill-posed noisy (inverse) problems.Traditionally, these types of problems are solved by using parametric and semi-parametric estimators such as the least squares, regularization and non-likelihood methods.In this work, we propose a semi-parametric information theoretic method for solving these problems while allowing the researcher to impose prior knowledge in a non-Bayesian way.The model developed here provides a major extension of the Generalized Maximum Entropy model of Golan, Judge and Miller [1] and provides new statistical results of estimators discussed in Gzyl and Velásquez [2].
The overall purpose of this paper is fourfold.First, we develop a generic information theoretic method for solving linear, noisy inverse problems that uses minimal distributional assumptions.This method is generic in the sense that it provides a framework for analyzing non-normal models and it allows the user to incorporate prior knowledge in a non-Bayesian way.Second, we provide detailed analytic solutions for a number of possible priors.Third, using the concentrated (unconstrained) model, we are able to compare our estimator to other estimators, such as the Least Squares, regularization and Bayesian methods.Our proposed model is easy to apply and suitable for analyzing a whole class of linear inverse problems across the natural and social sciences.Fourth, we provide the large sample properties of our estimator.
To achieve our goals, we build on the current Information-Theoretic (IT) literature that is founded on the basis of the Maximum Entropy (ME) principle (Jaynes [3,4]) and on Shannon's [5] information measure (entropy) as well as other generalized entropy measures.To understand the relationship between the familiar linear statistical model and the approach we take here, we now briefly define our basic problem, discuss its traditional solution and provide the basic logic and related literature we use here in order to solve that problem such that our objectives are achieved.
Consider the basic (linear) problem of estimating the K-dimensional location parameter vector (signal, input) E given an N-dimensional observed sample (response) vector y and an N K u design (transfer) matrix X such that y = XE + H and H is an N-dimensional random vector such that E[H] = 0 and with some positive definite covariance matrix with a scale parameter 2 .
V The statistical nature of the unobserved noise term is supposed to be known, and we suppose that the second moments of the noise are finite.The researcher's objective is to estimate the unknown vector E with minimal assumptions on H. Recall that under the traditional regularity conditions for the linear model (and for X of rank K), the least squares, (LS), unconstrained, estimator is where "t" stands for transpose.
Consider now the problem of estimating E and H simultaneously while imposing minimal assumptions on the likelihood structure and while incorporating certain constraints on the signal and perhaps on the noise.Further, rather than following the tradition of employing point estimators, consider estimating the empirical distribution of the unknown quantities k E and n H with the joint objectives of maximizing the in-and-out of sample prediction.
With these objectives, the problem is inherently under-determined and cannot be solved with the traditional least squares or likelihood approaches.Therefore, one must resort to a different principle.In the work done here, we follow the Maximum Entropy (ME) principle that was developed by Jaynes [3,4] for similar problems.The classical ME method consists of using a variational method to choose a probability distribution from a class of probability distributions having pre-assigned generalized moments.
In more general terms, consider the problem of estimating an unknown discrete probability distribution from a finite and possibly noisy set of observed generalized (sample) moments, that is, arbitrary functions of the data.These moments (and the fact that the distribution is proper) are supposed to be the only available information.Regardless of the level of noise in these observed moments, if the dimension of the unknown distribution is larger than the number of observed moments, there are infinitely many proper probability distributions satisfying this information.Such a problem is called an under-determined problem.Which one of the infinitely many solutions that satisfy the data should one choose?Within the class of information-theoretic (IT) methods, the chosen solution is the one that maximizes an information criterion-entropy.Procedure that we propose below to solve the estimation problem described above, fits in that framework.
We construct our proposed estimator for solving the noisy, inverse, linear problem in two basic steps.In our first step, each unknown parameter ( k E and n H ) is constructed as the expected value of a certain random variable.That is, we view the possible values of the unknown parameters as values of random variables whose distributions are to be determined.We will assume that the range of each such random variable contains the true unknown value of k E and n H respectively.This step actually involves two specifications.The first one is the pre-specified support space for the two sets of parameters (finite/infinite and/or bounded/unbounded).At the outset of section two we shall do this as part of the mathematical statement of the problem.Any further information we may have about the parameters is incorporated into the choice of a prior (reference) measure on these supports.Since usually a model for the noise is supposed to be known, the statistical nature of the noise is incorporated at this stage.As far as the signal goes, this is an auxiliary construction.This constitutes our second specification.
In our second step, because minimal assumptions on the likelihood implies that such a problem is under-determined, we resort to the ME principle.This means that we need to convert this under-determined problem to a well-posed, constrained optimization.Similar to the classical ME method the objective function in that constrained optimization problem is composed of N K u entropy functions: one for each one of the N K u proper probability distributions (one for each signal k E and one for each noise component n H ). The constraints are just the observed information (data) and the requirement that all probability distributions are proper.Maximizing (simultaneously) the N K u entropies subject to the constraints yields the desired solution.This optimization yields a unique solution in terms of a unique set of proper probability distribution which in turn yields the desired point estimates k E and n H . Once the constrained model is solved, we construct the concentrated (unconstrained) model.In the method proposed here, we also allow introduction of different priors corresponding to one's beliefs about the data generating process and the structure of the unknown E's.
Our proposed estimator is a member of the IT family of estimators.The members of this family of estimators include the Empirical Likelihood (EL), the Generalized EL (GEL), the Generalized Method of Moments (GMM), the Bayesian Method of Moments, (BMOM), the Generalized Maximum Entropy (GME), and the Maximum Entropy in the Mean (MEM), and are all related to the classical Maximum Entropy, ME. (e.g., Owen [6,7]; Qin and Lawless [8]; Smit, [9]; Newey and Smith [10]; Kitamura and Stutzer [11]; Imbens et al. [12]; Zellner [13,14]; Zellner and Tobias [15]; Golan, Judge and Miller [1]; Gamboa and Gassiat [16]; Gzyl [17]; Golan and Gzyl [18]).See also Gzyl and Velásquez [2], which builds upon Golan and Gzyl [18] where the synthesis was first proposed.If, in addition, the data are ill-conditioned, one often has to resort to the class of regularization methods (e.g., Hoerl and Kennard [19] O'Sullivan [20], Breiman [21], Tibshirani [22], Titterington [23], Donoho et al. [24]; Besnerais et al. [25].A reference for regularization in statistics is Bickel and Li [26].If some prior information on the data generation process or on the model is available, Bayesian methods are often used.For a detailed review and synthesis of the IT family of estimators, historical perspective and synthesis, see Golan [27].For other background and related entropy and IT methods of estimation see the special volume of Advances in Econometrics (Fomby and Hill [28]) and the two special issues of the Journal of Econometrics [29,30].For additional mathematical background see Mynbaev [31] and Asher, Borchers and Thurber [32].
Our proposed generic IT method will provide us with an estimator for the parameters of the linear statistical model that reconciles some of the objectives achieved by each one of the above methods.Like the philosophy behind the EL, we do not assume a pre-specified likelihood, but rather recover the (natural) weight of each observation via the optimization procedure (e.g., Owen [7]; Qin and Lawless [8]).Similar to regularization methods used for ill-behaved data, we follow the GME logic and use here the pre-specified support space for each one of the unknown parameters as a form of regularization (e.g., Golan, Judge and Miller [1]).The estimated parameters must fall within that space.However, unlike the GME, our method allows for infinitely large support spaces and continuous prior distributions.Like Bayesian approaches, we do use prior information.But we use these priors in a different way-in a way consistent with the basics of information theory and in line with the Kullback-Liebler entropy discrepancy measure.In that way, we are able to combine ideas from the different methods described above that together yield an efficient and consistent IT estimator that is statistically and computationally efficient and easy to apply.
In Section 2, we lay out the basic formulation and then develop our basic model.In Section 3, we provide a detailed closed form examples of the normal priors' case and other priors.In Section 4 we develop the basic statistical properties of our estimator including first order approximation.In Section 5, we compare our method with Least Squares, regularization and Bayesian methods, including the Bayesian Method of Moments.The comparisons are done under the normal priors.An additional set of analytical examples, providing the formulation and solution of four basic priors (bounded, unbounded and a combination of both) is developed in Section 6.In Section 7, we comment on model comparison.We provide detailed closed form formulations for that section in an Appendix.We conclude in Section 8.The Appendices provide the proofs and detailed analytical formulations.

Notation and Problem Statement
Consider the linear statistical model where K ȕ is an unknown K-dimensional signal vector that cannot be directly measured but is required to satisfy some convex constraints expressed as  1) holds.For that, we convert problem (1) into a generalized moment problem and consider the estimated E and H as expected values of random variables z and v with respect to an unknown probability law P.Note that z is an auxiliary random variable whereas v is the actual model for the noise perturbing the measurements.Formally: Assumption 2.1.The range of z is the constraint set s C embodying the constraints that the unknown E is to satisfy.Similarly, we assume that the range of v is a closed convex set n C where "s" and "n" stand for signal and noise respectively.Unless otherwise specified, and in line with tradition, it is assumed that v is symmetric about zero.Comment.This assumption implies that for any strictly positive density U(z,v) we have:

Comment
( , ) ( ) ( ) and ( , ) ( ) ( ) To solve problems like (1) with minimal assumptions one has to (i) incorporate some prior knowledge, or constraints, on the solution, or (ii) specify a certain criterion to choose among the infinitely many solutions, or (iii) use both approaches.The different criteria used within the IT methods are all directly related to the Shannon's information (entropy) criterion (Golan [33]).The criterion used in the s C ȕ method developed and discussed here is the Shannon's entropy.For a detailed discussion and further background see for example the two special issues of the Journal of Econometrics [29,30].

The Solution
In what follows we explain how to transform the original linear problem into a generalized moment problem, or how to transform any constrained linear model like (1) into a problem consisting of finding an unknown density.
Instead of searching directly for the point estimates (E, H t we view it as the expected value of auxiliary random variables (z, v) t that take values in the convex set C s uC n distributed according to some unknown auxiliary probability law , dP z v .Thus: (2) where E P denotes the expected value with respect to P.
To z v amounts to assuming an a priori independence of signal and noise.This is a natural assumption as the signal part is a mathematical artifact and the noise part is the actual model of the randomness/noise.
There are potentially many candidates U's that satisfy (3).To find one (the least informative one given the data), we set up the following variational problem: Find * , U z v that maximizes the entropy functional, Q S U defined by: (4) on the following admissible class of densities: where "ln" stands for the natural logarithm.As usual we extend xlnx as 0 to 0 x .If the maximization problem has a solution, the estimates satisfy the constraints and Equations (1) or (3).The familiar and classical answer to the problem of finding such a * U is expressed in the following lemma.Lemma 2.1.Assume that U is any positive density with respect to dQ and that lnU is integrable with respect to dP = UdQ, then S Q (P) < 0.
Proof.By the concavity of the logarithm and Jensen's inequality it is immediate to verify that: (6) Before applying this result to our model, we define A=[X I] as an from juxtaposing X and the N N u identity matrix I.We now work with the matrix A which allows us to consider the larger space rather than just the more traditional moment space.This is shown and discussed explicitly in the examples and derivations of Sections 4-6.For practical purposes, when facing a relatively small sample, the researcher may prefer working with A, rather than with the sample moments.This is because for finite sample the total information captured by using A is larger than when using the sample's moments.
To apply lemma (1) to our model, let U be any member of the exponential (parametric) family: are N free parameters that will play the role of Lagrange multipliers (one multiplier for each observation).The quantity : Ȝ is the normalization function: (8) where: is the Laplace transform of Q. Next taking logs in (7) and defining: for any N Ȝ and for any U in the class of probability laws P C defined in (5).However, the problem is that we do not know whether the solution v , the solution to (1), which is consistent with the data (3), is found.Formally, the result is contained in the following theorem.(Note that the Kullback's measure (Kullback [34]), is a particular case of S Q (P), with a sign change and when both P and Q have densities).
( ) 0 Note that this is equivalent to minimizing (9) which is the concentrated likelihood-entropy function.Notice as well that * * .
Comment.This theorem is practically equivalent to representing the estimator in terms of the estimating equations.Estimation equations (or functions) are the underlying equations from which the roots or solutions are derived.The logic for using these equations is (i) they have simpler form (e.g., a linear form for the LS estimator) than their roots, and (ii) they preserve the sampling properties of their roots (Durbin,[35]).To see the direct relationship between estimation equations and the dual/concentrated model (extremum estimator), note that the estimation equations are the first order conditions of the respective extremum problem.The choice of estimation equations is appropriate whenever the first order conditions characterize the global solution to the (extremum) optimization problem, which is the case in the model discussed here.
Theorem 2.1 can be summarized as follows: in order to determine E and H from (1), it is easier to transform the algebraic problem into the problem of obtaining a minimum of the convex function

Closed Form Examples
With the above formulation, we now turn to a number of relatively simple analytical examples.These examples demonstrate the advantages of our method and its simplicity.In Section 6 we provide additional closed form examples.

Normal Priors
In this example the index d takes the possible values (dimensions) K, N, or K+N depending if it relates to s C (or z), to n C (or v) or to both.Assume the reference prior dQ is a normal random vector is the vector of prior means and is specified by the researcher.Next, we define the Laplace transform, Z(Ĳ), of the normal prior.
This transform involves the diagonal covariance matrix for the noise and signal models: then replacing Ĳ by either t X Ȝ or by Ȝ , (for the noise vector) verifies that : Ȝ turns out to be of a quadratic form, and therefore the problem of minimizing 6 Ȝ is just a quadratic minimization problem.In this case, no bounds are specified on the parameters.Instead, normal priors are used.
From (10) we get the concentrated model: (11) with a minimum at * Ȝ satisfying: M y Ac and therefore: For the general case A = [X I] and: the generalized entropy solution for the traditional linear model is: and finally: Here B = 0 y Ac .See Appendix 2 for a detailed derivation.

Discrete Uniform Priors -A GME Model
Consider now the uniform priors, which is basically the GME method (Golan, Judge and Miller [1]).Jaynes's classical ME estimator (Jaynes [3,4]) is a special case of the GME.Let the components of z take discrete values, and let .Since the spaces are discrete, the information is described by the obvious V-algebras and both the prior and post-data measures will be discrete.As a prior on the signal space, we may consider: ..., ) ( )... ( ) ...
where a similar expression may be specified for the priors on C n .Finally, we get: together with a similar expression for the Laplace transform of the noise prior.Notice that since the noise and signal are independent in the priors, this is also true for the post-data, so:

Signal and Noise Bounded Above and Below
Consider the case in which both E and H are bounded above and below.This time we place a Bernoulli measure on the constraint space C s and the noise space C n .Let The concentrated entropy function is: .
These are respectively the Maximum Entropy probabilities that the auxiliary random variables z j will attain the values a j or b j , or the auxiliary random variables v l describing the error terms attain the values ±e.These can be also obtained as the expected values of v and z with respect to the post-data measure P * (O,d[) given by: Note that this model is the continuous version of the discrete GME model described earlier.

Large Sample Properties
In this section we develop the basic statistical results.In order to develop these results for our generic IT estimator, we needed to employ tools that are different than the standard tools used for developing asymptotic theories (e.g., Mynbaev [31] or in Mittelhammer et al. [36]).

Notations and First Order Approximation
Denote by * N ȕ the estimator of the true ȕ when the sample size is N. Throughout this section we add a subscript N to all quantities introduced in Section 2 to remind us that the size of the data set is N.We want to show that * as N o f in some appropriate way (for some covariance V. We state here the basic notations, assumptions and results and leave the details to the Appendix.The problem is that when N varies, we are dealing with problems of different sizes (recall O is of dimension N in our generic model).To turn all problems to the same size let: .
The modified data vector and the modified error terms are K-dimensional (moment) vectors, and the modified design matrix is a KuK-matrix.Problem (15), call it the moment, or the stochastic moment, problem, can be solved using the above generic IT approach which reduces to minimizing the modified concentrated (dual) entropy function: ( ) ln ( ) , Assume that there exits an invertible K K u symmetric and positive definite matrix Assume as well that for any N-vector v, as N o f Recall that in finite dimensions all norms are equivalent so convergence in any norm is equivalent to component wise convergence.This implies that under Assumption 4.1, the vectors converge to 0 in L 2 , therefore in probability.To see the logic for that statement, recall that the vector  ( ) .
N o f weakly in K with respect to the appropriate induced measure.
Proof of lemma 4.1.Note that for .
This is equivalent to the assertion of the lemma.
Proof of Lemma 4.3.Recall that: From this, the desired result follows after a simple computation.We write the post data probability that solves problem (15) (or (1)) as: Comment.The block structure of the covariance matrix results from the independence of the signal and the noise components in both in the prior measure dQ and the post data (maximum entropy) probability measure dP*.
Following the above, we assume: The proof is presented in the Appendix.The proof is presented in the Appendix.

Consistency
The following lemma and proposition provide results related to the large sample behavior of our generalized entropy estimator.For simplicity of the proof and without loss of generality, we suppose here that Both proofs are presented in the Appendix.

Forecasting
Once the Generalized Entropy (GE) estimated vector * ȕ has been found, we can use it to predict future (yet) "unobserved" values.If additive noise (H or v) is distributed according to the same prior n Q , and if future observations are determined by the design matrix f X , then the possible future observations are described by a random variable f y given by * In the next section we contrast our estimator with other estimators.Then, in Section 6 we provide more analytic solutions for different priors.

Method Comparison
In this Section we contrast our IT estimator with other estimators that are often used for estimating the location vector E in the noisy, inverse linear problem.We start with the least squares (LS) model, continue with the generalized LS (GLS) and then discuss the regularization method often used for ill-posed problems.We then contrast our estimator with a Bayesian one and with the Bayesian Method of Moments (BMOM).We also show that exact correspondence between our estimator and the other estimators under normal priors.

The General Case
We first consider the purely geometric/algebraic approach for solving the linear model (1).A traditional method consists of solving the variational problem: (16) The rationale here is that because of the noise H, the data True y Xȕ İ may fall outside the range ^:  14).If, on the other hand, the problem is ill-posed (e.g., X is not invertible), then the solution is not unique, and a combination of the above two methods (16 and 18) can be used.This yields the regularization method consisting of finding E such that: (20) is achieved (see for example, Donoho et al. [25] for a nice discussion of regularization within the ME formulation.)Traditionally, the positive penalization parameter D is specified to favor small sized reconstructions, meaning that out of all possible reconstructions with a given discrepancy, those with the smallest norms are chosen.The norms in (20) can be chosen to be weighted, so that the model can be generalized to: (21) The solution is: independently of y.For this equality to hold, D=1.
The above result shows that if we weigh the discrepancy between the observed data (y) and its true value (XE) by the prior covariance matrix 2 D , the penalized GLS and our entropy solutions coincide for D=1 and for normal priors.Proof of Lemma 5.2.
D X M y X X X y for all y, which implies the following chain of identities: Clearly there are only two possibilities.First, if the noise components are not constant, D 2 is invertible and therefore X t must vanish (trivial but an uninteresting case).Second, if the variance of the noise component is zero, (1) becomes a pure linear inverse problem (i.e., we solve y = XE).

The Moments' Case
Up to now, the comparison was done where the Generalized Entropy, GE, estimator was optimized under a larger space (A) than the other LS or GLS estimators.In other words, the constraints in the GE estimator are the data points rather than the moments.The comparison is easier if one performs the above comparisons under similar spaces, namely using the sample's moments.This can easily be done if X t X is invertible, and where we re-specify A to be the generic matrix A = [X t X X t ], rather than A = [X I].Now, let y' { X t y, X' { X t X, and H' { X t H , then the problem is represented as y' = X'E + H'.In that case the conditions for *

X D X 0
In general, when X t X is invertible, it is easy to verify that the solutions to variational problems of the type y' { X t y = X t XE are of the form (X t X) í1 X t y.In one case, the problem is to find: (23) while in the other case the solution consists of finding: (24) Under this "moment" specification, the solutions to the three different methods described above (16, 23 and 24) coincide.
where Var(Z|y) is the total variance of the K random variates z in Z. Finally, it is important to emphasize here that the Bayesian approach provides us with a whole range of tools for inference, forecasting, model averaging, posterior intervals, etc.In this paper, however, the focus is on estimation and on the basic comparison of our GE method with other methods under the notations and formulations developed here.Extensions to testing and inference are left for future work.

A Standard Example: Normal Priors
As before, we view E and H as realizations of random variables Z and V having the informative normal "a priori" (priors for signal and noise) distributions: .
For notational convenience we assume that both Z and V are centered on zero and independent, and both covariance matrices D 1 and D 2 are strictly positive definite.For comparison purposes, we are using the same notation as in Section 3. The randomness is propagated to the data Y such that the conditional density (or the conditional priors on y) of Y is: The conditional distribution of Z given Y is easy to obtain under the normal setup.Thus, the post-data distribution of the signal, E, given the data y is: where D XD X That is, "the posterior (post-data)" distribution of Z has changed (relative to the prior) by the data.Finally, the post-data expected value of Z is given by: (28) This is the traditional Bayesian solution for the linear regression using the support spaces for both signal and noise within the framework developed here.As before, one can compare this Bayesian solution with our Generalized Entropy solution.Equation ( 28) is comparable with our solution ( 14) for z 0 = 0 which is the Generalized Entropy method with normal priors and center of supports equal zero.In addition, it is easy to see that the Bayesian solution (28) coincides with the penalized GLS (model ( 24)) for D = 1.
A few comments on these brief comparisons are in place.First, under both approaches the complete posterior (or post-data) density is estimated and not only the posterior mean, though under the GE estimator the post-data is related to the pre-specified spaces and priors.(Recall that the Bayesian posterior means are specific to a particular loss function.)Second, the agreement between the Bayesian result and the minimizer of (24) with D = 1 assumes a known value of 2 v V , which is contained in D 2 .In the Bayesian result 2 v V is marginalized, so it is not conditional on that parameter.Therefore, with a known value of 2 v V , both estimators are the same.
There are two reasons for the equivalence of the three methods (GE, Bayes and Penalized GLS).The first is that there are no binding constraints imposed on the signal and the noise.The second is the choice of imposing the normal densities as informative priors for both signal and noise.In fact, this result is standard in inverse problem theory where L is known as the Wiener filter (see for example Bertero and Boccacci [37]).In that sense, the Bayesian technique and the GE technique have some (2 ) det procedural ingredients in common, but the distinguishing factor is the way the posterior (post-data) is obtained.(Note that "posterior" for the entropy method, means the "post data" distribution which is based on both the priors and the data, obtained via the optimization process).
In one case it is obtained by maximizing the entropy functional while in the Bayesian approach it is obtained by a direct application of Bayes theorem.For more background and related derivation of the ME and Bayes rule see Zellner [38,39].

Comparison with the Bayesian Method of Moments (BMOM)
The basic idea behind Zellner's BMOM is to avoid a likelihood function.This is done by maximizing the continuous (Shannon) entropy subject to the empirical moments of the data.This yields the most conservative (closest to uniform) post data density (Zellner [14,[40][41][42][43]; Zellner and Tobias [15]).In that way the BMOM uses only assumptions on the realized error terms which are used to derive the post data density.
Building on the above references, assume 1 t

X X
exists, then the LS solution to (1) is X X Xy which is assumed to be the post data mean with respect to (yet) unknown distribution (likelihood).This is equivalent to assuming ).To find g(z|Data), or in Zellener's notation gE E|Data), one applies the classical ME with the following constraints (information): V is a positive parameter.Then, the maximum entropy density satisfying these two constraints (and the requirement that it is a proper density) is: This is the BMOM post data density for the parameter vector with mean ˆLS ȕ under the two side conditions used here.If more side conditions are used, the density function g will not be normal.
Information other than moments can also be incorporated within the BMOM.Comparing Zellner's BMOM with our Generalized Entropy method we note that the BMOM produces the post data density from which one can compute the vector of point estimates ˆLS ȕ of the unconstrained problem (1).Under the GE model, the solution * GE ȕ satisfies the data/constraints within the joint support space C. Further, for the GE construction there is no need to impose exact moment constraints, meaning it provides a more flexible post data density.Finally, under both methods one can use the post data densities to calculate the uncertainties around future (unobserved) observations.

More Closed Form Examples
In Section 3, we formulated three relatively simple closed form cases.In the current section, we extend our analytic solutions to a host of potential priors.This section demonstrates the capabilities of our proposed estimator.We do not intend here to formulate our model under all possible priors.We present our examples in such a way that the priors can be assigned for either the signal or the noise components.The different priors we discuss correspond to different prior beliefs: unbounded (unconstrained), bounded below, bounded above, or bounded below and above.The following set of examples, together with those in Section 3, represents different cases of commonly used prior distributions, and their corresponding partition functions.Specifically, the different cases are the Laplace (bilateral exponential) which is symmetric but with heavy tails, the Gamma distribution that is bounded below and is non-symmetric, the continuous and discrete uniform distributions, and the Bernoulli distribution that allows an easy specification of a prior mean that is not at the center of the pre-specified supports.In all the examples below, the index d takes the possible values (dimensions) K, N, or K+N depending if it relates to s C (or z), to n C (or v) or to both.We note that information theoretic procedures were also used for producing priors (e.g., Jeffreys', Berger and Bernardo's, Zellner, etc.).In future work we will try to relate them to the procedure developed here.In these approaches, E is not always viewed as the mean, as given in Equation (2).For example, Jeffreys, Zellner (e.g., Zellner [41]) and others have used Cauchy priors and unbounded measure priors, for which the mean does not exist.

The Basic Formulation
Case 1. Bilateral Exponential-Laplace Distribution.Like the normal distribution, another possible unconstrained model is obtained if we take as reference measure a bilateral exponential, or Laplace distribution.This is useful for modeling distributions with tails heavier than the normal.The following derivation holds for both our generic model (captured via the generic matrix A) and for just the signal or noise parts separately.We only provide here the solution for the signal part.
In this case, the density of dQ is 0 2 exp .,  ( ) by equating its gradient with respect to O to 0, we obtain that at the minimum: (30) Explicitly: (31) Finally, having solved for the optimal vector * O that minimizes ¦ Ȝ , and such that the previous identity holds, we can rewrite our model as: (32) where rather than solve (31) directly, we make use of * Ȝ , that minimizes ¦ Ȝ and satisfies (30).
As expected, the post-data has a well-defined Laplace distribution (Kotz et al. [44]) but this distribution in not symmetrical anymore, and the decay rate is modified by the data.Specifically:

X Ȝ z Ȝ
Case 2. Lower Bounds-Gamma Distribution.Suppose that the E's are all bounded below by theory.Then, we can specify a random vector Z with values in the positive orthant translated to the lower bound K-dimensional vector l, so > > a b ª º ¬ ¼ .A common choice for the bounds of the errors supports in that case are the three-sigma rule (Pukelsheim [44]) where "sigma" is the empirical standard deviation of the sample analyzed (see for example Golan, Judge and Miller, [1] for a detailed discussion).In this ln ln .
T X X Z : situation we provide two simple (and extreme) choices for the reference measure.The first is a uniform measure on , j j a b ª º ¬ ¼ , and the second is a Bernoulli distribution supported on j a and j b .

Uniform Reference Measure
In this case the reference (prior) measure dQ(z) is distributed according to the uniform density and Z Ĳ is finite for every vector Ĳ .

Bernoulli Reference Measure
In this case the reference measure is singular (with respect to the volume measure) and is given by dQ(z) = ( ) ( ) , where G z denotes the (Dirac) unit point mass at some point c, and where j p and j q do not have to sum up to one, yet they determine the weight within the bounded interval , where again, Z Ĳ is finite for all Ĳ .
In this case, there is no common criterion that can be used to decide which a priori reference measure to choose.In many specific cases, we have noticed that a reconstruction with the discrete Bernoulli prior of 1 2 p q yields estimates that are very similar to the continuous uniform prior.
Case 4. Symmetric Bounds.This is a special case of Case 3 above for a j = íc j and b j = c j for positive c j 's.The corresponding versions of ( 35) and (36), the uniform and Bernoulli, are respectively: (37) and: (38)

The Full Model
Having developed the basic formulations and building blocks of our model, we note that the list can be amplified considerably and these building blocks can be assembled into a variety of combinations.We already demonstrated such a case in Section 3.3.We now provide such an example.
where Ȝ is found by minimizing the concentrated entropy function: , ,..., K l l l l and l determines the "shift" for each coordinate.(For example, if

A Comment on Model Comparison
So far we have described our model, its properties and specified some closed form examples.The next question facing the researcher is how to decide on the most appropriate prior/model to use for a given set of data.In this section, we briefly comment on a few possible model comparison techniques.
A possible criterion for comparing estimations (reconstructions) resulting from different priors should be based on a comparison of the post-data entropies associated with the proposed setup.
Implicit in the choice of priors is the choice of supports (Z and V), which in turn is dictated by the constraints imposed on the E's.Assumption 2.1 means that these constraints are properly specified, namely there is no arbitrariness in the choice of C s .The choice of a specific model for the noise involves two assumptions.The first one is about the support that reflects the actual range of the errors.The second is the choice of a prior describing the distribution of the noise within that support.To contrast two possible priors, we want to compare the reconstructions provided by the different models for the signal and noise variables.Within the information theoretic approach taken here, comparing the post-data entropies seems a reasonable choice.
From a practical point of view, the post-data entropies depend on the priors and the data in an explicit but nonlinear way.All we can say for certain is that for all models (or priors) the optimal solution is: (39) where * Ȝ has to be computed by minimizing the concentrated entropy function ( ) 6 Ȝ , and it is clear that the total entropy difference between the post-data and the priors is just the entropy difference for the signal plus the entropy difference for the noise.Note that * * 2 2 ( ) 2 ( ) where d is the dimension of O.This is the entropy ratio statistics which is similar in nature to the empirical likelihood ratio statistic (e.g., Golan [27]).Rather than discussing this statistic here, we provide in Appendix 3 analytic formulations of Equation ( 39) for a large number of prior distributions.These formulations are based on the examples of earlier sections.Last, we note that in some cases, where the competing models are of different dimensions, a normalization of both statistics is necessary.

Conclusions
In this paper we developed a generic information theoretic method for solving a noisy, linear inverse problem.This method uses minimal a-priori assumptions, and allows us to incorporate constraints and priors in a natural way for a whole class of linear inverse problems across the natural and social sciences.This inversion method is generic in the sense that it provides a framework for analyzing non-normal models and it performs well also for data that are not of full rank.
We provided detailed analytic solutions for a large class of priors.We developed the first order properties as well as the large sample properties of that estimator.In addition, we compared our model to other methods such as the Least Squares, Penalized LS, Bayesian and the Bayesian Method of Moments.
The proposed model main advantage over other LS and ML methods is that it has better performance (more stable and lower variances) for (possibly small) finite samples.The smaller the sample and/or the more ill-behaved (e.g., collinear) is the sample, the better this method performs.However, if one knows the underlying distribution, the sample is well behaved and large enough the traditional ML is the correct model to use.The other advantages of our proposed model (relative to the GME and other IT estimators) are that (i) we can impose different priors (discrete or continuous) for the signal and the noise, (ii) we estimate the full distribution of each one of the two sets of unknowns (signal and noise), and (iii) our model is based on minimal assumptions.
In future research, we plan to study the small sample properties as well as develop statistics to evaluate the performance of the competing priors and models.We conclude by noting that the same framework developed here can be easily extended for nonlinear estimation problems.This is because all the available information enters as stochastic constraints within the constrained optimization problem. where: and .
Next, we solve for the optimal E and H. Recalling the optimal solution is and , then following the derivations of Section 3 we get: and: , so: Rewriting the exponent in the numerator as: and incorporating it in yields: , where the second right-hand side term equals 1. Finally: To check our solution, note that , so: and finally: which is (14), where B = y í Ac 0. Within the basic model, it is clear that , or . Under the natural case where the errors' priors are centered on zero ( ), and .If in addition , then .

Appendix 3: Model Comparisons -Analytic Examples
We provide here the detailed analytical formulations of constructing the dual (concentrated) GE model for different priors.It can be used to derive the entropy ratio statistics based on Equation (39).
Example 1.The priors for both the signal and the noise are normal.In that case, the final post-data entropy, computed in Section 3, is: .This seems to be the only case amenable to full analytical computation.Here, the post-data entropy is: . Example 6.Both signal and noise have bounded supports, and we assume uniform priors for both.The post-data is: . Finally, we complete the set of examples with, probably, the most common case.
Example 7. Uniform priors on bounded intervals for the signal components and normal priors for the noise.The post-data entropy is: We reemphasize that this model comparison can only be used to compare models after each model has been completely worked out and for a given data set.Finally, we presented here the case of comparing the total entropies of post-data to priors, but as Equation (39) shows, one can just compare the post and pre data entropies of only the signal, , or only the noise, .
© 2012 by the authors; licensee MDPI, Basel, Switzerland.This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).
the Euclidean scalar (inner) product of vectors a and b, andN Ȝ

UȜ
Ȝ z v is a member of P C for some O. Therefore, we search for * is a minimum.If such a * Ȝ is found, then we would have found a density (the unique one, for Q S is strictly convex in U) that maximizes the entropy, and by using the fact that constrains(3) or(1) and maximizes the entropy.Proof.Consider the gradient of ¦ Ȝ at * Ȝ .The equation to be solved to determine * coincides with Equation (3) when the gradient is written out explicitly.
the estimates * ȕ and * İ .The above procedure is designed in such a way that * s C ȕ is automatically satisfied.Since the actual measurement noise is unknown, it is treated as a quantity to be determined, and treated (mathematically) as if both ȕ and H were unknown.The interpretations of the reconstructed residual * İ and the reconstructed *ȕ , are different.The latter is the unknown parameter vector we are after while the former is the residual (reconstructed error) such that the linear Equation(1), that background, we now discuss the basic properties of our model.For a detailed comparison of a large number of IT estimation methods see) and the nice text of Mittelhammer, Judge and Miller[36] detailed derivations and discussion of the GME see Golan, Judge and Miller[1].
the signal and noise bounds , j j a b and e respectively.The Bernoulli a priori measure on

G
denotes the (Dirac) unit point mass at some point c.Recalling that > @ , A X I we now compute the Laplace transform Z(t) of Q, which in turn yields :(O)=Z(A T O):

and assumption 4 . 1 Corollary 4 . 1 .Lemma 4 . 1 .
yields the above conclusion.(To keep notations simple, and without loss of generality, we discuss here the case of V 2 I N .)By as N o f (the proof is immediate).Under Assumption 4.1 and assume that for real a , exp( )

Corollary 4 . 2 .ȝLemma 4 . 4 .
the solution for the N-dimensional (data) problem and 0 ( ) N dP ȟ is the solution for the moment problem, we have the following result: With the notations introduced above and by Lemma 4.with respect to Q.The invertibility of the functions defined above is related to the non-singularity of their Jacobian matrices, which are the N P P -covariances of [.These functions will be invertible as long as these quantities are positive definite.The relationship among the above quantities is expressed in the following lemma: With the notations introduced above and in(8), and recall that we suppose that

Assumption 4 . 3 .
The eigenvalues of the Hessian matrix ln ( ) N : ȝ ȝ ȝ are uniformly (with respect to N and P) bounded below away from zero.

Lemma 4 . 6 .Proposition 4 . 2 .
(Consistency in squared mean).Under the same assumptions of Lemma 4.5, sinceE[H] and the H are homoskedastic, then 1 o ȕ ȕ in square mean as N o f .Next, we provide our main result of convergence in distribution.(Convergence in distribution).Under the same assumptions as in Lemma 4.N o f , where D o stands for convergence in distribution (or law).

2 D,Lemma 5 . 1 .
can be substituted for any weight matrix of interest.Using the first component of we can state the following.With the above notations, * * P GE ȕ ȕ for D = 1.

Lemma 5 . 2 .
With the above notations, * GE ȕ = * LS ȕ when the constraints are in terms of pure moments (zero moments).
For the rest of this section, the priors ݃ ௦ ሺࢠሻǡ n g v will have their usual Bayesian interpretation.For a given z, we think of y = Xz + v as a realization of the random variable Y = Xz + V.Then, of Y and Z, where Z is distributed according to the prior s Q is: transform of this density is:

Example 2 .Example 3 .
Laplace prior for state space variables plus an uniform prior (in [íe, e]) for the noise term.The full post-data entropy is:.Normal prior for state space variables and an uniform prior (in [íe, e]) for the noise.The post-data entropy is: where z 0 is the center of the normal priors and D 1 is the covariance matrix of the state space variables.

Example 4 .
A Gamma prior for the state space variables, and an uniform prior (in [íe, e]) for the noise term.In this case the post-data entropy is: İ is a noise vector.Throughout this paper we assume that the components of the noise vector İ are i.i.d.random variables with zero mean and a variance2 N

. It is reasonable to assume that n C is convex and symmetric in N
The closures of the convex hulls of the supports of Q s and Q n are respectively C s and C n and we set dQ = dQ s u dQ n .
(15)elate the solution to problem(1)to that of problem(15), To relate the solution of the problem of size K to that of the problem of size N, we have: N and N : 6 are the functions introduced in Section 2 for a problem of size N.
so the objective is to minimize that discrepancy.The minimizer * D Ș Ș D Ș .In this case we get the GLS solution ȕ XD X XD y for any general (covariance) matrix D with blocks D 1 and D 2 .If, on the other hand, our objective is to reconstruct simultaneously both the signal and the noise, we can rewrite (1) as: Case 3. Bounds on Signal and Noise.Consider the case where each component Z and V takes values in some bounded interval , Bounded Parameters and Normally Distributed ErrorsConsider the common case of naturally bounded signal and normally distributed errors.This case combines Case 2 of Section 6.1 together with the normal case discussed in Section 3.1.Let :: Example 5.The priors for the state space variables are Laplace and the priors for the noise are normal.