# An Entropic Estimator for Linear Inverse Problems

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

**β**given an N-dimensional observed sample (response) vector

**y**and an N × K design (transfer) matrix

**X**such that

**y**= X

**β**+

**ε**and

**ε**is an N-dimensional random vector such that E[

**ε**] =

**0**and with some positive definite covariance matrix with a scale parameter σ

^{2}. The statistical nature of the unobserved noise term is supposed to be known, and we suppose that the second moments of the noise are finite. The researcher’s objective is to estimate the unknown vector

**β**with minimal assumptions on

**ε**. Recall that under the traditional regularity conditions for the linear model (and for

**X**of rank K), the least squares, (LS), unconstrained, estimator is and where “t” stands for transpose.

**β**and

**ε**simultaneously while imposing minimal assumptions on the likelihood structure and while incorporating certain constraints on the signal and perhaps on the noise. Further, rather than following the tradition of employing point estimators, consider estimating the empirical distribution of the unknown quantities β

_{k}and ε

_{n}with the joint objectives of maximizing the in-and-out of sample prediction.

_{k}and ε

_{n}) is constructed as the expected value of a certain random variable. That is, we view the possible values of the unknown parameters as values of random variables whose distributions are to be determined. We will assume that the range of each such random variable contains the true unknown value of β

_{k}and ε

_{n}respectively. This step actually involves two specifications. The first one is the pre-specified support space for the two sets of parameters (finite/infinite and/or bounded/unbounded). At the outset of section two we shall do this as part of the mathematical statement of the problem. Any further information we may have about the parameters is incorporated into the choice of a prior (reference) measure on these supports. Since usually a model for the noise is supposed to be known, the statistical nature of the noise is incorporated at this stage. As far as the signal goes, this is an auxiliary construction. This constitutes our second specification.

_{k}and one for each noise component ε

_{n}). The constraints are just the observed information (data) and the requirement that all probability distributions are proper. Maximizing (simultaneously) the N × K entropies subject to the constraints yields the desired solution. This optimization yields a unique solution in terms of a unique set of proper probability distribution which in turn yields the desired point estimates β

_{k}and ε

_{n}. Once the constrained model is solved, we construct the concentrated (unconstrained) model. In the method proposed here, we also allow introduction of different priors corresponding to one’s beliefs about the data generating process and the structure of the unknown β’s.

## 2. Problem Statement and Solution

#### 2.1. Notation and Problem Statement

_{s}is a closed convex set. For example, with constants . (These constraints may come from constraints on , and may have a natural reason for being imposed).

**X**is an N × K known linear operator (design matrix) that can be either fixed or stochastic, is a vector of noisy observations, and is a noise vector. Throughout this paper we assume that the components of the noise vector

**ε**are i.i.d. random variables with zero mean and a variance σ

^{2}with respect to a probability law dQ

_{n}(

**v**) on We denote by Q

_{s}and Q

_{n}the prior probability measures reflecting our knowledge about

**β**and

**ε**respectively.

**y**, our objective is to simultaneously recover and the residuals so that Equation (1) holds. For that, we convert problem (1) into a generalized moment problem and consider the estimated

**β**and

**ε**as expected values of random variables

**z**and

**v**with respect to an unknown probability law P. Note that

**z**is an auxiliary random variable whereas

**v**is the actual model for the noise perturbing the measurements. Formally:

**Assumption**

**2.1.**

**z**is the constraint set C

_{s}embodying the constraints that the unknown

**β**is to satisfy. Similarly, we assume that the range of

**v**is a closed convex set C

_{n}where “s” and “n” stand for signal and noise respectively. Unless otherwise specified, and in line with tradition, it is assumed that

**v**is symmetric about zero.

**Comment.**

_{n}is convex and symmetric in . Further, in some cases the researcher may know the statistical model of the noise. In that case, this model should be used. As stated earlier, Q

_{s}and Q

_{n}are the prior probability measures for

**β**and

**ε**respectively. To ensure that the expected values of

**z**and

**v**fall in C = C

_{s}× C

_{n}we need the following assumption.

**Assumption**

**2.2.**

_{s}and Q

_{n}are respectively C

_{s}and C

_{n}and we set dQ = dQ

_{s}× dQ

_{n}.

**Comment.**

**z**,

**v**) we have:

#### 2.2. The Solution

**β**,

**ε**)

^{t}we view it as the expected value of auxiliary random variables (

**z**,

**v**)

^{t}that take values in the convex set C

_{s}×C

_{n}distributed according to some unknown auxiliary probability law dP(

**z**,

**v**). Thus:

_{P}denotes the expected value with respect to P.

**z**,

**v**) = dQ

_{s}(

**z**) dQ

_{n}(

**v**) on the Borel subsets of the product space C = C

_{s}× C

_{n}Again, note that while C is binding, Q

_{s}describes one’s own belief/knowledge on the unknown

**β**, whereas Q

_{n}describes the actual model for

**ε**. With the above specification, problem (1) becomes:

**Problem (1)**

**restated:**

**z**,

**v**) such that dP = ρdQ is a probability law on C and the linear relations:

**β**and is an estimator of the noise.

**Comment.**

**z**,

**v**) = dQ

_{s}(

**z**) dQ

_{n}(

**v**) amounts to assuming an a priori independence of signal and noise. This is a natural assumption as the signal part is a mathematical artifact and the noise part is the actual model of the randomness/noise.

^{*}(

**z**,

**v**) that maximizes the entropy functional, S

_{Q}(ρ) defined by:

^{*}is expressed in the following lemma.

**Lemma**

**2.1.**

_{Q}(P) < 0.

**A**=[

**X I**] as an N × (K + N) matrix obtained from juxtaposing

**X**and the N × N identity matrix

**I**. We now work with the matrix

**A**which allows us to consider the larger space rather than just the more traditional moment space. This is shown and discussed explicitly in the examples and derivations of Section 4, Section 5 and Section 6. For practical purposes, when facing a relatively small sample, the researcher may prefer working with

**A**, rather than with the sample moments. This is because for finite sample the total information captured by using

**A**is larger than when using the sample’s moments.

**a**and

**b**, and are N free parameters that will play the role of Lagrange multipliers (one multiplier for each observation). The quantity Ω(

**λ**) is the normalization function:

**λ**) ≥ S

_{Q}(ρ) for any and for any ρ in the class of probability laws P(C) defined in (5). However, the problem is that we do not know whether the solution ρ

^{*}(

**λ**,

**z**,

**v**) is a member of P(C) for some

**λ**. Therefore, we search for

**λ**

^{*}such that ρ

^{*}= ρ(

**λ**

^{*}) is in P(C) and

**λ**

^{*}is a minimum. If such a

**λ**

^{*}is found, then we would have found a density (the unique one, for S

_{Q}is strictly convex in ρ) that maximizes the entropy, and by using the fact that

**β**

^{*}= E

_{P*}[

**z**] and

**ε**

^{*}= E

_{P*}[

**v**], the solution to (1), which is consistent with the data (3), is found. Formally, the result is contained in the following theorem. (Note that the Kullback’s measure (Kullback [34]), is a particular case of S

_{Q}(P), with a sign change and when both P and Q have densities).

**Theorem**

**2.1.**

**λ**) is achieved at

**λ**

^{*}. Then, satisfies the set of constrains (3) or (1) and maximizes the entropy.

**λ**) at

**λ**

^{*}. The equation to be solved to determine

**λ**

^{*}is which coincides with Equation (3) when the gradient is written out explicitly.

**λ**

^{*}) = S

_{Q}(ρ

^{*})

**Comment.**

**β**and

**ε**from (1), it is easier to transform the algebraic problem into the problem of obtaining a minimum of the convex function ∑(

**λ**) and then use

**β**

^{*}= E

_{P*}[

**z**] and

**ε**

^{*}= E

_{P*}[

**v**] to compute the estimates

**β**

^{*}and

**ε**

^{*}. The above procedure is designed in such a way that is automatically satisfied. Since the actual measurement noise is unknown, it is treated as a quantity to be determined, and treated (mathematically) as if both

**β**and

**ε**were unknown. The interpretations of the reconstructed residual

**ε**

^{*}and the reconstructed

**β**

^{*}, are different. The latter is the unknown parameter vector we are after while the former is the residual (reconstructed error) such that the linear Equation (1), , is satisfied. With that background, we now discuss the basic properties of our model. For a detailed comparison of a large number of IT estimation methods see Golan ([27,28,29,30,31,32,33]) and the nice text of Mittelhammer, Judge and Miller [36]

## 3. Closed Form Examples

#### 3.1. Normal Priors

_{s}(or

**z**), to C

_{n}(or

**v**) or to both. Assume the reference prior dQ is a normal random vector with d × d [i.e., K × K, N × N or (N + K) × (N + K)] covariance matrix

**D**, the law of which has density where is the vector of prior means and is specified by the researcher. Next, we define the Laplace transform, ω(

**τ**), of the normal prior. This transform involves the diagonal covariance matrix for the noise and signal models:

**τ**by either X

^{t}

**λ**or by

**λ**, (for the noise vector) verifies that Ω(

**λ**) turns out to be of a quadratic form, and therefore the problem of minimizing ∑(

**λ**) is just a quadratic minimization problem. In this case, no bounds are specified on the parameters. Instead, normal priors are used.

**λ**

^{*}, satisfying:

**M**

^{#}denotes the generalized inverse of

**M**=

**ADA**

^{t}then and therefore:

**A**= [

**X I**] and:

#### 3.2. Discrete Uniform Priors — A GME Model

**z**take discrete values, and let for 1 ≤ k ≤ K Note that we allow for the cardinality of each of these sets to vary. Next, define . A similar construction may be proposed for the noise terms, namely we put . Since the spaces are discrete, the information is described by the obvious σ-algebras and both the prior and post-data measures will be discrete. As a prior on the signal space, we may consider:

_{n}. Finally, we get:

#### 3.3. Signal and Noise Bounded Above and Below

_{s}and the noise space C

_{n}. Let and for the signal and noise bounds a

_{j}, b

_{j}and e respectively. The Bernoulli a priori measure on C = C

_{s}× C

_{n}is:

_{c}(dz) denotes the (Dirac) unit point mass at some point c. Recalling that

**A**= [

**X**,

**I**] we now compute the Laplace transform ω(

**t**) of Q, which in turn yields Ω(

**λ**)=ω(

**A**

^{T}

**λ**):

**λ**

^{*}. Once it has been found, then and . Explicitly:

**τ**= (

**X**

^{T}

**λ**

^{*}). Similarly:

_{j}will attain the values a

_{j}or b

_{j}, or the auxiliary random variables v

_{l}describing the error terms attain the values ±e. These can be also obtained as the expected values of

**v**and

**z**with respect to the post-data measure P

^{*}(

**λ**,d

**ξ**) given by:

## 4. Main Results

#### 4.1. Large Sample Properties

#### 4.1.1. Notations and First Order Approximation

**β**when the sample size is N. Throughout this section we add a subscript N to all quantities introduced in Section 2 to remind us that the size of the data set is N. We want to show that and as N → ∞ in some appropriate way (for some covariance

**V**). We state here the basic notations, assumptions and results and leave the details to the Appendix. The problem is that when N varies, we are dealing with problems of different sizes (recall

**λ**is of dimension N in our generic model). To turn all problems to the same size let:

**Assumption**

**4.1.**

**v**, as .

**0**in L

_{2}, therefore in probability. To see the logic for that statement, recall that the vector

**ε**

_{N}has covariance matrix σ

^{2}I

_{N}. Therefore, and assumption 4.1 yields the above conclusion. (To keep notations simple, and without loss of generality, we discuss here the case of σ

^{2}I

_{N}.)

**Corollary**

**4.1.**

**β**is the true but unknown vector of parameters. Then, as N → ∞ (the proof is immediate).

**Lemma**

**4.1.**

**Proof of**

**lemma 4.1.**

**Lemma**

**4.2.**

**Comment.**

**μ**

^{*}that minimizes satisfies: Or since W is invertible,

**β**admits the representation Note that this last identity can be written as

**Assumption**

**4.2.**

**θ**(

**τ**) is invertible and continuously differentiable.

_{N}and ∑

_{N}are the functions introduced in Section 2 for a problem of size N. To relate the solution of the problem of size K to that of the problem of size N, we have:

**Lemma**

**4.3.**

_{N}(

**λ**).

**Proof of**

**Lemma 4.3.**

**Corollary**

**4.2.**

**ξ**. These functions will be invertible as long as these quantities are positive definite. The relationship among the above quantities is expressed in the following lemma:

**Lemma**

**4.4.**

_{2}= σ

^{2}I

_{N}we have: and where , and

**θ**' is the first derivative of

**θ**.

**Comment.**

**Assumption**

**4.3.**

**μ**) bounded below away from zero.

**Proposition**

**4.1.**

_{N}(

**Y**), ψ

_{∞}(

**Y**) respectively denote the compositional inverses of φ

_{N}(

**μ**), φ

_{∞}(

**μ**). Then, as N → ∞, (i) φ

_{N}(

**μ**) → φ

_{∞}(

**μ**) and (ii) ψ

_{N}(

**y**) → ψ

_{∞}(

**y**).

#### 4.1.2. First Order Unbiasedness

**Lemma**

**4.5.**

**β**.

#### 4.1.3. Consistency

**Lemma**

**4.6.**

**ε]=0**and the

**ε**are homoskedastic, then in square mean as N → ∞.

**Proposition**

**4.2.**

- (a)
- as N → ∞,
- (b)
- as N → ∞,

#### 4.2. Forecasting

**β**

^{*}has been found, we can use it to predict future (yet) “unobserved” values. If additive noise (

**ε**or

**v**) is distributed according to the same prior Q

_{n}, and if future observations are determined by the design matrix

**X**

_{f}, then the possible future observations are described by a random variable

**y**

_{f}given by . For example, if

**v**

_{f}is centered (on

**0**), then and:

## 5. Method Comparison

**β**in the noisy, inverse linear problem. We start with the least squares (LS) model, continue with the generalized LS (GLS) and then discuss the regularization method often used for ill-posed problems. We then contrast our estimator with a Bayesian one and with the Bayesian Method of Moments (BMOM). We also show that exact correspondence between our estimator and the other estimators under normal priors.

#### 5.1. The Least Squares Methods

#### 5.1.1. The General Case

**ε**, the data may fall outside the range of

**X,**so the objective is to minimize that discrepancy. The minimizer of (16) provides us with the LS estimates that minimize the errors sum of square distance from the data to . When (

**X**

^{t}

**X**)

^{−}exists, then . The reconstruction error can be thought of as our estimate of the “minimal error in quadratic norm” of the measurement errors, or of the noise present in the measurements.

**D**with blocks

**D**

_{1}and

**D**

_{2}.

**A**and

**ξ**are as defined in Section 2. Since , and the matrix

**A**is of dimension N × (N + K), there are infinitely many solutions that satisfy the observed data in (1) (or (17)). To choose a single solution we solve the following model:

**ξ**:

_{s}× C

_{n}) and

**D**can be taken to be the full covariance matrix composed of both

**D**

_{1}and

**D**

_{2}defined in Section 3.1. Under the assumption that

**M**≡ (

**ADA**

^{t}) is invertible, the solution to the variational problem (19) is given by . This solution coincides with our Generalized Entropy formulation when normal priors are imposed and are centered about zero (

**c**

_{0}=

**0**) as is developed explicitly in Equation (14).

**X**is not invertible), then the solution is not unique, and a combination of the above two methods (16 and 18) can be used. This yields the regularization method consisting of finding

**β**such that:

**D**

_{1}and

**D**

_{2}can be substituted for any weight matrix of interest. Using the first component of , we can state the following.

**Lemma**

**5.1.**

**Proof**

**of Lemma 5.1.**

**y**. For this equality to hold, α=1.

**y**) and its true value (

**Xβ**) by the prior covariance matrix

**D**

_{2}, the penalized GLS and our entropy solutions coincide for α=1 and for normal priors.

**Lemma**

**5.2.**

**Proof of**

**Lemma 5.2.**

**y**, which implies the following chain of identities:

**D**

_{2}is invertible and therefore

**X**

^{t}must vanish (trivial but an uninteresting case). Second, if the variance of the noise component is zero, (1) becomes a pure linear inverse problem (i.e., we solve

**y**=

**Xβ**).

#### 5.1.2. The Moments’ Case

**A**) than the other LS or GLS estimators. In other words, the constraints in the GE estimator are the data points rather than the moments. The comparison is easier if one performs the above comparisons under similar spaces, namely using the sample’s moments. This can easily be done if

**X**

^{t}

**X**is invertible, and where we re-specify

**A**to be the generic matrix

**A**= [

**X**

^{t}

**X X**

^{t}], rather than

**A**= [

**X I**]. Now, let

**y**’ ≡

**X**

^{t}

**y**,

**X**’ ≡

**X**

^{t}

**X**, and

**ε**’ ≡

**X**

^{t}

**ε**, then the problem is represented as

**y**’ =

**X**’

**β**+

**ε**’. In that case the conditions for = is the trivial condition

**X**

^{t}

**D**

_{2}

**X**=

**0**.

**X**

^{t}

**X**is invertible, it is easy to verify that the solutions to variational problems of the type

**y**’ ≡

**X**

^{t}

**y**=

**X**

^{t}

**Xβ**are of the form (

**X**

^{t}

**X**)

^{−1}

**X**

^{t}

**y**. In one case, the problem is to find:

#### 5.2. The Basic Bayesian Method

_{n}and C

_{s}are closed convex subsets of and respectively and that and . For the rest of this section, the priors ${g}_{s}\left(z\right),$ g

_{n}(

**v**) will have their usual Bayesian interpretation. For a given

**z**, we think of

**y**=

**Xz + v**as a realization of the random variable Y =

**Xz + V**. Then, . The joint density g

_{y,z}(

**y,z**) of Y and Z, where Z is distributed according to the prior Q

_{s}is:

**y**is and therefore by Bayes Theorem the posterior (post-data) conditional is from which:

_{y,z}(

**y,z**). The conditional covariance matrix:

**Z**|

**y**) is the total variance of the K random variates

**z**in Z. Finally, it is important to emphasize here that the Bayesian approach provides us with a whole range of tools for inference, forecasting, model averaging, posterior intervals, etc. In this paper, however, the focus is on estimation and on the basic comparison of our GE method with other methods under the notations and formulations developed here. Extensions to testing and inference are left for future work.

#### 5.2.1. A Standard Example: Normal Priors

**β**and

**ε**as realizations of random variables

**Z**and

**V**having the informative normal “a priori” (priors for signal and noise) distributions:

**Z**and

**V**are centered on zero and independent, and both covariance matrices

**D**

_{1}and

**D**

_{2}are strictly positive definite. For comparison purposes, we are using the same notation as in Section 3. The randomness is propagated to the data

**Y**such that the conditional density (or the conditional priors on

**y**) of

**Y**is:

**Z**given

**Y**is easy to obtain under the normal setup. Thus, the post-data distribution of the signal,

**β**, given the data

**y**is:

**Z**has changed (relative to the prior) by the data. Finally, the post-data expected value of

**Z**is given by:

**z**

^{0}=

**0**which is the Generalized Entropy method with normal priors and center of supports equal zero. In addition, it is easy to see that the Bayesian solution (28) coincides with the penalized GLS (model (24)) for α = 1.

**D**

_{2}. In the Bayesian result is marginalized, so it is not conditional on that parameter. Therefore, with a known value of , both estimators are the same.

**L**is known as the Wiener filter (see for example Bertero and Boccacci [37]). In that sense, the Bayesian technique and the GE technique have some procedural ingredients in common, but the distinguishing factor is the way the posterior (post-data) is obtained. (Note that “posterior” for the entropy method, means the “post data” distribution which is based on both the priors and the data, obtained via the optimization process). In one case it is obtained by maximizing the entropy functional while in the Bayesian approach it is obtained by a direct application of Bayes theorem. For more background and related derivation of the ME and Bayes rule see Zellner [38,39].

#### 5.3. Comparison with the Bayesian Method of Moments (BMOM)

**X**

^{t}

**X**)

^{−1}exists, then the LS solution to (1) is which is assumed to be the post data mean with respect to (yet) unknown distribution (likelihood). This is equivalent to assuming (the columns of

**X**are orthogonal to the N × 1 vector E[

**V**|Data]). To find g(

**z**|Data), or in Zellener’s notation g(

**β**|Data), one applies the classical ME with the following constraints (information):

^{2}is a positive parameter. Then, the maximum entropy density satisfying these two constraints (and the requirement that it is a proper density) is:

## 6. More Closed Form Examples

_{s}(or

**z**), to C

_{n}(or

**v**) or to both.

#### 6.1. The Basic Formulation

**A**) and for just the signal or noise parts separately. We only provide here the solution for the signal part.

_{j}is the variance of each component. The Laplace transform of dQ is:

**τ**=

**X**

^{t}

**λ**. (Note that under the generic formulation, instead of

**X**

^{t}, we can work with

**X**

^{*}which stands for either

**X**

^{t},

**I**or

**A**

^{t}) We compute Ω(

**λ**) via the Laplace transformation (8). It then follows that where ω(

**t**) is always finite and positive. For this relationship to be satisfied, for all j = 1, 2, …, d. Finally, replacing

**τ**by

**X**

^{t}

**λ**yields D(Ω).

**λ**to

**0,**we obtain that at the minimum:

**λ**

^{*}that minimizes ∑(

**λ**), and such that the previous identity holds, we can rewrite our model as:

**λ**

^{*}, that minimizes ∑(

**λ**) and satisfies (30).

**β**’s are all bounded below by theory. Then, we can specify a random vector

**Z**with values in the positive orthant translated to the lower bound K-dimensional vector

**l**, so , where [l

_{j},∞) = [

**Z**

_{j},∞). Like related methods, we assume that each component

**Z**

_{j}of

**Z**is distributed in [l

_{j},∞) according to a translated . With this in mind, a direct calculation yields:

_{j}= 0 corresponds to the standard exponential distribution defined on [

**l**

_{j},∞). Finally, when

**τ**is replaced by

**X**

^{t}

**λ**, we get:

**Z**and

**V**takes values in some bounded interval [a

_{j}, b

_{j}]. A common choice for the bounds of the errors supports in that case are the three-sigma rule (Pukelsheim [44]) where “sigma” is the empirical standard deviation of the sample analyzed (see for example Golan, Judge and Miller, [1] for a detailed discussion). In this situation we provide two simple (and extreme) choices for the reference measure. The first is a uniform measure on [a

_{j}, b

_{j}], and the second is a Bernoulli distribution supported on a

_{j}and b

_{j}.

#### 6.1.2. Uniform Reference Measure

**τ**) id finite for every vector

**τ**.

#### 6.1.3. Bernoulli Reference Measure

**z**) = , where δ

_{c}(d

**z**) denotes the (Dirac) unit point mass at some point c, and where p

_{j}and q

_{j}do not have to sum up to one, yet they determine the weight within the bounded interval [a

_{j}, b

_{j}]. The Laplace transform of dQ is:

**τ**) is finite for all

**τ**.

_{j}= −c

_{j}and b

_{j}= c

_{j}for positive c

_{j}’s. The corresponding versions of (35) and (36), the uniform and Bernoulli, are respectively:

#### 6.2 The Full Model

#### 6.2.1. Bounded Parameters and Normally Distributed Errors

**ε**. From Section 2, with , and . The signal component is formulated earlier, while . Using

**A**=[

**X I**] we have for the N-dimensional vector

**λ**, and therefore, . The maximal entropy probability measures (post-data) are:

**is found by minimizing the concentrated entropy function:**

**λ**^{*}**is found, we get:**

**λ**^{*}## 7. A Comment on Model Comparison

**β**’s. Assumption 2.1 means that these constraints are properly specified, namely there is no arbitrariness in the choice of C

_{s}. The choice of a specific model for the noise involves two assumptions. The first one is about the support that reflects the actual range of the errors. The second is the choice of a prior describing the distribution of the noise within that support. To contrast two possible priors, we want to compare the reconstructions provided by the different models for the signal and noise variables. Within the information theoretic approach taken here, comparing the post-data entropies seems a reasonable choice.

**has to be computed by minimizing the concentrated entropy function ∑(**

**λ**^{*}**λ**), and it is clear that the total entropy difference between the post-data and the priors is just the entropy difference for the signal plus the entropy difference for the noise. Note that where d is the dimension of

**λ**. This is the entropy ratio statistics which is similar in nature to the empirical likelihood ratio statistic (e.g., Golan [27]). Rather than discussing this statistic here, we provide in Appendix 3 analytic formulations of Equation (39) for a large number of prior distributions. These formulations are based on the examples of earlier sections. Last, we note that in some cases, where the competing models are of different dimensions, a normalization of both statistics is necessary.

## 8. Conclusions

## Acknowledgments

## References

- Golan, A.; Judge, G.G.; Miller, D. Maximum Entropy Econometrics: Robust Estimation with Limited Data; John Wiley & Sons: New York, NY, USA, 1996. [Google Scholar]
- Gzyl, H.; Velásquez, Y. Linear Inverse Problems: The Maximum Entropy Connection; World Scientific Publishers: Singapore, 2011. [Google Scholar]
- Jaynes, E.T. Information theory and statistical mechanics. Phys. Rev.
**1957**, 106, 620–630. [Google Scholar] [CrossRef] - Jaynes, E.T. Information theory and statistical mechanics II. Phys. Rev.
**1957**, 108, 171–190. [Google Scholar] [CrossRef] - Shannon, C. A mathematical theory of communication. Bell System Technical. J.
**1948**, 27, 379–423, 623–656. [Google Scholar] [CrossRef] - Owen, A. Empirical likelihood for linear models. Ann. Stat.
**1991**, 19, 1725–1747. [Google Scholar] [CrossRef] - Owen, A. Empirical Likelihood; Chapman & Hall/CRC: Boca Raton, FL, USA, 2001. [Google Scholar]
- Qin, J.; Lawless, J. Empirical likelihood and general estimating equations. Ann. Stat.
**1994**, 22, 300–325. [Google Scholar] [CrossRef] - Smith, R.J. Alternative semi parametric likelihood approaches to GMM estimations. Econ. J.
**1997**, 107, 503–510. [Google Scholar] [CrossRef] - Newey, W.K.; Smith, R.J. Higher order properties of GMM and Generalized empirical likelihood estimators. Department of Economics, MIT: Boston, MA, USA, Unpublished work. 2002. [Google Scholar]
- Kitamura, Y.; Stutzer, M. An information-theoretic alternative to generalized method of moment estimation. Econometrica
**1997**, 66, 861–874. [Google Scholar] [CrossRef] - Imbens, G.W.; Johnson, P.; Spady, R.H. Information-theoretic approaches to inference in moment condition models. Econometrica
**1998**, 66, 333–357. [Google Scholar] [CrossRef] - Zellner, A. Bayesian Method of Moments/Instrumental Variables (BMOM/IV) analysis of mean and regression models. In Prediction and Modeling Honoring Seymour Geisser; Lee, J.C., Zellner, A., Johnson, W.O., Eds.; Springer Verlag: New York, NY, USA, 1996. [Google Scholar]
- Zellner, A. The Bayesian Method of Moments (BMOM): Theory and applications. In Advances in Econometrics; Fomby, T., Hill, R., Eds.; JAI Press: Greenwich, CT, USA, 1997; Volume 12, pp. 85–105. [Google Scholar]
- Zellner, A.; Tobias, J. Further results on the Bayesian method of moments analysis of multiple regression model. Int. Econ. Rev.
**2001**, 107, 1–15. [Google Scholar] [CrossRef] - Gamboa, F.; Gassiat, E. Bayesian methods and maximum entropy for ill-posed inverse problems. Ann. Stat.
**1997**, 25, 328–350. [Google Scholar] [CrossRef] - Gzyl, H. Maxentropic reconstruction in the presence of noise. In Maximum Entropy and Bayesian Studies; Erickson, G., Ryckert, J., Eds.; Kluwer: Dordrecht, The Netherlands, 1998. [Google Scholar]
- Golan, A.; Gzyl, H. A generalized maxentropic inversion procedure for noisy data. Appl. Math. Comput.
**2002**, 127, 249–260. [Google Scholar] [CrossRef] - Hoerl, A.E.; Kennard, R.W. Ridge regression: Biased estimation for non-orthogonal problems. Technometrics
**1970**, 1, 55–67. [Google Scholar] [CrossRef] - O’Sullivan, F. A statistical perspective on ill-posed inverse problems. Stat. Sci.
**1986**, 1, 502–527. [Google Scholar] [CrossRef] - Breiman, L. Better subset regression using the nonnegative garrote. Technometrics
**1995**, 37, 373–384. [Google Scholar] [CrossRef] - Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B
**1996**, 58, 267–288. [Google Scholar] - Titterington, D.M. Common structures of smoothing techniques in statistics. Int. Stat. Rev.
**1985**, 53, 141–170. [Google Scholar] [CrossRef] - Donoho, D.L.; Johnstone, I.M.; Hoch, J.C.; Stern, A.S. Maximum entropy and the nearly black object. J. R. Stat. Soc. Ser. B
**1992**, 54, 41–81. [Google Scholar] - Besnerais, G.L.; Bercher, J.F.; Demoment, G. A new look at entropy for solving linear inverse problems. IEEE Trans. Inf. Theory
**1999**, 45, 1565–1578. [Google Scholar] [CrossRef] - Bickel, P.; Li, B. Regularization methods in statistics. Test
**2006**, 15, 271–344. [Google Scholar] [CrossRef] - Golan, A. Information and entropy econometrics—A review and synthesis. Found. Trends Econometrics
**2008**, 2, 1–145. [Google Scholar] [CrossRef] - Fomby, T.B.; Hill, R.C. Advances in Econometrics; JAI Press: Greenwich, CT, USA, 1997. [Google Scholar]
- Golan, A. (Ed.) Special Issue on Information and Entropy Econometrics (Journal of Econometrics); Elsevier: Amsterdam, The Netherlands, 2002; Volume 107, Issues 1–2, pp. 1–376.
- Golan, A.; Kitamura, Y. (Eds.) Special Issue on Information and Entropy Econometrics: A Volume in Honor of Arnold Zellner (Journal of Econometrics); Elsevier: Amsterdam, The Netherlands, 2007; Volume 138, Issue 2, pp. 379–586.
- Mynbayev, K.T. Short-Memory Linear Processes and Econometric Applications; John Wiley & Sons: Hoboken, NY, USA, 2011. [Google Scholar]
- Asher, R.C.; Borchers, B.; Thurber, C.A. Parameter Estimation and Inverse Problems; Elsevier: Amsterdam, Holland, 2003. [Google Scholar]
- Golan, A. Information and entropy econometrics—Editor’s view. J. Econom.
**2002**, 107, 1–15. [Google Scholar] [CrossRef] - Kullback, S. Information Theory and Statistics; John Wiley & Sons: New York, NY, USA, 1959. [Google Scholar]
- Durbin, J. Estimation of parameters in time-series regression models. J. R. Stat. Soc. Ser. B
**1960**, 22, 139–153. [Google Scholar] - Mittelhammer, R.; Judge, G.; Miller, D. Econometric Foundations; Cambridge Univ. Press: Cambridge, UK, 2000. [Google Scholar]
- Bertero, M.; Boccacci, P. Introduction to Inverse Problems in Imaging; CRC Press: Boca Raton, FL, USA, 1998. [Google Scholar]
- Zellner, A. Optimal information processing and Bayes theorem. Am. Stat.
**1988**, 42, 278–284. [Google Scholar] - Zellner, A. Information processing and Bayesian analysis. J. Econom.
**2002**, 107, 41–50. [Google Scholar] [CrossRef] - Zellner, A. Bayesian Method of Moments (BMOM) Analysis of Mean and Regression Models. In Modeling and Prediction; Lee, J.C., Johnson, W.D., Zellner, A., Eds.; Springer: New York, NY, USA, 1994; pp. 17–31. [Google Scholar]
- Zellner, A. Models, prior information, and Bayesian analysis. J. Econom.
**1996**, 75, 51–68. [Google Scholar] [CrossRef] - Zellner, A. Bayesian Analysis in Econometrics and Statistics: The Zellner View and Papers; Edward Elgar Publishing Ltd.: Cheltenham Glos, UK, 1997; pp. 291–304, 308–318. [Google Scholar]
- Kotz, S.; Kozubowski, T.; Podgórski, K. The Laplace Distribution and Generalizations; Birkhauser: Boston, MA, USA, 2001. [Google Scholar]
- Pukelsheim, F. The three sigma rule. Am. Stat.
**1994**, 48, 88–91. [Google Scholar]

## Appendix 1: Proofs

**Proof of**

**Proposition 4.1.**

**ξ**is , then -covariance of

**ξ**is an (N + K) × (N + K)-matrix given by . Here is the -covariance of the signal component of

**ξ**. Again, from Assumptions 4.1–4.3 it follows that which is the covariance of the signal component of

**ξ**with respect to the limit probability Therefore, φ

_{∞}is also invertible and To verify the uniform convergence of ψ

_{N}(

**y**) towards ψ

_{∞}(

**y**) note that:

**Proof of Lemma**

**4.5.**

**Proof of**

**Lemma 4.6.**

**Proof of**

**Proposition 4.2.**

_{n}for any , where Since the components of

**ε**are i.i.d. random variables, the standard approximations yield: , where , and therefore the law of concentrates at

**0**asymptotically. This completes Part (a).

## Appendix 2: Normal Priors — Derivation of the Basic Linear Model

_{i}= a + bx

_{i}+ ε

_{i}where

**X**= (

**l**,

**x**), and

**β**= (ab)

^{t}. We assume that (i) both Q

_{s}and Q

_{n}are normal and that (ii) Q

_{s}and Q

_{n}are independent, meaning the Laplace transform is just (10). Recalling that

**t**=

**A**

^{t}

**λ**, and that in the generic model

**A**= [

**X I**] = [

**1 x I**] where

**A**is an N × (N + 2), or N × (N + K) for the general model with K > 2, dimensional matrix. The log of the normalization factor of the post-data, Ω(

**λ**), is:

**λ**

^{*}, , yields:

**M**is:

**11**

^{t}=

**1**

_{N}.

**β**and

**ε**. Recalling the optimal solution is and , then following the derivations of Section 3 we get:

^{*}yields:

**B**=

**y − Ac**

_{0.}Within the basic model, it is clear that , or . Under the natural case where the errors’ priors are centered on zero (

**v**

^{0}=

**0**), and . If in addition

**z**

^{0}=

**0**, then .

## Appendix 3: Model Comparisons — Analytic Examples

**z**

_{0}is the center of the normal priors and

**D**

_{1}is the covariance matrix of the state space variables.

© 2012 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

## Share and Cite

**MDPI and ACS Style**

Golan, A.; Gzyl, H. An Entropic Estimator for Linear Inverse Problems. *Entropy* **2012**, *14*, 892-923.
https://doi.org/10.3390/e14050892

**AMA Style**

Golan A, Gzyl H. An Entropic Estimator for Linear Inverse Problems. *Entropy*. 2012; 14(5):892-923.
https://doi.org/10.3390/e14050892

**Chicago/Turabian Style**

Golan, Amos, and Henryk Gzyl. 2012. "An Entropic Estimator for Linear Inverse Problems" *Entropy* 14, no. 5: 892-923.
https://doi.org/10.3390/e14050892