#
Entropy, Information Theory, Information Geometry and Bayesian Inference in Data, Signal and Image Processing and Inverse Problems^{ †}

^{†}

## Abstract

**:**

## 1. Introduction

_{n}) = p

_{n}; n = 1, ⋯, N} with mutually exclusive values x

_{n}is its probability distribution. When X is a continuous valued variable, p(x) is its probability density function from which we can compute $P(a\le X<b)={\displaystyle {\int}_{a}^{b}p(x)\phantom{\rule{0.2em}{0ex}}\mathrm{d}x}$ or any other probabilistic quantity, such as its mode, mean, median, region of high probabilities, etc.

_{n}associated with one of the possible values of x

_{n}of X with probabilities P(X = x

_{n}) = p

_{n}to be ${I}_{n}=\mathrm{ln}\frac{1}{{p}_{n}}=-\mathrm{ln}{p}_{n}$ and the entropy H as the expected value

_{1}, ⋯, x

_{N}} and when we do not know anything else about it, Laplace proposed the “Principe d’indifférence”, where $P(X={x}_{n})={p}_{n}=\frac{1}{N},\forall n=1,\cdots ,N$, a uniform distribution. However, what if we know more, but not enough to be able to assign the probability law {p

_{1}, ⋯, p

_{N}} completely?

_{1}, ⋯, p

_{N}}. If we have a sufficient number of constraints (at least N), then we may obtain a unique solution. However, very often, this is not the case. The question now is how to assign a probability distribution {p

_{1}, ⋯, p

_{N}} that satisfies the available constraints. This question is an ill-posed problem in the mathematical sense of Hadamard [5] in the sense that the solution is not unique. We can propose many probability distributions that satisfy the constraint imposed by this problem. To answer this question, Jaynes [6–8] introduced the maximum entropy principle (MEP) as a tool for assigning a probability law to a quantity on which we have some incomplete or macroscopic (expected values) information. Some more details about this MEP, the mathematical optimization problem, the expression of the solution and the algorithm to compute it will be given in Sections 3 and 4.

_{1}(x) = p(x|θ

_{1}) and p

_{2}(x) = p(x|θ

_{2}) in the same exponential family, the Kullback–Leibler divergence KL [p

_{1}: p

_{2}] induces a Bregman divergence B[θ

_{1}: θ

_{2}] between the two parameters [13,14]. More details will be given in Section 8.

## 2. Bayes Rule

**X**, as is the case in many signal and image processing applications, this computation becomes very costly [17]. We may then want to summarize p(x|y) by a few interesting or significant point estimates. For example, compute the maximum a posteriori (MAP) solution:

**Remarks**on notation used for the expected value in this paper: For a variable X with the probability density function (pdf) p(x) and any regular function h(X), we use indifferently:

## 3. Quantity of Information and Entropy

#### 3.1. Shannon Entropy

_{1}, ⋯, x

_{N}} with probabilities {p

_{1}, ⋯, p

_{N}} and defined the quantities of information associated with each of them as ${I}_{n}=\mathrm{ln}\frac{1}{{p}_{n}}=-\mathrm{ln}{p}_{n}$ and its expected value as the entropy:

#### 3.2. Thermodynamical Entropy

#### 3.3. Statistical Mechanics Entropy

#### 3.4. Boltzmann Entropy

#### 3.5. Gibbs Entropy

_{i}is the energy of microstate i and p

_{i}is its probability that it occurs during the system’s fluctuations, then the entropy of the system is:

## 4. Relative Entropy or Kullback–Leibler Divergence

_{1}and p

_{2}on the same variable X. Two related notions have been defined:

- Relative Entropy of p
_{1}with respect to p_{2}:$$D[{p}_{1}:{p}_{2}]=-{\displaystyle \int {p}_{1}(x)\mathrm{ln}\frac{{p}_{1}(x)}{{p}_{2}(x)}}\phantom{\rule{0.2em}{0ex}}\mathrm{d}x$$ - Kullback–Leibler divergence of p
_{1}with respect to p_{2}:$$\mathrm{KL}[{p}_{1}:{p}_{2}]=-D[{p}_{1}:{p}_{2}]={\displaystyle \int {p}_{1}(x)\mathrm{ln}\frac{{p}_{1}(x)}{{p}_{2}(x)}}\phantom{\rule{0.2em}{0ex}}\mathrm{d}x$$

- KL [q : p] ≥ 0,
- KL [q : p] = 0, if q = p and
- KL [q : p
_{0}] ≥ KL [q : p_{1}] + KL [p_{1}: p_{0}]. - KL [q : p] is invariant with respect to a scale change, but is not symmetric.
- A symmetric quantity can be defined as:$$\mathrm{J}[q,p]=\frac{1}{2}(\mathrm{KL}[q:p]+\mathrm{KL}[p:q]).$$

## 5. Mutual Information

- I[Y, X] is a concave function of p(y) when p(x|y) is fixed and a convex function of p(x|y) when p(y) is fixed.
- I[Y, X] ≥ 0 with equality only if X and Y are independent.

## 6. Maximum Entropy Principle

_{k}are any known functions. First, we assume that such probability laws exist by defining:

_{0}= 1 and d

_{0}= 1 for the normalization purpose. Then, the MEP is written as an optimization problem:

**λ**), called the partition function, is given by: $Z(\mathbf{\lambda})={\displaystyle \int \mathrm{exp}[-{\displaystyle {\sum}_{k=1}^{K}{\lambda}_{k}{\varphi}_{k}(x)}]\phantom{\rule{0.2em}{0ex}}\mathrm{d}x}$ and $\mathbf{\lambda}={\left[{\lambda}_{1},\dots ,{\lambda}_{K}\right]}^{\prime}$ have to satisfy:

**ln Z(**

_{λ}**λ**) =

**d**. Different algorithms have been proposed to compute numerically the ME distributions. See, for example, [31–37]

## 7. Link between Entropy and Likelihood

**θ**of a probability law p(x|

**θ**) from an n-element sample of data

**x**= {x

_{1}, ⋯, x

_{n}}.

**θ**is defined as:

**θ**) with respect to

**θ**gives what is called the maximum likelihood (ML) estimate of θ.

**θ**) depends on n, we may consider $\frac{1}{n}L(\mathbf{\theta})$ and define:

**θ**

^{*}is the right answer and p(x|

**θ**

^{*}) its corresponding probability law. We may then remark that:

## 8. Fisher Information, Bregman and Other Divergences

**θ**

^{*}) : p(x|

**θ**

^{*}+ ∆

**θ**)] and assume that ln p(x|

**θ**) can be developed in a Taylor series. Then, keeping the terms up to the second order, we obtain:

**F**is the Fisher information:

_{1}(x) = p(x|θ

_{1}) and p

_{2}(x) = p(x|θ

_{2}) in the same exponential family, the Kullback–Leibler divergence KL [p

_{1}: p

_{2}] induces a Bregman divergence B[θ

_{1}|θ

_{2}] between the two parameters [14,45–48].

- f-divergences:The f-divergences, which are a general class of divergences, indexed by convex functions f, that include the KL divergence as a special case. Let f: (0, ∞) ↦
**R**be a convex function for which f(1) = 0. The f-divergence between two probability measures P and Q is defined by:$${D}_{f}[P:Q]={\displaystyle \int q\phantom{\rule{0.2em}{0ex}}f\left(\frac{p}{q}\right)}$$Every f-divergence can be viewed as a measure of distance between probability measures with different properties. Some important special cases are:- – f(x) = x ln x gives KL divergence: $\mathrm{KL}[P:Q]={\displaystyle \int p\mathrm{ln}\left(\frac{p}{q}\right)}$.
- – $f(x)=|x-1|/2$ gives total variation distance: $\mathrm{TV}[P,Q]={\displaystyle \int |p-q|/2}$.
- – $f(x)={\left(\sqrt{x}-1\right)}^{2}$ gives the square of the Hellinger distance: ${H}^{2}[P,Q]={\displaystyle \int {\left(\sqrt{p}-\sqrt{q}\right)}^{2}}$.
- – f(x) = (x − 1)
^{2}gives the chi-squared divergence: ${\chi}^{2}[P:Q]={\displaystyle \int \frac{{(p-q)}^{2}}{q}}$.

- Rényi divergences:These are another generalization of the KL divergence. The Rényi divergence between two probability distributions P and Q is:$${D}_{\alpha}[P:Q]=\frac{1}{\alpha -1}\mathrm{ln}{\displaystyle \int {p}^{\alpha}{q}^{1-\alpha}}.$$
_{α}[P : Q] converges to KL [P : Q].${D}_{1/2}[P,Q]=-2\mathrm{ln}{\displaystyle \int \sqrt{pq}}$ is called Bhattacharyya divergence (closely related to Hellinger distance). Interestingly, this quantity is always smaller than KL:$${D}_{1/2}[P:Q]\le \mathrm{KL}[P:Q].$$As a result, it is sometimes easier to derive risk bounds with ${D}_{1/2}$ as the loss function as opposed to KL. - Bregman divergences:The Bregman divergences provide another class of divergences that are indexed by convex functions and include both the Euclidean distance and the KL divergence as special cases. Let ϕ be a differentiable strictly convex function. The Bregman divergence B
_{ϕ}is defined by:$${B}_{\varphi}[\mathbf{x}:\mathbf{y}]=\varphi (\mathbf{x})-\varphi (\mathbf{y})-\langle \mathbf{x}-\mathbf{y},\nabla \varphi (\mathbf{y})\rangle $$**x**and**y**and where the domain of ϕ is a space where convexity and differentiability make sense (e.g., whole or a subset of**R**^{d}or an L_{p}space). For example, $\varphi (\mathbf{x})={\Vert \mathbf{x}\Vert}^{2}$ and**R**^{d}gives the Euclidean distance:$${B}_{\varphi}[\mathbf{x}:\mathbf{y}]=\varphi (\mathbf{x})-\varphi (\mathbf{y})-\langle \mathbf{x}-\mathbf{y},\nabla \varphi (\mathbf{y})\rangle ={\Vert \mathbf{x}\Vert}^{2}-{\Vert \mathbf{y}\Vert}^{2}-\langle \mathbf{x}-\mathbf{y},2\mathbf{y}\rangle ={\Vert \mathbf{x}-\mathbf{y}\Vert}^{2}$$**R**^{d}gives the KL divergence:$${B}_{\varphi}[\mathbf{x}:\mathbf{y}]={\displaystyle \sum _{j}{x}_{j}\mathrm{ln}{x}_{j}}-{\displaystyle \sum _{j}{y}_{j}\mathrm{ln}{y}_{j}}-{\displaystyle \sum _{j}({x}_{j}-{y}_{j})(1+\mathrm{ln}{y}_{j})}={\displaystyle \sum _{j}{x}_{j}\mathrm{ln}\frac{{x}_{j}}{{y}_{j}}}=\mathrm{KL}[\mathbf{x}:\mathbf{y}]$$

_{p}

_{(}

_{x}

_{)}{B

_{ϕ}(X, m)} is minimized over m in the domain of ϕ at m = E {X}:

**y**, a set of parameters

**x**, a likelihood p(y|x) and a prior π(x), which gives the posterior $p(x|y)\propto p(y|x)\pi (x)$. Let us also consider a cost function $C[x,\tilde{x}]$ in the parameter space x ∈

**X**. The classical Bayesian point estimation of

**x**is expressed as the minimizer of an expected risk:

_{1}(

**x**) and π

_{2}(

**x**), which give rise to two posterior probability laws p

_{1}(

**x**|

**y**) and p

_{2}(

**x**|

**y**). If the prior laws and the likelihood are in the exponential families, then the posterior laws are also in the exponential family. Let us note them as p

_{1}(

**x**|

**y**;

**θ**

_{1}) and p

_{2}(

**x**|

**y**;

**θ**

_{2}), where

**θ**

_{1}and

**θ**

_{2}are the parameters of those posterior laws. We then have the following properties:

- KL [p
_{1}: p_{2}] is expressed as a Bregman divergence B[**θ**_{1}:**θ**_{2}]. - A Bregman divergence B[
**x**_{1}:**x**_{2}] is induced when KL [p_{1}: p_{2}] is used to compare the two posteriors.

## 9. Vectorial Variables and Time Indexed Process

**X**}, and the variances are replaced by a covariance matrix: $\mathit{R}=\mathrm{E}\left\{(\mathit{X}-\mu ){\left(\mathit{X}-\mu \right)}^{\prime}\right\}$; and almost all of the quantities can be defined immediately. For example, for a Gaussian vector $p(\mathit{x})=\mathcal{N}(\mathit{x}|0,\mathit{R})$, the entropy is given by [49]:

**μ**= E {

**X**} and the covariance matrix $\sum =\mathrm{E}\{(\mathit{X}-\mu ){(\mathit{X}-\mu )}^{\prime}\}$.

_{1}(ω) and S

_{2}(ω), we have:

## 10. Entropy in Independent Component Analysis and Source Separation

**x**(t), the independent component analysis (ICA) consists of finding a separating matrix

**B**, such that the components y(t) =

**Bx**(t) are as independent as possible. The notion of entropy is used here as a measure of independence. For example, to find

**B**, we may choose $D\left[p(y):{\displaystyle {\prod}_{j}{p}_{j}({y}_{j})}\right]$ as a criterion of independence of the components y

_{j}. The next step is to choose a probability law p(

**x**) from which we can find an expression for p(

**y**) from which we can find an expression for $D\left[p(y):{\displaystyle {\prod}_{j}{p}_{j}({y}_{j})}\right]$ as a function of the matrix

**B**, which can be optimized to obtain it.

**x**(t) is a linear combination of the sources s(t), i.e.,

**x**(t) =

**As**(t), with

**A**being the mixing matrix. The objective of source separation is then to find the separating matrix

**B**=

**A**

^{−}

^{1}.

**y**=

**Bx**. Then,

**y**) is used as a criterion for ICA or source separation. As the objective in ICA is to obtain

**y**in such a way that its components become as independent as possible, the separating matrix

**B**has to maximize H(

**y**). Many ICA algorithms are based on this optimization [54–65]

## 11. Entropy in Parametric Modeling and Model Selection

**θ**in a probabilistic model p(

**x**|

**θ**), is an important subject in many data and signal processing problems. As an example, in autoregressive (AR) modeling:

**θ**= {θ

_{1}⋯, θ

_{K}}, we may want to compare two models with two different values of K.

**θ**is a very well-known problem, and there are likelihood based [66] or Bayesian approaches for that [67]. The determination of the order is however more difficult [68]. Between the tools, we may mention here the Bayesian methods [69–74], but also the use of relative entropy D [p(

**x**|

**θ**

^{*}): p(

**x**|

**θ**)], where

**θ**

^{*}represents the vector of the parameters of dimension K

^{*}and

**θ**and the vector

**θ**with dimension K ≤ K

^{*}. In such cases, even if the two probability laws to be compared have parameters with different dimensions, we can always use the KL [p(

**x**|

**θ**

^{*}): p(

**x**|

**θ**)] to compare them. The famous criterion of Akaike [75–78] uses this quantity to determine the optimal order. For a linear parameter model with Gaussian probability laws and likelihood-based methods, there are analytic solutions for it [68].

## 12. Entropy in Spectral Analysis

#### 12.1. Burg’s Entropy-Based Method

**x**) to the vector $\underset{\xaf}{X}=[X(0),\dots ,X(N-1){]}^{\prime}$. For this, we can use the principle of maximum entropy (PME) with the data as constraints (54). As these constraints are the second order moments, the PME solution is a Gaussian probability law: $\mathcal{N}(\mathit{x}|0,\mathit{R})$. For a stationary Gaussian process, when the number of samples N → ∞, the expression of the entropy becomes:

**λ**, which provides the possibility to give an analytical expression for S(ω) as a function of the data {r(k), k = 0,⋯, K}:

**Γ**= Toeplitz(r(0),⋯, r(K)) is the correlation matrix and

**δ**and

**e**are two vectors defined by

**δ**= [1, 0,⋯, 0]′ and

**e**= [1, e

^{−}

^{j}

^{ω}, e

^{−}

^{j2}

^{ω},⋯, e

^{−}

^{j}

^{Kω}]′.

#### 12.2. Extensions to Burg’s Method

**x**): p

_{0}(

**x**)] or minimizing KL [p(

**x**) : p

_{0}(

**x**)] where p

_{0}(

**x**) is an a priori law. The choice of the prior is important. Choosing a uniform p

_{0}(

**x**), we retrieve the previous case [77].

_{0}(

**x**), the expression to maximize becomes:

_{0}(ω) corresponds to the power spectral density of the reference process p

_{0}(

**x**). Now, the problem becomes: minimize D [p(

**x**): p

_{0}(

**x**)] subject to the constraints (54).

#### 12.3. Shore and Johnson Approach

#### 12.4. ME in the Mean Approach

## 13. Entropy-Based Methods for Linear Inverse Problems

#### 13.1. Linear Inverse Problems

**r**) through an observed signal g(t), image g(x, y) or any multi-variable observable function g(

**s**), which are related through an operator $\mathscr{H}:f\mapsto g$. This operator can be linear or nonlinear. Here, we consider only linear operators $g=Hf$:

**r**,

**s**) is the response of the measurement system. Such linear operators are very common in many applications of signal and image processing. We may mention a few examples of them:

- Convolution operations g = h * f in 1D (signal):$$g(t)={\displaystyle \int h(t-{t}^{\prime})f({t}^{\prime})\phantom{\rule{0.2em}{0ex}}\mathrm{d}{t}^{\prime}}$$$$g(x,y)={\displaystyle \iint h(x-{x}^{\prime},y-{y}^{\prime})f({x}^{\prime},{y}^{\prime})\phantom{\rule{0.2em}{0ex}}\mathrm{d}{x}^{\prime}\phantom{\rule{0.2em}{0ex}}\mathrm{d}{y}^{\prime}}$$
- Radon transform (RT) in computed tomography (CT) in the 2D case [86]:$$g(r,\varphi )={\displaystyle \iint \delta (r-x\mathrm{cos}\varphi -y\mathrm{sin}\varphi )f(x,y)\phantom{\rule{0.2em}{0ex}}\mathrm{d}x\phantom{\rule{0.2em}{0ex}}\mathrm{d}y}$$
- Fourier transform (FT) in the 2D case:$$g(u,v)={\displaystyle \iint \mathrm{exp}[-j(ux+uy)]f(x,y)\phantom{\rule{0.2em}{0ex}}\mathrm{d}x\phantom{\rule{0.2em}{0ex}}\mathrm{d}y}$$

**f**= [f

_{1},⋯, f

_{n}]′ represents the unknowns,

**g**= [g

_{1},⋯, g

_{m}]′ the observed data,

**ϵ**= [ϵ

_{1},⋯, ϵ

_{m}]′ the errors of modeling and measurement and

**H**the matrix of the system response.

#### 13.2. Entropy-Based Methods

**H**is a matrix of dimensions (M × N), which is in general singular or very ill conditioned. Even if the cases M > N or M = N may appear easier, they have the same difficulties as those of the underdetermined case M < N that we consider here. In this case, evidently the problem has an infinite number of solutions, and we need to choose one.

**f**) and satisfy the uniqueness of the solution. For example:

_{j}> 0 and ∑ f

_{j}= 1, thus considering f

_{j}as a probability distribution f

_{j}= P (U = u

_{j}). The variable U can correspond (or not) to a physical quantity. Ω(

**f**) is the entropy associated with this variable.

_{j}> 0 to represent the power spectral density of a physical quantity, then the entropy becomes:

**f**) can be used. Here, we mentioned four of them with different interpretations.

- L
_{2}or quadratic:$$\mathrm{\Omega}(\mathit{f})={\displaystyle \sum _{j}{f}_{j}^{2}}$$ - L
_{β}:$$\mathrm{\Omega}(\mathit{f})={\displaystyle \sum _{j}{|{f}_{j}|}^{\beta}}$$ - Shannon entropy:$$\mathrm{\Omega}(\mathit{f})=-{\displaystyle \sum _{j}{f}_{j}\mathrm{ln}{f}_{j}}$$
_{j}< 1, - The Burg entropy:$$\mathrm{\Omega}(\mathit{f})={\displaystyle \sum _{j}\mathrm{ln}{f}_{j}}$$
_{j}> 0.

#### 13.3. Maximum Entropy in the Mean Approach

_{j}= E {U

_{j}} or

**f**= E {

**U**} [41,41,42]. Again, here, U

_{j}or

**U**can, but need not, correspond to some physical quantities. In any case, we now want to assign a probability law $\widehat{p}(\mathit{u})$ to it. Noting that the data

**g**=

**H f**=

**HE**{

**U**} = E {

**HU**} can be considered as the constraints on it, we may need again a criterion to determine $\widehat{p}(\mathit{u})$. Assuming then having some prior μ(

**u**), we may maximize the relative entropy as that criterion. The mathematical problem then becomes:

**u**). When μ(

**u**) is separable: $\mu (\mathit{u})={\displaystyle {\prod}_{j}{\mu}_{j}({u}_{j})}$, the expression of $\widehat{p}(\mathit{u})$ will also be separable.

**λ**) is called the dual criterion and F (

**f**) primal. However, it is not always easy to obtain an analytical expression for G(

**s**) and its gradient G′(

**s**). The functions F (

**f**) and G(

**s**) are conjugate convex.

**λ**) or G(

**s**) = ln Z or F (

**f**) are very limited. However, when there is analytical expressions for them, the computations can be done very easily. In Table 1, we summarizes some of those solutions:

## 14. Bayesian Approach for Inverse Problems

#### 14.1. Simple Bayesian Approach

- Assign a prior probability law p(
**ϵ**) to the modeling and observation errors, here**ϵ**. From this, find the expression of the likelihood p(**g**|**f**,**θ**_{1}). As an example, consider the Gaussian case:$$p(\in )=\text{N}(\in |0,{v}_{\in}\mathit{I})\to p(\mathit{g}|\mathit{f}=\mathcal{N}(\mathit{g}|\mathit{H}\phantom{\rule{0.2em}{0ex}}\mathit{f},{v}_{\in}\mathit{I}).$$_{1}in this case is the noise variance v_{ϵ}. - Assign a prior probability law p(
**f**|**θ**_{2}) to the unknown**f**to translate your prior knowledge on it. Again, as an example, consider the Gaussian case:$$p(\mathit{f})=\mathcal{N}(\mathit{f}|0,{v}_{f}\mathit{I})$$_{2}in this case is the variance v_{f}. - Apply the Bayes rule to obtain the expression of the posterior law:$$p(\mathit{f}|\mathit{g},{\mathit{\theta}}_{1},{\mathit{\theta}}_{2})=\frac{p(\mathit{g}|\mathit{f},{\mathit{\theta}}_{1})p(\mathit{f}|{\mathit{\theta}}_{2})}{p(\mathit{g}|{\mathit{\theta}}_{1},{\mathit{\theta}}_{2})}\propto p(\mathit{g}|\mathit{f},{\mathit{\theta}}_{1})p(\mathit{f}|{\mathit{\theta}}_{2}),$$
**g**|**f**,**θ**_{1}) is the likelihood, p(**f**|**θ**_{2}) the prior model,**θ**= [**θ**_{1},**θ**_{2}]′ their corresponding parameters (often called the hyper-parameters of the problem) and p(**g**|**θ**_{1},**θ**_{2}) is called the evidence of the model. - Use p(
**f**|**g**,**θ**_{1},**θ**_{2}) to infer any quantity dependent of**f**.

**θ**can be fixed a priori, the problem is easy. In practice, we may use some summaries, such as:

- MAP:$${\widehat{\mathit{f}}}_{\mathit{MAP}}=\underset{\mathit{f}}{\mathrm{arg}\mathrm{max}}\{p(\mathit{f}|\mathit{g},\mathit{\theta})\}$$
- EAP or posterior mean (PM):$${\widehat{\mathit{f}}}_{EAP}={\displaystyle \int \mathit{f}p(\mathit{f}|\mathit{g},\mathit{\theta})}\mathrm{d}\mathit{f}$$

- For MAP, we need optimization algorithms, which can handle the huge dimensional criterion J(
**f**) = − ln p(**f**|**g**,**θ**). Very often, we may be limited to using gradient-based algorithms. - For EAP, we need integration algorithms, which can handle huge dimensional integrals. The most common tool here is the MCMC methods [24]. However, for real applications, very often, the computational costs are huge. Recently, different methods, called approximate Bayesian computation (ABC) [96–100] or VBA, have been proposed [74,96,98,101–107].

#### 14.2. Full Bayesian: Hyperparameter Estimation

**θ**have also to be estimated, a prior p(

**θ**) is assigned to them, and the expression of the joint posterior:

**f**,

**θ**|

**g**) by a simpler distribution, which can be handled more easily. Two particular and extreme cases are:

- Bloc separable, such as q(
**f**,**θ**) = q_{1}(**f**) q_{2}(**θ**) or - Completely separable, such as $q(\mathit{f},\mathit{\theta})={\displaystyle {\prod}_{j}{q}_{1j}}({f}_{j}){\displaystyle {\prod}_{k}{q}_{2k}(}{\theta}_{k})$.

## 15. Basic Algorithms of the Variational Bayesian Approximation

**X**and its probability density function p(

**x**), which we want to approximate by q(

**x**) = ∏

_{j}q

_{j}(x

_{j}). Using the KL criterion:

**x**)〉

_{q}= ∫ q(

**x**) ln p(

**x**) d

**x**and q

_{−j}(

**x**) = ∏

_{i≠j}q

_{i}(x

_{i})

_{i}, the basic method is an alternate optimization algorithm:

_{j}(x

_{j}) depends on q (x

_{i}), i ≠ j. It is not always possible to obtain analytical expressions for q

_{j}(x

_{j}). It is however possible to show that, if p(

**x**) is a member of exponential families, then q

_{j}(x

_{j}) are also members of exponential families. These iterations then become much simpler, because at each iteration, we need to update the parameters of the exponential families. To go a little more into the details, let us consider some particular simple cases.

#### 15.1. Case of Two Gaussian Variables

**x**= [x

_{1}, x

_{2}]′, we have:

_{1}, x

_{2}) by q(x

_{1}, x

_{2}) = q

_{1}(x

_{1}) q

_{2}(x

_{2}) to be able to compute the expected values:

_{1}, x

_{2}) is not separable in its two variables. If we can do that separable approximation, then, we can compute:

_{1}, m

_{2}). To illustrate this, let us consider the very simple case of the Gaussian:

_{1}and m

_{2}, However, we may be careful about the convergence of the variances.

#### 15.2. Case of Exponential Families

**x**) >

_{q}

_{2(}

_{x}

_{2)}and < ln p(

**x**) >

_{q}

_{1(}

_{x}

_{1)}. Only for a few cases can we can do this analytically. Different algorithms can be obtained depending on the choice of a particular family for q

_{j}(x

_{j}) [103,112–120].

**θ**is a vector of parameter and g(

**θ**) and

**u**(

**x**) are known functions.

**θ**) in the family:

**θ**:

**θ**|η,

**ν**) = p(

**x**|

**θ**) p(

**θ**|η,

**ν**), which is not separable in

**x**and

**θ**, and we want to approximate it with the separable q(

**x**,

**θ**) = q

_{1}(

**x**) q

_{2}(

**θ**), then we will have:

## 16. VBA for the Unsupervised Bayesian Approach to Inverse Problems

**f**,

**θ**) = p(

**f**|

**θ**) p(

**θ**) by a separable q(

**f**,

**θ**) = q

_{1}(

**f**) q

_{2}(

**θ**). Interestingly, depending on the choice of the family laws for q

_{1}and q

_{2}, we obtain different algorithms:

- ${q}_{1}(\mathit{f})=\delta (\mathit{f}-\tilde{\mathit{f}})$ and ${q}_{2}(\mathit{\theta})=\delta (\mathit{\theta}-\tilde{\mathit{\theta}})$. In this case, we have:$$\{\begin{array}{c}{q}_{1}(\mathit{f})\propto \mathrm{exp}[<\mathrm{ln}p(\mathit{f},\mathit{\theta}){>}_{{q}_{2}}]\propto \mathrm{exp}\left[\mathrm{ln}p(\mathit{f},\tilde{\mathit{\theta}})\right]\propto p(\mathit{f},\mathit{\theta}=\tilde{\mathit{\theta}})\propto p(\mathit{f}|\mathit{\theta}=\tilde{\mathit{\theta}})\\ {q}_{2}(\mathit{\theta})\propto \mathrm{exp}[<\mathrm{ln}p(\mathit{f},\mathit{\theta}){>}_{{q}_{1}}]\propto \mathrm{exp}\left[\mathrm{ln}p(\tilde{\mathit{f}},\mathit{\theta})\right]\propto p(\mathit{f}=\tilde{\mathit{f}},\mathit{\theta})\propto p(\mathit{\theta}|\mathit{f}=\tilde{\mathit{f}})\end{array}$$$$\{\begin{array}{c}\tilde{\mathit{f}}=\mathrm{arg}{\mathrm{max}}_{\mathit{f}}\left\{p(\mathit{f},\mathit{\theta})=\tilde{\mathit{\theta}}\right\}\\ \tilde{\mathit{\theta}}=\mathrm{arg}{\mathrm{max}}_{\mathit{\theta}}\left\{p(\mathit{f}=\tilde{\mathit{f}},\mathit{\theta})\right\}\end{array}$$$$(\tilde{\mathit{f}},\tilde{\mathit{\theta}})=\underset{(\mathit{f},\mathit{\theta})}{\mathrm{arg}\mathrm{max}}\{p(\mathit{f},\mathit{\theta})\}.$$
**f**are not used for the estimation of**θ**and the uncertainties of**θ**are not used for the estimation of**f**. - q
_{1}(**f**) is free form and ${q}_{2}(\mathit{\theta})=\delta (\mathit{\theta}-\tilde{\mathit{\theta}})$ In the same way, this time we obtain:$$\{\begin{array}{l}<\mathrm{ln}p(\mathit{f},\mathit{\theta}){>}_{{q}_{2}(\mathit{\theta})}=\mathrm{ln}p(\mathit{f},\tilde{\mathit{\theta}})\hfill \\ <\mathrm{ln}p(\mathit{f},\mathit{\theta}){>}_{{q}_{1}(\mathit{f})}=<\mathrm{ln}p(\mathit{f},\mathit{\theta}){>}_{{q}_{1}(\mathit{f}|\tilde{\mathit{\theta}})}=Q(\mathit{\theta},\tilde{\mathit{\theta}})\hfill \end{array}$$$$\{\begin{array}{l}\begin{array}{l}{q}_{1}(\mathit{f})\propto \mathrm{exp}\left[\mathrm{ln}p(\mathit{f},\mathit{\theta}=\tilde{\mathit{\theta}})\right]\propto p(\mathit{f},\tilde{\mathit{\theta}})\hfill \\ {q}_{2}(\mathit{\theta})\propto \mathrm{exp}\left[Q(\mathit{\theta},\tilde{\mathit{\theta}})\right]\to \tilde{\mathit{\theta}}=\mathrm{arg}{\mathrm{max}}_{\mathit{\theta}}\left\{Q(\mathit{\theta},\tilde{\mathit{\theta}})\right\}\hfill \end{array}\hfill \end{array}$$**f**are used for the estimation of**θ**, but the uncertainties of**θ**are not used for the estimation of**f**. - ${q}_{1}(\mathit{f})=\delta (\mathit{f}-\tilde{\mathit{f}})$ and q
_{2}(**θ**) is free form. In the same way, this time we obtain:$$\{\begin{array}{l}<\mathrm{ln}p(\mathit{f},\mathit{\theta}){>}_{{q}_{1}(\mathit{f})}=\mathrm{ln}p(\mathit{f}=\tilde{\mathit{f}},\mathit{\theta})\hfill \\ <\mathrm{ln}p(\mathit{f},\mathit{\theta}){>}_{{q}_{2}(\mathit{\theta})}=<\mathrm{ln}p(\mathit{f},\mathit{\theta}){>}_{p(\mathit{\theta}|\mathit{f}=\tilde{\mathit{f}})}=Q(\tilde{\mathit{f}},\mathit{f})\hfill \end{array}$$$$\{\begin{array}{l}{q}_{2}(\theta )\propto \mathrm{ln}p(\mathit{f}=\tilde{\mathit{f}},\mathit{\theta})=p(\mathit{\theta}|\mathit{f}=\tilde{\mathit{f}})\hfill \\ {q}_{1}(\mathit{f})\propto \mathrm{exp}\left[Q(\tilde{\mathit{f}},\mathit{\theta})\right]\to \tilde{\mathit{\theta}}=\mathrm{arg}{\mathrm{max}}_{\mathit{\theta}}\left\{Q(\mathit{f}=\tilde{\mathit{f}},\mathit{\theta})\right\}\hfill \end{array}$$**f**are used for the estimation of**θ**, but the uncertainties of**θ**are not used for the estimation of**f**. - Both q
_{1}(f) and q_{2}(θ) have free form. The main difficulty here is that, at each iteration, the expression of q_{1}and q_{2}may change. However, if p(**f**,**θ**) is in the generalized exponential family, the expressions of q_{1}(**f**) and q_{2}(**θ**) will also be in the same family, and we have only to update the parameters at each iteration.

## 17. VBA for a Linear Inverse Problem with Simple Gaussian Priors

**f**, θ

_{1}, θ

_{2}) = ln p(

**f**, θ

_{1}, θ

_{2}|

**g**), it is easy to obtain the equations of an alternate JMAP algorithm by computing the derivatives of it with respective to its arguments and equating them to zero:

**f**, θ

_{1}, θ

_{2}|

**g**), we can also obtain the expressions of the conditionals:

**f**|

**g**), p(θ

_{1}|

**g**) and p(θ

_{2}|

**g**) is not easy. We can then obtain approximate expressions q

_{1}(

**f**|

**g**), q

_{2}(θ

_{1}|

**g**) and q

_{3}(θ

_{2}|

**g**) using the VBA method. For this case, thanks to the conjugacy property, we have:

**f**can be done via the optimization of the criterion J(

**f**, θ

_{1}, θ

_{2}) = ln p(

**f**, θ

_{1}, θ

_{2}|

**g**), which does not need explicitly the matrix inversion of $\tilde{\mathit{V}}={({\mathit{H}}^{\prime}\mathit{H}+{\lambda}^{\prime}\mathit{I})}^{-1}$ However, in BEM and VBA, we need to compute it due to the following requirements:

## 18. Bayesian Variational Approximation with Hierarchical Prior Models

**g**|

**f**,

**θ**

_{1}), when a hierarchical prior model p(

**f**|

**z**,

**θ**

_{2}) p(

**z**|

**θ**

_{3}) is used and when the estimation of the hyper-parameters

**θ**= [

**θ**

_{1},

**θ**

_{2},

**θ**

_{3}]′ has to be considered, the joint posterior law of all the unknowns becomes:

**f**,

**z**,

**θ**|

**g**) = q

_{1}(

**f**) q

_{2}(

**z**) q

_{3}(

**θ**) and where the expressions of q(

**f**,

**z**,

**θ**|

**g**) are obtained by minimizing the Kullback–Leibler divergence (99), as explained in previous section. This approach can also be used for model selection based on the evidence of the model ln p(

**g**) [121] where:

**g**). Indeed, the name variational approximation is due to the fact that $\mathrm{ln}p(g)\ge \mathcal{F}(q)$, and so, $\mathcal{F}\left(q\right)$ is a lower bound to the evidence ln p(

**g**).

_{1}, q

_{2}and q

_{3}results in:

_{1}(

**f**), q

_{2}(z) and q

_{3}(

**θ**), which need, at each iteration, the expression of the expectations in the right hand of exponentials. If p(

**g**|

**f**,

**z**,

**θ**

_{1}) is a member of an exponential family and if all of the priors p(

**f**|

**z**,

**θ**

_{2}), p(

**z**|

**θ**

_{3}), p(

**θ**

_{1}), p(

**θ**

_{2}) and p(

**θ**

_{3}) are conjugate priors, then it is easy to see that these expressions lead to standard distributions for which the required expectations are easily evaluated. In that case, we may note:

_{1}, the parameters ( $\tilde{\mathit{f}}$, $\tilde{\mathit{\theta}}$) of q

_{2}and the parameters ( $\tilde{\mathit{f}}$, $\tilde{\mathit{z}}$) of q

_{3}. Finally, we may note that, to monitor the convergence of the algorithm, we may evaluate the free energy:

**f**,

**z, θ**) are also possible. For example: q(

**f**,

**z, θ**) = q

_{1}(

**f**|

**z**) q

_{2}(

**z**) q

_{3}(

**θ**) or even: $q(\mathit{f},\mathit{z},\mathit{\theta})={\displaystyle {\prod}_{j}{q}_{1j}({f}_{j})}\phantom{\rule{0.2em}{0ex}}{\displaystyle {\prod}_{j}{q}_{2j}({z}_{fj})\phantom{\rule{0.2em}{0ex}}{\displaystyle {\prod}_{l}{q}_{3l}({\theta}_{l})}}$. Here, we consider the first case and give some more details on it.

## 19. Bayesian Variational Approximation with Student t Priors

_{fj}:

**g**=

**Hf**+

**ϵ**and assign a Gaussian law with unknown variance ${\upsilon}_{{\in}_{i}}$ to the noise ϵ

_{i}, which results in $p(\in )=\mathcal{N}(g|0,{\mathrm{V}}_{\in})$ with ${\mathrm{V}}_{\in}=\text{diag}[{\upsilon}_{e}]$ with ${\upsilon}_{\in}=[{\upsilon}_{{\in}_{1}},\cdots ,{\upsilon}_{{\in}_{M}}]$, and so:

## 20. Conclusions

- A probability law is a tool for representing our state of knowledge about a quantity.
- The Bayes or Laplace rule is an inference tool for updating our state of knowledge about an inaccessible quantity when another accessible, related quantity is observed.
- Entropy is a measure of information content in a variable with a given probability law.
- The maximum entropy principle can be used to assign a probability law to a quantity when the available information about it is in the form of a limited number of constraints on that probability law.
- Relative entropy and Kullback–Leibler divergence are tools for updating probability laws in the same context.
- When a parametric probability law is assigned to a quantity and we want to measure the amount of information gain about the parameters when some direct observations of that quantity is available, we can use the Fisher information. The structure of the Fisher information geometry in the space of parameters is derived from the relative entropy by a second order Taylor series approximation.
- All of these rules and tools are used currently in different ways in data and signal processing. In this paper, a few examples of the ways these tools are used in data and signal processing problems are presented. One main conclusion is that each of these tools has to be used in appropriate contexts. The example in spectral estimation shows that it is very important to define the problems very clearly at the beginning and to use appropriate tools and interpret the results appropriately.
- The Laplacian or Bayesian inference is the appropriate tool for proposing satisfactory solutions to inverse problems. Indeed, the expression of the posterior probability law represents the combination of the state of the knowledge in the forward model and the data and the state of the knowledge before using the data.
- The Bayesian approach can also easily be used to propose unsupervised methods for the practical application of these methods.
- One of the main limitation of those sophisticated methods is the computational cost. For this, we proposed to use VBA as an alternative to MCMC methods to propose realistic algorithms in huge dimensional inverse problems where we want to estimate an unknown signal (1D), image (2D), volume (3D) or even more (3D + time or 3D + wavelength), etc.

## Acknowledgments

^{†}This paper is an extended version of the paper published in Proceedings of the 34th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering (MaxEnt 2014), Amboise, France, 21–26 September 2014.

## Conflicts of Interest

## References

- Mohammad-Djafari, A. Bayesian or Laplacian inference, entropy and information theory and information geometry in data and signal processing. AIP Conf. Proc
**2014**, 1641, 43–58. [Google Scholar] - Bayes, T. An Essay toward Solving a Problem in the Doctrine of Chances. Philos. Trans
**1763**, 53, 370–418, By the late Rev. Mr. Bayes communicated by Mr. Price, in a Letter to John Canton. [Google Scholar] - De Laplace, P. S. Mémoire sur la probabilité des causes par les évènements. Mémoires de l’Academie Royale des Sciences Presentés par Divers Savan
**1774**, 6, 621–656. [Google Scholar] - Shannon, C. A Mathematical Theory of Communication. Bell Syst. Tech. J
**1948**, 27, 379–423. [Google Scholar] - Hadamard, J. Mémoire sur le problème d’analyse relatif à l’équilibre des plaques élastiques encastrées; Mémoires présentés par divers savants à l’Académie des sciences de l’Institut de France; Imprimerie nationale, 1908. [Google Scholar]
- Jaynes, E.T. Information Theory Statistical Mechanics. Phys. Rev
**1957**, 106, 620–630. [Google Scholar] - Jaynes, E.T. Information Theory and Statistical Mechanics II. Phys. Rev
**1957**, 108, 171–190. [Google Scholar] - Jaynes, E.T. Prior Probabilities. IEEE Trans. Syst. Sci. Cybern
**1968**, 4, 227–241. [Google Scholar] - Kullback, S. Information Theory and Statistics; Wiley: New York, NY, USA, 1959. [Google Scholar]
- Fisher, R. On the mathematical foundations of theoretical statistics. Philos. Trans. R. Stat. Soc. A
**1922**, 222, 309–368. [Google Scholar] - Rao, C. Information and accuracy attainable in the estimation of statistical parameters. Bull. Culcutta Math. Soc
**1945**, 37, 81–91. [Google Scholar] - Sindhwani, V.; Belkin, M.; Niyogi, P. The Geometric basis for Semi-supervised Learning. In Semi-supervised Learning; Chapelle, O., Schölkopf, B., Zien, A., Eds.; MIT press: Cambridge, MA, USA, 2006; pp. 209–226. [Google Scholar]
- Lin, J. Divergence Measures Based on the Shannon Entropy. IEEE Trans. Inf. Theory
**1991**, 37, 145–151. [Google Scholar] - Johnson, O.; Barron, A.R. Fisher Information Inequalities and the Central Limit Theorem. Probab. Theory Relat. Fields
**2004**, 129, 391–409. [Google Scholar] - Berger, J. Statistical Decision Theory and Bayesian Analysis, 2nd ed; Springer-Verlag: New York, NY, USA, 1985. [Google Scholar]
- Gelman, A.; Carlin, J.B.; Stern, H.S.; Rubin, D.B. Bayesian Data Analysis, 2nd ed; Chapman & Hall/CRC Texts in Statistical Science; Chapman and Hall/CRC: Boca Raton, FL, USA, 2003. [Google Scholar]
- Skilling, J. Nested Sampling. In Bayesian Inference and Maximum Entropy Methods in Science and Engineering; Proceedings of 24th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering, Garching, Germany, 25–30 July 2004, Fischer, R., Preuss, R., Toussaint, U.V., Eds.; pp. 395–405.
- Metropolis, N.; Rosenbluth, A.W.; Rosenbluth, M.N.; Teller, A.H.; Teller, E. Equation of State Calculations by Fast Computing Machines. J. Chem. Phys
**1953**, 21, 1087–1092. [Google Scholar] - Hastings, W.K. Monte Carlo Sampling Methods using Markov Chains and their Applications. Biometrika
**1970**, 57, 97–109. [Google Scholar] - Gelfand, A.E.; Smith, A.F.M. Sampling-Based Approaches to Calculating Marginal Densities. J. Am. Stat. Assoc
**1990**, 85, 398–409. [Google Scholar] - Gilks, W.R.; Richardson, S.; Spiegelhalter, D.J. Introducing Markov Chain Monte Carlo. In Markov Chain Monte Carlo in Practice; Gilks, W.R., Richardson, S., Spiegelhalter, D.J., Eds.; Chapman and Hall: London, UK, 1996; pp. 1–19. [Google Scholar]
- Gilks, W.R. Strategies for Improving MCMC. In Markov Chain Monte Carlo in Practice; Gilks, W.R., Richardson, S., Spiegelhalter, D.J., Eds.; Chapman and Hall: London, UK, 1996; pp. 89–114. [Google Scholar]
- Roberts, G.O. Markov Chain Concepts Related to Sampling Algorithms. In Markov Chain Monte Carlo in Practice; Gilks, W.R., Richardson, S., Spiegelhalter, D.J., Eds.; Chapman and Hall: London, UK, 1996; pp. 45–57. [Google Scholar]
- Tanner, M.A. Tools for Statistical Inference: Methods for the Exploration of Posterior Distributions and Likelihood Functions; Springer series in Statistics; Springer: New York, NY, USA, 1996. [Google Scholar]
- Djurić, P.M.; Godsill, S.J. (Eds.) Special Issue on Monte Carlo Methods for Statistical Signal Processing; IEEE: New York, NY, USA, 2002.
- Andrieu, C.; de Freitas, N.; Doucet, A.; Jordan, M.I. An Introduction to MCMC for Machine Learning. Mach. Learn
**2003**, 50, 5–43. [Google Scholar] - Clausius, R. On the Motive Power of Heat, and on the Laws Which Can be Deduced From it for the Theory of Heat; Poggendorff’s Annalen der Physick, LXXIX, Dover Reprint: New York, NY, USA, 1850; ISBN ISBN 0-486-59065-8. [Google Scholar]
- Caticha, A. Maximum Entropy, fluctuations and priors.
- Giffin, A.; Caticha, A. Updating Probabilities with Data and Moments.
- Caticha, A.; Preuss, R. Maximum Entropy and Bayesian Data Analysis: Entropic Priors Distributions. Phys. Rev. E
**2004**, 70, 046127. [Google Scholar] - Akaike, H. On Entropy Maximization Principle. In Applications of Statistics; Krishnaiah, P.R., Ed.; North-Holland: Amsterdam, The Netherlands, 1977; pp. 27–41. [Google Scholar]
- Agmon, N.; Alhassid, Y.; Levine, D. An Algorithm for Finding the Distribution of Maximal Entropy. J. Comput. Phys
**1979**, 30, 250–258. [Google Scholar] - Jaynes, E.T. Where do we go from here? In Maximum-Entropy and Bayesian Methods in Inverse Problems; Smith, C.R., Grandy, W.T., Jr, Eds.; Springer: Dordrecht, The Netherlands, 1985; pp. 21–58. [Google Scholar]
- Borwein, J.M.; Lewis, A.S. Duality relationships for entropy-like minimization problems. SIAM J. Control Optim
**1991**, 29, 325–338. [Google Scholar] - Elfwing, T. On some Methods for Entropy Maximization and Matrix Scaling. Linear Algebra Appl
**1980**, 34, 321–339. [Google Scholar] - Eriksson, J. A note on Solution of Large Sparse Maximum Entropy Problems with Linear Equality Constraints. Math. Program
**1980**, 18, 146–154. [Google Scholar] - Erlander, S. Entropy in linear programs. Math. Program
**1981**, 21, 137–151. [Google Scholar] - Jaynes, E.T. On the Rationale of Maximum-Entropy Methods. Proc. IEEE
**1982**, 70, 939–952. [Google Scholar] - Shore, J.E.; Johnson, R.W. Properties of Cross-Entropy Minimization. IEEE Trans. Inf. Theory
**1981**, 27, 472–482. [Google Scholar] - Mohammad-Djafari, A. Maximum d’entropie et problèmes inverses en imagerie. Traitement Signal
**1994**, 11, 87–116. [Google Scholar] - Bercher, J. Développement de critères de nature entropique pour la résolution des problèmes inverses linéaires. Ph.D. Thesis, Université de Paris–Sud, Orsay, France, 1995. [Google Scholar]
- Le Besnerais, G. Méthode du maximum d’entropie sur la moyenne, critère de reconstruction d’image et synthèse d’ouverture en radio astronomie. In Ph.D. Thesis; Université de Paris-Sud: Orsay, France, 1993. [Google Scholar]
- Caticha, A.; Giffin, A. Updating Probabilities. [CrossRef]
- Caticha, A. Entropic Inference.
- Costa, S.I.R.; Santos, S.A.; Strapasson, J.E. Fisher information distance: A geometrical reading
**2012**. arXiv: 1210.2354. - Rissanen, J. Fisher Information and Stochastic Complexity. IEEE Trans. Inf. Theory
**1996**, 42, 40–47. [Google Scholar] - Shimizu, R. On Fisher’s amount of information for location family. In A Modern Course on Statistical Distributions in Scientific Work; D. Reidel Dordrecht: The Netherlands, 1975; Volume 3, pp. 305–312. [Google Scholar]
- Nielsen, F.; Nock, R. Sided and Symmetrized Bregman Centroids. IEEE Trans. Inf. Theory
**2009**, 55, 2048–2059. [Google Scholar] - Rasmussen, C.E.; Williams, C.K.I. Gaussian Processes for Machine Learning; MIT Press: Cambridge, MA, USA, 2006. [Google Scholar]
- Schroeder, M.R. Linear prediction, entropy and signal analysis. IEEE ASSP Mag
**1984**, 1, 3–11. [Google Scholar] - Itakura, F.; Saito, S. A Statistical Method for Estimation of Speech Spectral Density and Formant Frequencies. Electron. Commun. Jpn
**1970**, 53-A, 36–43. [Google Scholar] - Kitagawa, G.; Gersch, W. Smoothness Priors Analysis of Time Series; Lecture Notes in Statistics; Volume 116, Springer: New York, NY, USA, 1996. [Google Scholar]
- Rue, H.; Held, L. Gaussian Markov Random Fields: Theory and Applications; CRC Press: New York, NY, USA, 2005. [Google Scholar]
- Amari, S.; Cichocki, A.; Yang, H.H. A new learning algorithm for blind source separation. 757–763.
- Amari, S. Neural learning in structured parameter spaces—Natural Riemannian gradient. 127–133.
- Amari, S. Natural gradient works efficiently in learning. Neural Comput
**1998**, 10, 251–276. [Google Scholar] - Knuth, K.H. Bayesian source separation and localization. SPIE Proc
**1998**, 3459. [Google Scholar] [CrossRef] - Knuth, K.H. A Bayesian approach to source separation. 283–288.
- Attias, H. Independent Factor Analysis. Neural Comput
**1999**, 11, 803–851. [Google Scholar] - Mohammad-Djafari, A. A Bayesian approach to source separation. 221–244.
- Choudrey, R.A.; Roberts, S. Variational Bayesian Mixture of Independent Component Analysers for Finding Self-Similar Areas in Images. 107–112.
- Lopes, H.F.; West, M. Bayesian Model Assessment in Factor Analysis. Statsinica
**2004**, 14, 41–67. [Google Scholar] - Ichir, M.; Mohammad-Djafari, A. Bayesian Blind Source Separation of Positive Non Stationary Sources. 493–500.
- Mohammad-Djafari, A. Bayesian Source Separation: Beyond PCA and ICA.
- Comon, P.; Jutten, C. (Eds.) Handbook of Blind Source Separation: Independent Component Analysis and Applications; Academic Press: Burlington, MA, USA, 2010.
- Yuan, M.; Lin, Y. Model selection and estimation in the Gaussian graphical model. Biometrika
**2007**, 94, 19–35. [Google Scholar] - Fitzgerald, W. Markov Chain Monte Carlo methods with Applications to Signal Processing. Signal Process
**2001**, 81, 3–18. [Google Scholar] - Matsuoka, T.; Ulrych, T. Information theory measures with application to model identification. IEEE Trans. Acoust. Speech Signal Process
**1986**, 34, 511–517. [Google Scholar] - Bretthorst, G.L. Bayesian Model Selection: Examples Relevant to NMR. In Maximum Entropy and Bayesian Methods; Springer: Dordrecht, The Netherlands, 1989; pp. 377–388. [Google Scholar]
- Gelfand, A.E.; Dey, D.K. Bayesian model choice: Asymptotics and exact calculations. J. R. Stat. Soc. Ser. B
**1994**, 56, 501–514. [Google Scholar] - Mohammad-Djafari, A. Model selection for inverse problems: Best choice of basis function and model order selection.
- Clyde, M.A.; Berger, J.O.; Bullard, F.; Ford, E.B.; Jefferys, W.H.; Luo, R.; Paulo, R.; Loredo, T. Current Challenges in Bayesian Model Choice. 71, 224–240.
- Wyse, J.; Friel, N. Block clustering with collapsed latent block models. Stat. Comput
**2012**, 22, 415–428. [Google Scholar] - Giovannelli, J.F.; Giremus, A. Bayesian noise model selection and system identification based on approximation of the evidence. 125–128.
- Akaike, H. A new look at the statistical model identification. IEEE Trans. Automat. Control
**1974**, AC-19, 716–723. [Google Scholar] - Akaike, H. Power spectrum estimation through autoregressive model fitting. Ann. Inst. Stat. Math
**1969**, 21, 407–419. [Google Scholar] - Farrier, D. Jaynes’ principle and maximum entropy spectral estimation. IEEE Trans. Acoust. Speech Signal Process
**1984**, 32, 1176–1183. [Google Scholar] - Wax, M. Detection and Estimation of Superimposed Signals. Ph.D. Thesis, Standford University, CA, USA, March 1985. [Google Scholar]
- Burg, J.P. Maximum Entropy Spectral Analysis.
- McClellan, J.H. Multidimensional spectral estimation. Proc. IEEE
**1982**, 70, 1029–1039. [Google Scholar] - Lang, S.; McClellan, J.H. Multidimensional MEM spectral estimation. IEEE Trans. Acoust. Speech Signal Process
**1982**, 30, 880–887. [Google Scholar] - Johnson, R.; Shore, J. Which is Better Entropy Expression for Speech Processing:-SlogS or logS? IEEE Trans. Acoust. Speech Signal Process
**1984**, ASSP-32, 129–137. [Google Scholar] - Wester, R.; Tummala, M.; Therrien, C. Multidimensional Autoregressive Spectral Estimation Using Iterative Methods. 1. [CrossRef]
- Picinbono, B.; Barret, M. Nouvelle présentation de la méthode du maximum d’entropie. Traitement Signal
**1990**, 7, 153–158. [Google Scholar] - Borwein, J.M.; Lewis, A.S. Convergence of best entropy estimates. SIAM J. Optim
**1991**, 1, 191–205. [Google Scholar] - Mohammad-Djafari, A. (Ed.) Inverse Problems in Vision and 3D Tomography; digital signal and image processing series; ISTE: London, UK; Wiley: Hoboken, NJ, USA, 2010.
- Mohammad-Djafari, A.; Demoment, G. Tomographie de diffraction and synthèse de Fourier à maximum d’entropie. Rev. Phys. Appl. (Paris)
**1987**, 22, 153–167. [Google Scholar] - Féron, O.; Chama, Z.; Mohammad-Djafari, A. Reconstruction of piecewise homogeneous images from partial knowledge of their Fourier transform. 68–75.
- Ayasso, H.; Mohammad-Djafari, A. Joint NDT Image Restoration and Segmentation Using Gauss–Markov–Potts Prior Models and Variational Bayesian Computation. IEEE Trans. Image Process
**2010**, 19, 2265–2277. [Google Scholar] - Ayasso, H.; DuchÃłne, B.; Mohammad-Djafari, A. Bayesian inversion for optical diffraction tomography. J. Mod. Opt
**2010**, 57, 765–776. [Google Scholar] - Burch, S.; Gull, S.F.; Skilling, J. Image Restoration by a Powerful Maximum Entropy Method. Comput. Vis. Graph. Image Process
**1983**, 23, 113–128. [Google Scholar] - Gull, S.F.; Skilling, J. Maximum entropy method in image processing. IEE Proc. F
**1984**, 131, 646–659. [Google Scholar] - Gull, S.F. Developments in maximum entropy data analysis. In Maximum Entropy and Bayesian Methods; Skilling, J., Ed.; Springer: Dordrecht, The Netherlands, 1989; pp. 53–71. [Google Scholar]
- Jones, L.K.; Byrne, C.L. General entropy criteria for inverse problems with application to data compression, pattern classification and cluster analysis. IEEE Trans. Inf. Theory
**1990**, 36, 23–30. [Google Scholar] - Macaulay, V.A.; Buck, B. Linear inversion by the method of maximum entropy. Inverse Probl
**1989**, 5. [Google Scholar] [CrossRef] - Rue, H.; Martino, S. Approximate Bayesian inference for hierarchical Gaussian Markov random field models. J. Stat. Plan. Inference
**2007**, 137, 3177–3192. [Google Scholar] - Wilkinson, R. Approximate Bayesian computation (ABC) gives exact results under the assumption of model error
**2009**. arXiv:0811.3355. - Rue, H.; Martino, S.; Chopin, N. Approximate Bayesian Inference for Latent Gaussian Models Using Integrated Nested Laplace Approximations. J. R. Stat. Soc. Ser. B
**2009**, 71, 319–392. [Google Scholar] - Fearnhead, P.; Prangle, D. Constructing Summary Statistics for Approximate Bayesian Computation: Semi-automatic ABC
**2011**. arxiv:1004.1112v2. - Turner, B.M.; van Zandt, T. A tutorial on approximate Bayesian computation. J. Math. Psych
**2012**, 56, 69–85. [Google Scholar] - MacKay, D.J.C. A Practical Bayesian Framework for Backpropagation Networks. Neural Comput
**1992**, 4, 448–472. [Google Scholar] - Mohammad-Djafari, A. Variational Bayesian Approximation for Linear Inverse Problems with a hierarchical prior models. 8085, 669–676.
- Likas, C.L.; Galatsanos, N.P. A Variational Approach For Bayesian Blind Image Deconvolution. IEEE Trans. Signal Process
**2004**, 52, 2222–2233. [Google Scholar] - Beal, M.; Ghahramani, Z. Variational Bayesian learning of directed graphical models with hidden variables. Bayesian Stat
**2006**, 1, 793–832. [Google Scholar] - Kim, H.; Ghahramani, Z. Bayesian Gaussian Process Classification with the EM-EP Algorithm. IEEE Trans. Pattern Anal. Mach. Intell
**2006**, 28, 1948–1959. [Google Scholar] - Jordan, M.I.; Ghahramani, Z.; Jaakkola, T.S.; Saul, L.K. An introduction to variational methods for graphical models. Mach. Learn
**2006**, 37, 183–233. [Google Scholar] - Forbes, F.; Fort, G. Combining Monte Carlo and Mean-Field-Like Methods for Inference in Hidden Markov Random Fields. IEEE Trans. Image Process
**2007**, 16, 824–837. [Google Scholar] - Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum Likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. (B)
**1977**, 39, 1–38. [Google Scholar] - Miller, M.I.; Snyder, D.L. The Role of Likelihood and Entropy in Incomplete-Data Problems: Applications to Estimating Point-Process Intensities and Toeplitz Constrained Covariances. Proc. IEEE
**1987**, 75, 892–907. [Google Scholar] - Snoussi, H.; Mohammad-Djafari, A. Information geometry of Prior Selection.
- Mohammad-Djafari, A. Approche variationnelle pour le calcul bayésien dans les problémes inverses en imagerie
**2009**. arXiv:0904.4148. - Beal, M. Variational Algorithms for Approximate Bayesian Inference. In Ph.D. Thesis; Gatsby Computational Neuroscience Unit, University College London: UK, 2003. [Google Scholar]
- Winn, J.; Bishop, C.M.; Jaakkola, T. Variational message passing. J. Mach. Learn. Res
**2005**, 6, 661–694. [Google Scholar] - Chatzis, S.; Varvarigou, T. Factor Analysis Latent Subspace Modeling and Robust Fuzzy Clustering Using t-DistributionsClassification of binary random Patterns. IEEE Trans. Fuzzy Syst
**2009**, 17, 505–517. [Google Scholar] - Park, T.; Casella, G. The Bayesian Lasso. J. Am. Stat. Assoc
**2008**, 103, 681–686. [Google Scholar] - Mohammad-Djafari, A. A variational Bayesian algorithm for inverse problem of computed tomography. In Mathematical Methods in Biomedical Imaging and Intensity-Modulated Radiation Therapy (IMRT); Censor, Y., Jiang, M., Louis, A.K., Eds.; Publications of the Scuola Normale Superiore/CRM Series; Edizioni della Normale: Rome, Italy, 2008; pp. 231–252. [Google Scholar]
- Mohammad-Djafari, A.; Ayasso, H. Variational Bayes and mean field approximations for Markov field unsupervised estimation. 1–6.
- Tipping, M.E. Sparse Bayesian learning and the relevance vector machine. J. Mach. Learn. Res
**2001**, 1, 211–244. [Google Scholar] - He, L.; Chen, H.; Carin, L. Tree-Structured Compressive Sensing With Variational Bayesian Analysis. IEEE Signal Process. Lett
**2010**, 17, 233–236. [Google Scholar] - Fraysse, A.; Rodet, T. A gradient-like variational Bayesian algorithm, Proceedings of 2011 IEEE Conference on Statistical Signal Processing Workshop (SSP), Nice France, 28–30 June 2011; pp. 605–608.
- Johnson, V.E. On Numerical Aspects of Bayesian Model Selection in High and Ultrahigh-dimensional Settings. Bayesian Anal
**2013**, 8, 741–758. [Google Scholar] - Dumitru, M.; Mohammad-Djafari, A. Estimating the periodic components of a biomedical signal through inverse problem modeling and Bayesian inference with sparsity enforcing prior. AIP Conf. Proc
**2015**, 1641, 548–555. [Google Scholar] - Wang, L.; Gac, N.; Mohammad-Djafari, A. Bayesian 3D X-ray computed tomography image reconstruction with a scaled Gaussian mixture prior model. AIP Conf. Proc
**2015**, 1641, 556–563. [Google Scholar] - Mohammad-Djafari, A. Bayesian Blind Deconvolution of Images Comparing JMAP, EM and VBA with a Student-t a priori Model. 98–103.
- Su, F.; Mohammad-Djafari, A. An Hierarchical Markov Random Field Model for Bayesian Blind Image Separation.
- Su, F.; Cai, S.; Mohammad-Djafari, A. Bayesian blind separation of mixed text patterns. 1373–1378.

$\mu (u)\propto \mathrm{exp}\left[{-}_{2}^{1}{\sum}_{j}{u}_{j}^{2}\right]$ | $\widehat{\mathit{f}}={\mathit{H}}^{\prime}\mathbf{\lambda}$ | $\widehat{\mathit{f}}={\mathit{H}}^{\prime}{(\mathit{H}{\mathit{H}}^{\prime})}^{-1}\mathit{g}$ |

$\mu (u)\propto \mathrm{exp}\left[-{\sum}_{j}|{u}_{j}|\right]$ | $\widehat{\mathit{f}}=1./({\mathit{H}}^{\prime}\mathbf{\lambda}\pm 1)$ | $\mathit{H}\widehat{\mathit{f}}=\mathit{g}$ |

$\mu (u)\propto \mathrm{exp}\left[-{\sum}_{j}{u}_{j}^{\mathrm{\alpha}-1}\mathrm{exp}\left[-\beta {u}_{j}\right]\right],\phantom{\rule{0.2em}{0ex}}{u}_{j}>0$ | $\widehat{\mathit{f}}=\alpha 1./({\mathit{H}}^{\prime}\mathbf{\lambda}+\beta 1)$ | $\mathit{H}\widehat{\mathit{f}}=\mathit{g}$ |

JMPA | BEM | VBA |
---|---|---|

$q(\mathit{f})=\delta (\mathit{f}-\tilde{\mathit{f}})$ | $q(\mathit{f})=N(\mathit{f}|\tilde{\mathit{f}},\tilde{\mathit{V}})$ | $q(\mathit{f})=N(\mathit{f}|\tilde{\mathit{f}},\tilde{\mathit{V}})$ |

$\tilde{\mathit{V}}={({\mathit{H}}^{\prime}\mathit{H}+\tilde{\lambda}\mathit{I})}^{-1}$ | $\tilde{\mathit{V}}={({\mathit{H}}^{\prime}\mathit{H}+\tilde{\lambda}\mathit{I})}^{-1}$ | $\tilde{\mathit{V}}={({\mathit{H}}^{\prime}\mathit{H}+\tilde{\lambda}\mathit{I})}^{-1}$ |

$\tilde{\mathit{f}}=\tilde{\mathit{V}}{\mathit{H}}^{\prime}\mathit{g}$ | $\tilde{\mathit{f}}=\tilde{\mathit{V}}{\mathit{H}}^{\prime}\mathit{g}$ | $\tilde{\mathit{f}}=\tilde{\mathit{V}}{\mathit{H}}^{\prime}\mathit{g}$ |

$q({\theta}_{1})=\delta ({\theta}_{1}-{\tilde{\theta}}_{1})$ | $q({\theta}_{1})=\delta ({\theta}_{1}-{\tilde{\theta}}_{1})$ | $q({\theta}_{1})=\mathcal{G}({\theta}_{1}|{\tilde{\alpha}}_{1},{\tilde{\beta}}_{1})$ |

${\tilde{\alpha}}_{1}=({\alpha}_{10}-1)+\frac{\mathrm{M}}{2}$ | ${\tilde{\alpha}}_{1}=({\alpha}_{10}-1)+\frac{\mathrm{M}}{2}$ | ${\tilde{\alpha}}_{1}=({\alpha}_{10}-1)+\frac{\mathrm{M}}{2}$ |

${\tilde{\beta}}_{1}={\beta}_{10}+\frac{1}{2}\Vert \mathit{g}-\mathit{Hf}\Vert {}_{2}^{2}$ | ${\tilde{\beta}}_{1}={\beta}_{10}+\frac{1}{2}<\Vert \mathit{g}-\mathit{Hf}\Vert {}_{2}^{2}>$ | ${\tilde{\beta}}_{1}={\beta}_{10}+\frac{1}{2}<\Vert \mathit{g}-\mathit{Hf}\Vert {}_{2}^{2}>$ |

${\tilde{\theta}}_{1}=\frac{{\tilde{\alpha}}_{1}}{{\beta}_{1}}$ | ${\tilde{\theta}}_{1}=\frac{{\tilde{\alpha}}_{1}}{{\tilde{\beta}}_{1}}$ | ${\tilde{\theta}}_{1}=\frac{{\tilde{\alpha}}_{1}}{{\tilde{\beta}}_{1}}$ |

$q({\theta}_{2})=\delta ({\theta}_{2}-{\tilde{\theta}}_{2})$ | $q({\theta}_{2})=\delta ({\theta}_{2}-{\tilde{\theta}}_{2})$ | $q({\theta}_{2})=\mathcal{G}({\theta}_{2}|{\tilde{\alpha}}_{2},{\tilde{\beta}}_{2})$ |

${\tilde{\alpha}}_{2}=({\alpha}_{20}-1)+\frac{\mathrm{M}}{2}$ | ${\tilde{\alpha}}_{2}=({\alpha}_{20}-1)+\frac{\mathrm{M}}{2}$ | ${\tilde{\alpha}}_{2}=({\alpha}_{20}-1)+\frac{\mathcal{N}}{2}$ |

${\tilde{\beta}}_{2}={\beta}_{10}+\frac{1}{2}\Vert \mathit{f}\Vert {}_{2}^{2}$ | ${\tilde{\beta}}_{2}={\beta}_{20}+\frac{1}{2}<\Vert \mathit{f}\Vert {}_{2}^{2}>$ | ${\tilde{\beta}}_{2}={\beta}_{20}+\frac{1}{2}<\Vert \mathit{f}\Vert {}_{2}^{2}>$ |

${\tilde{\theta}}_{2}=\frac{{\tilde{\alpha}}_{2}}{{\tilde{\beta}}_{2}}$ | ${\tilde{\theta}}_{2}=\frac{{\tilde{\alpha}}_{2}}{{\tilde{\beta}}_{2}}$ | ${\tilde{\theta}}_{2}=\frac{{\tilde{\alpha}}_{2}}{{\tilde{\beta}}_{2}}$ |

$\tilde{\lambda}=\frac{{\tilde{\theta}}_{2}}{{\tilde{\theta}}_{1}}$ | $\tilde{\lambda}=\frac{{\tilde{\theta}}_{2}}{{\tilde{\theta}}_{1}}$ | $\tilde{\lambda}=\frac{{\tilde{\theta}}_{2}}{{\tilde{\theta}}_{1}}$ |

© 2015 by the authors; licensee MDPI, Basel, Switzerland This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Mohammad-Djafari, A.
Entropy, Information Theory, Information Geometry and Bayesian Inference in Data, Signal and Image Processing and Inverse Problems. *Entropy* **2015**, *17*, 3989-4027.
https://doi.org/10.3390/e17063989

**AMA Style**

Mohammad-Djafari A.
Entropy, Information Theory, Information Geometry and Bayesian Inference in Data, Signal and Image Processing and Inverse Problems. *Entropy*. 2015; 17(6):3989-4027.
https://doi.org/10.3390/e17063989

**Chicago/Turabian Style**

Mohammad-Djafari, Ali.
2015. "Entropy, Information Theory, Information Geometry and Bayesian Inference in Data, Signal and Image Processing and Inverse Problems" *Entropy* 17, no. 6: 3989-4027.
https://doi.org/10.3390/e17063989