Entropy, Information Theory, Information Geometry and Bayesian Inference in Data, Signal and Image Processing and Inverse Problems

: The main content of this review article is ﬁrst to review the main inference tools using Bayes rule, the maximum entropy principle (MEP), information theory, relative entropy and the Kullback–Leibler (KL) divergence, Fisher information and its corresponding geometries. For each of these tools, the precise context of their use is described. The second part of the paper is focused on the ways these tools have been used in data, signal and image processing and in the inverse problems, which arise in different physical sciences and engineering applications. A few examples of the applications are described: entropy in independent components analysis (ICA) and in blind source separation, Fisher information in data model selection, different maximum entropy-based methods in time series spectral estimation and in linear inverse problems and, ﬁnally, the Bayesian inference for general inverse problems. Some original materials concerning the approximate Bayesian computation (ABC) and, in particular, the variational Bayesian approximation (VBA) methods are also presented. VBA is used for proposing an alternative Bayesian computational tool to the classical Markov chain Monte Carlo (MCMC) methods. We will also see that VBA englobes


Introduction
As this paper is an overview and an extension of my tutorial paper in MaxEnt 2014 workshop [1], this Introduction gives a summary of the content of this paper.
The qualification Bayesian refers to the influence of Thomas Bayes [2], who introduced what is now known as Bayes' rule, even if the idea had been developed independently by Pierre-Simon de Laplace [3].For this reason, I am asking a question of the community if we shall change Bayes to Laplace and Bayesian to Laplacian or at least mention them both.Whatever the answer, we assume that the reader knows what probability means in a Bayesian or Laplacian framework.The main idea is that a probability law P (X) assigned to a quantity X represents our state of knowledge that we have about it.If X is a discrete valued variable, {P (X = x n ) = p n , n = 1, • • • , N } with mutually exclusive values x n is its probability distribution.When X is a continuous valued variable, p(x) is its probability density function from which we can compute P (a ≤ X < b) = b a p(x) dx or any other probabilistic quantity, such as its mode, mean, median, region of high probabilities, etc.
In science, it happens very often that a quantity cannot be directly observed or, even when it can be observed, the observations are uncertain (commonly said to be noisy), by uncertain or noisy, here, I mean that, if we repeat the experiences with the same practical conditions, we obtain different data.However, in the Bayesian approach, for a given experiment, we have to use the data as they are, and we want to infer it from those observations.Before starting the observation and gathering new data, we have very incomplete knowledge about it.However, this incomplete knowledge can be translated in probability theory via an a priori probability law.We will discuss this point later on regarding how to do this.For now, we assume that this can be done.When a new observation (data D) on X becomes available (direct or indirect), we gain some knowledge via the likelihood P (D|X).Then, our state of knowledge is updated combining P (D|X) and P (X) to obtain an a posteriori law P (X|D), which represents the new state of knowledge on X.This is the main esprit of the Bayes rule, which can be summarized as: P (X|D) = P (D|X)P (X)/P (D). ( As P (X|D) has to be a probability law, we have: This relation can be extended to the continuous case.Some more details will be given in Section 2. Associated with a probability law is the quantity of information it contains.Shannon [4] introduced the notion of quantity of information I n associated with one of the possible values of x n of X with probabilities P (X = x n ) = p n to be I n = ln 1 pn = − ln p n and the entropy H as the expected value of I n : The word entropy has also its roots in thermodynamics and physics.However, this notion of entropy has no direct link with entropy in physics, even if for a particular physical system, we may attribute a probability law to a quantity of interest of that system and then define its entropy.This information definition of Shannon entropy became the main basis of information theory in many data analyses and the science of communication.More details and extensions about this subject will be given in Section 3.
As we can see up to now, we did not yet discuss how to assign a probability law to a quantity.For the discrete value variable, when X can take one of the N values {x 1 , • • • , x N } and when we do not know anything else about it, Laplace proposed the "Principe d'indifférence", where P (X = x n ) = p n = 1 N , ∀n = 1, • • • , N , a uniform distribution.However, what if we know more, but not enough to be able to assign the probability law {p 1 , • • • , p N } completely?
For example, if we know that the expected value is n x n p n = d, this problem can be handled by considering this equation as a constraint on the probability distribution {p 1 , • • • , p N }.If we have a sufficient number of constraints (at least N ), then we may obtain a unique solution.However, very often, this is not the case.The question now is how to assign a probability distribution {p 1 , • • • , p N } that satisfies the available constraints.This question is an ill-posed problem in the mathematical sense of Hadamard [5] in the sense that the solution is not unique.We can propose many probability distributions that satisfy the constraint imposed by this problem.To answer this question, Jaynes [6][7][8] introduced the maximum entropy principle (MEP) as a tool for assigning a probability law to a quantity on which we have some incomplete or macroscopic (expected values) information.Some more details about this MEP, the mathematical optimization problem, the expression of the solution and the algorithm to compute it will be given in Sections 3 and 4.
Kullback [9] was interested in comparing two probability laws and introduced a tool to measure the increase of information content of a new probability law with respect to a reference one.This tool is called either the Kullback-Leibler (KL) divergence, cross entropy or relative entropy.It has also been used to update a prior law when new pieces of information in the form of expected values are given.As we will see later, this tool can also be used as an extension to the MEP of Jaynes.Furthermore, as we will see later, this criterion of comparison of two probability laws is not symmetric: one of the probability laws has to be chosen to be the reference, and then, the second is compared to this reference.Some more details and extensions will be given in Section 5.
Fisher [10] wanted to measure the amount of information that a random variable X carries about an unknown parameter θ upon which its probability law p(x|θ) depends.The partial derivative with respect to θ of the logarithm of this probability law, called the log-likelihood function for θ, is called the score.He showed that the first order moment of the score is zero, but its second order moment is positive and is also equivalent to the expected values of the second derivative of log-likelihood function with respect to θ.This quantity is called the Fisher information.It is also been shown that for the small variations of θ, the Fisher information induces locally a distance in the space of parameters Θ, if we had to compare two very close values of θ.In this way, the notion of the geometry of information is introduced [11,12].
We must be careful here that this geometrical property is related to the space of the parameters Θ for small changes of the parameter for a given family of parametric probability law p(x|θ) and not in the space of probabilities.However, for two probability laws p 1 (x) = p(x|θ 1 ) and p 2 (x) = p(x|θ 2 ) in the same exponential family, the Kullback-Leibler divergence KL [p 1 : p 2 ] induces a Bregman divergence B[θ 1 : θ 2 ] between the two parameters [13,14].More details will be given in Section 8.
At this stage, we have almost introduced all of the necessary tools that we can use for different levels of data, signal and image processing.In the following, we give some more details for each of these tools and their inter-relations.Then, we review a few examples of their use in different applications.As examples, we demonstrate how these tools can be used in independent components analysis (ICA) and source separation, data model selection, in spectral analysis of the signals and in inverse problems, which arise in many sciences and engineering applications.At the end, we focus more on the Bayesian approach for inverse problems.We present some details concerning unsupervised methods, where the hyper parameters of the problem have to be estimated jointly with the unknown quantities (hidden variables).Here, we will see how the Kullback-Leibler divergence can help approximate Bayesian computation (ABC).In particular, some original materials concerning variational Bayesian approximation (VBA) methods are presented.

Bayes Rule
Let us introduce things very simply.If we have two discrete valued related variables X and Y , for which we have assigned probability laws P (X) and P (Y ), respectively, and their joint probability law P (X, Y ), then from the sum and product rule, we have: where P (X, Y ) is the joint probability law, P (X) = Y P (X, Y ) and P (Y ) = X P (X, Y ) are the marginals and P (X|Y ) = P (X, Y )/P (Y ) and P (Y |X) = P (X, Y )/P (X) are the conditionals.Now, consider the situation where Y can be observed, but not X.Because these two quantities are related, we may want to infer X from the observations on Y .Then, we can use: which is called the Bayes rule.This relation is extended to the continuous valued variables using the measure theory [15,16]: with: More simply, the Bayes rule is often written as: This writing can be used when we want to use p(x|y) to compute quantities that are only dependent on the shape of p(x|y), such as the mode, the median or quantiles.However, we must be careful that the denominator is of importance in many other cases, for example when we want to compute expected values.There is no need for more sophisticated mathematics here if we want to use this approach.As we mentioned, the main use of this rule is in particular when X can not be observed (unknown quantity), but Y is observed and we want to infer X.In this case, the term p(y|x) is called the likelihood (of unknown quantity X in the observed data y), p(x) is called a priori and p(x|y) a posteriori.The likelihood is assigned using the link between the observed Y and the unknown X, and p(x) is assigned using the prior knowledge about it.The Bayes rule then is a way to do state of knowledge fusion.Before taking into account any observation, our state of knowledge is represented by p(x), and after the observation of Y , it becomes p(x|y).
However, in this approach, two steps are very important.The first step is the assigning of p(x) and p(y|x) before being able to use the Bayes rule.As noted in the Introduction and as we will see later, we need other tools for this step.The second important step is after: how to use p(x|y) to summarize it.When X is just a scalar value variable, we can do this computation easily.For example, we can compute the probability that X is in the interval [a, b] via: However, when the unknown becomes a high dimensional vectorial variable X, as is the case in many signal and image processing applications, this computation becomes very costly [17].We may then want to summarize p(x|y) by a few interesting or significant point estimates.For example, compute the maximum a posteriori (MAP) solution: the expected a posteriori (EAP) solution: x EAP = x p(x|y) dx, (11) the domains of X which include an integrated probability mass of more than some desired value (0.95 for example): or any other questions, such as median or any α-quantiles: As we see, computation of MAP needs an optimization algorithm, while these last three cases need integration, which may become very complicated for high dimensional cases [17].We can also just explore numerically the whole space of the distribution using the Markov chain Monte Carlo (MCMC) [18][19][20][21][22][23][24][25][26] or any other sampling techniques [17].In the scalar case (one dimension), all of these computations can be done numerically very easily.For the vectorial case, when the dimensions become large, we need to develop specialized approximation methods, such as VBA and algorithms to do these computations.We give some more details about these when using this approach for inverse problems in real applications.
Remarks on notation used for the expected value in this paper: For a variable X with the probability density function (pdf) p(x) and any regular function h(X), we use indifferently: As an example, as we will say later, the entropy of p(x) is noted indifferently: For any conditional probability density function (pdf) p(x|y) and any regular function h(X), we use indifferently: As another example, as we will see later, the relative entropy of p(x) over q(x) is noted indifferently: and when there is not any ambiguity in the integration variable, we may omit it.For example, we may note: Finally, when we have two variables X and Y with their joint pdf p(x, y), their marginals p(x) and p(y) and their conditionals p(x|y) and p(y|x), we may use the following notations:

Shannon Entropy
To introduce the quantity of information and the entropy, Shannon first considered a discrete valued variable X taking values {x 1 , • • • , x N } with probabilities {p 1 , • • • , p N } and defined the quantities of information associated with each of them as I n = ln 1 pn = − ln p n and its expected value as the entropy: Later, this definition is extended to the continuous case by: H By extension, if we consider two related variables (X, Y ) with the probability laws, joint p(x, y), marginals, p(x), p(y), and conditionals, p(y|x), p(x|y), we can define, respectively, the joint entropy:

Thermodynamical Entropy
Entropy is also a property of thermodynamical systems introduced by Clausius [27].For a closed homogeneous system with reversible transformation, the differential entropy δS is related to δQ the incremental reversible transfer of heat energy into that system by δS = δQ/T with T being the uniform temperature of the closed system.
It is very hard to establish a direct link between these two entropies.However, in statistical mechanics, thanks to Boltzmann, Gibbs and many others, we can establish some link if we consider the microstates (for example, the number, positions and speeds of the particles) and the macrostates (for example, the temperature T , pressure P , volume V and energy E) of the system and if we assign a probability law to microstates and consider the macrostates as the average (expected values) of some functions of those microstates.Let us give a very brief summary of some of those interpretations.

Statistical Mechanics Entropy
The interpretation of entropy in statistical mechanics is the measure of uncertainty that remains about the state of a system after its observable macroscopic properties, such as temperature (T ), pressure (P ) and volume (V ), have been taken into account.For a given set of macroscopic variables T , P and V , the entropy measures the degree to which the probability of the system is spread out over different possible microstates.In contrast to the macrostate, which characterizes plainly observable average quantities, a microstate specifies all atomic details about the system, including the position and velocity of every atom.Entropy in statistical mechanics is a measure of the number of ways in which the microstates of the system may be arranged, often taken to be a measure of "disorder" (the higher the entropy, the higher the disorder).This definition describes the entropy as being proportional to the natural logarithm of the number of possible microscopic configurations of the system (microstates), which could give rise to the observed macroscopic state (macrostate) of the system.The proportionality constant is the Boltzmann constant.

Boltzmann Entropy
Boltzmann described the entropy as a measure of the number of possible microscopic configurations Ω of the individual atoms and molecules of the system (microstates) that comply with the macroscopic state (macrostate) of the system.Boltzmann then went on to show that k ln Ω was equal to the thermodynamic entropy.The factor k has since been known as Boltzmann's constant.
In particular, Boltzmann showed that the entropy S of an ideal gas is related to the number of states of the molecules (microstates Ω) with a given temperature (macrostate):

Gibbs Entropy
The macroscopic state of the system is defined by a distribution on the microstates that are accessible to a system in the course of its thermal fluctuations.Therefore, the entropy is defined over two different levels of description of the given system.The entropy is given by the Gibbs entropy formula, named after J. Willard Gibbs.For a classical system (i.e., a collection of classical particles) with a discrete set of microstates, if E i is the energy of microstate i and p i is its probability that it occurs during the system's fluctuations, then the entropy of the system is: where k is again the physical constant of Boltzmann, which, like the entropy, has units of heat capacity.The logarithm is dimensionless.It is interesting to note that Relation (17) can be obtained from Relation (18) when the probability distribution is uniform over the volume Ω [28][29][30].

Relative Entropy or Kullback-Leibler Divergence
Kullback wanted to compare the relative quantity of information between two probability laws p 1 and p 2 on the same variable X.Two related notions have been defined: • Kullback-Leibler divergence of p 1 with respect to p 2 : We may note that: • KL [q : p] ≥ 0, • KL [q : p] = 0, if q = p and • KL [q : p] is invariant with respect to a scale change, but is not symmetric.
• A symmetric quantity can be defined as:

Mutual Information
The purpose of mutual information is to compare two related variables Y and X.It can be defined as the expected amount of information that one gains about Xif we observe the value of Y, and vice versa.Mathematically, the mutual information between X and Y is defined as: or equivalently as: With this definition, we have the following properties: and: We may also remark on the following property: • I [Y, X] is a concave function of p(y) when p(x|y) is fixed and a convex function of p(x|y) when p(y) is fixed.
• I [Y, X] ≥ 0 with equality only if X and Y are independent.

Maximum Entropy Principle
The first step before applying any probability rules for inference is to assign a probability law to a quantity.Very often, the available knowledge on that quantity can be described mathematically as the constraints on the desired probability law.However, in general, those constraints are not enough to determine in a unique way that probability law.There may exist many solutions that satisfy those constraints.We need then a tool to select one.
Jaynes introduced the MEP [8], which can be summarized as follows: When we do not have enough constraints to determine a probability law that satisfies those constraints, we may select between them the one with maximum entropy.
Let us be now more precise.Let us assume that the available information on that quantity X is the form of: where φ k are any known functions.First, we assume that such probability laws exist by defining: with φ 0 = 1 and d 0 = 1 for the normalization purpose.Then, the MEP is written as an optimization problem: whose solution is given by: where Z(λ), called the partition function, is given by: which can also be written as −∇ λ ln Z(λ) = d.Different algorithms have been proposed to compute numerically the ME distributions.See, for example, [31][32][33][34][35][36][37] The maximum value of entropy reached is given by: This optimization can easily be extended to the use of relative entropy by replacing H(p) by D[p : q] where q(x) is a given reference of a priori law.See [9,38,39] and [34,[40][41][42] for more details.

Link between Entropy and Likelihood
Consider the problem of the parameter estimation θ of a probability law p(x|θ) from an n-element The log-likelihood of θ is defined as: Maximizing L(θ) with respect to θ gives what is called the maximum likelihood (ML) estimate of θ.
Noting that L(θ) depends on n, we may consider 1 n L(θ) and define: where θ * is the right answer and p(x|θ * ) its corresponding probability law.We may then remark that: The first term in the right-hand side being a constant, we derive that: In this way, there is a link between the maximum likelihood and maximum relative entropy solutions [24].
There is also a link between the maximum relative entropy and the Bayes rule.See, for example, [43,44] and their corresponding references.

Fisher Information, Bregman and Other Divergences
Fisher [10] was interested in measuring the amount of information that samples of a variable X carries about an unknown parameter θ upon which its probability law p(x|θ) depends.For a given sample of observation x and its probability law p(x|θ), the function L(θ) = p(x|θ) is called the likelihood of θ in the sample x.He called the score of x over θ the partial derivative with respect to θ of the logarithm of this function: He also showed that the first order moment of the score is zero: but its second order moment is positive and is also equivalent to the expected values of the second derivative of the log-likelihood function with respect to θ.
This quantity is called the Fisher information [14].
It is also shown that for the small variations of θ, the Fisher information induces locally a distance in the space of parameters Θ, if we had to compare two very close values of θ.In this way, the notion of the geometry of information is introduced.The main steps for introducing this notion are the following: Consider D [p(x|θ * ) : p(x|θ * + ∆θ)] and assume that ln p(x|θ) can be developed in a Taylor series.Then, keeping the terms up to the second order, we obtain: where F is the Fisher information: We must be careful here that this geometry property is related to the space of the parameters Θ for a given family of parametric probability law p(x|θ) and not in the space of probabilities.However, for two probability laws p 1 (x) = p(x|θ 1 ) and p 2 (x) = p(x|θ 2 ) in the same exponential family, the Kullback-Leibler divergence KL [p 1 : p 2 ] induces a Bregman divergence B[θ 1 |θ 2 ] between the two parameters [14,[45][46][47][48].
To go further into detail, let us extend the discussion about the link between Fisher information and KL divergence, as well as other divergences, such as f -divergences, Rényi's divergences and Bregman divergences.
• f -divergences: The f -divergences, which are a general class of divergences, indexed by convex functions f , that include the KL divergence as a special case.Let f : (0, ∞) → I R be a convex function for which f (1) = 0.The f -divergence between two probability measures P and Q is defined by: Every f -divergence can be viewed as a measure of distance between probability measures with different properties.Some important special cases are: 2 gives the square of the Hellinger distance: • Rényi divergences: These are another generalization of the KL divergence.The Rényi divergence between two probability distributions P and Q is: When α = 1, by a continuity argument, D α [P : Q] converges to KL √ p q is called Bhattacharyya divergence (closely related to Hellinger distance).Interestingly, this quantity is always smaller than KL: As a result, it is sometimes easier to derive risk bounds with D 1/2 as the loss function as opposed to KL.
• Bregman divergences: The Bregman divergences provide another class of divergences that are indexed by convex functions and include both the Euclidean distance and the KL divergence as special cases.Let φ be a differentiable strictly convex function.The Bregman divergence B φ is defined by: where < x, y >= j x i y j here means the scalar product of x and y and where the domain of φ is a space where convexity and differentiability make sense (e.g., whole or a subset of I R d or an L p space).For example, φ(x) = x 2 on I R d gives the Euclidean distance: and φ(x) = j x j ln x j on the simplex in I R d gives the KL divergence: where it is assumed j x j = j y j = 1.
Let X be a quantity taking values in the domain of φ with a probability distribution function p(x).Then, E p(x) {B φ (X, m)} is minimized over m in the domain of φ at m = E {X}: Moreover, this property characterizes Bregman divergence.When applied to the Bayesian approach, this means that, using the Bregman divergence as the loss function, the Bayes estimator is the posterior mean.This point is detailed in the following.
Links between all of these through an example: Let us consider the Bayesian parameter estimation where we have some data y, a set of parameters x, a likelihood p(y|x) and a prior π(x), which gives the posterior p(x|y) ∝ p(y|x) π(x).Let us also consider a cost function C[x, x] in the parameter space x ∈ X .The classical Bayesian point estimation of x is expressed as the minimizer of an expected risk: where: It is very well known that the mean squared error estimator, which corresponds to C[x, x] = x − x 2 , is the posterior mean.It is now interesting to know that choosing C[x, x] to be any Bregman divergence B φ [x, x], we obtain also the posterior mean: x p(x|y) dx (46) where: Consider now that we have two prior probability laws π 1 (x) and π 2 (x), which give rise to two posterior probability laws p 1 (x|y) and p 2 (x|y).If the prior laws and the likelihood are in the exponential families, then the posterior laws are also in the exponential family.Let us note them as p 1 (x|y; θ 1 ) and p 2 (x|y; θ 2 ), where θ 1 and θ 2 are the parameters of those posterior laws.We then have the following properties: is used to compare the two posteriors.

Vectorial Variables and Time Indexed Process
The extension of the scalar variable to the finite dimensional vectorial case is almost immediate.In particular, for the Gaussian case p(x) = N (x|µ, R), the mean becomes a vector µ = E {X}, and the variances are replaced by a covariance matrix: R = E {(X − µ)(X − µ) }; and almost all of the quantities can be defined immediately.For example, for a Gaussian vector p(x) = N (x|0, R), the entropy is given by [49]: and the relative entropy of N (x|0, R) with respect to N (x|0, S) is given by: The notion of time series or processes needs extra definitions.For example, for a random time series X(t), we can define p(X(t)), ∀t, the expected value time series x(t) = E {X(t)} and what is called the autocorrelation function Γ(t, τ ) = E {X(t) X(t + τ )}.A time series is called stationary when these quantities does not depend on t, i.e., x(t) = m and Γ(t, τ ) = Γ(τ ) [50].Another quantity of interest for a stationary time series is its power spectral density (PSD) function: When X(t) is observed on times t = n∆T with ∆T = 1, we have X(n), and for a sample {X(1), • • • , X(N )}, we may define the mean µ = E {X} and the covariance matrix With these definitions, it can easily been shown that the covariance matrix of a stationary Gaussian process is Toeplitz [49].It is also possible to show that the entropy of such a process can be expressed as a function of its PSD function: For two stationary Gaussian processes with two spectral density functions S 1 (ω) and S 2 (ω), we have: where we find the Itakura-Saito distance in the spectral analysis literature [50][51][52][53].These definitions and expressions have often been used in time series analysis.In what follows, we give a few examples of the different ways these notions and quantities have been used in different applications of data, signal and image processing.

Entropy in Independent Component Analysis and Source Separation
Given a vector of time series x(t), the independent component analysis (ICA) consists of finding a separating matrix B, such that the components y(t) = Bx(t) are as independent as possible.The notion of entropy is used here as a measure of independence.For example, to find B, we may choose D p(y) : j p j (y j ) as a criterion of independence of the components y j .The next step is to choose a probability law p(x) from which we can find an expression for p(y) from which we can find an expression for D p(y) : j p j (y j ) as a function of the matrix B, which can be optimized to obtain it.
The ICA problem has a tight link with the source separation problem, where it is assumed that the measured time series x(t) is a linear combination of the sources s(t), i.e., x(t) = As(t), with A being the mixing matrix.The objective of source separation is then to find the separating matrix B = A −1 .
To see how the entropy is used here, let us note y = Bx.Then, H(y) is used as a criterion for ICA or source separation.As the objective in ICA is to obtain y in such a way that its components become as independent as possible, the separating matrix B has to maximize H(y).Many ICA algorithms are based on this optimization [54][55][56][57][58][59][60][61][62][63][64][65]

Entropy in Parametric Modeling and Model Selection
Determining the order of a model, i.e., the dimension of the vector parameter θ in a probabilistic model p(x|θ), is an important subject in many data and signal processing problems.As an example, in autoregressive (AR) modeling: where θ = {θ 1 , • • • , θ K }, we may want to compare two models with two different values of K.
When the order K is fixed, the estimation of the parameters θ is a very well-known problem, and there are likelihood based [66] or Bayesian approaches for that [67].The determination of the order is however more difficult [68].Between the tools, we may mention here the Bayesian methods [69][70][71][72][73][74], but also the use of relative entropy D [p(x|θ * ) : p(x|θ)], where θ * represents the vector of the parameters of dimension K * and θ and the vector θ with dimension K ≤ K * .In such cases, even if the two probability laws to be compared have parameters with different dimensions, we can always use the KL [p(x|θ * ) : p(x|θ)] to compare them.The famous criterion of Akaike [75][76][77][78] uses this quantity to determine the optimal order.For a linear parameter model with Gaussian probability laws and likelihood-based methods, there are analytic solutions for it [68].

Entropy in Spectral Analysis
Entropy and MEP have been used in different ways in the spectral analysis problem.It has been an important subject of signal processing for the decades.Here, we are presenting, in a brief way, these different approaches.

Burg's Entropy-Based Method
A classical one is Burg's entropy method [79], which can be summarized as follows: Let X(n) be a stationary, centered process, and assume we have as data a finite number of samples (lags) of its autocorrelation function: The task is then to estimate its power spectral density function: As we can see, due to the fact that we have only the elements of the right-hand for k = −K, • • • , +K, the problem is ill posed.To obtain a probabilistic solution, we may start by assigning a probability law p(x) to the vector X = [X(0), . . ., X(N − 1)] .For this, we can use the principle of maximum entropy (PME) with the data as constraints (54).As these constraints are the second order moments, the PME solution is a Gaussian probability law: N (x|0, R).For a stationary Gaussian process, when the number of samples N −→ ∞, the expression of the entropy becomes: This expression is called Burg's entropy [79].Thus, Burg's method consists of maximizing H subject to the constraints (54).The solution is: where λ = [λ 0 , • • • , λ K ] , the Lagrange multipliers associated with the constraints (54), are here equivalent to the AR modeling of the Gaussian process X(n).
We may note that, in this particular case, we have an analytical expression for λ, which provides the possibility to give an analytical expression for S(ω) as a function of the data {r(k), k = 0, • • • , K}: where We may note that we first used MEP to choose a probability law for X(n).With the prior knowledge that we have second order moments, the MEP results in a Gaussian probability density function.Then, as for a stationary Gaussian process, the expression of the entropy is related to the power spectral density S(ω), and as this is related to the correlation data by a Fourier transform, an ME solution could be computed easily.

Extensions to Burg's Method
The second approach consists of maximizing the relative entropy D [p(x) : p 0 (x)] or minimizing KL [p(x) : p 0 (x)] where p 0 (x) is an a priori law.The choice of the prior is important.Choosing a uniform p 0 (x), we retrieve the previous case [77].
However, choosing a Gaussian law for p 0 (x), the expression to maximize becomes: when N → ∞ and where S 0 (ω) corresponds to the power spectral density of the reference process p 0 (x).Now, the problem becomes: minimize D [p(x) : p 0 (x)] subject to the constraints (54).

Shore and Johnson Approach
Another approach is to decompose first the process X(n) on the Fourier basis {cos kωt, sin kωt}, to consider ω to be the variable of interest and S(ω), normalized properly, to be considered as its probability distribution function.Then, the problem can be reformulated as the determination of the S(ω), which maximizes the entropy: subject to the linear constraints (54).The solution is in the form of: which can be considered as the most uniform power spectral density that satisfies those constraints.

ME in the Mean Approach
In this approach, we consider S(ω) as the expected value Z(ω) for which we have a prior law µ(z), and we are looking to assign p(z), which maximizes the relative entropy D [p(z) : µ(z)] subject to the constraints (54).
When p(z) is determined, the solution is given by: The expression of S(ω) depends on µ(z).When µ(z) is Gaussian, we obtain the Rényi entropy: If we choose a Poisson measure for µ(z), we obtain the Shannon entropy: and if we choose a Lebesgue measure over [0, ∞], we obtain Burg's entropy: When this step is done, the next step becomes maximizing these entropies subject to the constraints of the correlations.The obtained solutions are very different.For more details, see [39,[79][80][81][82][83][84][85].
13. Entropy-Based Methods for Linear Inverse Problems

Linear Inverse Problems
A general way to introduce inverse problems is the following: Infer an unknown signal f (t), image f (x, y) or any multi-variable function f (r) through an observed signal g(t), image g(x, y) or any multi-variable observable function g(s), which are related through an operator H : f → g.This operator can be linear or nonlinear.Here, we consider only linear operators g = Hf : where h(r, s) is the response of the measurement system.Such linear operators are very common in many applications of signal and image processing.We may mention a few examples of them: • Convolution operations g = h * f in 1D (signal): or in 2D (image): • Radon transform (RT) in computed tomography (CT) in the 2D case [86]: • Fourier transform (FT) in the 2D case: which arise in magnetic resonance imaging (MRI), in synthetic aperture radar (SAR) imaging or in microwave and diffraction optical tomography (DOT) [86][87][88][89][90].
No matter the category of the linear transforms, when the problem is discretized, we arrive at the relation: where the errors of modeling and measurement and H the matrix of the system response.

Entropy-Based Methods
Let us consider first the simple no noise case: where H is a matrix of dimensions (M × N ), which is in general singular or very ill conditioned.Even if the cases M > N or M = N may appear easier, they have the same difficulties as those of the underdetermined case M < N that we consider here.In this case, evidently the problem has an infinite number of solutions, and we need to choose one.
Between the numerous methods, we may mention the minimum norm solution, which consists of choosing between all of the possible solutions: the one that has the minimum norm: This optimization problem can be solved easily in this case, and we obtain: In fact, we may choose any other convex criterion Ω(f ) and satisfy the uniqueness of the solution.For example: which can be interpreted as the entropy when f j > 0 and f j = 1, thus considering f j as a probability distribution f j = P (U = u j ).The variable U can correspond (or not) to a physical quantity.Ω(f ) is the entropy associated with this variable.If we consider f j > 0 to represent the power spectral density of a physical quantity, then the entropy becomes: and we can use it as criterion to select a solution to the problem (71).
As we can see, any convex criterion Ω(f ) can be used.Here, we mentioned four of them with different interpretations.
• L 2 or quadratic: which can be interpreted as the Rényi's entropy with q = 2.
• L β : When β < 1 the criterion is not bounded at zero.When β ≥ 1 the criterion is convex.
• Shannon entropy: which has a valid interpretation if 0 < f j < 1, • The Burg entropy: which needs f j > 0.
Unfortunately, only for the first case, there is an analytical solution for the problem, which is f = H (HH )g.For all of the other cases, we may need an optimization algorithm to obtain a numerical solution [91][92][93][94][95].

Maximum Entropy in the Mean Approach
A second approach consists of considering f j = E {U j } or f = E {U } [41,41,42].Again, here, U j or U can, but need not, correspond to some physical quantities.In any case, we now want to assign a probability law p(u) to it.Noting that the data g = Hf = HE {U } = E {HU } can be considered as the constraints on it, we may need again a criterion to determine p(u).Assuming then having some prior µ(u), we may maximize the relative entropy as that criterion.The mathematical problem then becomes: The solution is: where: When p(u) is obtained, we may be interested in computing: which is the required solution.
Interestingly, if we focus on f = E {U }, we will see that its expression depends on the choice of the prior µ(u).When µ(u) is separable: µ(u) = j µ j (u j ), the expression of p(u) will also be separable.
To go a little more into the details, let us introduce s = H λ and define: and its conjugate convex: It can be shown easily that f = E {U } can be obtained either via the dual λ variables: where λ is obtained by: λ = arg min λ or directly: D(λ) is called the dual criterion and F (f ) primal.However, it is not always easy to obtain an analytical expression for G(s) and its gradient G (s).The functions F (f ) and G(s) are conjugate convex.
For the computational aspect, unfortunately, the cases where we may have analytical expressions for Z(λ) or G(s) = ln Z or F (f ) are very limited.However, when there is analytical expressions for them, the computations can be done very easily.In Table 1, we summarizes some of those solutions:

Bayesian Approach for Inverse Problems
In this section, we present in a brief way the Bayesian approach for the inverse problems in signal and image processing.

Simple Bayesian Approach
The different steps to find a solution to an inverse problem using the Bayesian approach can be summarized as follows: • Assign a prior probability law p( ) to the modeling and observation errors, here .From this, find the expression of the likelihood p(g|f , θ 1 ).As an example, consider the Gaussian case: θ 1 in this case is the noise variance v .
• Assign a prior probability law p(f |θ 2 ) to the unknown f to translate your prior knowledge on it.Again, as an example, consider the Gaussian case: θ 2 in this case is the variance v f .
• Apply the Bayes rule to obtain the expression of the posterior law: where the sign ∝ stands for "proportionality to", p(g|f , θ 1 ) is the likelihood, p(f |θ 2 ) the prior model, θ = [θ 1 , θ 2 ] their corresponding parameters (often called the hyper-parameters of the problem) and p(g|θ 1 , θ 2 ) is called the evidence of the model.
For the expressions of likelihood in (90) and the prior in (91), we obtain very easily the expression of the posterior: When the hyper-parameters θ can be fixed a priori, the problem is easy.In practice, we may use some summaries, such as: • EAP or posterior mean (PM): For the Gaussian case of ( 91), the MAP and EAP are the same and can be obtained by noting that: However, in real applications, the computation of even these simple point estimators may need efficient algorithm: • For MAP, we need optimization algorithms, which can handle the huge dimensional criterion J(f ) = − ln p(f |g, θ).Very often, we may be limited to using gradient-based algorithms.

Full Bayesian: Hyperparameter Estimation
When the hyperparameters θ have also to be estimated, a prior p(θ) is assigned to them, and the expression of the joint posterior: is obtained, which can then be used to infer them jointly.Very often, the expression of this joint posterior law is complex, and any computation may become very costly.The VBA methods try to approximate p(f , θ|g) by a simpler distribution, which can be handled more easily.Two particular and extreme cases are: • Bloc separable, such as q(f , θ) = q 1 (f ) q 2 (θ) or • Completely separable, such as q(f , θ) = j q 1j (f j ) k q 2k (θ k ).
Any mixed solution is also valid.For example, the one we have chosen is: Obtaining the expressions of these approximated separable probability laws has to be done via a criterion.The natural criterion with some geometrical interpretation for the probability law manifolds is the Kullback-Leibler (KL) criterion: For hierarchical prior models with hidden variables z, the problem becomes more complex, because we have to give the expression of the joint posterior law: and then approximate it by separable ones: and then use them for estimation.See more discussions in [9,31,38,[108][109][110] In the following, first the general VBA method is detailed for the inference problems with hierarchical prior models.Then, a particular class of prior model (Student t) is considered, and the details of VBA algorithms for that are given.

Basic Algorithms of the Variational Bayesian Approximation
To illustrate the basic ideas and tools, let us consider a vector X and its probability density function p(x), which we want to approximate by q(x) = j q j (x j ).Using the KL criterion: KL [q : p] = q(x) ln q(x) p(x) dx = q(x) ln q(x) dx − q(x) ln p(x) dx = j q j (x j ) ln q j (x j ) dx j − ln p(x) q = j q j (x j ) ln q j (x j ) dx j − q j (x j ) < ln p(x) > q −j dx j where we used the notation: ln p(x) q = q(x) ln p(x) dx and q −j (x) = i =j q i (x i ).
From here, trying to find the solution q i , the basic method is an alternate optimization algorithm: As we can see, the expression of q j (x j ) depends on q i (x i ), i = j.It is not always possible to obtain analytical expressions for q j (x j ).It is however possible to show that, if p(x) is a member of exponential families, then q j (x j ) are also members of exponential families.These iterations then become much simpler, because at each iteration, we need to update the parameters of the exponential families.To go a little more into the details, let us consider some particular simple cases.

Case of Two Gaussian Variables
In the case of two variables x = [x 1 , x 2 ] , we have: As an illustrative example, consider the case where we want to approximate p(x 1 , x 2 ) by q(x 1 , x 2 ) = q 1 (x 1 ) q 2 (x 2 ) to be able to compute the expected values: which need double integrations when p(x 1 , x 2 ) is not separable in its two variables.If we can do that separable approximation, then, we can compute: which needs only 1D integrals.Let us see if ( µ 1 , µ 2 ) will converge to (m 1 , m 2 ).To illustrate this, let us consider the very simple case of the Gaussian: It is then easy to see that q 1 (x 1 ) = N (x 1 | µ 1 , v 1 ) and q 2 (x 2 ) = N (x 2 | µ 2 , v 2 ) and that: with: See [111] for details and where we showed that, initializing the algorithm with µ (0) 1 = 0 and µ (0) 2 = 0, the means converges to the right values m 1 and m 2 , However, we may be careful about the convergence of the variances.

Case of Exponential Families
As we could see, to be able to use such an algorithm in practical cases, we need to be able to compute < ln p(x) > q 2 (x 2 ) and < ln p(x) > q 1 (x 1 ) .Only for a few cases can we can do this analytically.Different algorithms can be obtained depending on the choice of a particular family for q j (x j ) [103,[112][113][114][115][116][117][118][119][120].
To show this, let us consider the exponential family: where θ is a vector of parameter and g(θ) and u(x) are known functions.This parametric exponential family has the following conjugacy property: For a given prior p(θ) in the family: the corresponding posterior: is in the same family.For this family, we have: ln p(x|θ) q = ln g(θ) + θ u(x) q .
It is then easy to show that: q j (x j ) ∝ g(θ) exp θ u(x) q −j (114) which are in the same exponential family.This simplifies greatly the computations, thanks to the fact that, in each iteration, we only need to compute u(x) = u(x) q −j and update the parameters.Now, if we consider: with a prior on θ: and the joint p(x, θ|η, ν) = p(x|θ) p(θ|η, ν), which is not separable in x and θ, and we want to approximate it with the separable q(x, θ) = q 1 (x) q 2 (θ), then we will have: where u = u(x) q 1 (x) .

VBA for the Unsupervised Bayesian Approach to Inverse Problems
Before going into the details and for similarity with the notations in the next sections, we replace x by f , such that now we are trying to approximate p(f , θ) = p(f |θ) p(θ) by a separable q(f , θ) = q 1 (f ) q 2 (θ).Interestingly, depending on the choice of the family laws for q 1 and q 2 , we obtain different algorithms: • q 1 (f ) = δ(f − f ) and q 2 (θ) = δ(θ − θ).In this case, we have: and so: which can be interpreted as an alternate optimization algorithm for obtaining the JMAPestimates: The main drawback here is that the uncertainties of the f are not used for the estimation of θ and the uncertainties of θ are not used for the estimation of f .
Here, the uncertainties of the f are used for the estimation of θ, but the uncertainties of θ are not used for the estimation of f .
• q 1 (f ) = δ(f − f ) and q 2 (θ) is free form.In the same way, this time we obtain: which can be compared with the classical EM algorithm.Here, the uncertainties of the f are used for the estimation of θ, but the uncertainties of θ are not used for the estimation of f .
• Both q 1 (f ) and q 2 (θ) have free form.The main difficulty here is that, at each iteration, the expression of q 1 and q 2 may change.However, if p(f , θ) is in the generalized exponential family, the expressions of q 1 (f ) and q 2 (θ) will also be in the same family, and we have only to update the parameters at each iteration.
17. VBA for a Linear Inverse Problem with Simple Gaussian Priors As a simple example, consider the Gaussian case where p(g|f , θ 1 ) = N (g|Hf , (1/θ 1 )I), p(f |θ 2 ) = N (f |0, (1/θ 2 )I) and p(θ 1 ) = G(θ 1 |α 10 , β 10 ) p(θ 2 ) = G(θ 2 |α 20 , β 20 ), and so, we have: From this expression J(f , θ 1 , θ 2 ) = ln p(f , θ 1 , θ 2 |g), it is easy to obtain the equations of an alternate JMAP algorithm by computing the derivatives of it with respective to its arguments and equating them to zero: From the expression of the joint probability law p(f , θ 1 , θ 2 |g), we can also obtain the expressions of the conditionals: However, obtaining analytical expressions of the marginals p(f |g), p(θ 1 |g) and p(θ 2 |g) is not easy.We can then obtain approximate expressions q 1 (f |g), q 2 (θ 1 |g) and q 3 (θ 2 |g) using the VBA method.For this case, thanks to the conjugacy property, we have: We can then compare the three algorithms in Table 2: It is important to remark that, in JMAP, the computation of f can be done via the optimization of the criterion J(f , θ 1 , θ 2 ) = ln p(f , θ 1 , θ 2 |g), which does not need explicitly the matrix inversion of V = (H H + λI) −1 .However, in BEM and VBA, we need to compute it due to the following requirements: For some extensions and more details, see [111].

Bayesian Variational Approximation with Hierarchical Prior Models
For a linear inverse problem: with an assigned likelihood p(g|f , θ 1 ), when a hierarchical prior model p(f |z, θ 2 ) p(z|θ 3 ) is used and when the estimation of the hyper-parameters θ = [θ 1 , θ 2 , θ 3 ] has to be considered, the joint posterior law of all the unknowns becomes: The main idea behind the VBA is to approximate this joint posterior by a separable one, for example: q(f , z, θ|g) = q 1 (f ) q 2 (z) q 3 (θ) and where the expressions of q(f , z, θ|g) are obtained by minimizing the Kullback-Leibler divergence (99), as explained in previous section.This approach can also be used for model selection based on the evidence of the model ln p(g) [121] where: Interestingly, it is easy to show that: where F(q) is the free energy associated with q defined as: Therefore, for a given model M, minimizing KL [q : p] is equivalent to maximizing F(q) and when optimized, F(q * ) gives a lower bound for ln p(g).Indeed, the name variational approximation is due to the fact that ln p(g) ≥ F(q), and so, F(q) is a lower bound to the evidence ln p(g).
(135) Note that these relations represent an implicit solution for q 1 (f ), q 2 (z) and q 3 (θ), which need, at each iteration, the expression of the expectations in the right hand of exponentials.If p(g|f , z, θ 1 ) is a member of an exponential family and if all of the priors p(f |z, θ 2 ), p(z|θ 3 ), p(θ 1 ), p(θ 2 ) and p(θ 3 ) are conjugate priors, then it is easy to see that these expressions lead to standard distributions for which the required expectations are easily evaluated.In that case, we may note: q(f , z, θ) = q 1 (f | z, θ) q 2 (z| f , θ) q 3 (θ| f , z) where the tilded quantities z, f and θ are, respectively, functions of ( f , θ), ( z, θ) and ( f , z).This means that the expression of q 1 (f | z, θ) depends on ( f , θ), the expression of q 2 (z| f , θ) depends on ( z, θ) and the expression of q 3 (θ| f , z) depends on ( f , z).With this notation, the alternate optimization results in alternate updating of the parameters ( z, θ) of q 1 , the parameters ( f , θ) of q 2 and the parameters ( f , z) of q 3 .Finally, we may note that, to monitor the convergence of the algorithm, we may evaluate the free energy: F(q)= ln p(f , z, θ, g) q − ln q(f , z, θ) q = ln p(g|f , z, θ) q + ln p(f |z, θ) q + ln p(z|θ) q + ln p(θ) q − ln q(f ) q − ln q(z) q − ln q(θ) q . (137) Other decompositions for q(f , z, θ) are also possible.For example: q(f , z, θ) = q 1 (f |z) q 2 (z) q 3 (θ) or even: q(f , z, θ) = j q 1j (f j ) j q 2j (z f j ) l q 3l (θ l ).Here, we consider the first case and give some more details on it.

Bayesian Variational Approximation with Student t Priors
The Student t model is: The Cauchy model is obtained when ν = 1.Knowing that: we can write this model via the positive hidden variables z f j : Now, let us consider the forward model g = Hf + and assign a Gaussian law with unknown variance v i to the noise i , which results in p( ) = N (g|0, V ) with Let us also note and finally, Then, we obtain the following expressions for the VBA: q 2j (z f j ) = G(z f j | α j , β j ) with α j = α 00 + 1/2, β j = β 00 + < f 2 j > /2; q 3 (z i ) = G(z i | α i , β i ) with α i = α 0 + (N + 1)/2, where: We have implemented these algorithms for many linear inverse problems [102], such as periodic components estimation in time series [122] or computed tomography [123], blind deconvolution [124], blind image separation [125,126] and blind image restoration [89].

Conclusions
The main conclusions of this paper can be summarized as follows: • A probability law is a tool for representing our state of knowledge about a quantity.
• The Bayes or Laplace rule is an inference tool for updating our state of knowledge about an inaccessible quantity when another accessible, related quantity is observed.
• Entropy is a measure of information content in a variable with a given probability law.
• The maximum entropy principle can be used to assign a probability law to a quantity when the available information about it is in the form of a limited number of constraints on that probability law.
• Relative entropy and Kullback-Leibler divergence are tools for updating probability laws in the same context.
• When a parametric probability law is assigned to a quantity and we want to measure the amount of information gain about the parameters when some direct observations of that quantity is available, we can use the Fisher information.The structure of the Fisher information geometry in the space of parameters is derived from the relative entropy by a second order Taylor series approximation.
• All of these rules and tools are used currently in different ways in data and signal processing.In this paper, a few examples of the ways these tools are used in data and signal processing problems are presented.One main conclusion is that each of these tools has to be used in appropriate contexts.The example in spectral estimation shows that it is very important to define the problems very clearly at the beginning and to use appropriate tools and interpret the results appropriately.
• The Laplacian or Bayesian inference is the appropriate tool for proposing satisfactory solutions to inverse problems.Indeed, the expression of the posterior probability law represents the combination of the state of the knowledge in the forward model and the data and the state of the knowledge before using the data.
• The Bayesian approach can also easily be used to propose unsupervised methods for the practical application of these methods.
• One of the main limitation of those sophisticated methods is the computational cost.For this, we proposed to use VBA as an alternative to MCMC methods to propose realistic algorithms in huge dimensional inverse problems where we want to estimate an unknown signal (1D), image (2D), volume (3D) or even more (3D + time or 3D + wavelength), etc.