- freely available
- re-usable

*Entropy*
**2014**,
*16*(2),
1047-1069;
doi:10.3390/e16021047

^{1}

^{2}

Published: 19 February 2014

## Abstract

**:**In 1960, Rudolf E. Kalman created what is known as the Kalman filter, which is a way to estimate unknown variables from noisy measurements. The algorithm follows the logic that if the previous state of the system is known, it could be used as the best guess for the current state. This information is first applied a priori to any measurement by using it in the underlying dynamics of the system. Second, measurements of the unknown variables are taken. These two pieces of information are taken into account to determine the current state of the system. Bayesian inference is specifically designed to accommodate the problem of updating what we think of the world based on partial or uncertain information. In this paper, we present a derivation of the general Bayesian filter, then adapt it for Markov systems. A simple example is shown for pedagogical purposes. We also show that by using the Kalman assumptions or “constraints”, we can arrive at the Kalman filter using the method of maximum (relative) entropy (MrE), which goes beyond Bayesian methods. Finally, we derive a generalized, nonlinear filter using MrE, where the original Kalman Filter is a special case. We further show that the variable relationship can be any function, and thus, approximations, such as the extended Kalman filter, the unscented Kalman filter and other Kalman variants are special cases as well.

## 1. Introduction

In 1960, Rudolf E. Kalman demonstrated an ingenious way to estimate unknown variables from noisy measurements [1]. He did this by including information about the underlying dynamical system that governed the variables under consideration. With this, the optimal state of the system was inferred. To do this, he had two main assumptions: (1) all noise was Gaussian or normal and linearly additive; (2) the dynamical system was linear. The result was what is known as the Kalman filter.

Essentially, the algorithm follows the logic that if the previous state of the system is known, it could be used as the best guess for the current state. This information is used in two ways, the first is that prior to any measurement, the underlying dynamics of the system may be known. Given this knowledge and the previous state, the new state could be determined. Second, measurements of the unknown variables are taken. These two ways may conflict. Which solution should we believe? The answer is that we should believe them both, with some uncertainty. They should both be taken into account to determine what our new belief is for the state or what the values of the variables are after the measurements.

Bayesian inference is specifically designed to accommodate the problem of updating what we think of the world based on partial or uncertain information. It is well remarked that the Kalman filter is a special case of Bayesian inference [2]. We present our own derivation of the general Bayesian filter, then adapt it for Markov systems. A simple example is shown for pedagogical purposes with emphasis on the construction of the Kalman gain. Besides offering a greater pedagogical understanding of the algorithm, this also offers an insight into what to do when the Kalman assumptions are not valid or no longer apply. This allows the enhancement of sophisticated solutions, such as the extended Kalman, unscented Kalman, etc. [3].

However, Bayes rule does not assign probabilities; it is only a rule to manipulate them. The MaxEnt method [4,5] was designed to assign probabilities. This method has evolved to a more general method, the method of maximum (relative) entropy (MrE) [6,7], which has the advantage of not only assigning probabilities but updating them when new information is given in the form of constraints on the family of allowed posteriors. The main purpose of this paper is to show both general and specific examples of how the MrE can be applied using data and moment constraints. It should also be noted that Bayes’ rule is a special case of MrE. This implies that MrE is capable of producing every aspect of orthodox Bayesian inference and proves the complete compatibility of the Bayesian and entropy methods. Further, it opens the door to tackling problems that could not be addressed by either the MaxEnt or orthodox Bayesian methods individually; problems in which one has data and moment constraints. Thus, Kalman filters can be seen as a special case of filters developed using the MrE methods with Kalman assumptions.

In this paper, we will show several things; first, the derivation of the general Bayesian filter as used for problems that are of the nature that the Kalman filter is intended for, i.e., Markov systems. Second, we will show a simple example illustrating that the Kalman filter is a special case of the general Bayesian filter for pedagogical purposes. Third, we show that using the Kalman assumptions or “constraints”, we can arrive at the Kalman filter from MrE directly. Finally, we will show how the same Kalman logic can be applied to non-linear dynamical systems using Bayes rule and avoid approximations that are usually applied in extended Kalman filter and the unscented Kalman filter.

## 2. Bayesian Filter

Here, we will build the Bayesian filter. We start with Bayes rule,

_{k}) is our prior, p(Y

_{k}|x

_{k}) is our likelihood, p(x

_{k}|Y

_{k}) is the posterior, Y

_{k}= {y

_{1}, …, y

_{k}} are our measurements and x

_{k}is some unknown variable that we would like to infer. We can split Y

_{k}into two sets, y

_{k}and Y

_{k−1}, where Y

_{k−1}= {y

_{k−1}, …, y

_{1}}, which would give us,

_{k}, given all the other measurements; in other words, all the Bayesian updating on x

_{k}prior to measuring y

_{k}. This yields,

At this point, we come to our first key assumption for Kalman; if x_{k} is “complete” [9], then y_{k} is not conditionally dependent on Y_{k−1} or p(y_{k}|x_{k}, Y_{k−1}) = p(y_{k}|x_{k}). In other words, if we have, x_{k}, then we do not need Y_{k−1} to determine the probability of y_{k}. This then yields,

_{k}|Y

_{k−1}), is not known. However, it can be seen as a marginal,

_{k−1}is the previous state and p(x

_{k−1}|Y

_{k−1}) is the previous state posterior. This completes the typical recursive Bayesian filter with the “complete” or Markov assumption.

## 3. Kalman Filter

The second key assumption of the Kalman filter is that we assume that we do not have the past measurements, Y_{k−1}, when trying to determine our belief for x_{k}. This means that we need a form for our prior that allows us not to use past measurements. However, the previous value for the state, x_{k−1}, is known.

Now, we will include the main Kalman assumptions above, first that all noise is Gaussian and linearly additive. Therefore, we will use Gaussians for our density distributions,

_{k}〉 and 〈x

_{k}〉 are the means of y

_{k}and x

_{k}and ${\sigma}_{y}^{2}$ and ${\sigma}_{x}^{2}$ are the variances of each variable, respectively. Note, for illustration purposes, we limit ourselves to one variable. In later sections, we will include multiple variables. Thus, the posterior that we are looking for is,

_{k}|Y

_{k−1}).

The next question is deciding on the value of the means, since we are not inferring those, as can be seen from the posterior. For this, we have to look at the “forward” problems for each density function. For the prior, the forward problem is x_{k} = F_{k,k−1}x_{k−1} + η_{k−1}, where F_{k,k−1} is called the “transition matrix” and for the likelihood, it is y_{k} = G_{k}x_{k} + ɛ_{k}, where G_{k} is called the “measurement matrix” function [10]. Each have Gaussian noise, η_{k−1} and ε_{k}, with zero means, respectively. Therefore, for the prior, we have,

For the likelihood we have,

Note, this is similar to a least squares (which itself is a special case of Bayes). There is one more obvious question to be answered: while this may be a solution for the density function in regards to x_{k}, how do we get x_{k−1}? We need a single number. The answer depends on the what is considered the “best guess” or point estimate for x_{k−1}. There are many choices, such as the mean, median or mode. However, since we are dealing with a symmetric solution, they are one in the same. Therefore, the easiest point estimate to get is the mode, x̂, where,

_{k−1}, which is the mode from the previous step.

#### 3.1. A Simple Example

To show its processing workflow, we show a very simple example. We wish to know our 1D location given the known dynamical system and a measurement at each time step. First, we let G_{k} = 1. Then, we apply the dynamical equation,

_{0}is a known velocity constant and Δt is a known time step. It should noted, as well, that we can write this in terms of the noise,

_{k}, so we will write for our case,

One last question that needs to be addressed is what is the value of x_{k−1}? The last assumption of the Kalman filter is that the MAP estimate of the last estimate is the best estimate for this value. This is a key assumption and not trivial, as it means that with each iterative step, information would be lost generally. This is not the case for the Kalman filter, since the Gaussian is assumed, and therefore, the mode and the variance uniquely identify the distribution.

These solutions can be manipulated and written in the other following form, as well,

_{k}is what is known as the Kalman “gain”, which for this example is,

## 4. Maximum Relative Entropy

First, we present a review of maximum relative entropy. For a more detailed discussion and several examples, please see [6,11]. Our first concern when using the MrE method to update from a prior to a posterior distribution is to define the space in which the search for the posterior will be conducted. We wish to infer something about the values of one or several quantities, θ ∈ Θ, on the basis of three pieces of information: prior information about θ (the prior), the known relationship between x and θ (the model) and the observed values of the data x ∈ 𝒳. Since we are concerned with both x and θ, the relevant space is neither 𝒳 nor Θ, but the product 𝒳 × Θ, and our attention must be focused on the joint distribution, P (x, θ). The selected joint posterior, P_{new}(x, θ), is that which maximizes the entropy (in MrE terminology, we “maximize” the negative relative entropy, S, so that S ≤ 0. This is the same as minimizing the relative entropy),

_{old}(x, θ) contains our prior information, which we call the joint prior. To be explicit,

_{old}(θ) is the traditional Bayesian prior and P

_{old}(x|θ) is the likelihood. It is important to note that they both contain prior information. The Bayesian prior is defined as containing prior information. However, the likelihood is not traditionally thought of in terms of prior information. Of course, it is reasonable to see it as such, because the likelihood represents the model (the relationship between θ and x) that has already been established. Thus, we consider both pieces, the Bayesian prior and the likelihood to be prior information.

The new information is the observed data, x′, which in the MrE framework must be expressed in the form of a constraint on the allowed posteriors. The family of posteriors that reflects the fact that x is now known to be x′ is such that,

We proceed by maximizing Equation (26) subject to the above constraints. The purpose of maximizing the entropy is to determine the value for P when S = 0, meaning that we want the value of P that is closest to P_{old} given the constraints. The calculus of variations is used to do this by varying P → δP, i.e., setting the derivative with respect to P equal to zero. The Lagrange multipliers, α, β and λ(x) are used so that the P that is chosen satisfies the constraint equations. The actual values are determined by the value of the constraints themselves. We now provide the detailed steps in this maximization process.

First we setup the variational form with the Lagrange multipliers,

In order to determine the Lagrange multipliers, we substitute our solution Equation (35) into the various constraint equations. The constant α is eliminated by substituting Equation (35) into Equation (29),

^{(−1+α)},

^{βf}(θ)P

_{old}(x, θ) dθ.

The Lagrange multiplier β is determined by first substituting Equation (42) into Equation (30),

^{βf}(θ)P

_{old}(x′, θ) dθ. Now β can be determined rewriting Equation (44) as

_{new}(x, θ) over x to get our updated probability,

^{βf(θ)}P

_{old}(θ) P

_{old}(x′|θ) dθ. The right side resembles Bayes theorem, where the term P

_{old}(x′|θ) is the standard Bayesian likelihood and P

_{old}(θ) is the prior. The exponential term is a modification to these two terms. In an effort to put some names to these pieces we will call the standard Bayesian likelihood the likelihood and the exponential part the likelihood modifier so that the product of the two gives the modified likelihood. The denominator is the normalization or marginal modified likelihood. Notice when β = 0 (no moment constraint) we recover Bayes’ rule. For β ≠ 0 Bayes’ rule is modified by a “canonical” exponential factor.

## 5. Maximum Relative Entropy and Kalman

There are works where entropy maximization is being used in Kalman filtering [12–14]. For example, [14] uses entropy maximization as one of the approximation tools to reduce uncertainty. Here, we show that if the same assumptions are taken into account, the explicit closed form solution is derived. So, a numerical comparison, as in [14], becomes unnecessary and even limited. To the best of our knowledge, there is no work that shows a direct link between the original Kalman filter and maximization of the relative entropy and produces the closed form solution. We will now present a more complicated example illustrating the maximum relative entropy (MrE) solution and discuss the assumptions and constraints that lead to the same closed form Kalman filter solution.

This example consists of analyzing a linear system composed of two equations that represent linear motion with constant acceleration c_{a,k}. The dynamics of the velocity, v_{k}, and the position, x_{k}, are encoded in the following two relationships, which we assume are relevant for predicting the linear motion of the next state,

Here, we will derive the so-called “prediction step”, which will be the posterior or the following optimization criterion or entropy, which has the form,

_{k}is the position, v

_{k}is the velocity, P

_{prior,k}(x

_{k}, v

_{k}) is the prior probability distribution function (which is sometimes a uniform distribution for the first sample) and P̄

_{k}(x

_{k}, v

_{k}) is the posterior distribution to be found as a result of the first Kalman filter step, which is also called a “prediction” step [9]. In fact, it is the object that we are deriving, P̄

_{k}(x

_{k}, v

_{k}), that will be the prior for our Bayesian filter and in the special case, the Kalman filter, such as Equation (8).

All constraints come from the same Kalman filter assumptions. We derive the first constraint using Equation (48), which is the variance,

_{k}= dx

_{k}dv

_{k}dη

_{x,k−1}dη

_{v,k−1}dη

_{a,k−1}, and where,

_{k}is the noise term of the model. Note that this is effectively following the note mentioned in conjunction with Equation (19) and where x̂

_{k−1}and v̂

_{k−1}are estimates of our variables from the previous discretization interval and η

_{x,k−1}, η

_{v,k−1}, η

_{a,k−1}are multivariate normal distribution additive noise variables, which have means of zero. Frequently, the joint prior distribution of noise variables is defined by four main assumptions:

- (1)
The means of all noise variables are zero;

- (2)
The joint distribution function is a multivariate normal distribution;

- (3)
The covariance matrix is not only valid for the previous posterior distribution discretization interval, but also for the current posterior distribution discretization interval, which in our case is P̄(x

_{k}, v_{k}, η_{x,k−1}, η_{v,k−1}, η_{a,k−1}). In other words, it is implied that in our specific case, we have the following equalities,$\begin{array}{c}\underset{-\infty}{\overset{\infty}{\int}}\cdots \underset{-\infty}{\overset{\infty}{\int}}{\eta}_{x,k-1}^{2}{\overline{P}}_{k}({x}_{k},{v}_{k},{\eta}_{x,k-1},{\eta}_{v,k-1},{\eta}_{a,k-1}){\Omega}_{k}=\\ \underset{-\infty}{\overset{\infty}{\int}}\cdots \underset{-\infty}{\overset{\infty}{\int}}{\eta}_{x,k-1}^{2}{P}_{k-1}({x}_{k-1},{v}_{k-1},{\eta}_{x,k-1},{\eta}_{v,k-1},{\eta}_{a,k-1}){\Omega}_{k-1}={\sigma}_{x,k-1}^{2},\end{array}$where Ω_{k−1}= dx_{k−1}dv_{k−1}dη_{x,k−1}dη_{v,k−1}dη_{a,k−1},$\begin{array}{c}\underset{-\infty}{\overset{\infty}{\int}}\cdots \underset{-\infty}{\overset{\infty}{\int}}{\eta}_{v,k-1}^{2}{\overline{P}}_{k}({x}_{k},{v}_{k},{\eta}_{x,k-1},{\eta}_{v,k-1},{\eta}_{a,k-1}){\Omega}_{k}=\\ \underset{-\infty}{\overset{\infty}{\int}}\cdots \underset{-\infty}{\overset{\infty}{\int}}{\eta}_{v,k-1}^{2}{P}_{k-1}({x}_{k-1},{v}_{k-1},{\eta}_{x,k-1},{\eta}_{v,k-1},{\eta}_{a,k-1}){\Omega}_{k-1}={\sigma}_{v,k-1}^{2},\end{array}$$\begin{array}{c}\underset{-\infty}{\overset{\infty}{\int}}\cdots \underset{-\infty}{\overset{\infty}{\int}}{\eta}_{x,k-1}{\eta}_{v,k-1}{\overline{P}}_{k}({x}_{k},{v}_{k},{\eta}_{x,k-1},{\eta}_{v,k-1},{\eta}_{a,k-1}){\Omega}_{k}=\\ \underset{-\infty}{\overset{\infty}{\int}}\cdots \underset{-\infty}{\overset{\infty}{\int}}{\eta}_{x,k-1}{\eta}_{v,k-1}{P}_{k-1}({x}_{k-1},{v}_{k-1},{\eta}_{x,k-1},{\eta}_{v,k-1},{\eta}_{a,k-1}){\Omega}_{k-1}={\mathit{cov}}_{x,v,k-1},\end{array}$$\begin{array}{c}\underset{-\infty}{\overset{\infty}{\int}}\cdots \underset{-\infty}{\overset{\infty}{\int}}{\eta}_{a,k-1}^{2}{\overline{P}}_{k}({x}_{k},{v}_{k},{\eta}_{x,k-1},{\eta}_{v,k-1},{\eta}_{a,k-1}){\Omega}_{k}=\\ \underset{-\infty}{\overset{\infty}{\int}}\cdots \underset{-\infty}{\overset{\infty}{\int}}{\eta}_{a,k-1}^{2}{P}_{k-1}({x}_{k-1},{v}_{k-1},{\eta}_{x,k-1},{\eta}_{v,k-1},{\eta}_{a,k-1}){\Omega}_{k-1}={\sigma}_{a,k-1}^{2},\end{array}$where the variances and covariances are usually taken from the inference result of the previous discretization interval or are set with initial guesses.- (4)
The last, but not the least, assumption in Kalman Filtering is that our noise variables are independent from our main state variables, i.e., x

_{k}and v_{k}. The main benefit of this assumption is that we can manipulate Equation (53) by applying the constraints in Equations (55)–(58). Keep in mind that the means of noise variables are zeros, so many additive terms will zero out after integrations. Therefore, we finally get Equation (53) in the form of,$\underset{-\infty}{\overset{\infty}{\int}}\underset{-\infty}{\overset{\infty}{\int}}{\left({x}_{k}-\langle {x}_{k}\rangle \right)}^{2}{\overline{P}}_{k}({x}_{k},{v}_{k})\hspace{0.17em}{\mathit{dx}}_{k}{\mathit{dv}}_{k}$$={\sigma}_{x,k-1}^{2}+2{\mathit{cov}}_{x,v,k-1}\Delta t+{\sigma}_{v,k-1}^{2}\Delta {t}^{2}+\frac{{\sigma}_{a,k-1}^{2}\Delta {t}^{4}}{4},$where,$\langle {x}_{k}\rangle ={\widehat{x}}_{k-1}+\Delta t\hspace{0.17em}{\widehat{v}}_{k-1}+\frac{{c}_{a,k}\Delta {t}^{2}}{2}.$

Similarly, we can construct two other MrE constraints based on Kalman filter assumptions as,

_{prior,k}(x

_{k}, v

_{k}) is a uniform distribution. In other words, the distribution that maximizes Equation (50) with constraint Equations (59), (62) and (64) is the prediction step function below. This solution yields the same answer as the prediction step as in the Kalman filter, where P̄

_{k}(x

_{k}, v

_{k}) is an unknown posterior distribution function being constrained. This distribution is simply a multivariate normal distribution with a covariance matrix containing elements defined by scalar values on the right side of Equations (59), (62) and (64), and the means defined by our mathematical model, which are,

To be clear, this “posterior” Equation (70) is effectively the “prior” Equation (8) that would be in the Kalman filter.

#### 5.1. Kalman Filter’s Updating Step

Our focus here is the traditional updating step of the Kalman filter and its reproduction by MrE. The measurement distribution needed would be obtained in a similar manner as in the predictive step using the following constraints,

_{k}is the observation value that is analogous to Equation (28). Using these pieces, we get,

_{k}and v

_{k}; the reason being that following the previous section, the y

_{k}in P

_{likelihood}

_{,k}(y

_{k}, v

_{k}) was replaced with the observed value, y′

_{k}through the data constraint (Dirac delta function) and the mean of y

_{k}is a function of x

_{k}. Therefore, the joint posterior that is analogous to Equation (16) would be,

_{a,k}= c

_{a,k−1}, i.e., acceleration is constant in this problem. The iterative nature of Kalman filter comes into effect by assuming that P

_{prior,k+1}(x

_{k+1}, v

_{k+1}) = P

_{prior,k+1}(x

_{k}, v

_{k}) = P

_{k}(x

_{k}, v

_{k}). In other words, the previous discretization interval’s posterior function is the prior for the current discretization interval.

#### 5.2. Kalman Filter Revisited

We will now present the solution of the Kalman filter that is the same closed form solution as in the previous subsection. First, we need to construct our problem in matrix form.

The mathematical model or our state space system, as in Equation (48) and Equation (49), has the following matrix representation:

**x̄**

_{k}is a vector that has coordinates representing our position and velocity variables. The transition matrix and additive term matrices are,

_{k}, its covariance matrix, R

_{k}, and the measurement vector matrix, y

_{k}, are,

Sometimes, there are discussions [15] about the numerical stability of the Kalman filter and a different, and simplified version of covariance matrix Equation (97) is used [9],

Summarizing, if our state space system and its corresponding transition matrix is of such a size and/or sparsity that we can get its inverse matrix analytically in its reduced solution without numerical iterations, then selecting which version (Joseph’s stabilized or original) does not matter, because the closed form and the numeric answer would be the same. From a practical point of view, the maximum relative entropy method might be particularly useful for loosely coupled systems (not necessarily small), because its complexity (the total number of Lagrange multipliers) is equal to the total number of transition equations, and it has no explicit difficulty in calculating inverse matrices, because of the variational techniques used.

The fact is that both Equation (97) and Equation 99 return exactly the same closed form solution as MrE does in Equations (80)–(82), and by definition, these expressions are numerically stable, always. However, if one applies Kalman filter matrix operations and does not continue with further reductions of the final estimates’ expressions into their irreducible forms, then, yes, expression Equation (97) might be better to use compared to Equation (99). This shows one more benefit of MrE: the closed form solutions allow one to avoid numeric instability in certain situations.

## 6. Nonlinear Filter

The original Kalman filter has an assumption that the relationships between the state space system’s variables are linear. This assumption allows it to be expressed in a matrix form. Therefore, by definition, the Kalman filter is a linear filter, and nonlinear relationships have no explicit representations in transition matrix Equation (90). For that reason, variants of the Kalman filter were invented. One is called the extended Kalman filter, where any function is approximated locally by calculating a Jacobian at the approximate estimated location. Another variant is the unscented Kalman filter, which locally restricts the data sampling to a set of 2n+1 sigma points, where n is the dimension of the state space. It allows the avoidance of calculating the Jacobian and has the benefit that the nonlinear transition can be locally approximated by a cubic characteristic, if we looked from a single variable’s point of view. In the next section, we will derive the Kalman filter expression by a probabilistic approach by applying a one-to-one transition between the state variables, where the transition can be monotonic increasing or decreasing and not necessarily linear, as in original Kalman filter assumptions. More complex and generalized formulae might be found in [17].

#### 6.1. Generalized Univariate Nonlinear Filter for Monotonic Transitions

In this section, we construct a general transformation of variables. While this can be found in undergraduate texts and advanced literature on Kalman filtering [18,19], we feel the need to include it, as it allows us to further point directly to why Kalman assumptions are necessary. Assume we have a single random variable, X (thus, a univariate approach). A random variable is a function whose value is subject to variations, due to some randomness. A value of this function (which we will call a random variable from now on) is associated with some probability (discrete case) or with probability density (continuous case). In this section, we deal with continuous real-valued data values only, but the approach itself is not restricted to them only.

The cumulative distribution function (cdf) of X is the function given by,

Assume we are measuring the traveled distance by a robot in meters (random variable X) and we can learn how many kilometers the robot has traveled (random variable Y); then, there is a physical relationship in the measurable space, because we know the ratio between kilometers and meters (A = 1000; note that this parameter is known, i.e., 1000, is a constant), as,

^{−1}(Y)), with which we can transform the random variable, X, to Y. In other words, the variable, Y, has complete probabilistic information for this transformation, and we can therefore get all the probabilistic characteristics of X. We could assume some nonlinear assumptions here, where there are one-to-many links between variables [17], but for simplicity’s sake, we avoid this here.

Current one-to-one assumptions are still more general than the original Kalman filter assumptions, because we allow not only the linear equation system of variables, but also the system of any continuously increasing or decreasing functions. Then, the definition or the meaning of Equation (103) can be represented by the following expressions:

_{increasing}(x) and Equation (105) is the case when g (X) is a continuous decreasing function g

_{decreasing}(X). The main parts of these inequalities are P (Y ≤ y) = P (g

_{increasing}(X) ≤ y) and P (Y ≤ y) = P (g

_{decreasing}(X) ≤ y), i.e., by definition, we can apply the transformation of random variables when constructing the probabilities. In other words, if we know the cdf of the random variable, X (the right side of Equation (104) and Equation (105)), then we can construct the cdf of Y (the left side of Equation (104) and Equation (105). Extending Equation (101) by reusing the expression of the cumulative distribution from Equation (104) and applying the Fundamental Theorem of Calculus and the Chain rule, we attain,

_{Y}(y) depends on the sign of $\frac{d({g}_{\text{decreasing}}^{-1}(y))}{\mathit{dy}}$, i.e., when it is increasing (the derivative is non-zero), the sign is positive, and when decreasing, the sign is negative. Therefore, in the general case, when function g (X) is invertible and monotonic, we can write the final transformation expression as,

#### 6.2. Generalized Multivariate Nonlinear Filter for Monotonic Transitions

We begin the generation for a multivariate, nonlinear filter. A system of transformations in the measurement space is,

**x**= (x

_{1}, x

_{2}, … x

_{n}) and

**y**= (y

_{1}, y, … y

_{n}). Again, this is not necessary to assume that g(

**x**) is a system of linear functions, like in the Kalman filter. Therefore, the Kalman filter is a special case of the approach that we outline here. The investigation of situations when we have many-to-one (or other combinations) of relationships between vectors

**x**and

**y**is out of scope of this work. Therefore, we have an invertible one-to-one transformation:

**Y**≡ g (

**X**) between random vector variables

**Y**and

**X**assumes that,

_{k}, and y represents the subsequent measurement value, y, happening at the time t

_{k+1}= t

_{k}+ Δt, where index k ranges from one to n and is the discretization interval of our measurement signal, x (t). Therefore, in the Kalman filter context, we should rewrite Equation (102) as,

#### 6.3. Kalman Filter Revisited Using a Jacobian

In this subsection, we will revisit the Kalman filter example from the previous section. Our state space system with transformation function g (x) stays the same,

^{−1}(⋯), of the transformation is,

_{k−1}(⋯), is constructed exactly the same as in previous subsection by eliminating the random noise variables. In other words, it is a multivariate normal distribution with a covariance matrix of,

## 7. Summary and Final Remarks

Kalman demonstrated an ingenious way to estimate unknown variables from noisy measurements, in part by making various assumptions. In this paper, we derive the Bayesian filter and, then, show that by applying the Kalman assumptions, we arrive at a solution that is consistent with the original Kalman filter for pedagogical purposes; explicitly showing that the “transition” or “predictive” step is the prior information and the “measurement” or “updating” step is the likelihood of Bayes’ rule. Further, we showed that the well-known Kalman gain is the new uncertainty associated with the posterior distribution.

Recently, a paper [22] used maximum relative entropy (MrE) with the Kalman assumptions, but did not explicitly state that there is a direct link between these two approaches. Here, we showed that the method of maximum relative entropy (MrE) explicitly produces the same mathematical solutions as the Kalman filter, and thus, Kalman is a special case of MrE. We also showed that the closed form solutions after the application of MrE are immune to real-life numeric problems, which might occur when manipulating the Kalman filter matrix operations.

By applying and manipulating pure probabilistic definitions and techniques used in signal analysis theory, we derived a general, nonlinear filter, where constraining the variables of interest in the form of continuous monotonic increasing or decreasing functions and not necessarily a linear set of functions, like in the original Kalman filter. Thus, we can include more information and extend approximation approaches, such as the extended Kalman filter and unscented Kalman filter techniques and other hybrid variants.

In the end, we derived general distributions using MrE for use in Bayes’ Theorem for the same purposes as the original Kalman filter and all of its offshoots. However, MrE can do even more. An important future work will be to include simultaneous constraints on the posterior that Bayes cannot do easily alone, such as including macroscopic or course-grained relationships between the various variables of interest. This has been demonstrated in [11].

We would like to acknowledge valuable discussions with Julian Center and Peter D. Joseph.

## Conflicts of Interest

The authors declare no conflict of interest.

## References

- Kalman, R.E. A new approach to linear filtering and prediction problems. J. Basic. Eng
**1960**, 82, 35–45. [Google Scholar] - Meinhold, R.J.; Singpurwalla, N.D. Understanding the Kalman Filter. Am. Statis
**1983**, 37, 123–127. [Google Scholar] - Gibbs, B.P. Advanced Kalman Filtering, Least-Squares and Modeling: A Practical Handbook; Wiley: New York, NY, USA, 2011. [Google Scholar]
- E. T. Jaynes: Papers on Probability, Statistics and Statistical Physics; Rosenkrantz, R.D., Ed.; Dordrecht: Reidel, Holand, 1983.
- Jaynes, E.T. Probability Theory: The Logic of Science; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
- Giffin, A.; Caticha, A. Updating Probabilities with Data and Moments. In AIP Conference Procedings Bayesian Inference and Maximum Entropy Methods in Science and Engineering; Knuth, K., Caticha, A., Center, J. L., Giffin, A., Rodríguez, C.C., Eds.; American Institute of Physics: Melville, NY, USA, 2007; Volume 954, p. 74. [Google Scholar]
- Caticha, A.; Giffin, A. Updating Probabilities. In Conference Procedings of Bayesian Inference and Maximum Entropy Methods in Science and Engineering; Mohammad-Djafari, A., Ed.; American Institute of Physics: Melville, NY, USA, 2006; Volume 872, p. 31. [Google Scholar]
- Cox, R.T. Probability, frequency, and reasonable expectation. Am. J. Phys
**1946**, 14, 1–13. [Google Scholar] - Thrun, S.; Burgard, W.; Fox, D. Probabilistic Robotics; MIT Press: Cambridge, MA, USA, 2006. [Google Scholar]
- Chen, Z. Bayesian filtering: From Kalman filters to particle filters, and beyond. Statistics
**2003**, 182, 1–69. [Google Scholar] - Giffin, A. From physics to economics: An econometric example using maximum relative entropy. Physica A
**2009**, 388, 1610–1620. [Google Scholar] - Mitter, S.K.; Newton, N.J. A Variational Approach to Nonlinear Estimation. SIAM J. Contr
**2004**, 42, 1813–1833. [Google Scholar] - Mitter, S.K.; Newton, N.J. Information and Entropy Flow in the Kalman-Bucy Filter. J. Stat. Phys
**2005**, 118, 145–176. [Google Scholar] - Eyink, G.; Kim, S. A maximum entropy method for particle filtering. J. Stat. Phys
**2006**, 123, 1071–1128. [Google Scholar] - Joseph, P.D. Kalman Filter Lessons. Available online: http://home.earthlink.net/~pdjoseph/id11.html (accessed on 18 February 2014).
- Crassidis, J.L.; Junkins, J.L. Optimal Estimation of Dynamical Systems; Chapman & Hall/CRC: Boca Raton, FL, USA, 2004. [Google Scholar]
- Poularikas, A. D. The Handbook of Formulas and Tables for Signal Processing; CRC Press: Boca Raton, FA, USA, 1998. [Google Scholar]
- Jazwinski, A. H. Stochastic Processes and Filtering Theory; Academic Press: New York, NY, USA, 1970. [Google Scholar]
- Crisan, D.; Rozovskii, B. L. The Oxford Handbook of Nonlinear Filtering; Oxford University Press: Oxford, UK, 2011. [Google Scholar]
- Katok, A.; Hasselblatt, B. Introduction to the Modern Theory of Dynamical Systems; Cambridge University Press: Cambridge, UK, 1996. [Google Scholar]
- Beck, C.; Schögl, F. Thermodynamics of Chaotic Systems: An Introduction; Cambridge University Press: Cambridge, UK, 1995. [Google Scholar]
- Urniezius, R. Online robot dead reckoning localization using maximum relative entropy optimization with model constraints. In AIP Conference Proceedings of Bayesian Inference and Maximum Entropy Methods in Science and Engineering; Mohammad-Djafari, A., Ed.; American Institute of Physics: Melville, NY, USA, 2011; Volume 1305, p. 274. [Google Scholar]

© 2014 by the authors; licensee MDPI, Basel, Switzerland This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).