Optimal Nonlinear Estimation in Statistical Manifolds with Application to Sensor Network Localization

Information geometry enables a deeper understanding of the methods of statistical inference. In this paper, the problem of nonlinear parameter estimation is considered from a geometric viewpoint using a natural gradient descent on statistical manifolds. It is demonstrated that the nonlinear estimation for curved exponential families can be simply viewed as a deterministic optimization problem with respect to the structure of a statistical manifold. In this way, information geometry offers an elegant geometric interpretation for the solution to the estimator, as well as the convergence of the gradient-based methods. The theory is illustrated via the analysis of a distributed mote network localization problem where the Radio Interferometric Positioning System (RIPS) measurements are used for free mote location estimation. The analysis results demonstrate the advanced computational philosophy of the presented methodology.


Introduction
Information geometry, pioneered by Rao in the 1940s [1] and further developed by Chentsov [2], Efron [3,4] and Amari [5,6], considers the statistical relationships between families of probability densities in terms of the geometric properties of Riemann manifolds.It is the study of intrinsic properties of manifolds of probability distributions [7], where the ability of the data to discriminate those distributions is translated into a Riemannian metric.Specifically, the Fisher information gives a local measure of discrimination of the distributions which immediately provides a Riemannian metric on the parameter manifold of the distributions [1].In particular, the collection of probability density functions called curved exponential families, which encapsulate important distributions in many real world problems, have been treated using this framework [5].
The main tenet of information geometry is that many important notions in probability theory, information theory and statistics can be treated as structures in differential geometry by regarding a space of probabilities as a differentiable manifold endowed with a Riemannian metric and a family of affine connections, including but not exclusively, the canonical Levi-Civita affine connection [6].By providing a means to analyse the Riemannian geometric properties of various families of probability density functions, information geometry offers comprehensive results about statistical problems simply by considering them as geometrical objects.Information geometry opens new prospect to study the intrinsic geometrical nature of information theory and provides a new way to deal with statistical problems on manifolds.For example, Smith [8] studied the intrinsic Cramér-Rao bounds on estimation accuracy for the estimation problems on arbitrary manifolds where the set of intrinsic coordinates is not apparent, and derived the intrinsic bounds in the examples of covariance matrix and subspace estimation.Srivastava et al. [9,10] addressed the geometric subspace estimation and target tracking problems under a Bayesian framework.Bhattacharya and Patrangenaru [11] treated the general problem of estimation on Riemannian manifolds.
As this new general theory reveals the capability of defining a new perspective on existing questions, many researchers are extending their work on this geometric theory of information to new areas of application and interpretation.For example, a most important milestone in the area of signal processing is the work of Richardson where the geometric perspective clearly indicates the relationship between turbo-decoding and maximum-likelihood decoding [12].The results of Amari et al. on the information geometry of turbo and low-density parity-check codes extend the geometrical framework initiated by Richardson to the information geometrical framework of dual affine connections [13].Other investigations include the geometrical interpretation of fading in wireless networks [14]; the geometrical interpretation of the solution to the multiple hypothesis testing problem in the asymptotic limit developed by Westover [15]; and a geometric characterization of multiuser detection for synchronous DS/CDMA channels [16].Recently, the framework of information geometry has been applied to address issues in the application of sensor networks such as target resolvability [17], radar information resolution [18] and passive sensor scheduling [19,20].
In this paper, we mainly focus on the nonlinear estimation problem and illustrate how it can benefit from the powerful methodologies of information geometry.The geometric interpretation for the solution to the maximum likelihood estimation for curved exponential families and the convergence of the gradient-based methods (such as Newton's method and the Fisher scoring algorithm) are demonstrated via the framework developed by Efron and Amari et al.Our essential motivation of this work is to provide some insights into the nonlinear parameter estimation problems in signal processing using the theory of information geometry.By gaining a better understanding of the existing algorithms through the use of information geometric method, we are, hopefully, enabled to derive better algorithms for solving non-linear problems.
The work described in this paper consists of the following aspects.Firstly, an iterative maximum likelihood estimator (MLE) for estimating non-random parameters with measurement of the curved exponential distributions is presented.The estimator belongs to the gradient-based methods that operate on statistical manifolds and can be seen as a generalization of Newton's method to families of probability density functions and their relevant statistics.Its interpretation in terms of differential and information geometry provides insight into its optimality and convergence.Then, by utilizing the properties of exponential families and thus identifying the parameters on statistical manifolds, the implementation of the presented MLE algorithm is simplified via reparametrization.Furthermore, it is shown that the associated stochastic estimation problem reduces to a deterministic optimization problem with respect to the measures (statistics) defined over the distributions.Finally, an example of a one-dimensional curved Gaussian is presented to demonstrate the method in the manifold.A practical example of distributed mote network localization using the Radio Interferometric Positioning System (RIPS) measurements is given to demonstrate the issues addressed in this paper.The performance of the estimator is discussed.
In the next section, classical nonlinear estimation via natural gradient MLE for curved exponential families is derived.The reparametrization from local parameters to natural parameters and expectation parameters is highlighted.In Section 3, the principles of information geometry are introduced.Further, the geometric interpretation for the presented iterative maximum likelihood estimator and the convergence of the developed algorithm is demonstrated via the properties of statistical manifolds.In Section 4, a one parameter estimation example is presented to illustrate the geometric operation of the algorithm.The performance and efficiency of the algorithm are further demonstrated via a mote localization example using RIPS measurements.Finally, conclusions are made in Section 5.

Nonlinear Estimation via Natural Gradient MLE
In probability and statistics, exponential families (including the normal, exponential, Gamma, Chi-squared, Beta, Poisson, Bernoulli, multinomial and many others) are an important class of distributions naturally considered.There is a framework for selecting a possible alternative parameterization of these distributions, in terms of the natural parameters, and for defining useful statistics, called the natural statistics of the families.When the natural parameters in exponential families are nonlinear functions of some "local" parameters, the distributions involved are in the curved exponential families.While curved exponential families which encapsulate important distributions are more suitable to describe many real world problems, the estimation of local parameters is often non-trivial because of the nonlinearity in parameters.
In this section, a natural gradient based maximum likelihood estimator is derived to address a nonlinear estimation problem for curved exponential families.Although the estimator has been well-known as the Fisher scoring method in the literature, the interpretation of its iterative operations via the theory of information geometry is interesting and will be presented in the following section.
The general form of a curved exponential family [5] can be expressed as where x ∈ Ω is a vector valued measurement, θ = {θ 1 , . . ., θ n } are the natural coordinates or canonical parameters, u denote local parameters and F(x) = {F 1 (x), . . ., F n (x)} are sufficient statistics for θ = {θ 1 , . . ., θ n }, and functions on the measurement space with elements denoted by x.The function ϕ is called the potential function of the exponential family and it is found from the normalisation condition Ω p θ (x)dx = 1, i.e., The term "curved" comes from the fact that the distribution in Equation ( 1) is defined by a smooth embedding u −→ θ(u) from the manifold parameterized by u into the canonical exponential family p(x|θ).
As an example, a curved Gaussian distribution with local parameter u ∈ R m , mean µ(u) ∈ R n and covariance Σ(u) ∈ R n×n is expressed as By reparameterization, the standardized natural parameters (θ, Θ) of a curved Gaussian distribution are found to be The sufficient statistics of the Gaussian distribution in Equation (3) is and the potential function expressed in terms of local parameters (µ, Σ) is given by where the superscript T signifies the transpose operation and n is the cardinality of µ ∈ R n .
Let l(θ, x) = log p(x|θ) be the log likelihood of a general family of distributions p(x|θ) as in Equation ( 1), and ∇θ the Jacobian matrix of the natural parameter θ as a function of the local parameters u.We may write the following equations by using Equation ( 1) where is called the expectation parameter which connects to θ(u) by the well known Legendre transformation [5], and Both the expectation parameter η and Fisher information matrix G(θ) can be obtained by differentiating the potential function ϕ(θ) with respect to natural parameters [21] The maximum likelihood estimator û of the curved exponential family satisfies the following likelihood equation Here l( û) is an objective function to be maximized with parameter u.It was pointed out by Amari in [22] that the geometry of the Riemannian manifold must be taken into account when calculating the steepest learning directions on a manifold.He suggested the use of natural gradient (NAT) updates for the optimization on a Riemannian manifold, i.e., where λ is a positive learning rate that determines the step size and G(u) denotes the Riemannian metric matrix of the manifold.For a parameterized family of probability distributions on a statistical manifold, the Riemannian metric is defined as the Fisher information matrix (FIM) [1].For the curved exponential family in Equation ( 1), the FIM with respect to the local parameter u is where G(θ) is the FIM with respect to the natural parameter θ.A recursive MLE of curved exponential families can then be implemented as follows where θ(u) and η(u) denote the natural parameter and expectation parameter of the distribution, respectively.F(x) is the sufficient statistics for the measurement x.
The covariance (CRLB) of the recursive MLE estimator u (k+1) is the inverse of Fisher information matrix G −1 (u (k+1) ), where The proposed algorithm is summarized in Algorithm 1.

Algorithm 1:
The natural gradient based iterative MLE algorithm.

Distribution reparameterization
Identify the natural parameter θ(u), sufficient statistics F(x) and potential function ϕ θ(u) .2. Find expectation parameter η and Fisher information metric G to construct manifold of the curved exponential family p x|θ(u) , The above natural gradient approach has a similar structure as the common gradient-based algorithms, such as the well-known steepest descent method, Newton's method and Fisher scoring algorithm.However, it does distinguish itself from the others in the following points:

•
The natural gradient estimator updates the underlying manifold metric G (i.e., FIM) at each iteration as well, which evaluates the estimate accuracy.

•
Updates in the classical steepest descent types are performed via the standard gradient ∇l(u) and are well-matched to the Euclidean distance measure as well as the gradient adaptation.For the cases where the underlying parameter spaces are not Euclidean but are curved, i.e., Riemannian, −∇l(u) does not represent the steepest descent direction in the parameter space, and thus the standard gradient adaptation is no longer appropriate.The natural gradient updates in Equation ( 14) improve the steepest descent update rule by taking the geometry of the Riemannian manifold into account to calculate the learning directions.In other words, it modifies the standard gradient direction according to the local curvature of the parameter space in terms of the Riemannian metric tensor G(u), thus offers faster convergence than the steepest descent method.

•
The Newton method improves the steepest descent method by using the second-order derivatives of the cost function, i.e., the inverse of the Hessian of l(u) to adjust the gradient search direction.When l(u) is a quadratic function of u, the inverse of the Hessian is equal to G(u), and thus Newton's method and the natural gradient approach are identical [22].However, in more general contexts, the two techniques are different.Generally, the natural gradient approach increases the stability of the iteration with respect to Newton's method through replacing the Hessian by its expected value, i.e., the Riemannian metric tensor G(u).

•
The natural gradient approach is identical to the Fisher scoring method in cases where the Fisher information matrix coincides with the Riemannian metric tensor of the underlying parameter space.In such cases, the natural gradient approach is a Riemannian-based version of the Fisher scoring method performed on manifolds, and it is very appropriate when the cost function is related to the Riemannian geometry of the underlying parameter space [23].Once these methods are entered into the manifold, additional insights into their geometric meaning may be deduced in the framework of differential and information geometry.
It is worth mentioning that a strategic choice of parameterizations of the cost function may result in a faster convergence or a more meaningful implementation of an optimization algorithm, though it is quite-often non-trivial.In the proposed iterative MLE algorithm in Equation ( 16), an alternative parameterization of the curved exponential family, in terms of the natural and expectation parameters, are employed.Through such a reparameterization, the implementation of the natural gradient updates is facilitated by the relevant statistics of a curved exponential family.

Principles of Information Geometry
(1) Definition of a statistical manifold: Information geometry originates from the study of manifolds of probability distributions.Consider the parameterized family of probability distributions S = { p(x|θ )}, where x is a random variable and θ = (θ 1 , . . ., θ n ) is a parameter vector specifying the distribution.The family S is regarded as a statistical manifold with θ as its (possibly local) coordinate system [24].
Figure 1 illustrates the definition of a statistical manifold.For a given state of interest θ in the parameter space Θ ∈ R n , the measurement x in the sample space X ∈ R m is an instantiation of a probability distribution p(x|θ ).Each probability distribution p(x|θ ) is labelled by a point s(θ) in the manifold S. The parameterized family of probability distributions S = { p(x|θ )} forms an n-dimensional statistical manifold where θ plays the role of a coordinate system of S.
(2) Fisher information metric: The metric is the object specifying the scalar product in a particular point on a manifold in differential geometry.It encodes how to measure distances, angles and area at a particular point on the manifold by specifying the scalar product between tangent vectors at that point [25].For a parameterized family of probability distributions on a statistical manifold, the FIM plays the role of a Riemannian metric tensor [1].Denoted by G(θ) = [g ij (θ)], where the FIM measures the ability of the random variable x to discriminate the values of the parameter θ from θ for θ close to θ.
(3) Affine connection and Flatness: The affine connection ∇ (The notation ∇ is also used to denote the Jacobian in this paper.However, there should be no confusion from the context.) on a manifold S defines a linear one-to-one mapping between two neighboring tangent spaces of the manifold.When the connection coefficients of ∇ with respect to a coordinate system of S are all identically 0, then ∇ is said to be flat, or alternatively, S is flat with respect to ∇.The curvature of a flat manifold is zero everywhere.Correspondingly, flatness will result in considerable simplification of the geometry.Intuitively, a flat manifold is one that "locally looks like" a Euclidean space in terms of distances and angles.Consequently, many operations on the flat manifold such as projection and orthogonality become more closely analogous to the case of an Euclidean space.
It is worth mentioning that the flatness of a manifold is closely related to the definition of affine connections as well as the choice of coordinate systems of the manifold.In 1972, Chentsov [2] introduced a one-parameter family of affine connections called α-connections which were later popularized by Amari [5]: where In Equation (20), α = 0 corresponds to the Levi-Civita connection (information connection).The case α = −1 defines the e-connection (exponential connection) while α = −1 defines the m-connection (mixture connection).An exponential family with natural parameter θ as the coordinate system (parameterization) is a flat manifold under the e-connection while a mixture family with expectation parameter η as the coordinate system is a flat manifold under the m-connection [24].The e-connection and m-connection play an important role in statistical inference and geometrize the estimation on the flat manifolds.

Information Geometric Interpretation for Natural Gradient MLE
Based on the principles of information geometry introduced above, the algorithm described in Algorithm 1 can be explained via Figure 2, where the upper figure illustrates the estimation operation in Euclidean space while an alternative view of the estimation operation on statistical manifolds is analogously illustrated in the lower figure.In the upper figure, the local parameter u is to be estimated from measurements (samples) x via the likelihood function p(x|u).When the measurement model is nonlinear in the parameter, the underlying estimation of the local parameter is a nonlinear estimation or filtering problem.Usually, linear signal processing problems can be routinely solved systematically by the astute application of results from linear algebra.However, the nonlinear cases are not easy to solve.The methodology of differential and information geometry are more adaptable and capable of dealing with nonlinear problems.Given that M = {p(x|θ)} signifies a general set of conditional distributions, the natural parameter space {θ} ∈ A n contains the distributions of all exponential families; to be regarded as an enveloping space.Then the curved exponential family p(x|u) in the upper figure is smoothly embedded in the enveloping space A n , m ≤ n by distribution reparameterization u −→ θ(u), i.e., the curved exponential family p(x|u) can be represented by a curve {θ = θ(u)} embedded in A n .Consequently the nonlinearity in the underlying estimation problem is completely characterized by the red curve inside the natural parameter space.
The circle on the lower right side of Figure 2 is the expectation parameter space {η} ∈ B n or sampling space, which is dual to the natural parameter space A. The dots in the space B signify the "realizations" of the sufficient statistics F(x) of the distribution p(x|θ) and they are obtained from measurements (samples) x.Connected by the Legendre transformation the dual enveloping sub-manifolds A and B (i.e., natural and expectation parameter spaces) are in one-to-one correspondence [5].By viewing the expression of the curved exponential families in Equation ( 1), we observe that the newly formulated likelihood p (F(x)|θ) is linear, which indicates the possibility

•
The iterative estimator is optimal in the MLE sense as the filtering itself involves no information loss.The stochastic filtering problem becomes an optimization problem defined over a statistical manifold.

•
As seen from Algorithm 1, the algorithm implementation is relatively simple and straightforward by distribution reparameterization and operating in the dual flat manifolds.Though a Newton method-based MLE estimator can be derived directly via the likelihood.However, in most cases the operation is not trivial.

•
The initial guess is important to facilitate convergence of the estimator to the true value.This can be varied and such initial value sampling may provide more certainty about reaching a global minimum.This has not been examined here.
In the next section, two examples are given to demonstrate the implementation of the developed estimator as well as its geometric interpretation.

An One Parameter Estimation Example of Curved Gaussian Distribution
Consider a curved Gaussian distribution where a is a constant and u is an unknown parameter to be estimated.The collection of distributions specified by the parameter u constitute a one-dimensional curved exponential family, which can be embedded in the natural parameter space A in terms of natural coordinates which is a parabola (denoted by F A ) in A. The underlying distribution in Equation ( 22) can be alternatively embedded in the expectation parameter space B in terms of expectation coordinates which is also a parabola (denoted by F B ) in B.
The tangent vector ∇θ(u) of the curve The metric G(u) has only one component g in this case, and is The sufficient statistics F(x) of the underlying distribution are and they are obtained from measurements (samples) x.
Figure 4 shows the two dual flat spaces (A and B) and illustrates the estimation process implemented in them, where the blue parabolas in two figures denote the embeddings of the curved Gaussian distribution specified by parameter u.Without loss of generality, a = 1, u = 2 are assumed.The red arrows in two figures show the tangent vector ∇θ(u) of the curve F A while the two red dots on the parabolas denote the "realizations" of the distribution in Equation ( 22) specified by the given parameter u = 2.One hundred observed data (measurements) are shown by the blue dots in the expectation parameter space specified by coordinates (x, x 2 ).The red asterisk denotes the sufficient statistics F(x) obtained from the statistical mean of the measurements (samples).By projecting the data F(x) on to the sub-manifold represented by F B orthogonally to ∇θ(u), i.e., (F(x) − η( ûML )) ⊥ ∇θ( ûML ), the MLE estimation of parameter u is obtained.By viewing Equation (25) and Figure 4b, the estimation ûML = η1 is with high accuracy to the true value of u in this example. -

A Mote Localization Example via RIPS Measurements
The Radio Interferometric Positioning System (RIPS) is an efficient sensing technique for sensor networks of small, low cost sensors with various applications.It utilizes radio frequency interference to obtain a sum of distance differences between the locations of a quartet of the motes which is initially reported in [26] and further discussed in [27].In this paper, we take this mote localization problem via RIPS measurements as an application of the presented natural gradient MLE.
(1) Problem description: The Radio Interferometric Positioning System (RIPS) measurement model is described in [28].A single RIPS measurement, as described in [26], involves four motes.A mote is a node in a sensor network that is capable of performing some processing, gathering sensory information and communicating with other connected nodes in the network.The main components of a sensor mote are a microcontroller, transceiver, external memory, power source and sensing hardware device.Figure 5 illustrates a collection of four motes A, B, C, D, where A, B and C are anchor motes (i.e., their location states are already known) and D is a free mote with unknown location.Two motes act as transmitters, sending a pure sine wave at slightly different frequencies.This results in an interference signal at a low beat frequency that is received by the other two motes (acting as receivers).A sum of range differences between the four motes can be obtained from the phase difference of the received interference signals at the two receiver locations.If motes A and B serve as transmitters and motes C and D form the receiver pair, then the corresponding RIPS measurement, denoted k A,B,C,D , measures the distance differences which may be simply written as In the absence of noise, two independent RIPS measurements can be found.Therefore, the other independent measurement is given by The two independent RIPS measurements can be written as where u = [u x , u y ] is the unknown location of the free node D, and are both known constants in which x i , y i , i = a, b, c, are the location coordinates of the three anchor motes.Accordingly, a generic RIPS measurement model can be written as The underlying localization problem is to estimate the location of the free mote D based on RIPS measurement in Equation (34), where we assume that the knowledge of anchor node locations are known.The problem of locating the node D from RIPS measurements corrupted with Gaussian noise in Equation (34) reduces to a nonlinear parameter estimation problem.We adopt the natural gradient based MLE estimator described above to address it.
(2) Mote localization via RIPS measurements: Based on the RIPS measurement model in Equation (34), we can write the likelihood function in the form Rearranging Equation (35) to describe in terms of the canonical curved exponential family, we obtain which yields (The required quantities can be obtained directly from the standard relations between Equations ( 4)- (7) for curved Gaussian distributions.) The expectation parameter and FIM on natural parameter are given by The Jacobian matrix of natural parameter θ with respect to local parameter u is given by ∇θ where The FIM with respect to the local parameter u, i.e., Equation (15) becomes Therefore, the iterative MLE estimator for estimating the location of a free mote using RIPS measurements is implemented as The covariance of the estimator u (k+1) is the inverse of the Fisher information matrix given in Equation (49).
As with other gradient optimization algorithms, a reasonable guess of the initial state value u (0) is required to facilitate the optimization converges to the correct local minimum.In this application, the initial state u (0) may be obtained from the RIPS measurements via the RIPS trilateration algorithm described in [29], which is then used to calculate G(u (0) ).
The performance of the proposed localization algorithm is illustrated by analyzing a scenario illustrated in Figure 6.In this example, three RIPS nodes are located at (40, 50), (70, 50), and (60, 70) m.The noise of a RIPS measurement is assumed to be zero-mean Gaussian with a standard deviation σ = 1 m. Figure 6a shows 5 cases of localization results of the algorithm.The initial values are randomly generated in the simulations.We use the label CRLB to signify the ellipse which corresponds to the inverse of the Fisher information matrix of the network G(u), centered at the true location u. Figure 6b shows the covariances of the iterative MLE estimator at different iterations, where the kth error ellipse is calculated using the inverse of G(u (k) ) in Equation (49) and is centered at u (k) .Figure 7a,b demonstrate the localization results of different initial state values with the same measurements and results based on a set of measurements with the same initial state values, respectively.The estimator performance under this scenario is summarized in Table 1.Table 1.Estimator performance summary for the mote localization scenario shown in Figure 7b.

Conclusions
In this paper, an iterative maximum likelihood estimator based on the natural gradient method is described to address a class of nonlinear estimation problems for distributions of curved exponential families.We show that the underlying nonlinear stochastic filtering problem is solved by a natural gradient optimization technique which operates over statistical manifolds under dual affine connections.In this way, information geometry offers an interesting insight into the natural gradient algorithm and connects the stochastic estimation problem to a deterministic optimization problem.In this respect, the underlying philosophy is far more significant than the algorithm itself.Furthermore, based on an information geometric analysis it is promising that better algorithms for solving non-linear estimation problems can be derived.For instance, a "whitened gradient" which whitens the tangent space of a manifold has been presented in [30].The whitened gradient replaces the Riemannian metric G(u) in the natural gradient updates by its square root G − 1 2 (u) and results in a faster and more robust convergence.
The work in this paper indicates that the methods of differential/information geometry provide useful tools for systematically solving certain non-linear problems commonly encountered in signal processing.Future work involves extrapolation of these techniques to handle the filtering problem for nonlinear stochastic dynamics.

Figure 2 .
Figure 2. Illustrates the nonlinear estimation concept in both Euclidean space and statistical manifolds the dual spaces of natural parameter θ and expectation parameter η.Top figure shows relations between parameter spaces (left) and samples (right) and associated estimating mappings.Bottom figure shows equivalent model for manifolds of parameter (left) and sample distributions (right).In both cases, the aims are to estimate the most likely parameter values that predict the data, and vice-versa.

Figure 4 .
Figure 4. Estimation example of one-dimensional curved Gaussian distribution.(a) Shows the natural parameter space and embedding of the curved Gaussian distribution in it; (b) Shows the dual expectation parameter space and the estimation process in it.

Figure 5 .
Figure 5. Radio Interferometric Positioning System (RIPS) measurement involving four sensors with three known anchors and one unknown sensor.

Figure 6 .
Figure 6.(a) Shows how the algorithm correctly localizes 5 unknown sensors ("+") from 3 motes ("∆") with the iteration beginning at initial (guessed) locations; (b) Shows an example of convergence of the iterative MLE estimator covariance to CRLB in the localization process.

Figure 7 .
Figure 7. (a) An example of localization with the same three anchor motes as in Figure 6 for iteration beginning at 10 different initial locations; (b) Localization results for a set of measurements with the same initial state values.