The Bayesian Inference of Pareto Models Based on Information Geometry

Bayesian methods have been rapidly developed due to the important role of explicable causality in practical problems. We develope geometric approaches to Bayesian inference of Pareto models, and give an application to the analysis of sea clutter. For Pareto two-parameter model, we show the non-existence of α-parallel prior in general, hence we adopt Jeffreys prior to deal with the Bayesian inference. Considering geodesic distance as the loss function, an estimation in the sense of minimal mean geodesic distance is obtained. Meanwhile, by involving Al-Bayyati’s loss function we gain a new class of Bayesian estimations. In the simulation, for sea clutter, we adopt Pareto model to acquire various types of parameter estimations and the posterior prediction results. Simulation results show the advantages of the Bayesian estimations proposed and the posterior prediction.


Introduction
Geometric method plays an important role in Bayesian statistics. At present, there are two main ways to study Bayesian inference through geometric methods. An idea is to regard the prior distribution, the probability distribution of the statistical model and the posterior distribution as the vectors in Hilbert space L 2 (Θ). The research is then carried out through the geometric properties of Hilbert space. M. de Carvalho [1] used the cosine of the angle between vectors to study their relationship with each other, where the cosine of priors represents coherency of opinions of experts, the cosine of prior and probability density represents prior-data agreement and the cosine of prior and posterior represents sensitivity of the posterior to the prior specification. Furthermore, M. de Carvalho used Monte Carlo Markov Chain to give an estimation of the cosine value for further analysis. R. Kulhavy [2] viewed statistical inference as an approximation of the empirical density rather than an estimation of a true density, and built a model by analyzing the trace of orthogonal projection of conditional empirical distributions onto the model manifold. He also used Kerridge inaccuracy as a generalized empirical error. Kerridge inaccuracy is a generalization of Shannon entropy. It is used to measure the difference between observed distribution Q = (q 1 , · · · , q n ) and true distribution P = (p 1 , · · · , p n ), which is defined by I(P, Q) = − ∑ i p i log(q i ). The advantage of this idea is providing a unified treatment to all pieces of Bayesian theorem. However, the finite parameter space measure is required.
Another idea is to give the statistical manifolds Riemannian metrics. J.A. Hartigan [3][4][5] proposed a reparametrization invariance prior-α-parallel prior and later J. Takeuchi and S. Amari [6,7] clarified an interesting connection between the information geometrical properties of the statistical model and the existence of the α-parallel prior. α-parallel prior, as an uninformed prior, is invariant under the coordinate transformation and can well reflect the intrinsic properties of the model. It is worth noting that the general α-parallel where α ∈ R is an arbitrary real number, g ij = E ∂ i l∂ j l , g ij is the inverse matrix g ij , Γ s:jk = E ∂ s l∂ j ∂ k l , T sjk = E ∂ s l∂ j l∂ k l , l := log p(x; θ) denotes the log likelihood function and E[·] denotes the expectation with respect to the observation x.
Definition 2 ([7]). An affine connection ∇ is called locally equiaffine if around each point x of M, there is a parallel volume element, that is, a nonvanishing d-form w such that ∇w = 0.
An equiaffine connection ∇ on M is a torsion-free affine connection with a parallel volume element w on M.
If w is a volume element on M such that ∇w = 0, then we say that (∇, w) is an affine structure on M.
For a statistical manifold M, we may represent the α-parallel volume element w as w = π(θ)dθ 1 ∧ · · · ∧ dθ d for a certain coordinate θ = θ 1 , · · ·, θ d ∈ Θ ⊂ R d , where π is the volume form on the whole manifold. We take π(θ) as a prior distribution on the parameter space Θ. Definition 3 ([7]). In a statistically equiaffine manifold, for a fixed α ∈ R, we call the above form of π an α-parallel prior.
When α = 1, 1-parallel prior is called maximum likelihood estimation (MLE) prior proposed by J.A. Hartigan [5]. Note that there always exists a ∇ (0) -parallel volume element w ∝ g(θ)dθ 1 ∧ · · · ∧ dθ d , where g is the determinant of the Fisher metric, the invariant volume element in a Riemannian manifold M, g ij . This prior distribution π ∝ g(θ) is called the Jeffreys prior.
J. Takeuchi and S. Amari gave a sufficient and necessary condition for the existence of α-parallel prior.
where T i := T ikl g kl .

Bayesian Inference
For the random variate x subject to the distribution p(x; θ), and let π(θ) be the prior distribution of θ. The posterior distribution π(θ|x) is given by the formula Now, we introduce some notations for later uses. Letθ MD be the maximum posterior estimation. Letθ Me be the posterior median estimation which is the median of the posterior distribution. Letθ E be the posterior expectation estimation which is the expectation of the posterior distribution.
These three estimations are also known as Bayesian estimations of θ. Whenθ =θ E , the posterior mean square value reaches the minimum. Henceθ E = E[θ|x] is often taken as the Bayesian estimation.
Let the random variable X ∼ p(x; θ). If one does not know the observation data, the marginal distribution m(x) is also known as the prior prediction distribution. If one obtains the observation data x = (x 1 , · · ·, x n ), the distribution of unknown observation values could be obtained by the posterior distribution π(θ|x):

1.
Predict the future observations of the same population p(x; θ)

2.
Predict the observations of another population g(z; θ) where m(x|x) or m(z|x) is called the posterior predictive distribution.

The Geometric Approaches for Bayesian Inference
In this section, we introduce the basic methods of Bayesian inference with geometric means. The idea of geometry is embodied in the selection of priors and loss functions.

The Geometric Prior
The idea of geometric methods is to extend the uniform distribution naturally and construct geometric priors suitable for multidimensional and measure infinite-dimensional parameter space according to the idea that probability measure is proportional to volume element. The studied probability distribution family can be regarded as a statistical manifold with Riemannian metrics.
Fisher information matrix is the most widely used Riemannian metric on statistical manifolds, and the prior generated by its corresponding volume element is Jeffreys prior. α-connection is a natural extension of the Levi-Civita connection corresponding to Fisher information matrix. Its corresponding volume element is α-parallel volume element, and the generated prior is called α-parallel prior. In particular, the 0-parallel prior is the Jeffreys prior.
α-parallel prior reflects the intrinsic property of the model and does not depend on the selection of parameters. Although Jeffreys prior must exist, general α-parallel prior does not necessarily exist. (1) gives the necessary and sufficient conditions for the existence of general α-parallel prior.
Therefore, when one deals with Bayesian inference by geometric methods, the first step is to select the appropriate geometric priors, that is, to verify the existence of α-parallel prior in a specific statistical manifold.
With Riemannian metric, we can acquire geometric information of the statistical manifold, such as connection, curvature, geodesic and geodesic distance [14,15]. Through geometric priors, the joint posterior density of the parameters can be obtained, and then the corresponding Bayesian estimation and Bayesian posterior prediction are carried out [16].

The Geometric Loss Functions
In this subsection, we show the geometric meaning of the common Bayesian estimations and propose a new geometric approach of choosing loss functions.
For the loss function l 1 (θ,θ), we define Then we get For the loss function l 2 (θ,θ), we define Then we obtain If the loss function is the distance induced by · 1 , then by Proposition 2 we see that the corresponding risk function represents the average distance between the estimated value and the true value. Besides, the posterior median estimation of parameters minimizes risk function, which means that this estimation has the minimum mean distance from the posterior density.
If the loss function is the distance induced by · 2 , then the corresponding risk function represents the average value of the square of the distance between the estimated value and the true value. The obtained estimation is the posterior expectation of parameters, which has the minimum mean square error from the posterior density.
These two kinds of loss functions above are distances or increasing functions of distances in R n . However, in the parameter space endowed with corresponding Riemannian metric, the distance between two points is geodesic distance instead of Euclidean distance.
Hence, in order to make the estimation more accurate, we take the geodesic distance or its increasing function as a loss function, the corresponding risk function represents the geodesic distance between the estimated value and the true value. Before that, we need the following definition. Let π(θ|x) be the joint posterior distribution and d(θ,θ) be the geodesic distance between θ andθ, whereθ is the estimation of θ. Let F: R → R be an increasing function. Denote D(θ,θ) = F • d(θ,θ). The risk function with the loss function D(θ,θ) is The estimation minimizing R(θ,θ) is called mean geodesic estimation (MGE) and denoted byθ MGE .
The geometric priors, the corresponding geodesic distance and the corresponding Bayesian inference depend on the choice of the Riemannian metric. Hence choosing a proper Riemannian metric is of great importance to Bayesian inference.

The Geometric Structure of Pareto Two-Parameter Model
The probability density function of Pareto two-parameter distribution satisfies where α is called the scale parameter and β is called the shape parameter. Its logarithmic likelihood function is Noting that the Pareto distribution family does not meet the common regularity condition, hence the Fisher-Rao metric on the Pareto distribution family is not equal to the negative Hessian matrix.
Furthermore, from References [17,18] we can get the geometric structure of Pareto model. On Pareto two-parameter distribution family, the tensor form of Fisher-Rao metric satisfies which is isometric to the upper half of the Poincaré plane. Hence, Pareto two-parameter model (P, g) is a Riemannian manifold endowed with Riemannian metric g. The volume form, the connection form, the curvature form, the Christoffel symbols and the geodesic distance formula on (P, g) are given as follows

The Existence of α-Parallel Prior on Pareto Two-Parameter Model
Theorem 1. When α = 0, Pareto two-parameter model does not have any α-parallel prior.
Then by calculation, we can obtain Hence, we get It is obvious that α = 0 means ∂ i T j − ∂ j T i = 0. Therefore, according to Proposition 1, we find that Pareto two-parameter model does not have any α-parallel prior when α = 0.

Bayesian Estimations of Pareto Model
Before we proceed, we state necessary results from Reference [17]. The joint probability density of a simple random sample on Pareto model is The posterior distribution of Pareto model under Jeffreys prior is obtained by Bayesian formula where Furthermore, by calculation we can see that the maximum likelihood estimation and the maximum posterior estimation of α, β are given as The marginal posterior density of α determined by the joint posterior density π(α, β|x) is and its cumulative distribution function is The marginal posterior density of β determined by the joint posterior density π(α, β|x) is Under posterior distribution (13), when β is known, the conditional posterior density of α is and its cumulative distribution function is When α is known, the conditional posterior density of β is

Theorem 3.
When β is known, we haveα MGE (x|β) =ˆα MLE exp 1 nβ . And when α is known, we havê Proof. When β is known, the risk function is When α is known, the risk function is Noting that

Bayesian Estimations under Al-Bayyati's Loss Function
The Al-Bayyati's loss function was stated by Reference [19] as where c is a real number. Next, we use Al-Bayyati's loss function to derive the Bayesian estimation of Pareto model.

Proposition 3.
Assume that θ ∈ Θ ⊂ R. Under Al-Bayyati's loss function, the Bayesian estimation of parameter θ is given byθ Proof. Since the risk function Then we haveθ Using Al-Bayyati's loss function,α B c lacks the simple display expression. Thus we give the upper and lower bound estimations.

Theorem 4.
Using Al-Bayyati's loss function and assuming c ≥ 0, we find that when β is unknown, thenα B c satisfies And when α is unknown, thenβ Furthermore, there exists c 0 such thatβ B c 0 = β 0 , where β 0 is the real value of shape parameter β.
Proof. When β is unknown, we have Noting that Similarly, we can get Furthermore, by Proposition 3, we havê When α is unknown, we have

Proof.
When β is known, we have
Hence we can take such thatβ B c 0 (x|α 0 ) is the true value of β.

Bayesian Posterior Prediction
LetX ∼ π(x; α, β) be the value that needs to be observed from Pareto distribution. In the sense of posterior distribution (13), if the sample X is given, we can make relevant posterior prediction ofX. The discussion will be divided into the following three cases.

1.
When neither α nor β are unknown, then we have

2.
When α is known and β is unknown, we have

3.
When β is known and α is unknown, we have For the above posterior prediction distribution, given the prediction credibility k, we can make Bayesian prediction inference in practical applications. The specific process is as follows. From x Ũ x L m(x|x)dx = k, multiple sets ofx L ,x U can be got. By choosing appropriately the upper and lower bounds forX such thatx U −x L is smaller, then we can obtain higher prediction accuracy.

Simulation
In real life, the proposed algorithm for target detection of maritime radar needs to be tested and verified on sea clutter data. In order to determine the sea clutter better, it is often necessary to estimate the parameters of sea clutter model. Therefore, in this section, we will use the conclusion of Section 4 to estimate the parameters of Pareto model of sea clutter and show the simulation results.

The Influence of Parameters on Sea Clutter
In this subsection, we show the effect of scale parameter α and shape parameter β on sea clutter. Figures 1 and 2 show the probability density curve of Pareto distribution with respect to two parameters. It can be seen from the figures that when the scale parameter α is larger, the density curve is even. The proportion of small clutter amplitude increases, and the decline of the whole curve is gentle. As the shape parameter β becomes larger, the proportion of small clutter amplitude increases significantly and becomes more concentrated, and the tail descends faster. On the whole, for Pareto model, the energy is concentrated on the small clutter. The trailing phenomenon is apparent. The essential reason is that when the radar is grazing incident, the overall backscatter echo is relatively weak.

Various Types of Bayesian Estimation on Sea Clutter Models
In this subsection, we show the aforementioned Bayesian estimations of sea clutter. In order to generate random samples of Pareto distribution with parameters α 0 and β 0 , we use the inverse distribution function and take the inverse transformation method to extract Pareto samples: X = α 0 U −1 β 0 , where U is the uniformly distributed random variable on [0, 1]. We carry out numerical simulations where α 0 = 0.5, 1, 1.5 and β 0 = 0.5, 1, 1.5, respectively. Using the inverse transformation method, we generate 1000 random samples subject to Pareto distribution. To show the geometry of Pareto model of sea clutter, we take (α 0 , β 0 ) = (0.5, 1 ± 0.5) as the center and draw the unit geodesic circumference with dotted line. We draw 64 uniformly distributed geodesics with directions θ 0 = kπ 32 , k = 0, 1, · · · , 63 with solid line. See Figures 3 and 4. To describe the proximity between the estimated values of each group and the predetermined parameter value (α 0 , β 0 ), we need to calculate the geodesic distance d{(α 0 , β 0 ), (α(x),β(x))}. If the distance between the estimated value and the predetermined parameter value is close, we believe that the estimation is accurate. By (22), d (α 0 , β 0 ), α(x),β(x) ∝ 1 + β 0β (x)(log α 0 −logα(x)) 2 2 . Hence the smaller | log α 0 − logα| and |β 0 −β| are, the more accurate the estimation is.
Next we will make a comparative analysis of various types of Bayesian estimations.

Mean Geodesic Estimation and the Common Bayesian Estimations
Case 1. Both scale parameter α and shape parameter β are unknown. From Table 1 we know that |α E −α MGE | and |β E −β MGE | are less than 10 −4 , hencê α E andα MGE are almost equal. Sinceα E does not have explicit expression,α MGE can take the place ofα E and it also has more precise geometric explanation. In most simulation tests, (α MGE ,α MGE ) is more accurate than (α MLE ,α MLE ) and (α ME ,α ME ). Hence, in general the estimation MGE that we proposed is better than the common Baysesian estimations.
Case 2. Either shape parameter β or scale parameter α is known. When one parameter is known, the statistical manifold is degenerated. Hence taking the Euclidean distance or the geodesic distance does not make much difference. This can be seen in Tables 2 and 3. Table 2. Mean geodesic estimations and the common Bayesian estimations (β is known).  Table 3. Mean geodesic estimations and the common Bayesian estimations (α is known). Comparing Tables 2 and 3, when β is known, the accuracy of the estimations is 10 −3 and when α is known, the accuracy of all kinds of estimations is 10 −2 . Therefore, the accuracy of various types of estimations will improve if β is known. This indicates that scale parameter α is more easily obtained from samples in the sea clutter model and has strong robustness. Shape parameter β is more sensitive than scale parameter α.
Case 1. Both scale parameter α and shape parameter β are unknown. When α and β are unknown, the variation trend of Bayesian estimation of the two parameters under Al-Bayyati's loss function with parameter c is shown in Figures 5 and 6, respectively. When β is unknown, by (26) Figure 5 shows the case whenα E ≥ α 0 . Hence from the discussion in Remark 1,α E is the closest estimate among allα B c with the increasing of positive c.  When α is unknown, by Theorem 4 When c = 0,β B c =β MLE =β E and when c 0 = n β 0 β MLE − 1 ,β B c 0 = β 0 . As shown in Figure 6, there always exists infinitely many c such thatβ c is closer than the common Bayesian estimations. And c 0 = n β 0 β MLE − 1 is nothing but the real value of parameter β.
These two figures also show that MGE are better than the common Bayesian estimations when both of the parameters are unknown. Therefore, when α and β are unknown, to obtain closer estimations, we are able to change c 1 and c 2 to make | log α 0 − logα c 1 | and |β 0 −β c 2 | smaller and even obtain the minimum value. Through the previous discussions, the choice of the best c 1 depends on the inequality among real parameter α 0 ,α MLE andα E .
The best c 2 is c 2 = n β 0 Case 2. Either shape parameter β or scale parameter α is known. When β or α is known, the variation trend of various Bayesian estimation of parameter α or parameter β in Al-Bayyati's loss function with parameter c is shown in Figure 7 or Figure 8, respectively. When β = β 0 , by Theorem 5, we get Hence eitherα MLE is the true value of α, or we can take c 0 = α 0 α MLE −α 0 − nβ such thatα B c 0 (x|β), which is the true value of α. This is shown in Figure 7.  When α = α 0 , by Theorem 5, we get β B c (x|α) = n + 1 + c q 2 (x) − n log α .

Simulation of Posterior Predictive Distribution
In order to observe the simulation effect of the posterior prediction distribution, according to the samples generated in Section 5.2, we drew the posterior prediction distribution of sea clutter and the real Pareto distribution π(x|α 0 , β 0 ) of sea clutter where (α 0 , β 0 ) = (1.5, 1.5) for comparative analysis. See Figures 9-11. where The image is shown in Figure 9. The blue curve represents the probability distribution of sea clutter π(x|α 0 , β 0 ), which gives positive values at the right side of the boundary point x = α 0 . The orange curve represents the posterior prediction distribution of sea clutter m(x|x), which changes continuously whenx > 0, but forms a cusp atx =α MLE . Compared with the two curves, the curve of the predicted distribution of sea clutter is connected by a continuous curve and shifts slightly to the left. It is worth noting that although m(x|x) tends to infinity asx → 0 + , it is not reflected in the image and can be ignored in the actual calculation of the probability. Case 2. α is known and β is unknown The posterior prediction distribution of sea clutter is m(x|x, α) = (n + 1)(q 2 (x) − n log α) n+1 x((n + 1) log α + logx + q 2 (x)) n+2 I [x>α] .
It can be seen from Figure 10 that the probability distribution π(x|α 0 , β 0 ) (the blue curve) and the posterior prediction distribution m(x|x, β) (the orange curve) can only obtain positive values at the right side of the boundary point x = α 0 . There is a very high degree of overlap, which means that when α is known, the prediction is going to be very accurate. We come to the conclusion that more effective information can be obtained for parameter α than β.
The probability distribution π(x|α 0 , β 0 ) (the blue curve) obtains a positive value at the right side of boundary point x = α 0 . The posterior prediction distribution m(x|x, α) (the orange curve) changes continuously atx > 0 and forms a cusp atx =α MLE . By comparing these two curves, the posterior prediction distribution shows a significant right shift, and the simulation effect is not ideal nearx =α MLE . However, with the continuous increasing ofx, the two curves gradually coincide and the prediction accuracy becomes higher. Therefore, when β is known and α is unknown, the larger the clutter amplitude is to be observed, the higher the prediction accuracy will be.
To sum up, for the sea clutter model, the Bayesian posterior prediction results under the above three conditions are ideal, and the prediction model can well reflect the characteristics of sea clutter towing.

Conclusions and Future Work
In this paper, we presented systematic methods for Bayesian inference from geometric viewpoints and applied it to Pareto model. We carried out simulations on sea clutter to show the effectiveness.
For Pareto model, there does not exist general α-parallel prior. Using the Jeffreys prior and by using geodesic distance and Al-Bayyati's loss function, we obtain two new classes of Bayesian estimations. We call the estimation in the sense of mean geodesic distance MGE and it is proved that MGE has following advantages: it has the explicit expression, and is more accurate than the common Bayesian estimations which has shown in our simulation. We also prove that the estimations under the Al-Bayyati's loss function are more accurate than the common Bayesian estimations. Actually, there are infinitely many c such that the new estimations are better. These results are important for the estimation of parameters when studying sea clutter model. Finally we show that the Bayesian posterior prediction results can well reflect the characteristics of sea clutter towing in any case.
In the future, more in-depth researches are worth discussing. From statistical viewpoints, we can apply Bayesian inference for the Pareto model to non-linear regression models [20]. From geometrical viewpoints, we expect to generalize our framework and combine more tools from information geometry. We want to carry out more experiments and applications in different fields.