1. Introduction
A relevant task in statistics and machine learning is the comparison of probability distributions [
1]. This problem has been tackled from a wide range of paths, including statistical tests and, closer to this contribution, an Information Geometry perspective [
2,
3]. In Information Geometry, a probability function is represented by a vector of parameters, and these parameters define a statistical manifold. In this manifold, the application of concepts from differential geometry helps unveil relevant properties of the distribution of points, representing probability distributions, such as geodesics [
4], clusters [
5], and anomalies [
6].
Two random variables, representing a system in two different conditions or two processes compared at the same condition, can be compared from several perspectives. In
Figure 1A left, a collection of 1000 samples or measurements was collected from a Gaussian distribution with parameters
. In the same figure, on the right, is a collection of 1000 samples withdrawn from a uniform distribution in the range
. From each of these two collections, histograms can be computed, as shown in
Figure 1B. There, the empirical parameters, obtained from the samples, are displayed:
for the Gaussian distributions and
for the uniform distributions. These two histograms can be compared in a number of ways, such as a Kolmogorov–Smirnoff test [
7,
8], by computing the 1-D Wasserstein distance or by calculating the Aitchison distance. In all three cases, the comparison is made directly over the histograms.
There are alternatives to compare histograms or probability functions. Some probability functions can be represented by their mean
and standard deviation
. These two parameters define a statistical manifold [
2,
4], and its geometry can be captured by the Fisher–Rao distance [
9]. Probability distributions can be compared in this manifold using relevant distance functions or relying on relevant divergence metrics, as depicted in
Figure 1C, where the parameters derived from the distributions are the coordinates in the manifold.
In this contribution, to compare probability distributions, we follow a different path to that of applying statistical tests, and closely related to Information Geometry, where one of the main objectives is that of computing distances in a statistical manifold. In
Figure 1D, both left and right, the 1000 samples shown in
Figure 1A are observed in the x-axis, and in the y-axis, an error associated with the measurement is shown. This error is the mean square error (MSE) associated with observation
, considering
as the best descriptor of the elements in the sample or collection. For clarity, two specific points, shown as a red square and a blue triangle, are shown for both distributions in
Figure 1A,D. For the Gaussian distribution, it is observed that the red square has a large value (A), and thus the error associated with it is rather high, as shown in D. For the blue triangle, located near the mean of the distribution, the error is close to the minimum, which corresponds to the empirical mean. The opposite is observed for the uniform distribution, where the measurement for the red square is located near the mean, and thus its error is low, whereas the blue triangle presents a low measurement and thus a high error (D).
The core idea we tackled in this contribution is based on discrete probability distributions and can be described as follows. From a list of observations depicting measurements of a random variable, its probability distribution or its characterization as a histogram, is obtained. We characterize via its mean square error considering each value as the true descriptor of the distribution, that is, the value with the minimum square error. We will refer to this characterization as or, if the context allows disregarding , as . refers to a function that assigns an error to each observation when that parameter is considered the true value. presents some interesting properties. One such property is its characterization as a second-degree polynomial, and, in particular, the parameters of the polynomial are to be identified for the next steps in the analysis. We show that the linear and constant parameters of the polynomials of induce a manifold that can be represented by a second-degree polynomial. In this manifold, we can compare probability distributions from a different perspective and gain insight into the closeness or resemblance along the studied distributions.
Figure 2 shows the main argument of this contribution. In
Figure 2A, three probability functions are depicted, namely
.
comes from a Gaussian distribution with parameters
depicted as a solid blue line, from which one hundred samples were withdrawn and shown as gray dots. Similarly,
shows a Gaussian distribution with parameters
displayed as a solid red line. Again, a hundred samples were withdrawn from that distribution and shown as gray dots.
is a uniform distribution in the range
or as
. The three histograms can be compared using the 1D Wasserstein distance (
) or the Aitchison distance [
1,
10]. The probability functions can be compared in the manifold defined by their parameters
and
(
Figure 2B). In this statistical manifold, the Fisher–Rao distance (
) can be applied to compare the functions. However, we follow a different path. For a probability function
, its
can be computed. In
Figure 2C, for the distributions
, their
are shown as a function of the measurements
in each distribution.
can be represented by a second-degree polynomial, for which the quadratic parameter is always 1. As also shown in
Figure 1D, the
of a distribution can be represented by a second-degree polynomial. In
Figure 2C, the polynomial that fits the
of each of the three distributions
is shown, with their corresponding equations:
,
and (
). The parameters
and
are specific for each distribution but can be derived, as will be detailed in
Section 2, from the mean and standard deviation of the distributions. The linear and constant parameters of the polynomial representing
define a manifold in which probability functions can be compared (
Figure 2D). Here, each point comes from the parameters that define the second-degree polynomial of
. Orange dots represent Gaussian distributions, while green dots represent uniform distributions. This new manifold is also a second-degree polynomial. The comparison we suggest is conducted in the manifold defined by the linear (
) and constant
parameters of
.
In
Figure 2D, each dot represents the
of a probability function. The green circle represents the uniform distribution, whereas the red and blue triangles represent the Gaussian distribution from
Figure 2A. The rest of the points in
Figure 2D are linked to Gaussian (orange triangles) or uniform (pale green circles) distributions.
The takes into account in its definition the mean and the standard deviation. In that sense, our proposal is closely related to Information Geometry, where a statistical manifold, defined by precisely the mean and standard deviation of distributions, is the structure in which the analysis can be conducted.
The motivation of the approach presented in this contribution is that of comparing distributions or collections of limited data from a different perspective. The idea of transforming a distribution or a collection of observations to a representation based on the mean square error may be of interest, since it can capture, as we show here, interesting properties and can compare distributions under a different set of assumptions. In that sense, counting with such an approach can expand the set of available paths for comparing distributions.
This contribution continues as follows. In
Section 2, the mean square error is described, focusing on the second-degree polynomial that defines it. In
Section 3, we describe the relevant attributes of the manifold induced by the parameters of the second-degree polynomial of
. We continue to
Section 4, where we offer computational evidence for the relevance of comparing probability functions in the manifold induced by the parameters of the polynomial fitting of the MSE. Finally, in
Section 5, we offer some conclusions, limitations, and future work related to our contribution.
2. Mean Square Error and Its Approximation by Polynomials
The mean square error, or
, of a random variable
is defined as
, where
is an estimator of the distribution [
11,
12]. Here,
is any value in the range
of
, and the
achieves the global minima at the mean of
[
13]. In this sense,
can be represented by its
along values
.
Let
be the discrete probability distribution obtained from a collection of observations
. Let
∈
L be an observation or measurement, and let
be the cardinality of
L. The
associated to each measurement
is defined as
The range of is the same as the range of the probability distribution; that is, both have the same support. Although the minimum of corresponds to the mean of , we do not intend to replace by a single scalar, for example, its mean. The rationale of computing the for each value is to use it to represent and its probability distribution in a different perspective and compare that representation with similar representations from other collections of observations.
The evaluation of in its support can be described by a second-degree polynomial. It is important to identify the coefficients of such polynomial in order to generate a new space, defined by those parameters, in which random variables can be compared following the proposal described in this contribution. The assertion that can be fitted to a second-degree polynomial constitutes the first lemma in the current contribution and we will proceed to offer a proof of it.
Lemma 1. The of a probability distribution is represented by a second-degree polynomial with a quadratic term equal to 1.
Proof. To prove Lemma 1, we have to find a representation of
as a second-degree polynomial. A second-degree polynomial
evaluated in
is of the form
. The representation of the second-degree polynomial for
is as follows:
By inspecting Equation (
2), it is observed that the polynomial that fits
is of the form:
Here, the parameters
and
are defined as
Thus, the MSE can be represented by a second-degree polynomial with quadratic coefficient 1, linear coefficient equal to , and constant coefficient equal to . □
Given that the definition of involves a quadratic term of two quantities, , the application of a direct expansion of these terms led to the canonical form of a second-degree polynomial. This is possible because of the property of linearity of expectancies in the definition of .
It is important to note that the fitting of of a random variable by a second-degree polynomial is not a mere approximation. The definition of is in fact in terms of a quadratic relation between elements in a random variable or collection of observations. Lemma 1 shows how to find the parameters of such a polynomial, based on the definition of . Since the quadratic coefficient is always 1, we will not pay attention to it, and our focus will be on the linear and constant coefficients.
The linear coefficient of the polynomial fitting the of a random variable or a distribution is equal to . We need to express in terms that we can relate both coefficients, and . These two terms, and , will define the foundations of a manifold that will allow comparison of distributions following a path different from that of comparing mean and standard deviations, as in state-of-the-art Information Geometry. In order to express consistently with , we continue to Lemma 2.
Lemma 2. For any random variable, the constant coefficient of the polynomial fitting their is equal to .
Proof. We have to prove that
, since from Lemma 1, we already know that
. The variance of
is defined as
Since
, and
, and
, we can rearrange it as
(König–Huygens equation [
14]) □
The linear and constant coefficients of the
-fitting polynomial (
and
), expressed in terms of the expected value (mean) and in terms of standard deviation, are important to define our approach. These two coefficients will allow the creation of a new space in which distributions or random variables can be compared following simple differential geometry tools, as detailed in
Section 3.
For simplicity, we refer to as , again, where refers to the observations from which was computed. The second-degree polynomial of the of a distribution , namely , is determined by the linear and constant coefficients, and . Thus, refers to both the second-degree polynomial and its linear and constant coefficients (since the quadratic coefficient is always 1).
Let and be two Gaussian distributions, with means and and standard deviation and , respectively. Their corresponding characterization in terms of are and , where is the same range for both distributions. We observe, following Lemma 2, that only when their corresponding mean and standard deviations are the same.
For distributions other than Gaussians, following Lemma 1, things are not as easy. Indeed, since
and
are derived from the expected value of the distribution, it may occur that two distributions with the same expected value, but otherwise different, may be represented by the same parameters. This is an open problem, but for the sake of our argument, we will leave it aside for the moment and go back to it in
Section 4. However, if a distribution can be approximated by its mean and standard deviations, as is the case for uniform and Poisson distributions, the approach depicted so far holds.
Our goal is to explore the properties of the space defined by the linear and constant parameters of the second-degree polynomial fitting the of probability distributions. Our approach establishes the comparison of two probability distributions not directly, for example, by computing the Aitchison or the Wasserstein distances or by computing the Fisher–Rao distance in the statistical manifold defined by the parameters of the family of distributions. Instead, we obtain for each of the compared distributions their representation as the and, from it, the linear and constant coefficients of the second-degree polynomial fitting the . The linear and constant coefficients of the polynomial fitting define a new structure that we will call , which is a manifold in which the comparison of probability distributions can be carried out. The next section describes the properties of this manifold .
The of a distribution is defined by the mean and standard deviation. In that sense, the parameters , which define the linear and constant coefficients of the polynomial linked to , can be seen as an alternative but equivalent way to define spaces or manifolds to compare distributions.
In this Section, we have revised how any probability function can be represented by its . In the next section, we will review the manifold induced from the linear and constant coefficients of the second-degree polynomial describing .
3. The Geometry of the ψ Manifold Induced by the Linear and Constant Parameters of MSE
The linear and constant coefficients of the second-degree polynomial describing the
of a distribution do not scatter randomly. In fact, as seen in
Figure 2D,
is nonlinearly related to
. The distribution of
and
defines a manifold that will be called
. In this section, some of its properties are explored. For that, we remember that
and
. From this, it is already clear that there is a nonlinear relationship between these two coefficients. We will expand on this in the next paragraphs.
From
Figure 2D and Figure 5A–E in the first column, it can be observed that the points
from the
of several probability distributions follow a clear pattern. This pattern describes a nonlinear relation between
and
. The hypothesis is that the relationship between these two parameters can be expressed as a second degree polynomial. We want to obtain the values of
, and
. For that, we present Lemma 3.
Lemma 3. The constant coefficient of the -second degree fitting polynomial, , can be expressed as a polynomial of second degree of , that is, the linear coefficient of .
Proof. To prove Lemma 3, we explicitly define the relation between
and
:
From Lemmas 1 and 2, we have that
, and
. From here, we have that
, and since
, we can replace
to reach
. From (
7), it follows that
. □
The relation between
and
is defined by a polynomial of second degree, with quadratic term
, linear coefficient of 0, and constant term
. This relation is based on the definitions of
and
. In that sense, it is not a mere approximation but an equivalence. Equation (
8) explicitly relates
to
:
We can interrogate from an Information Geometry approach. The first question that arises is that of distances. Given two Gaussian distributions and and represented by their corresponding parameters in the statistical manifold , are their positions in this manifold and in correlated? In other words, we want to investigate whether the Fisher–Rao distance between and is somehow correlated to their distance in the assigned positions in . A related question is whether the 1-D Wasserstein distance between and is correlated to their distance in .
In order to answer these questions, we have to explicitly define a distance in the space. Since our hypothesis holds, that is, there is a second-degree polynomial fitting to , then a natural way to measure the distance between two points, representing two probability distributions along the curve, is to compute the arc length, which will define the geodesic between these two points.
Geodesics in
space are the arcs linking the points. The length of the arc from
to
along the curve is
For
:
It follows that
Let ; so, , and .
When , .
When
,
We proceed to substitute
and we reach
The arc length between two points
and
located along a second-degree polynomial with equation
is defined by Equation (
9).
The
manifold, defined by
and
, is a space that allows the comparison of distributions. In order to reveal more of its structure,
Figure 3 shows several examples of distributions represented by their corresponding coefficients
and
as the coordinates in
. In
Figure 3A, the different curves represent the second degree polynomial for distributions with the same
. Distributions in the same curve and thus with equal
may have a different
, which is coded in the coordinate
. In
Figure 3B, we show the geodesic between points A and B, which corresponds to the arc length of the polynomial shown in Equation (
8). For contrast, the Euclidean distance between A and B is also shown.
Figure 3C, shows several distributions, obtained from Gaussians with
and
in the indicated ranges. In contrast with
Figure 3A, here, the distributions, shown as colored points, present small variations in their
. Each point, representing a distribution, has as its coordinates
and
, where, as discussed in the previous Section, a probability distribution
is represented by
. Distributions with similar
fit the same curve. Each point represents a distribution of 1000 samples. In
Figure 3D, the case for uniform distributions is shown, for samples drawn from the specified intervals. Again, each point represents a distribution consisting of 1000 observations.
The curvature
of a structure is of high relevance in Differential Geometry. The curvature of a function
at point
is defined as [
15,
16]
The curvature of a polynomial of second degree such as those defined by Equation (
8), is defined as
The curvature
of
is defined in terms of
and
, the linear and constant coefficients of the second degree polynomial fitting the
of a distribution. A natural question to ask is whether the notion of distance or geodesics is affected by the curvature. First,
Figure 4 shows
(red) and the curvature associated with each point along it (blue), as defined in Equation (
11). As is clear from the definition of a curvature over a parabola, the maximum curvature corresponds to the point of the vanishing derivative.
Once the curvature of
is explicitly presented (
Figure 4A), we can proceed in the path to offer an answer to the question of how geodesics in
are affected by curvature. In
Figure 4B, we show the expected Euclidean distance (red) from each point in
(only the
coordinate is displayed) to the rest of points in
. Along it, in blue, we show the expected Fisher–Rao distance for the associated point in the
manifold. Each point in
can be easily mapped to the
manifold since
and
are defined in terms of
and
(see Equations (
2) and (
7)). The expected Fisher–Rao distance has its minimum at the lowest point of the parabola, whereas the minimum expected Euclidean distance of points in
does not correspond to the lowest point of
. This comes from the definition of Euclidean distance and will not be discussed here. For both the Euclidean and Fisher–Rao distances, it is observed that points closer to the vertex of
tend to present a lower expectancy, with the exception of the already mentioned increase in the expected Euclidean distance for the vertex in
.
Figure 4C presents the expected arc length in
as a function of
. Here, as for the Euclidean and Fisher–Rao distances, it is observed that points located in the extremes of
present a larger expected arc length. Interestingly, close to the vertex, the expected ar length decrease slower than for the Fisher–Rao and Euclidean distances. This suggests that the arc length is less affected by the curvature. In order to give more evidence of such assertion,
Figure 4D shows the relation between the expected Euclidean (red) and Fisher-Rao (blue) distances with the expected arc length for points in
. For the Euclidean case, the relation is almost linear, except for points in
that are indeed close to the vertex (
), where the expected Euclidean distance increases. The Fisher–Rao relation with the arc length is increasing, although nonlinear.
The effect of the curvature in the distances is shown in
Figure 4E,F. In
Figure 4E, the expected Euclidean (red) and Fisher–Rao (blue) distances are shown as a function of the curvature
of
. It is observed that the expected Fisher–Rao distance is monotonically decreasing with the curvature of
. On the other hand, the expected Euclidean distance presents a different behavior, as it decreases and then increases as a function of the curvature. Finally, in
Figure 4F, the expected arc length is presented as a function of the curvature
. The expected arc length of points in
is decreasing with the curvature, although, as observed, the effect of curvature is asymptotic. This means that the arc length of points in
can be used to compare distributions even when the curvature of the associated parameters is relatively large. So far, this is only a conjecture, but if it holds, then our proposal would be useful to compare distributions whose coordinates in
do present a large curvature.
A comparison of the distributions in
is straightforward, as long as the distributions are located along the curve. From Equation (
8) and
Figure 3A, it is observed that distributions with the same
will be located in the same
structure. If this is the case, the arc length that separates the points in the curve offers a natural distance function. If distributions do not have the same
, our approach offers no better solutions to the existing ones from Information Geometry. We will expand into these limitations and other properties of the approach defined here in
Section 4.
In this section, we revised the space defined by the linear and constant coefficients of the second-degree polynomial of the mean square error of a probability distribution. We observed that the distribution of these two parameters for several probability distributions is not random and, in some cases, can be fitted by a second-degree polynomial.
4. Computational Explorations
Most of the applications of Information Geometry deal with large collections of data. There, the parameters have to be inferred from data, and theoretical results, although valuable, are not directly applied. For instance, first and second moments of the distributions are to be computed from data. In that sense, it is relevant to explore the performance of our proposed with existing theoretical tools, such as the Fisher–Rao distance.
The Fisher–Rao distance between two Gaussian distributions
and
, with mean
,
and standard deviation
and
, respectively, defined in the statistical manifold induced by
, is given by [
8,
17,
18]
Since the goal of this contribution is to verify whether the space offers an interesting perspective at the comparison of probability distributions, a natural question to ask is whether the Fisher–Rao distance in the parameter space manifold is correlated to the arc length in . A second relevant question is whether the 1-D Wasserstein distance between the histograms is correlated with the arc length (geodesic) in the manifold.
To gather empirical evidence to answer the two previous questions, we conducted several experiments.
Figure 5 shows several Gaussian distributions and their characterization. In
Figure 5A, we show the coordinates
of hundreds of Gaussian distributions with mean in the specified range and a very low
. The points, that is, the distributions represented by
closely fit the theoretical polynomial
.
In
Figure 5B–E, the
space (first column) is shown again for Gaussian distributions with a mean in the range
and increasing range for
. The distributions or, more specifically, the parameters
and
that define the
of the distributions are shown as gray dots. Each dot is a Gaussian with mean in the range
and standard deviation in the indicated range. The solid blue line is the second-degree polynomial that fits
and
. From Equation (
7) and the subsequent text, we have
, and
.
For the first case shown in
Figure 5B, first column, and from the generated data when
, we have
. When the range of variation for
increases (
), as shown in
Figure 5C, the equation of the polynomial fitting the points is
. When
varies in a wider range (
), the equation is
(
Figure 5D). Finally, when
is allowed to vary even more (
), the polynomial is
(
Figure 5E).
The parameter
increases with
, as expected from Equation (
7). It is also observed that, as
increases, the points are more scattered along the fitting polynomial, which is expected because distributions can present a wider variance. Finally, even though
was supposed to be null, from the derivation of Equation (
7), when working with samples, it presents a value greater than 0 and increasing with
.
In the second column of
Figure 5B–E, the arc length, computed by Equation (
9), is shown as a function of the 1-D Wasserstein distance. It is clear that the correlation is not perfect, although a peculiar pattern appears. The relation between the Fisher–Rao distance (FR, Equation (
12)) in the statistical manifold induced by
and the arc length, shown in the third column, is far from perfect. This speaks of the different nature of comparing probability distributions in the two manifolds. The last column shows the scatter plot of 1-D Wasserstein distance and the Fisher–Rao distance. It is observed that for distributions with a wider range for
, the correlation between Wasserstein and FR decreases.
In this section, we offered computational evidence that the comparison in the space, that is, the space induced by the linear and constant parameters of the second degree polynomial of , may offer interesting alternatives to compare distributions.
5. Discussion and Conclusions
The comparison of probability distributions is of high relevance in many fields. The comparison can be directly applied over the distributions as with the Aitchison, Wasserstein, or other relevant distances. Alternatively, a comparison can be conducted in terms of the parameters that describe the distributions. Those parameters may define a statistical manifold, as is the case for Gaussian distributions when their mean and standard deviations are considered. There, the Fisher–Rao distance can be applied, since it captures the inherent geometry of the data points. There are other possibilities that attempt a parametrization of probability distributions, with a different list of relevant parameters.
In this contribution, we have described how a discrete probability function can be described by its mean square error, or , along its range. can be approximated by a second-degree polynomial , where , , and , where , and where L is the set of observations or measurements from which is computed.
The linear and constant parameters of the -fitting polynomial define an interesting structure. The distribution of and defines a nonlinear pattern. In particular, can be approximated by a second-degree polynomial from with parameters for the quadratic, linear and constant terms for certain cases, for example, for distributions with similar parameters.
Every probability distribution is mapped to a point along the second-degree polynomial that relates the linear and constant parameter of the -fitting polynomial. The comparison between probability distributions in this space can be conducted by computing the arc length of the curve defined by the polynomial. We presented empirical evidence of the possible relevance of such a comparison. Counting with a clear picture of how probability distributions are related to each other and thus how to best compare them is still an open question. Our aim is to offer more details on this comparison by the proposed method.
The approach introduced in this contribution is closely related to the existing methods in Information Geometry. There, distributions are mapped to a space defined by first and second statistical momenta, namely the mean and standard deviation. In that space, distributions can be compared by an appropriate function, such as the Fisher–Rao distance. Our approach is similar to other methods in the sense that it works with the same statistical momenta. However, our approach diverges from others in the sense that distributions are transformed to the mean square error representation. The mean square error representation of a distribution is a second degree polynomial, relating the mean and the standard deviation of the distribution. The linear and constant parameters of this representation define a manifold in which distributions can be compared by computing the arc length joining them.
Our approach, as shown by the numerical and analytical evidence, does not arrive at the same results as the existing schemes. The manifold introduced in this contribution, where distributions can be compared, is different to the usual manifold considered in Information Geometry, although the parameters are the same. In that sense, it is an alternative that may be of interest in cases in which distributions can be approximated by first and second statistical momenta.
Comparing probability functions that cannot be adequately represented by
and
is still an open problem [
9]. In this contribution, we explored alternative paths that allow a comparison between probability distributions of different families, while focusing on the mean and standard deviation. We do so by obtaining the mean square error of the studied probability functions, and then, we fit a second-degree polynomial to the mean square error of the probability functions. The parameters of such a polynomial define a manifold, which can also be approximated by a second degree polynomial. The comparison of probability functions can then be conducted in the latter polynomial. The position of every probability function is a point along the polynomial-defined manifold; so, the length of the arc that separates the points that represent the probability distributions is a natural way to compare such probability functions.
There are several possible extensions to the ideas presented here. More theoretical results are in order, for example, consider third or higher momentum in the error measure, as opposed to what is present in the mean square error, which takes into account only first and second momenta.
An analytical derivation of the expected arc length as a function of the curvature in the manifold is a natural next step. Counting with such a function and comparing this to the corresponding one for the Euclidean and Fisher–Rao distance would make a fairer comparison between the approach introduced here and the existing ones. The empirical and numerical evidence we have presented opens the door for more research.