2.1. A Bayesian Approach to the Inference Problem
In this subsection, we define the estimator to be used and formulate the problem within the Bayesian framework. The search for the prior is established as the main target of the paper.
Our goal is to estimate a property
of a stochastic system composed of a large number
k of states, when we only have access to
n samples, with
. The probabilities
constitute a complete description of the system (see
Figure 1).
We choose to work with the estimator
defined as the expected value of the property
when conditioned by the measured data
. The average
is calculated with the posterior distribution
, and the
k-dimensional vector
has components
, with
equal to the number of samples in which the system was observed in the
ith state. Therefore,
Throughout the paper, all integrals in
should be restricted to the space where
is defined: the simplex embedded in
, or the cartesian product of several simplexes, in the multivariate case. The estimator of Equation (
1) minimizes the mean square error of the estimation [
10]. Using Bayes’ rule, the posterior can be written in terms of the likelihood
and the prior
, so that
The likelihood
is the only factor that depends on the sampled data
, and for discrete states, it is a multinomial distribution, namely
The only factor that still needs to be defined is the prior , and the strategy underlying this choice is the topic of this paper. We want the prior to produce an inductive bias that acknowledges our ignorance about . Such bias will be useful to overcome the scarcity of samples, but it will only be suitable to estimate the specific of the problem at hand and not others. In other words, the prior proposed here is individually tailored for .
2.2. Defining the Prior on the Level Surfaces of the Property
In this subsection, the apparent high-dimensionality of the problem is shown to be illusory. The prior we are searching for, although defined in a high-dimensional simplex, is here shown to affect the result of the inference process only through its integrated value inside each level surface of the quantity . In other words, the variations of inside the level surfaces of are irrelevant since they have no influence on the value of the estimator . These level surfaces can be indexed by a single-dimensional parameter, which will later allow us to drastically simplify the search for the prior.
To show this point, we start by noting that
is defined on the simplex embedded in
. We introduce a change of variables
to perform the integral of Equation (
2). Specifically, the first two coordinates of the new variables are chosen to be
and the remaining coordinates
may be chosen arbitrarily, as long as the transformation
be invertible. In the new variables,
Equation (
4) shows that it does not matter how the prior
distributes the density inside the manifold
obtained by intersecting the level surface
f of the property
F with a level surface
ℓ of the likelihood
L. Only the integral of
inside
has an effect on
. Therefore, we opt to consider only priors
that are constant inside
, since for any prior not constant in
, a prior that is constant in
exists that produces the same estimation
. Our task is therefore to design how the prior
changes with
f and
ℓ. A crucial point is that the level surfaces of
vary with the data
. Yet, by definition, a prior cannot depend on the measured data. Therefore, we only work with priors that, when written in the
-coordinates, only depend on
through
. In other words, we assume that a function
g exists such that
. By limiting the search to priors of this type, for each value of
f, we impose maximal uncertainty about
, as dictated by a maxentropy principle.
As a side remark, we point out that the assumption
does not imply that, when written in the
-coordinates, the prior
must be independent of
, since the Jacobian of the transformation of variables may well depend on
ℓ. Moreover, the marginal prior
may also depend on
ℓ, since the level surface in which Equation (
4) is calculated depends on
. Therefore,
is allowed to depend on
ℓ, but only due to the geometrical structure that the delta function in
restricts the integration region of Equation (
4).
2.3. Decomposing the Prior as a Linear Combination of a Family of Base Priors
In this subsection, we exploit the fact that to decompose the prior as a linear combination of a collection of base priors indexed by a single real number f. The decomposition allows us to reduce the search of a prior defined in a high-dimensional simplex to the search of a density defined as a function of f. This step brings about a dramatic simplification, but more importantly, it reduces the relevance of the subjective aspect involved in the selection of priors. We show here that the range of f-values that are relevant to the inference process (and thereby, the base priors that actually matter) is only partially defined by , which is the only element still containing subjective choices. In many applications, the sampled data n will turn out to be the truly defining factor in selecting the base priors that participate in the inference process.
We start to demonstrate these statements by noting that the set of priors
for which a function
exists such that
is equal to all the distributions of the form
that can be obtained by varying
. Equation (
5) implies that the problem of specifying
can be reduced to the problem of specifying
, which is a drastically simpler object, given that
has
dimensions, whereas
f has only a single dimension.
We now show that, when the number of samples is sufficiently large, the determination of
g becomes unnecessary, since for large
k, the measured data
often enter into the likelihood in such a way that only a small range of
f-values remains with significantly non-zero probability. Therefore, all the
g-functions that do not vary drastically within the permitted range give rise to the same estimation. In such cases, the data alone dictate the value of the estimator through the likelihood, making the discussion about priors essentially inconsequential. We now prove these statements. Replacing Equation (
5) in (
2), the estimator becomes
with
where
is the Kullback–Leibler divergence [
11] between the distributions
and
, and the
ith component of the vector
is
. The divergence is always non-negative, and it only vanishes when the two distributions coincide [
12]. Therefore, if the number of samples
n is sufficiently large, the result of the integral is significantly different from zero only for level surfaces that pass close enough to the sampled frequencies
and is maximal for the one that contains the sampled frequencies. The data
thereby select the range of
f values that are compatible with the observations. The factor
n in the exponent of Equation (
7) implies that the allowed range of
f-values becomes increasingly narrow as the number of samples grows. The crucial point in this reasoning is that for sufficiently narrow ranges, the shape of the prior
becomes irrelevant. If the range is much narrower than the typical scales in which
varies, for all practical matters,
is constant within this range, and has no bearing in the estimation. This is the situation in which we no longer need to bother about the prior.
When that limit is reached, the likelihood becomes a delta distribution, and the only
-vector that contributes to a given
f-value is
. In this extreme case, when
is uniform, the estimator
converges to the plug-in estimator. Yet, the plug-in estimator is only justified when the likelihood indeed reaches the delta-like behavior, which only happens for
. Before this limit, the plug-in estimator completely neglects the width of the peak of the likelihood around the sampled frequencies
and thereby the possibility that a whole collection of
-vectors in the vicinity of
contribute to each
f. Allowing for this possibility is important since, in the undersampled regime, there is a large degree of uncertainty about the true
that generated the data
. By neglecting this uncertainty, the plug-in estimator produces the undesired biases mentioned before. The obvious solution is not to neglect the width of the likelihood. However, if no approximations are made, it is difficult to calculate or even to estimate
analytically, given that Equation (
7) involves an integral over an arbitrarily shaped manifold.
To overcome this problem, we would like to design another estimator that retains insensitivity to the prior but still gives a chance for a collection of
vectors to contribute to each
f. The size of this collection is determined by the width of the likelihood; that is, by the total number of samples
n, and the obtained frequencies
. The averaged contributions of all the
vectors that have non-vanishing likelihoods produces an estimated
that may be either smaller or larger than the plug-in value. In other words, all the
-vectors different from
may tend to either increase or decrease
F as compared with
, and the direction and size of this shift depends on the behavior of
around
. In what follows, we propose an alternative expansion of the prior different from Equation (
5), which is formulated in terms of a parameter that controls the relative weight of the level surfaces that tend to increase vs. those that tend to decrease
F.
Using the previous analysis as an inspiration, from now on, we consider a narrower set of priors that is a proper subset of the one defined by Equation (
5). A restriction in the set of possible priors does not invalidate the analysis of how the width of the likelihood depends on the sampled data, it only reduces the set from which priors are selected. We begin by proposing the decomposition
for a conveniently chosen family of priors
that now need not coincide with
but are still assumed to only depend on
through the property
. In other words, we assume that a family of normalizable functions
exists such that
Before continuing, it is important to note that in the new decomposition of Equation (
8), the parameter
f can no longer be identified as exactly the value of the property
in a given level surface, since the delta functions are no longer present to restrict
to a single level surface of
. Therefore, from now on,
f should be regarded as a formal label that parametrizes the set of base functions used in the expansion. Yet, later on, we choose a family of base functions for which there is still a connection between
f and
, but the connection turns out to be probabilistic (see next subsection).
By inserting Equation (
8) in Equation (
2), the estimator becomes
where we have introduced the likelihood of the data for each value of the parameter,
With this definition, the estimator reads
where
represents the amount of evidence in favor of each
f-value provided by the measured data
, and
is the estimation of the property
conditional on the measured data
and the parameter
f.
Equation (
10) is homologous to Equation (
6) for base priors that are not delta distributions. Using Bayes’ rule (Equation (
11)), the evidence
can be written in terms of the marginal likelihood
of Equation (
9). This evidence is defined by an integral in
-space that contains the multinomial likelihood
embodying the Kullback–Leibler divergence. Therefore, just as before, the data
still select a range of
f-values; only now, we cannot identify
f as an instantiation of
F on one specific region. Still, the value of
f that maximizes the posterior
is the one that makes the largest contribution to the integral in Equation (
10). Before, only keeping the optimal
f-value yielded to almost the plug-in estimator (except for the effect of
). In the present case, only keeping the optimal
f-value means to replace the integral in Equation (
10) by the evaluation of
at the value of
that maximizes
. This procedure is henceforth denominated the
MAP estimator, for
Maximum A Posteriori. With the present decomposition, this procedure does not yield the plug-in estimator, because
contains an integral that sweeps through a whole range of level surfaces. For the family of base priors selected in the following subsection, the MAP estimator performs substantially better than the plug-in estimator and often also better than custom-made estimators designed for specific properties. In those cases in which the shape of
can be argued to play no relevant role, the MAP estimator can be replaced by an empirical Bayes estimator, in which the selected
value maximizes the marginal likelihood
instead of the marginal posterior
.
2.4. A Maxentropy Strategy to Select the Base of Functions to Expand the Prior
In this subsection, we select the family of distributions
that serve as a base to expand the prior
, as dictated by Equation (
8). To do so, we employ a Maxentropy principle, maximizing the uncertainty about
under the constraint that the expected value of
be
f (Equation (
13)).
Before, when the expansion in delta distributions was used, each element of the base was associated with a single
f value, equal to the property
F evaluated on the corresponding level surface. A natural relaxation of this condition, while still fulfilling the requirement that the elements
have the same level surfaces as
, is to demand that
f be the
expected value of
, when the average is weighted by
. If, after imposing this requirement, we insist that the base functions have no additional structure, then the shape of
can be derived from the Maxentropy principle [
13,
14], in which the differential entropy
is maximized, conditioned to the restriction
The solution of this maximization problem is
where the hyperparameter
is a function of
f; that is, of the expected value of the property. The correspondence between each
f-value and each
-value implies that both these parameters constitute valid tags to designate a member of the base. For this reason, in what follows, we use the two parametrizations interchangeably, depending on what we want to stress. We warn the reader that we pass from one to the other, tagging the elements of the base equivalently as
or as
, understanding that there is a one-to-one mapping between
f and
. In the nomenclature of Amari [
15,
16], the parameter appearing linearly in the exponent of the distribution (for us,
) is referred to as an
exponential system of coordinates of the space of parameters. When
f is used, the coordinates are called
mixed. In equilibrium statistical mechanics, for example,
is often proportional to the negative of the inverse of the temperature, whereas
f is proportional to the mean energy of a state—and of course, the two are related.
In the exponential family of Equation (
14), all level surfaces of
contribute to each
, but some are more relevant than others. If
is large and positive,
is dominated by the level surface with maximal
and rapidly dies out when we depart from it. The mean value of
weighted by this
is equal to the maximum of
F on the simplex. If
, all level surfaces contribute uniformly. If
is large and negative, the level surface with minimal
has maximal relevance, and the mean value of
is equal to the minimum of
F on the simplex. Therefore,
operates as a tuning knob that raises or lowers the relevance of different level surfaces, ranking them by the value of
F. As happened when expanding the prior in a base of delta functions, the mean value of
F shifts from its minimum to its maximum by adjusting the parameter. However, in contrast with the delta case, the expansion in exponential functions allows each element of the base to spread out to different level surfaces, allowing a whole diversity of surfaces to contribute to each
.