A Noninformative Prior on a Space of Distribution Functions

Terenin, Alexander; Draper, David

doi:10.3390/e19080391

Open AccessArticle

A Noninformative Prior on a Space of Distribution Functions

by

Alexander Terenin

and

David Draper

^*

Applied Mathematics and Statistics, University of California, Santa Cruz, CA 95064, USA

^*

Author to whom correspondence should be addressed.

Entropy 2017, 19(8), 391; https://doi.org/10.3390/e19080391

Submission received: 20 June 2017 / Revised: 24 July 2017 / Accepted: 28 July 2017 / Published: 29 July 2017

(This article belongs to the Section Information Theory, Probability and Statistics)

Download Versions Notes

Abstract

:

In a given problem, the Bayesian statistical paradigm requires the specification of a prior distribution that quantifies relevant information about the unknowns of main interest external to the data. In cases where little such information is available, the problem under study may possess an invariance under a transformation group that encodes a lack of information, leading to a unique prior—this idea was explored at length by E.T. Jaynes. Previous successful examples have included location-scale invariance under linear transformation, multiplicative invariance of the rate at which events in a counting process are observed, and the derivation of the Haldane prior for a Bernoulli success probability. In this paper we show that this method can be extended, by generalizing Jaynes, in two ways: (1) to yield families of approximately invariant priors; and (2) to the infinite-dimensional setting, yielding families of priors on spaces of distribution functions. Our results can be used to describe conditions under which a particular Dirichlet Process posterior arises from an optimal Bayesian analysis, in the sense that invariances in the prior and likelihood lead to one and only one posterior distribution.

Keywords:

Bayesian nonparametrics; Dirichlet process; functional equations; Hyers–Ulam–Rassias stability; improper prior; invariance; optimal Bayesian analysis; transformation group

1. Introduction

Consider a statistician working on a problem P in which a vector

y = (y_{1}, . ., y_{n})

of real-valued outcomes is to be observed, and—prior to, i.e., without observing

y

—the statistician’s uncertainty is exchangeable, in the usual sense of being invariant under permutation of the order in which the outcomes are listed in

y

. This situation has extremely broad real-world applicability, including (but not limited to) the analysis of a completely randomized controlled trial, in which participants—ideally, similar to elements of a population to which it is desired to generalize inferentially—are randomized. Each participant is assigned either to a control group that receives the current best treatment, or to an experimental group that receives a new treatment whose causal effect on y is of interest. This design, while extremely simple, has proven to be highly useful over the past 90 years, in fields as disparate as agriculture [1], medicine [2], and (in contemporary usage)

A / B

testing in data science on a massive scale [3]. We use randomized controlled trials as a motivating example below, but we emphasize that they constitute only one of many settings to which the results of this paper apply.

Focusing just on the experimental group in the randomized controlled trial, the exchangeability inherent in

y

implies via de Finetti’s Theorem [4] that the statistician’s state of information may be represented by the hierarchical model

\begin{matrix} y_{i} ∣ F \overset{iid}{\sim} F & F \sim π (F) \end{matrix}

(1)

for

i = 1, . ., n

, where F is a cumulative distribution function (CDF) on

R

and

π (F)

is a prior on the space of all such CDFs, i.e., the infinite-dimensional probability simplex

S_{\infty}

. Note that (1) has uniquely specified the likelihood in a Bayesian nonparametric model for

y

, and all that remains is specification of

π (F)

.

Speaking now more generally (not just in the context of a randomized controlled trial), suppose that the nature of the problem P enables the analyst to identify an alternative statistical problem

\tilde{P}

in which

\begin{matrix} \tilde{P} & = g (P) & such that & g & \in G, \end{matrix}

(2)

where G is a collection of transformations g from one problem to another having the property that, without having seen any data,

\tilde{P}

and P are the exact same problem. Then, the prior

\tilde{π}

under

\tilde{P}

must be the same as the prior

π

under P! Furthermore, since this holds for any

g \in G

, the result will be, as long as G is endowed with enough structure, that there is one and only one prior

π

, for use in P, that respects the inherent invariance of the problem under study. Bayes’ Rule then implies that there is one and only one posterior distribution under P. When this occurs—when both the likelihood function and the prior are uniquely specified, as in the example above—we say that the problem P admits an optimal Bayesian analysis.

The logic underlying the above argument has been used to motivate and formalize the notion of noninformative priors for decades. Indeed, in the special case where F is parametric and G is a group of transformations encoding invariance with respect to monotonically-transformed units of measurement, Jeffreys [5] derived the resulting prior distribution. As another example, Jaynes [6] derived the prior distribution for the mean number of arrivals of a Poisson process by using its characterization as a Lévy counting process to specify an appropriate transformation group. Notably, the resulting prior distribution is not the Jeffreys prior, because the problem’s invariance and corresponding transformation group are different. See Eaton [7] for additional work on this subject.

Having studied this line of reasoning, it is natural to ponder its generality. In this paper we show that the argument can be made quite general—we prove that the argument’s formal notions

(a): can be generalized to include approximately invariant priors in an $ε$ – $δ$ sense; and
(b): can be extended to infinite-dimensional priors on spaces of CDFs.

We focus on the setting described in (1) and defer more general situations to future work. In this setting we derive a number of results, ultimately showing that the Dirichlet Process [8] prior

DP (ε, F_{0})

is an approximately invariant stochastic process for any CDF

F_{0}

on

R

and sufficiently small

ε > 0

. Together with de Finetti’s Theorem, this demonstrates that the posterior distribution

F ∣ y \sim DP (n, {\hat{F}}_{n}),

(3)

where

{\hat{F}}_{n}

is the empirical CDF, corresponds in a certain sense to an optimal Bayesian analysis—see Section 3 for more on this point.

Not all approaches to noninformative priors are based on group invariance. Perhaps the earliest approach can be traced back to Laplace [9], who proposed a Principle of Indifference under which, if all that is known about a quantity

θ

is that

θ \in Θ

(for some set

Θ

of possible values), then the prior should be uniform on

Θ

. For example, consider

Θ = (0, 1)

: the fact that

θ \sim U (0, 1)

is not consistent with

f (θ) \sim U (0, 1)

for any monotonic nonlinear f requires that the problem P under study must uniquely identify the scale on which uniformity should hold for the principle to be valid—this was a major reason for the rise of non-Bayesian theories of inference in the 19th century [10]. Bernardo [11] has proposed a notion of noninformative priors that is defined by studying their effect on posterior distributions, and choosing priors that ensure that prior impact is minimized. Jaynes [12] has proposed the Maximum Entropy Principle, which defines noninformative prior distributions via information-theoretic arguments, for use in settings in which invariance considerations do not lead to a unique prior. All of these notions are different, and applicable to problems where the corresponding notions of noninformativeness arise most naturally.

Most of the work on noninformative priors has focused on the parametric setting, in which the number of unknown quantities is finite. In contrast, Bush et al. [13] and Lee et al. [14] have derived results on noninformative priors in Dirichlet Process Mixture models. Their notion of noninformativeness is completely different from our own, as it is a posteriori, i.e., it involves examining the behavior of the posterior distribution under the priors studied. This makes their approach largely complementary to ours: in specifying priors, it is helpful to understand both the prior’s effect on the posterior and the prior’s behavior a priori without considering any data.

Here we study noninformative prior specification from a strictly a priori perspective. We do not consider the prior’s effect on the posterior distribution. There is no data or discussion of computation.

Our motivation is a generalization of the following argument by Jaynes [12]. Suppose that in the randomized controlled trial described above, the outcome y of interest is binary. By de Finetti’s Theorem, we know that

y_{i} ∣ θ_{1} \overset{iid}{\sim} Ber (θ_{1})

(4)

is the unique likelihood for (e.g., the treatment group in) this problem. Suppose further that the statistician’s state of information about

θ_{1}

external to the data set

y

is what Jaynes calls “complete initial ignorance” except for the fact that

θ = (θ_{1}, θ_{2})

is such that

{0 \leq θ_{1} \leq 1, 0 \leq θ_{2} \leq 1, θ_{1} + θ_{2} = 1} .

(5)

Jaynes argues that this state of information is equivalent to the statistician possessing complete initial ignorance about all possible rescaled and renormalized versions of

θ

, namely

θ^{'} = (\frac{c_{1} θ_{1}}{c_{1} θ_{1} + c_{2} θ_{2}}, \frac{c_{2} θ_{2}}{c_{1} θ_{1} + c_{2} θ_{2}})

(6)

for all positive

c_{1}, c_{2}

. Jaynes shows that this leads uniquely to the Haldane prior

\begin{matrix} π (θ_{1}) & \propto \frac{1}{θ_{1} (1 - θ_{1})} & or equivalently & π (θ_{1}, θ_{2}) & \propto \frac{1}{θ_{1} θ_{2}}, \end{matrix}

(7)

where

θ_{2} = 1 - θ_{1}

. Combining this result with the unique Bernoulli likelihood under exchangeability, in our language Jaynes has therefore identified an instance of optimal Bayesian analysis. In what follows we (a) extend Jaynes’s argument to the multinomial setting with p outcome categories for arbitrary finite p and (b) show how this generalization leads to a unique noninformative prior on

S_{\infty}

.

The

DP (n, {\hat{F}}_{n})

posterior and implied

DP (0)

prior—see Table 1 for the notational conventions used in this work—have not been subject to the same level of formal study as Dirichlet Process Mixture priors and other priors over CDFs, in part due to the simplicity and discrete nature of

DP (n, {\hat{F}}_{n})

. On the other hand, Dirichlet and Dirichlet Process priors with small concentration parameters have been used as low-information priors in a variety of settings (e.g., [15]), without much formal justification. In this paper we offer a mathematical foundation showing that the use of

DP (0)

is statistically sound.

2. Results

2.1. Preliminaries

To begin our discussion, we first introduce the notion of an invariant distribution, which describes what we mean by the term noninformative.

Definition 1.

[Invariant Distribution] A density

π (θ)

is invariant with respect to a transformation group G if for all

\tilde{π} (θ) = π [g (θ)]

with

g \in G

, and all measurable sets A,

\int_{A} π (θ) d θ = \int_{A} π [g (θ)] d g (θ) = \int_{A} \tilde{π} (θ) | \frac{\partial [g (θ)]}{\partial (θ)} | d θ,

(8)

where

\frac{\partial [g (θ)]}{\partial (θ)}

is the Jacobian of the transformation.

Note that in Equation (8), if we were to instead take A in the middle and right integrals to be

g^{- 1} (A)

, we would exactly get the classical integration by substitution formula, which under appropriate conditions is always true. We are interested in the inverse problem: given a set of transformations in G, does there exist a unique

π

satisfying (8)?

In a number of practically-relevant cases, G is uniquely specified by the context of the problem being studied. If this leads to a unique prior distribution

π

, and when additionally a unique likelihood also arises, for example via exchangeability, an optimal Bayesian analysis is possible, as defined in Section 1. It is often the case that the prior distributions that result from this line of reasoning are limits of conjugate families, making them easy to work with—this occurs in our results below, in which the corresponding posterior distributions are Dirichlet.

The above definition is intuitive, but not sufficiently general to be applicable to spaces of functions. There are multiple technical issues:

(a): in many cases, $π$ cannot be taken to integrate to 1;
(b): probability distributions on spaces of functions may not admit Riemann-integrable densities;
(c): G may be defined via equivalence classes of transformations, leading to singular Jacobians; and
(d): infinite-dimensional measures that are non-normalizable are not well-behaved mathematically.

As a result, the above definition needs to be extended to a measure-theoretic setting. We call a transformation group G acting on a measure space nonsingular if for

g \in G

with

\tilde{π} (θ) = π [g (θ)]

, we have

π ≪ \tilde{π} ≪ π

, where ≪ denotes absolute continuity of measures.

Definition 2.

[Invariant Measure] Let G be a nonsingular transformation group acting on a measure space. We say that a measure π is invariant with respect to G if for any

g \in G

with

\tilde{π} (θ) = π [g (θ)]

and for any measurable subset A we have

\int_{Ω} I_{A} d π = \int_{Ω} I_{A} \frac{d \tilde{π}}{d π} d \tilde{π},

(9)

where Ω is the domain of π,

I_{A}

is the indicator function of the set A, and

\frac{d \tilde{π}}{d π}

is the Radon–Nikodym derivative of

\tilde{π}

with respect to π.

It can be seen by taking

π

to be absolutely continuous with respect to the Lebesgue measure that Equation (9) is a direct extension of Equation (8).

We would ultimately like to extend the above definition to the infinite-dimensional setting. Doing so directly is challenging, because

π

may be non-normalizable, in which case Kolmogorov’s Consistency Theorem and other analytic tools for infinite-dimensional probability measures do not apply. Here we sidestep this problem by instead extending the definition of invariance to allow us to define a sequence of approximately invariant measures, which in our setting can be taken to be probability measures. To do so, two additional definitions are needed.

Definition 3 (ε-invariant Measure).

Let G be a nonsingular transformation group acting on a measure space with invariant measure

\hat{π}

. We say that a sequence of measures

{π^{(ε)} : ε > 0}

is

ε

-invariant with respect to G if for any

g \in G

with

{\tilde{π}}^{(ε)} (θ) = π^{(ε)} [g (θ)]

and each measurable subset A, the inequality

| \int_{Ω} I_{A} d π^{(ε)} - \int_{Ω} I_{A} \frac{d {\tilde{π}}^{(ε)}}{d π^{(ε)}} d {\tilde{π}}^{(ε)} | < ε

(10)

implies that

| π^{(ε)} (A) - \hat{π} (A) | \leq δ μ (A),

(11)

where

μ (A)

is a function,

ε \to 0

implies that

δ \to 0

, and Ω is the domain of

π^{(ε)}

for all ε.

Definition 4 (ε-invariant Process).

Let

{Π^{(ε)} : ε > 0}

be a sequence of stochastic processes, and let G be a nonsingular transformation group. Let I be an arbitrary finite subset of the index set of the process, let

π_{I}^{(ε)}

be the finite-dimensional measure of

Π^{(ε)}

under I, and let

G_{I}

be a finite-dimensional homomorphism of G with invariant measure

{\hat{π}}_{I}

. We say that the sequence of processes

Π^{(ε)}

is

ε

-invariant if, for each I, each

g_{I} \in G_{I}

with

{\tilde{π}}_{I}^{(ε)} (θ) = π_{I}^{(ε)} [g_{I} (θ)]

and each measurable subset A, the inequality

| \int_{Ω_{I}} I_{A} d π_{I}^{(ε)} - \int_{Ω_{I}} I_{A} \frac{d {\tilde{π}}_{I}^{(ε)}}{d π_{I}^{(ε)}} d {\tilde{π}}_{I}^{(ε)} | < ε

(12)

implies that

| π_{I}^{(ε)} (A) - {\hat{π}}_{I} (A) | \leq δ μ_{I} (A),

(13)

where

μ_{I} (A)

is a function,

ε \to 0

implies that

δ \to 0

,

Ω_{I}

is the domain of

π_{I}^{(ε)}

for all ε, and

(ε, δ)

can be taken to be identical for all I.

Definition 4 has been explicitly chosen to formalize the notion of noninformativeness on a space of functions without constructing a non-normalizable infinite-dimensional measure.

To complete our assumptions, we need to specify G. Our definitions constitute a direct generalization of the transformation group used by Jaynes to derive the Haldane prior for

p = 2

—see Section 1.

Definition 5 (Probability Function Transformation Group).

Let

G_{\infty} = {g : S_{\infty} \to S_{\infty}}

(14)

be a nonsingular group of measurable functions under composition acting on the infinite-dimensional simplex

S_{\infty}

.

Definition 6 (Probability Vector Transformation Group).

For non-negative integer p and any vector

(c_{1}, . ., c_{p})

of non-negative constants, let

G_{p} = {g : (θ_{1}, . ., θ_{p}) \to (\frac{c_{1} θ_{1}}{\sum_{i = 1}^{p} c_{i} θ_{i}}, . ., \frac{c_{p} θ_{p}}{\sum_{i = 1}^{p} c_{i} θ_{i}})}

(15)

be a nonsingular group under composition acting on the p-dimensional simplex

S_{p}

, where each element

g \in G

represents an equivalence class of the transformations (15).

Note that

G_{p}

is a p-dimensional homomorphism of

G_{\infty}

—we use this property in our proofs below. It can also readily be seen that for any g, the constants

c_{i}

are only determined up to proportionality.

Proposition 1 (Radon–Nikodym Derivative).

For each

g \in G_{p}

and

\tilde{π} (θ) = π [g (θ)]

, the Radon–Nikodym derivative of

\tilde{π}

with respect to π is

\frac{d \tilde{π}}{d π} (θ) = \frac{\prod_{i = 1}^{p} c_{i}}{{(\sum_{i = 1}^{p} c_{i} θ_{i})}^{p}} .

(16)

Proof.

Let

λ

be the Lebesgue measure on the p-dimensional probability simplex, and define

\tilde{λ} (θ) = λ [g (θ)]

. Note first that

λ ≪ \tilde{λ} ≪ π ≪ \tilde{π} ≪ λ

. Note also that

\frac{d π}{d λ} = \frac{d \tilde{π}}{d \tilde{λ}},

(17)

because the same transformation g is used in defining

\tilde{π}

and

\tilde{λ}

. Then, note that

\frac{d \tilde{π}}{d π} = \frac{d \tilde{π}}{d π} \frac{d \tilde{λ}}{d λ} \frac{d λ}{d \tilde{λ}} = \frac{d λ}{d π} \frac{d \tilde{λ}}{d λ} \frac{d \tilde{π}}{d \tilde{λ}} = \frac{d \tilde{λ}}{d λ},

(18)

and hence it suffices to consider the transformation g applied to the Lebesgue measure. Consider an arbitrary hypercube B. We have

λ (B) = λ_{1} (B_{1}) . . λ_{p} (B_{p}),

(19)

where

λ_{i}

are 1-dimensional Lebesgue measures, for which we have that

λ_{i} (B_{i}) = b_{i} - a_{i},

(20)

where

[a_{i}, b_{i}]

is the one-dimensional projection of the hypercube B in dimension i. Consider now the transformation g. We may decompose g into d and n where

\begin{matrix} d & : (θ_{1}, . ., θ_{p}) \to (c_{1} θ_{1}, . ., c_{p} θ_{p}) & n & : (θ_{1}, . ., θ_{p}) \to (\frac{θ_{1}}{\sum_{i = 1}^{p} θ_{i}}, . ., \frac{θ_{p}}{\sum_{i = 1}^{p} θ_{p}}) . \end{matrix}

(21)

Now consider the effect of d and n on

λ_{i}

. We have

\begin{matrix} λ_{i} [d (B_{i})] & = c_{i} (b_{i} - a_{i}) & and & λ_{i} [n (B_{i})] & = \frac{b_{i} - a_{i}}{\sum_{i = 1}^{p} (b_{i} - a_{i})}, \end{matrix}

(22)

hence

λ_{i} [g (B_{i})] = \frac{c_{i} (b_{i} - a_{i})}{\sum_{j = 1}^{p} c_{j} (b_{j} - a_{j})} .

(23)

Therefore

λ [g (B)] = \prod_{i = 1}^{p} \frac{c_{i} (b_{i} - a_{i})}{\sum_{j = 1}^{p} c_{j} (b_{j} - a_{j})}

(24)

and we can compute the ratio

\frac{\tilde{λ} (B)}{λ (B)} = \frac{λ [g (B)]}{λ (B)} = \prod_{i = 1}^{p} \frac{c_{i} (b_{i} - a_{i})}{\sum_{j = 1}^{p} c_{j} (b_{j} - a_{j})} {[\prod_{i = 1}^{p} (b_{i} - a_{i})]}^{- 1} = \frac{\prod_{i = 1}^{p} c_{i}}{{[\sum_{i = 1}^{p} c_{i} (b_{i} - a_{i})]}^{p}} .

(25)

This holds for all B, hence the Radon–Nikodym derivative is just

\frac{d \tilde{λ}}{d λ} (θ) = \frac{\prod_{i = 1}^{p} c_{i}}{{(\sum_{i = 1}^{p} c_{i} θ_{i})}^{p}},

(26)

which is the desired result. ☐

Since we are working with non-normalizable measures as improper priors, we cannot rigorously talk about their probability densities. In many cases, such improper priors can be shown to be limits of families of conjugate priors for which the limiting posterior distribution is well-defined, making them usable in practice. To make our discussion of improper priors rigorous, we need the following definition.

Definition 7 (Generalized Density).

Let π be a measure on

R^{p}

(for p a positive integer) such that

π ≪ λ ≪ π

, where λ is Lebesgue measure on

R^{p}

. Suppose that the Radon–Nikodym derivative of π with respect to λ is Riemann-integrable, and define a family of functions equal to the Radon–Nikodym derivative up to a proportionality constant. We call any function in this family a generalized density of π.

2.2. Main Results

Remark 1 (Notation).

In the following results, we will assume that

(θ_{1}, . ., θ_{p})

is a probability vector of dimension

p \geq 2

.

G_{\infty}

and

G_{p}

will be the transformation groups identified in Definitions 5 and 6, respectively. As noted previously in Table 1,

Dir (α, F_{0})

will denote the Dirichlet distribution under the alternative parametrization based on concentration parameter α and mean probability vector

F_{0}

. This is equivalent to the usual parameterization in terms of concentration vector α by the identity

α = α F_{0}

—we refer to this as the

Dir (α)

distribution. Similarly,

DP (α, F_{0})

will refer to the Dirichlet Process with concentration parameter α and mean function

F_{0}

. We will refer to the improper priors defined via the conjugate limits as

α \to 0

of

Dir (α, F_{0})

and

DP (α, F_{0})

for arbitrary

F_{0}

as

Dir (0)

and

DP (0)

, respectively.

We are now ready to introduce our first result. The argument below is a direct generalization of the line of reasoning in Jaynes [12]: the Haldane prior obtained is a special case of our result for

p = 2

.

Theorem 1.

Among the class of measures that admit generalized densities, the measure π with generalized density

π (θ_{1}, . ., θ_{p}) \propto \frac{1}{\prod_{i = 1}^{p} θ_{i}},

(27)

which we call

Dir (0)

, is the unique invariant measure under

G_{p}

.

Proof.

An invariant measure

π

under

G_{p}

needs to satisfy the equation

\int_{S_{p}} I_{A} d π = \int_{S_{p}} I_{A} \frac{d \tilde{π}}{d π} d \tilde{π},

(28)

where

S_{p}

is the p-dimensional simplex and

\tilde{π} (θ) = π [g (θ)]

for some

g \in G_{p}

. Since

π

is assumed to admit a generalized density, we can rewrite (28) as a Riemann integral. In addition, we substitute in the transformation and Radon–Nikodym derivative, and get

\int_{A} π (θ_{1}, . ., θ_{p}) d θ_{1} . . d θ_{p} = \int_{A} π (\frac{c_{1} θ_{1}}{\sum_{i = 1}^{p} c_{i} θ_{i}}, . ., \frac{c_{p} θ_{p}}{\sum_{i = 1}^{p} c_{i} θ_{i}}) \frac{\prod_{i = 1}^{p} c_{i}}{{(\sum_{i = 1}^{p} c_{i} θ_{i})}^{p}} d θ_{1} . . d θ_{p} .

(29)

This formula needs to hold for all measurable sets A, and hence the functions inside the integrals need to be equal pointwise. This yields the functional equation

π (θ_{1}, . ., θ_{p}) = π (\frac{c_{1} θ_{1}}{\sum_{i = 1}^{p} c_{i} θ_{i}}, . ., \frac{c_{p} θ_{p}}{\sum_{i = 1}^{p} c_{i} θ_{i}}) \frac{\prod_{i = 1}^{p} c_{i}}{{(\sum_{i = 1}^{p} c_{i} θ_{i})}^{p}},

(30)

which will be the main subject of further study. This is a multivariate functional equation that at first may appear fearsome, but is in fact solvable via elementary methods. To solve it, recognizing that (30) must hold for all probability vectors

(θ_{1}, . ., θ_{p})

and all vectors

(c_{1}, . ., c_{p})

of positive constants

c_{i}

, we set

\begin{matrix} (θ_{1}, . ., θ_{p}) & = (p^{- 1}, . ., p^{- 1}) & and & \sum_{i = 1}^{p} c_{i} = 1, \end{matrix}

(31)

which yields

π (p^{- 1}, . ., p^{- 1}) = π (\frac{c_{1} p^{- 1}}{p^{- 1} \sum_{i = 1}^{p} c_{i}}, . ., \frac{c_{p} p^{- 1}}{p^{- 1} \sum_{i = 1}^{p} c_{i}}) \frac{\prod_{i = 1}^{p} c_{i}}{{(p^{- 1} \sum_{i = 1}^{p} c_{i})}^{p}} .

(32)

Then, by swapping

c_{i}

for

θ_{i}

, (32) rearranges into

π (θ_{1}, . ., θ_{p}) = \frac{π (p^{- 1}, . ., p^{- 1}) p^{- p}}{\prod_{i = 1}^{p} θ_{i}} \propto \frac{1}{\prod_{i = 1}^{p} θ_{i}},

(33)

since the numerator is not a function of any

θ_{i}

, and it can easily be checked that all such generalized densities are valid solutions to the original equation. Thus (33) is the functional equation’s unique solution and therefore the unique invariant measure under

G_{p}

. ☐

The same technique used to solve the functional equation in Theorem 1 can be used to prove a much stronger result: if the functional equation is true approximately, its solutions will approximate those of the exact equation. In the next result we make use of the definition of stability of a functional equation due to Hyers, Ulam and Rassias—see Jung [16] for details.

Corollary 1 (Hyers–Ulam–Rassias Stability).

Suppose we have

| π (θ_{1}, . ., θ_{p}) - π (\frac{c_{1} θ_{1}}{\sum_{i = 1}^{p} c_{i} θ_{i}}, . ., \frac{c_{p} θ_{p}}{\sum_{i = 1}^{p} c_{i} θ_{i}}) \frac{\prod_{i = 1}^{p} c_{i}}{{(\sum_{i = 1}^{p} c_{i} θ_{i})}^{p}} | < δ .

(34)

Then

\begin{matrix} | π (θ_{1}, . ., θ_{p}) - \hat{π} (θ_{1}, . ., θ_{p}) | & < δ \frac{e^{e^{- 1}}}{\prod_{i = 1}^{p} θ_{i}}, & where & \hat{π} (θ_{1}, . ., θ_{p}) & \propto \frac{1}{\prod_{i = 1}^{p} θ_{i}} . \end{matrix}

(35)

Proof.

By repeating the technique from the previous proof, we have

| π (p^{- 1}, . ., p^{- 1}) - π (c_{1}, . ., c_{p}) \frac{\prod_{i = 1}^{p} c_{i}}{p^{- p}} | < δ,

(36)

which can be rewritten

| π (θ_{1}, . ., θ_{p}) - \frac{π (p^{- 1}, . ., p^{- 1}) p^{- p}}{\prod_{i = 1}^{p} θ_{i}} | < δ \frac{p^{- p}}{\prod_{i = 1}^{p} θ_{i}} < δ \frac{e^{e^{- 1}}}{\prod_{i = 1}^{p} θ_{i}},

(37)

where the last inequality is strict because p is a positive integer. Letting

\frac{π (p^{- 1}, . ., p^{- 1}) p^{- p}}{\prod_{i = 1}^{p} θ_{i}} \propto \frac{1}{\prod_{i = 1}^{p} θ_{i}} \propto \hat{π} (θ_{1}, . ., θ_{p}),

(38)

we get

| π (θ_{1}, . ., θ_{p}) - \hat{π} (θ_{1}, . ., θ_{p}) | < δ \frac{e^{e^{- 1}}}{\prod_{i = 1}^{p} θ_{i}},

(39)

which is the stability result desired. ☐

This suffices to prove our result for the Dirichlet distribution.

Theorem 2.

Dir (ε, F_{0})

is an ε-invariant measure under

G_{p}

for all

F_{0}

.

Proof.

By repeating the steps of Theorem 1 and combining them with Corollary 1, we obtain that

Dir (ε, F_{0})

is

ε

-invariant under

G_{p}

if and only if it satisfies

\begin{matrix} | π^{(ε)} (θ_{1}, . ., θ_{p}) - \hat{π} (θ_{1}, . ., θ_{p}) | & < δ \frac{e^{e^{- 1}}}{\prod_{i = 1}^{p} θ_{i}} & for some & \hat{π} (θ_{1}, . ., θ_{p}) & \propto \frac{1}{\prod_{i = 1}^{p} θ_{i}} . \end{matrix}

(40)

Substituting in

Dir (ε, F_{0})

, and choosing the constant

C_{ε}

of the generalized density

\hat{π}

to be the same as for the Dirichlet, we get

| C_{ε} \prod_{i = 1}^{p} θ_{i}^{ε F_{0 i} - 1} - \frac{C_{ε}}{\prod_{i = 1}^{p} θ_{i}} | < δ \frac{e^{e^{- 1}}}{\prod_{i = 1}^{p} θ_{i}},

(41)

where

F_{0 i}

are the components of the probability vector

F_{0}

, and this expression simplifies to

C_{ε} e^{e} | \prod_{i = 1}^{p} θ_{i}^{ε F_{0 i}} - 1 | < δ .

(42)

Since

0 \leq θ_{i} \leq 1

for all i, the product is upper bounded by 1 and lower bounded by 0. Thus the inequality holds near zero if

C_{ε} < δ

(43)

for all

(θ_{1}, . ., θ_{p})

, and since

C_{ε} \to 0

we get that, as

ε \to 0

, we can choose

δ

such that

δ \to 0

. Thus,

Dir (ε, F_{0})

is

ε

-invariant for all

F_{0}

. ☐

We now extend Theorem 2 to get an analogous result for the Dirichlet Process.

Theorem 3.

DP (ε, F_{0})

is an ε-invariant process under

G_{\infty}

for all

F_{0}

.

Proof.

Consider an arbitrary finite-dimensional index I with corresponding homomorphism

G_{I}

and finite-dimensional measure

π_{I}^{(ε)}

. It follows from Theorem 2 that

π_{I}^{(ε)}

is

ε

-invariant with

C_{ε} < δ .

(44)

This inequality depends only on

C_{ε}

, so it suffices to show that this constant can be bounded by another constant that is not a function of p and approaches 0.

C_{ε}

is an instance of the inverse multivariate beta function, which is a ratio of gamma functions. It is well known that

lim_{x \to 0} [\frac{1}{x} - Γ (x)] = γ,

(45)

where

γ

is the Euler-Mascheroni constant. Therefore, we have

C_{ε} = \frac{Γ (ε)}{\prod_{i = 1}^{p} Γ (ε F_{0 i})} = \frac{O (1 / ε)}{\prod_{i = 1}^{p} O (1 / ε)} \leq \frac{O (1 / ε)}{\prod_{i = 1}^{2} O (1 / ε)} = O (ε) \to 0

(46)

as

ε \to 0

. Thus, for each

ε

, we can choose a

δ

to satisfy the required expressions under all finite-dimensional index sets, and

DP (ε, F_{0})

is therefore an

ε

-invariant process. ☐

We conclude our theoretical investigation with a conjecture: the

ε

-invariance of all finite-dimensional distributions with a uniform

δ

should suffice for invariance with respect to the original group acting on the infinite-dimensional space.

Conjecture 4.

A stochastic process is an ε-invariant process if and only if the measure of its sample paths is an ε-invariant measure.

One approach to attempting a proof of this conjecture would involve appropriately extending Kolmogorov’s Consistency Theorem to

σ

-finite infinite-dimensional measures. This can be done, but the notions involved are quite technical—see Yamasaki [17] for more details.

3. Discussion

To see how our results may be applied, consider again the randomized controlled trial of Section 1, and suppose now that the outcome

y_{i}

for participant i in the experimental group is categorical with p levels. Under exchangeability, a minor extension of de Finetti’s Theorem for dichotomous outcomes then yields that the likelihood can be expressed as

y_{i} ∣ θ \overset{iid}{\sim} MN (1, θ),

(47)

in which MN

(k, θ)

is the multinomial distribution with parameters k and

θ

. Theorem 1 implies that, modulo inherent abuse of notation under improper priors,

(θ_{1}, . ., θ_{p}) \sim Dir (0)

(48)

is the unique prior that obeys the fundamental invariance possessed by the problem—namely, invariance with respect to all transformations of probability vectors that preserve normalization. Thus we have extended Jaynes’s result for binomial outcomes to the multinomial setting, yielding another instance of optimal Bayesian analysis.

Generalizing to the setting where

y

is an exchangeable sequence of real-valued outcomes, de Finetti’s most general representation theorem implies that

y_{i} ∣ F \overset{iid}{\sim} F

(49)

is the unique likelihood. If little is known about F, and it is therefore approximately invariant under all measurable functions—i.e., under

G_{\infty}

, see Definition 5—the prior given by Theorem 3 is

F \sim DP (ε, F_{0}) .

(50)

By the usual conjugate updating in the Dirichlet Process setting, the posterior on F given

y

with the prior in (50) is

F ∣ y \sim DP (ε + n, \frac{ε}{ε + n} F_{0} + \frac{n}{ε + n} {\hat{F}}_{n}),

(51)

in which

{\hat{F}}_{n}

is the empirical CDF based on

y

. Since

ε

may be taken as close to zero as one wishes, it is natural to regard

F ∣ y \sim DP (n, {\hat{F}}_{n})

(52)

as an instance of approximately optimal Bayesian analysis for all

ε

. Conjecture 4 would strengthen this assertion—provided

DP (0)

can be rigorously constructed as an infinite-dimensional

σ

-finite measure, which is beyond the scope of this work.

Though the simplicity of this analysis may at first make it seem limited, its appeal comes from its extremely general ability to characterize uncertainty. See, e.g., Terenin and Draper [18] for an example of a

DP (n, {\hat{F}}_{n})

analysis in two randomized controlled trials in e-commerce, one with sample sizes in the tens of millions. Furthermore, sampling from

DP (n, {\hat{F}}_{n})

on a discrete domain has recently been shown in a completely different setting—see Appendix B of Terenin et al. [19]—to be asymptotically equivalent to the widely-used frequentist bootstrap of Efron [20]. This also applies to the Bayesian bootstrap of Rubin [21], since it is asymptotically equivalent to the frequentist version. Our analysis provides a Bayesian nonparametric justification for this class of methods.

Bayesian analysis cannot proceed without the specification of a stochastic model—prior and sampling distribution—relating known quantities to unknown quantities: data to parameters. One of the great challenges of applied statistics is that the model is not necessarily uniquely determined by the context of the problem under study, giving rise to model uncertainty, which if not assessed and correctly propagated can cause badly calibrated and unreliable inference, prediction and decision—see, e.g., Draper [22]. Perhaps the simplest way to avoid model uncertainty is to recognize settings in which it does not exist—situations where broad and simple mathematical assumptions, rendered true by problem context, lead to unique posterior distributions. Our term for this is optimal Bayesian analysis. It seems worthwhile (a) to catalog situations in which optimal analysis is possible and (b) to work to extend the list of such situations—Theorems 1 and 3 are two contributions to this effort.

Acknowledgments

We are grateful to Daniele Venturi, Yuanran Zhu, and Catherine Brennan for their thoughts on differential equations, which we originally used in a much longer and more complicated proof of the solution of our functional equation. We are grateful to Dan Simpson for his thoughts on infinite-dimensional measures. We are additionally grateful to Juhee Lee for her comments on prior specification, and to Thanasis Kottas for his thoughts on Dirichlet Processes. Membership on this list does not imply agreement with the ideas expressed here, nor are any of these people responsible for any errors that may be present.

Author Contributions

Alexander Terenin and David Draper contributed to the conceptual and theoretical development of the methods in this work and co-wrote the manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

Fisher, R.A. Statistical Methods for Research Workers; Oliver and Boyd: Edinburgh, UK, 1925. [Google Scholar]
Medical Research Council. Streptomycin treatment of pulmonary tuberculosis. Br. Med. J. 1948, 2, 769–782. [Google Scholar]
Kohavi, R.; Longbotham, R. Online controlled experiments and AB tests. In Encyclopedia of Machine Learning and Data Mining; Springer: Berlin, Germany, 2015. [Google Scholar]
De Finetti, B. La prévision: Ses lois logiques, ses sources subjectives. Annales de l’institut Henri Poincaré 1937, 7, 1–68. (In French) [Google Scholar]
Jeffreys, H. An invariant form for the prior probability in estimation problems. Proc. R. Soc. Lond. A Math. Phys. Eng. Sci. 1946, 186, 453–461. [Google Scholar] [CrossRef]
Jaynes, E.T. Probability Theory: The Logic of Science; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
Eaton, M.L. Group Invariance Applications in Statistics; Regional Conference Series in Probability and Statistics; Institute of Mathematical Statistics: Shaker Heights, OH, USA, 1989. [Google Scholar]
Ferguson, T.S. A Bayesian analysis of some nonparametric problems. Ann. Stat. 1973, 1, 209–230. [Google Scholar] [CrossRef]
Laplace, P.S. Mémoire sur la probabilité des causes par les événements. Mémoires de l’Académie Royale des Sciences de Paris 1774, 6, 621. (In French) [Google Scholar]
Hald, A. A History of Parametric Statistical Inference from Bernoulli to Fisher, 1713–1935; Springer: Berlin, Germany, 2007. [Google Scholar]
Bernardo, J.M. Reference posterior distributions for Bayesian inference. J. R. Stat. Soc. Ser. B (Methodol.) 1979, 41, 113–147. [Google Scholar]
Jaynes, E.T. Prior probabilities. IEEE Trans. Syst. Sci. Cybern. 1968, 4, 227–241. [Google Scholar] [CrossRef]
Bush, C.A.; Lee, J.; MacEachern, S.N. Minimally informative prior distributions for non-parametric Bayesian analysis. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2010, 72, 253–268. [Google Scholar] [CrossRef]
Lee, J.; MacEachern, S.N.; Lu, Y.; Mills, G.B. Local-mass preserving prior distributions for nonparametric Bayesian models. Bayesian Anal. 2014, 9, 307–330. [Google Scholar] [CrossRef]
Gelman, A.; Carlin, J.B.; Stern, H.S.; Dunson, D.B.; Vehtari, A.; Rubin, D.B. Bayesian Data Analysis, 3rd ed.; CRC Press: Boca Raton, FL, USA, 2014. [Google Scholar]
Jung, S.M. Hyers-Ulam-Rassias Stability of Functional Equations in Nonlinear Analysis; Springer: Berlin, Germany, 2011. [Google Scholar]
Yamasaki, Y. Measures on Infinite-Dimensional Spaces; World Scientific: Singapore, 1985. [Google Scholar]
Terenin, A.; Draper, D. Cox’s Theorem and the Jaynesian Interpretation of Probability. arXiv, 2015; arXiv:1507.06597. [Google Scholar]
Terenin, A.; Magnusson, M.; Jonsson, L.; Draper, D. Pólya Urn Latent Dirichlet Allocation: A doubly sparse massively parallel sampler. arXiv, 2017; arXiv:1704.03581. [Google Scholar]
Efron, B. Bootstrap methods: Another look at the jackknife. Ann. Stat. 1979, 7, 1–26. [Google Scholar] [CrossRef]
Rubin, D.B. The Bayesian Bootstrap. Ann. Stat. 1981, 9, 130–134. [Google Scholar] [CrossRef]
Draper, D. Assessment and propagation of model uncertainty. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 1995, 57, 45–97. [Google Scholar]

Table 1. Notation. Bold symbols refer to vectors. Improper distributions are considered only as limits of conjugate families—we do not attempt to define

DP (0)

directly as a non-normalizable measure. CDF: cumulative distribution function.

Table 1. Notation. Bold symbols refer to vectors. Improper distributions are considered only as limits of conjugate families—we do not attempt to define

DP (0)

directly as a non-normalizable measure. CDF: cumulative distribution function.

Expression	Description
$Dir (α)$	The Dirichlet distribution with concentration vector $α$ .
$Dir (α, F_{0})$	The Dirichlet distribution with concentration parameter $α$ and mean probability vector $F_{0}$ .
$Dir (0)$	The improper Dirichlet distribution corresponding to ${lim}_{α \to 0} Dir (α, F_{0})$ for $F_{0}$ arbitrary.
$DP (α, F_{0})$	The Dirichlet Process with concentration parameter $α$ and mean CDF $F_{0}$ .
$DP (n, {\hat{F}}_{n})$	The Dirichlet Process whose mean CDF ${\hat{F}}_{n}$ is the empirical CDF of the data set $y$ of size n.
$DP (0)$	The improper Dirichlet Process corresponding to ${lim}_{α \to 0} (α, F_{0})$ for arbitrary $F_{0}$ .

© 2017 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Terenin, A.; Draper, D. A Noninformative Prior on a Space of Distribution Functions. Entropy 2017, 19, 391. https://doi.org/10.3390/e19080391

AMA Style

Terenin A, Draper D. A Noninformative Prior on a Space of Distribution Functions. Entropy. 2017; 19(8):391. https://doi.org/10.3390/e19080391

Chicago/Turabian Style

Terenin, Alexander, and David Draper. 2017. "A Noninformative Prior on a Space of Distribution Functions" Entropy 19, no. 8: 391. https://doi.org/10.3390/e19080391

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Noninformative Prior on a Space of Distribution Functions

Abstract

1. Introduction

2. Results

2.1. Preliminaries

2.2. Main Results

3. Discussion

Acknowledgments

Author Contributions

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI