On Consistency of the Bayes Estimator of the Density

Nogales, Agustín G.

doi:10.3390/math10040636

Open AccessArticle

On Consistency of the Bayes Estimator of the Density

by

Agustín G. Nogales

Departamento de Matemáticas, IMUEx, Universidad de Extremadura, 06006 Badajoz, Spain

Mathematics 2022, 10(4), 636; https://doi.org/10.3390/math10040636

Submission received: 2 February 2022 / Revised: 16 February 2022 / Accepted: 17 February 2022 / Published: 18 February 2022

(This article belongs to the Special Issue Probability Theory and Stochastic Modeling with Applications)

Download Versions Notes

Abstract

:

Under mild conditions, strong consistency of the Bayes estimator of the density is proved. Moreover, the Bayes risk (for some common loss functions) of the Bayes estimator of the density (i.e., the posterior predictive density) goes to zero as the sample size goes to ∞. In passing, a similar result is obtained for the estimation of the sampling distribution.

Keywords:

Bayesian density estimation; Bayesian estimation of the sampling distribution; posterior predictive distribution; consistency of the Bayes estimator

MSC:

primary; 62G07; 62G20; secondary; 62G07

1. Introduction

In a statistical context, since the expression the probability of an event A (usually denoted

P_{θ} (A)

) depends on the unknown parameter, it is really a misuse of language. Before performing the experiment, this expression can be assigned a natural meaning from a Bayesian perspective as the prior predictive probability of A since it is the prior mean of the probabilities

P_{θ} (A)

. However, in accordance with Bayesian philosophy, once the experiment has been carried out and the value

ω

has been observed, a more appropriate estimate of

P_{θ} (A)

is the posterior predictive probability given

ω

of A. The author has recently proved ([1]) that not only is this the Bayes estimator of

P_{θ} (A)

but that the posterior predictive distribution (resp. the posterior predictive density) is the Bayes estimator of the sampling distribution

P_{θ}

(resp. the density

p_{θ}

) for the squared variation total (resp. the squared

L^{1}

) loss function in the Bayesian experiment corresponding to an n-sized sample of the unknown distribution. It should be noted that the loss functions considered derive in a natural way from the commonly used squared error loss function when estimating a real function of the parameter.

The posterior predictive distribution is the cornerstone of Predictive Inference, which seeks to make inferences about a new unknown observation from a preceding random sample (see [2,3]). With that idea in mind, it has also been used in other areas such as model selection, testing for discordancy, goodness of fit, perturbation analysis, and classification (see additional fields of application in [1,2,3,4,5]). Furthermore, in [1], it has been presented as a solution for the Bayesian density estimation problem, giving several examples to illustrate the results and, in particular, to calculate a posterior predictive density. [3] provide many other examples of determining the posterior predictive distribution. But in practice, explicit evaluation of the posterior predictive distribution may be cumbersome, and its simulation may become preferable. The aforementioned work of [3] also constitutes a good reference for such simulation methods, and hence for the computation of the Bayes estimators of the density and the sampling distribution.

We would refer to the references cited in [1] for other statistical uses of the posterior predictive distribution and some useful ways to calculate it.

In this communication, we shall explore the asymptotic behaviour of the posterior predictive density as the Bayes estimator of the density, showing its strong consistency and that the Bayes risk goes to 0 as n goes to ∞.

2. The Framework

Let

(Ω, A, {P_{θ} : θ \in (Θ, T, Q)})

be a Bayesian experiment (where Q denotes de prior distribution on the parameter space

(Θ, T)

), and consider the infinite product Bayesian experiment

(Ω^{N}, A^{N}, {P_{θ}^{N} : θ \in (Θ, T, Q)})

corresponding to an infinite sample of the unknown distribution

P_{θ}

. Let us write

I (ω, θ) : = ω, J (ω, θ) : = θ, I_{n} (ω, θ) : = ω_{n} and I_{(n)} (ω) : = ω_{(n)} : = (ω_{1}, \dots, ω_{n})

for integer n.

We suppose that

P^{N} (θ, A) : = P_{θ}^{N} (A)

is a Markov kernel. Let

Π_{N} : = P^{N} \otimes Q

be the joint distribution of the parameter and the observations, i.e.,

Π_{N} (A \times T) = \int_{T} P_{θ}^{N} (A) d Q (θ), A \in A^{N}, T \in T .

As

Q : = Π_{N}^{J}

(i.e., the probability distribution of J with respect to

Π_{N}

),

P_{θ}^{N}

is a version of the conditional distribution (regular conditional probability)

Π_{N}^{I | J = θ}

. Analogously,

P_{θ}^{n}

is a version of the conditional distribution

Π_{N}^{I_{(n)} | J = θ}

.

Let

β_{Q, N}^{*} : = Π_{N}^{I}

, the prior predictive distribution in

Ω^{N}

(so that

β_{Q, N}^{*} (A)

is the prior mean of the probabilities

P_{θ}^{N} (A)

). Similarly, write

β_{Q, n}^{*} : = Π_{N}^{I_{(n)}}

for the prior predictive distribution in

Ω^{n}

. So, the posterior distribution

P_{ω, N}^{*} : = Π_{N}^{J | I = ω}

given

ω \in Ω^{N}

satisfies

Π_{N} (A \times T) = \int_{T} P_{θ}^{N} (A) d Q (θ) = \int_{A} P_{ω, N}^{*} (T) d β_{Q, N}^{*} (ω), A \in A^{N}, T \in T .

Denote by

P_{ω_{(n)}, n}^{*} : = Π_{N}^{J | I_{(n)} = ω_{(n)}}

for

ω_{(n)} \in Ω^{n}

the posterior distribution given

ω_{(n)} \in Ω^{n}

.

Write

P_{ω_{(n)}, n}^{* P}

for the posterior predictive distribution given

ω_{(n)} \in Ω^{n}

defined for

A \in A

as

P_{ω_{(n)}, n}^{* P} (A) = \int_{Θ} P_{θ} (A) d P_{ω_{(n)}, n}^{*} (θ) .

So

P_{ω_{(n)}, n}^{* P} (A)

is nothing but the posterior mean given

ω_{(n)} \in Ω^{n}

of the probabilities

P_{θ} (A)

.

In the dominated case, we can assume without loss of generality that the dominating measure

μ

is a probability measure (because of (1) below). We write

p_{θ} = d P_{θ} / d μ

. The likelihood function

L (ω, θ) : = p_{θ} (ω)

is assumed to be

A \times T

-measurable.

We have that, for all n and every event

A \in A

,

\begin{matrix} \begin{matrix} P_{ω_{(n)}, n}^{* P} (A) & = \int_{Θ} P_{θ} (A) d P_{ω_{(n)}, n}^{*} (θ) = \int_{Θ} \int_{A} p_{θ} (ω^{'}) d μ (ω^{'}) d P_{ω_{(n)}, n}^{*} (θ) \\ = \int_{A} \int_{Θ} p_{θ} (ω^{'}) d P_{ω_{(n)}, n}^{*} (θ) d μ (ω^{'}), \end{matrix} \end{matrix}

which proves that

p_{ω_{(n)}, n}^{* P} (ω^{'}) : = \int_{Θ} p_{θ} (ω^{'}) d P_{ω_{(n)}, n}^{*} (θ)

is a

μ

-density of

P_{ω_{(n)}, n}^{* P}

that we recognize as the posterior predictive density on

Ω

given

ω_{(n)}

.

In the same way,

p_{ω, N}^{* P} (ω^{'}) : = \int_{Θ} p_{θ} (ω^{'}) d P_{ω, N}^{*} (θ)

is a

μ

-density of

P_{ω, N}^{* P}

, the posterior predictive density on

Ω

given

ω \in Ω^{N}

.

In the following, we will assume the following additional regularity conditions:

(i): $(Ω, A)$ is a standard Borel space;
(ii): $Θ$ is a Borel subset of a Polish space and $T$ is its Borel $σ$ -field;
(iii): ${P_{θ} : θ \in Θ}$ is identifiable.

According to [1], the posterior predictive distribution

P_{ω_{(n)}, n}^{* P}

(resp. the posterior predictive density

p_{ω_{(n)}, n}^{* P}

) is the Bayes estimator of the sampling distribution

P_{θ}

(resp. the density

p_{θ}

) for the squared variation total (resp. the squared

L^{1}

) loss function in the product experiment

(Ω^{n}, A^{n}, {P_{θ}^{n} : θ \in (Θ, T, Q)})

. Analogously, the posterior predictive distribution

P_{ω, N}^{* P}

(resp. the posterior predictive density

p_{ω, N}^{* P}

) is the Bayes estimator of the sampling distribution

P_{θ}

(resp. the density

p_{θ}

) for the squared variation total (resp. the squared

L^{1}

) loss function in the product experiment

(Ω^{N}, A^{N}, {P_{θ}^{N} : θ \in (Θ, T, Q)})

.

As a particular case of a well known result about the total variation distance between two probability measures and the

L^{1}

-distance between their densities, we have that

\sup_{A \in A} |P_{ω_{(n)}, n}^{* P} (A) - P_{θ} (A)| = \frac{1}{2} \int_{Ω} |p_{ω_{(n)}, n}^{* P} - p_{θ}| d μ .

(1)

3. The Main Result

We ask whether the Bayes risk of the Bayes estimator

P_{ω_{(n)}, n}^{* P}

of the sampling distribution

P_{θ}

goes to zero when

n \to \infty

, i.e., whether

\lim_{n} \int_{Ω^{N} \times Θ} \sup_{A \in A} {|P_{ω_{(n)}, n}^{* P} (A) - P_{θ} (A)|}^{2} d Π_{N} (ω, θ) = 0 .

In terms of densities, the question is whether the Bayes risk of the Bayes estimator

p_{ω_{(n)}, n}^{* P}

of the density

p_{θ}

goes to zero when

n \to \infty

, i.e., whether

\lim_{n} \int_{Ω^{N} \times Θ} {(\int_{Ω} |p_{ω_{(n)}, n}^{* P} (ω^{'}) - p_{θ} (ω^{'})| d μ (ω^{'}))}^{2} d Π_{N} (ω, θ) = 0 .

Let us consider the auxiliary Bayesian experiment

(Ω \times Ω^{N}, A \times A^{N}, {μ \times P_{θ}^{N} : θ \in (Θ, T, Q)}) .

For

ω^{'} \in Ω

,

ω \in Ω^{n}

and

θ \in Θ

, we will continue to write

I (ω^{'}, ω, θ) = ω

and

J (ω^{'}, ω, θ) = θ

, and now we write

I^{'} (ω^{'}, ω, θ) = ω^{'}

.

The new prior predictive distribution is

μ \times β_{Q, n}^{*}

since

{(μ \times Π_{N})}^{(I^{'}, I_{(n)})} (A^{'} \times A_{(n)}) = μ (A^{'}) \cdot β_{Q, n}^{*} (A_{(n)}) = (μ \times β_{Q, n}^{*}) (A^{'} \times A_{(n)}) .

To compute the new posterior distributions, notice that

\begin{matrix} (μ \times Π_{N}) (A^{'} \times I_{(n)}^{- 1} (A_{(n)}) \times T) = \\ \int_{A^{'} \times I_{(n)}^{- 1} (A_{(n)})} {(μ \times Π_{N})}^{J | (I^{'}, I_{(n)}) = (ω^{'}, ω_{(n)})} (T) d {(μ \times Π_{N})}^{(I^{'}, I_{(n)})} (ω^{'}, ω_{(n)}) . \end{matrix}

On the other hand,

\begin{matrix} (μ \times Π_{N}) (A^{'} \times I_{(n)}^{- 1} (A_{(n)}) \times T) = μ (A^{'}) \cdot Π_{N} (I_{(n)}^{- 1} (A_{(n)}) \times T) = \\ μ (A^{'}) \cdot \int_{A_{(n)}} P_{ω_{(n)}, n}^{*} (T) d β_{Q, n}^{*} (ω_{(n)}) = \int_{A^{'} \times A_{(n)}} P_{ω_{(n)}, n}^{*} (T) d (μ \times β_{Q, n}^{*}) (ω^{'}, ω_{(n)}) . \end{matrix}

So,

P_{ω_{(n)}, n}^{*} = {(μ \times Π_{N})}^{J | (I^{'}, I_{(n)}) = (ω^{'}, ω_{(n)})} .

It follows that if

f \in L^{1} (Q)

then

E_{P_{ω_{(n)}, n}^{*}} (f) = E_{μ \times Π_{N}} [f ∣ (I^{'}, I_{(n)}) = (ω^{'}, ω_{(n)})] .

when

A_{(n)}^{'} : = {(I^{'}, I_{(n)})}^{- 1} (A \times A^{n})

, we have that

{(A_{(n)}^{'})}_{n}

is an increasing sequence of sub-

σ

-fields of

A \times A^{N}

such that

A \times A^{N} = σ (\cup_{n} A_{(n)}^{'})

. According to the martingale convergence theorem of Lévy, if Y is

(A \times A^{N} \times T)

-measurable and

μ \times Π_{N}

-integrable then

E_{μ \times Π_{N}} (Y | A_{(n)}^{'})

converges

(μ \times Π_{N})

-a.e. and in

L^{1} (μ \times Π_{N})

to

Y = E_{μ \times Π_{N}} (Y | A^{'} \times A^{N})

.

Let us consider the

μ \times Π_{N}

-integrable function

Y (ω^{'}, ω, θ) : = p_{θ} (ω^{'}) .

We shall see that

p_{ω, N}^{* P} (ω^{'}) = E_{μ \times Π_{N}} (Y ∣ (I^{'}, I) = (ω^{'}, ω)) .

(2)

Indeed, given

A^{'} \in A

and

A \in A^{N}

, we have that

\begin{matrix} \begin{matrix} \int_{{(I^{'}, I)}^{- 1} (A^{'} \times A)} & p_{θ} (ω^{'}) d (μ \times Π_{N}) (ω^{'}, ω, θ) = \int_{A} \int_{Θ} \int_{A^{'}} p_{θ} (ω^{'}) d μ (ω^{'}) d P_{ω, N}^{*} (θ) d β_{Q, N}^{*} (ω) \\ = \int_{A} \int_{Θ} P_{θ} (A^{'}) d P_{ω, N}^{*} (θ) d β_{Q, N} (ω) = \int_{A} P_{ω, N}^{* P} (A^{'}) d β_{Q, N}^{*} (ω) \\ = \int_{A^{'}} \int_{A} p_{ω, N}^{* P} (ω^{'}) d μ (ω^{'}) d β_{Q, N}^{*} (ω) = \int_{A^{'} \times A} p_{ω, N}^{* P} (ω^{'}) d {(μ \times Π_{N})}^{(I^{'}, I)} (ω^{'}, ω), \end{matrix} \end{matrix}

which proves (2).

Analogously, it can be shown that

p_{ω_{(n)}, n}^{* P} (ω^{'}) = E_{μ \times Π_{N}} (Y ∣ (I^{'}, I_{(n)}) = (ω^{'}, ω_{(n)})) .

(3)

Hence, it follows from the aforementioned theorem of Lévy that

\lim_{n} p_{ω_{(n)}, n}^{* P} (ω^{'}) = p_{ω, N}^{* P} (ω^{'}), (μ \times Π_{N}) - a . e .

(4)

and

\lim_{n} \int_{Ω \times Ω^{N} \times Θ} |p_{ω_{(n)}, n}^{* P} (ω^{'}) - p_{ω, N}^{* P} (ω^{'})| d (μ \times Π_{N}) (ω^{'}, ω, θ) = 0,

i.e.,

\lim_{n} \int_{Ω^{N} \times Θ} \int_{Ω} |p_{ω_{(n)}, n}^{* P} (ω^{'}) - p_{ω, N}^{* P} (ω^{'})| d μ (ω^{'}) d Π_{N} (ω, θ) = 0 .

(5)

On the other hand, as a consequence of a known theorem of Doob (see Theorem 6.9 and Proposition 6.10 of [4], pp. 129, 130, we have that, for every

ω^{'} \in Ω

,

\lim_{n} \int_{Θ} p_{θ^{'}} (ω^{'}) d P_{ω_{(n)}, n}^{*} (θ^{'}) = p_{θ} (ω^{'}), P_{θ}^{N} - a . e .

for Q-almost every

θ

. Hence

\lim_{n} p_{ω_{(n)}, n}^{* P} (ω^{'}) = p_{θ} (ω^{'}), P_{θ}^{N} - a . e .

for Q-almost every

θ

, i.e., given

ω^{'} \in Ω

there exists

T_{ω^{'}} \in T

such that

Q (T_{ω^{'}}) = 0

and,

\forall θ \notin T_{ω^{'}}

,

\lim_{n} p_{ω_{(n)}, n}^{* P} (ω^{'}) = p_{θ} (ω^{'}), P_{θ}^{N} - a . e .

So, for

θ \notin T_{ω^{'}}

, there exists

N_{θ, ω^{'}} \in A^{N}

such that

P_{θ}^{N} (N_{θ, ω^{'}}) = 0

and

\lim_{n} p_{ω_{(n)}, n}^{* P} (ω^{'}) = p_{θ} (ω^{'}), \forall ω \notin N_{θ, ω^{'}}, \forall θ \notin T_{ω^{'}}, \forall ω^{'} \in Ω .

In particular,

\lim_{n} p_{ω_{(n)}, n}^{* P} (ω^{'}) = p_{θ} (ω^{'}), μ \times P_{θ}^{N} - a . e .

(6)

From (4) and (6), it follows that

p_{θ} (ω^{'}) = p_{ω, N}^{* P} (ω^{'})

,

μ \times P_{θ}^{N} - a . e .

From this and (5), it follows that

\lim_{n} \int_{Ω^{N} \times Θ} \int_{Ω} |p_{ω_{(n)}, n}^{* P} (ω^{'}) - p_{θ} (ω^{'})| d μ (ω^{'}) d Π_{N} (ω, θ) = 0,

i.e., the risk of the Bayes estimator of the density for the

L^{1}

loss function goes to 0 when

n \to \infty

.

It follows from this and (1) that

\lim_{n} \int_{Ω^{N} \times Θ} \sup_{A \in A} |P_{ω_{(n)}, n}^{* P} (A) - P_{θ} (A)| d Π_{N} (ω, θ) = 0,

i.e., the risk of the Bayes estimator of the sampling distribution

P_{θ}

for the variation total loss function goes to 0 when

n \to \infty

.

We ask whether these results remain true for the squared versions of the loss functions. The answer is affirmative because of the following general result: Let

(X_{n})

be a sequence of r.r.v. on a probability space

(Ω, A, P)

such that

\lim_{n} \int | X_{n} | d P = 0

. If there exists

a > 0

such that

| X_{n} | \leq a

, for all n, then

\lim_{n} \int {| X_{n} |}^{2} d P = 0

because

0 \leq \int | X_{n} |^{2} d P \leq a \int | X_{n} | d P \to_{n} 0 .

In our case

a = 2

,

P : = Π_{N}

and

X_{n} : = \int_{Ω} |p_{ω_{(n)}, n}^{* P} (ω^{'}) - p_{ω, N}^{* P} (ω^{'})| d μ (ω^{'}), or X_{n} : = \sup_{A \in A} |P_{ω_{(n)}, n}^{* P} (A) - P_{θ} (A)| .

So, we have proved the following result.

Theorem 1.

Let

(Ω, A, {P_{θ} : θ \in (Θ, T, Q)})

be a Bayesian experiment dominated by a σ-finite measure μ. Let us assume that

(Ω, A)

is a standard Borel space, and that Θ is a Borel subset of a Polish space and

T

is its Borel σ-field. Assume also that the likelihood function

L (ω, θ) : = p_{θ} (ω) = \frac{d P_{θ}}{d μ} (ω)

is

A \times T

-measurable and the family

{P_{θ} : θ \in Θ}

is identifiable. Then:

(a): The posterior predictive density $p_{ω_{(n)}, n}^{* P}$ is the Bayes estimator of the density $p_{θ}$ in the product experiment $(Ω^{n}, A^{n}, {P_{θ}^{n} : θ \in (Θ, T, Q)})$ for the squared $L^{1}$ loss function. Moreover the risk function converges to 0 for both the $L^{1}$ loss function and the squared $L^{1}$ loss function.
(b): The posterior predictive distribution $P_{ω_{(n)}, n}^{* P}$ is the Bayes estimator of the sampling distribution $P_{θ}$ in the product experiment $(Ω^{n}, A^{n}, {P_{θ}^{n} : θ \in (Θ, T, Q)})$ for the squared variation total loss function. Moreover the risk function converges to 0 for both the variation total loss function and the squared variation total loss function.
(c): The posterior predictive density is a strongly consistent estimator of the density $p_{θ}$ , i.e.,

$\lim_{n} p_{ω_{(n)}, n}^{* P} (ω^{'}) = p_{θ} (ω^{'}), μ \times P_{θ}^{N} - a . e .$

for Q-almost every $θ \in Θ$ .

Funding

This research was funded by the Junta de Extremaura (SPAIN) grant number GR21044.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The author declares no conflict of interest.

References

Nogales, A.G. On Bayesian estimation of densities and sampling distributions: The posterior predictive distribution as the Bayes estimator. Stat. Neerl. 2021. accepted. [Google Scholar] [CrossRef]
Geisser, S. Predictive Inference: An Introduction; Chapman & Hall: New York, NY, USA, 1993. [Google Scholar]
Gelman, A.; Carlin, J.B.; Stern, H.S.; Dunson, D.B.; Vehtari, A.; Rubin, D.B. Bayesian Data Analysis, 3rd ed.; CRC Press (Taylor & Francis Group): Boca Raton, FL, USA, 2014. [Google Scholar]
Ghosal, S.; Vaart, A.V.D. Fundamentals of Nonparametric Bayesian Inference; Cambridge University Press: Cambridge, UK, 2017. [Google Scholar]
Rubin, D.B. Bayesianly justifiable and relevant frequency calculations for the applied statistician. Ann. Stat. 1984, 12, 1151–1172. [Google Scholar] [CrossRef]

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Nogales, A.G. On Consistency of the Bayes Estimator of the Density. Mathematics 2022, 10, 636. https://doi.org/10.3390/math10040636

AMA Style

Nogales AG. On Consistency of the Bayes Estimator of the Density. Mathematics. 2022; 10(4):636. https://doi.org/10.3390/math10040636

Chicago/Turabian Style

Nogales, Agustín G. 2022. "On Consistency of the Bayes Estimator of the Density" Mathematics 10, no. 4: 636. https://doi.org/10.3390/math10040636

APA Style

Nogales, A. G. (2022). On Consistency of the Bayes Estimator of the Density. Mathematics, 10(4), 636. https://doi.org/10.3390/math10040636

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

On Consistency of the Bayes Estimator of the Density

Abstract

1. Introduction

2. The Framework

3. The Main Result

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI