Next Article in Journal
Hardy Inequalities and Interrelations of Fractional Triebel–Lizorkin Spaces in a Bounded Uniform Domain
Next Article in Special Issue
Statistical Analysis of the Lifetime Distribution with Bathtub-Shaped Hazard Function under Lagged-Effect Step-Stress Model
Previous Article in Journal
Objective Criticism and Negative Conclusions on Using the Fuzzy SWARA Method in Multi-Criteria Decision Making
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

On Consistency of the Bayes Estimator of the Density

by
Agustín G. Nogales
Departamento de Matemáticas, IMUEx, Universidad de Extremadura, 06006 Badajoz, Spain
Mathematics 2022, 10(4), 636; https://doi.org/10.3390/math10040636
Submission received: 2 February 2022 / Revised: 16 February 2022 / Accepted: 17 February 2022 / Published: 18 February 2022
(This article belongs to the Special Issue Probability Theory and Stochastic Modeling with Applications)

Abstract

:
Under mild conditions, strong consistency of the Bayes estimator of the density is proved. Moreover, the Bayes risk (for some common loss functions) of the Bayes estimator of the density (i.e., the posterior predictive density) goes to zero as the sample size goes to ∞. In passing, a similar result is obtained for the estimation of the sampling distribution.

1. Introduction

In a statistical context, since the expression the probability of an event A (usually denoted P θ ( A ) ) depends on the unknown parameter, it is really a misuse of language. Before performing the experiment, this expression can be assigned a natural meaning from a Bayesian perspective as the prior predictive probability of A since it is the prior mean of the probabilities P θ ( A ) . However, in accordance with Bayesian philosophy, once the experiment has been carried out and the value ω has been observed, a more appropriate estimate of P θ ( A ) is the posterior predictive probability given ω of A. The author has recently proved ([1]) that not only is this the Bayes estimator of P θ ( A ) but that the posterior predictive distribution (resp. the posterior predictive density) is the Bayes estimator of the sampling distribution P θ (resp. the density p θ ) for the squared variation total (resp. the squared L 1 ) loss function in the Bayesian experiment corresponding to an n-sized sample of the unknown distribution. It should be noted that the loss functions considered derive in a natural way from the commonly used squared error loss function when estimating a real function of the parameter.
The posterior predictive distribution is the cornerstone of Predictive Inference, which seeks to make inferences about a new unknown observation from a preceding random sample (see [2,3]). With that idea in mind, it has also been used in other areas such as model selection, testing for discordancy, goodness of fit, perturbation analysis, and classification (see additional fields of application in [1,2,3,4,5]). Furthermore, in [1], it has been presented as a solution for the Bayesian density estimation problem, giving several examples to illustrate the results and, in particular, to calculate a posterior predictive density. [3] provide many other examples of determining the posterior predictive distribution. But in practice, explicit evaluation of the posterior predictive distribution may be cumbersome, and its simulation may become preferable. The aforementioned work of [3] also constitutes a good reference for such simulation methods, and hence for the computation of the Bayes estimators of the density and the sampling distribution.
We would refer to the references cited in [1] for other statistical uses of the posterior predictive distribution and some useful ways to calculate it.
In this communication, we shall explore the asymptotic behaviour of the posterior predictive density as the Bayes estimator of the density, showing its strong consistency and that the Bayes risk goes to 0 as n goes to ∞.

2. The Framework

Let
( Ω , A , { P θ : θ ( Θ , T , Q ) } )
be a Bayesian experiment (where Q denotes de prior distribution on the parameter space ( Θ , T ) ), and consider the infinite product Bayesian experiment
( Ω N , A N , { P θ N : θ ( Θ , T , Q ) } )
corresponding to an infinite sample of the unknown distribution P θ . Let us write
I ( ω , θ ) : = ω , J ( ω , θ ) : = θ , I n ( ω , θ ) : = ω n and I ( n ) ( ω ) : = ω ( n ) : = ( ω 1 , , ω n )
for integer n.
We suppose that P N ( θ , A ) : = P θ N ( A ) is a Markov kernel. Let
Π N : = P N Q
be the joint distribution of the parameter and the observations, i.e.,
Π N ( A × T ) = T P θ N ( A ) d Q ( θ ) , A A N ,   T T .
As Q : = Π N J (i.e., the probability distribution of J with respect to Π N ), P θ N is a version of the conditional distribution (regular conditional probability) Π N I | J = θ . Analogously, P θ n is a version of the conditional distribution Π N I ( n ) | J = θ .
Let β Q , N : = Π N I , the prior predictive distribution in Ω N (so that β Q , N ( A ) is the prior mean of the probabilities P θ N ( A ) ). Similarly, write β Q , n : = Π N I ( n ) for the prior predictive distribution in Ω n . So, the posterior distribution P ω , N : = Π N J | I = ω given ω Ω N satisfies
Π N ( A × T ) = T P θ N ( A ) d Q ( θ ) = A P ω , N ( T ) d β Q , N ( ω ) , A A N , T T .
Denote by P ω ( n ) , n : = Π N J | I ( n ) = ω ( n ) for ω ( n ) Ω n the posterior distribution given ω ( n ) Ω n .
Write P ω ( n ) , n   P for the posterior predictive distribution given ω ( n ) Ω n defined for A A as
P ω ( n ) , n   P ( A ) = Θ P θ ( A ) d P ω ( n ) , n ( θ ) .
So P ω ( n ) , n   P ( A ) is nothing but the posterior mean given ω ( n ) Ω n of the probabilities P θ ( A ) .
In the dominated case, we can assume without loss of generality that the dominating measure μ is a probability measure (because of (1) below). We write p θ = d P θ / d μ . The likelihood function L ( ω , θ ) : = p θ ( ω ) is assumed to be A × T -measurable.
We have that, for all n and every event A A ,
P ω ( n ) , n   P ( A ) = Θ P θ ( A ) d P ω ( n ) , n ( θ ) = Θ A p θ ( ω ) d μ ( ω ) d P ω ( n ) , n ( θ ) = A Θ p θ ( ω ) d P ω ( n ) , n ( θ ) d μ ( ω ) ,
which proves that
p ω ( n ) , n   P ( ω ) : = Θ p θ ( ω ) d P ω ( n ) , n ( θ )
is a μ -density of P ω ( n ) , n   P that we recognize as the posterior predictive density on Ω given ω ( n ) .
In the same way,
p ω , N   P ( ω ) : = Θ p θ ( ω ) d P ω , N ( θ )
is a μ -density of P ω , N   P , the posterior predictive density on Ω given ω Ω N .
In the following, we will assume the following additional regularity conditions:
(i)
( Ω , A ) is a standard Borel space;
(ii)
Θ is a Borel subset of a Polish space and T is its Borel σ -field;
(iii)
{ P θ : θ Θ } is identifiable.
According to [1], the posterior predictive distribution P ω ( n ) , n   P (resp. the posterior predictive density p ω ( n ) , n   P ) is the Bayes estimator of the sampling distribution P θ (resp. the density p θ ) for the squared variation total (resp. the squared L 1 ) loss function in the product experiment ( Ω n , A n , { P θ n : θ ( Θ , T , Q ) } ) . Analogously, the posterior predictive distribution P ω , N   P (resp. the posterior predictive density p ω , N   P ) is the Bayes estimator of the sampling distribution P θ (resp. the density p θ ) for the squared variation total (resp. the squared L 1 ) loss function in the product experiment ( Ω N , A N , { P θ N : θ ( Θ , T , Q ) } ) .
As a particular case of a well known result about the total variation distance between two probability measures and the L 1 -distance between their densities, we have that
sup A A P ω ( n ) , n   P ( A ) P θ ( A ) = 1 2 Ω p ω ( n ) , n   P p θ d μ .

3. The Main Result

We ask whether the Bayes risk of the Bayes estimator P ω ( n ) , n   P of the sampling distribution P θ goes to zero when n , i.e., whether
lim n Ω N × Θ sup A A P ω ( n ) , n   P ( A ) P θ ( A ) 2 d Π N ( ω , θ ) = 0 .
In terms of densities, the question is whether the Bayes risk of the Bayes estimator p ω ( n ) , n   P of the density p θ goes to zero when n , i.e., whether
lim n Ω N × Θ Ω p ω ( n ) , n   P ( ω ) p θ ( ω ) d μ ( ω ) 2 d Π N ( ω , θ ) = 0 .
Let us consider the auxiliary Bayesian experiment
( Ω × Ω N , A × A N , { μ × P θ N : θ ( Θ , T , Q ) } ) .
For ω Ω , ω Ω n and θ Θ , we will continue to write I ( ω , ω , θ ) = ω and J ( ω , ω , θ ) = θ , and now we write I ( ω , ω , θ ) = ω .
The new prior predictive distribution is μ × β Q , n since
( μ × Π N ) ( I , I ( n ) ) ( A × A ( n ) ) = μ ( A ) · β Q , n ( A ( n ) ) = ( μ × β Q , n ) ( A × A ( n ) ) .
To compute the new posterior distributions, notice that
( μ × Π N ) ( A × I ( n ) 1 ( A ( n ) ) × T ) = A × I ( n ) 1 ( A ( n ) ) ( μ × Π N ) J | ( I , I ( n ) ) = ( ω , ω ( n ) ) ( T ) d ( μ × Π N ) ( I , I ( n ) ) ( ω , ω ( n ) ) .
On the other hand,
( μ × Π N ) ( A × I ( n ) 1 ( A ( n ) ) × T ) = μ ( A ) · Π N ( I ( n ) 1 ( A ( n ) ) × T ) = μ ( A ) · A ( n ) P ω ( n ) , n ( T ) d β Q , n ( ω ( n ) ) = A × A ( n ) P ω ( n ) , n ( T ) d ( μ × β Q , n ) ( ω , ω ( n ) ) .
So,
P ω ( n ) , n = ( μ × Π N ) J | ( I , I ( n ) ) = ( ω , ω ( n ) ) .
It follows that if f L 1 ( Q ) then
E P ω ( n ) , n ( f ) = E μ × Π N [ f ( I , I ( n ) ) = ( ω , ω ( n ) ) ] .
when A ( n ) : = ( I , I ( n ) ) 1 ( A × A n ) , we have that ( A ( n ) ) n is an increasing sequence of sub- σ -fields of A × A N such that A × A N = σ ( n A ( n ) ) . According to the martingale convergence theorem of Lévy, if Y is ( A × A N × T ) -measurable and μ × Π N -integrable then
E μ × Π N ( Y | A ( n ) )
converges ( μ × Π N ) -a.e. and in L 1 ( μ × Π N ) to Y = E μ × Π N ( Y | A × A N ) .
Let us consider the μ × Π N -integrable function
Y ( ω , ω , θ ) : = p θ ( ω ) .
We shall see that
p ω , N   P ( ω ) = E μ × Π N ( Y ( I , I ) = ( ω , ω ) ) .
Indeed, given A A and A A N , we have that
( I , I ) 1 ( A × A ) p θ ( ω ) d ( μ × Π N ) ( ω , ω , θ ) = A Θ A p θ ( ω ) d μ ( ω ) d P ω , N ( θ ) d β Q , N ( ω ) = A Θ P θ ( A ) d P ω , N ( θ ) d β Q , N ( ω ) = A P ω , N   P ( A ) d β Q , N ( ω ) = A A p ω , N   P ( ω ) d μ ( ω ) d β Q , N ( ω ) = A × A p ω , N   P ( ω ) d ( μ × Π N ) ( I , I ) ( ω , ω ) ,
which proves (2).
Analogously, it can be shown that
p ω ( n ) , n   P ( ω ) = E μ × Π N ( Y ( I , I ( n ) ) = ( ω , ω ( n ) ) ) .
Hence, it follows from the aforementioned theorem of Lévy that
lim n p ω ( n ) , n   P ( ω ) = p ω , N   P ( ω ) , ( μ × Π N ) a . e .
and
lim n Ω × Ω N × Θ p ω ( n ) , n   P ( ω ) p ω , N   P ( ω ) d ( μ × Π N ) ( ω , ω , θ ) = 0 ,
i.e.,
lim n Ω N × Θ Ω p ω ( n ) , n   P ( ω ) p ω , N   P ( ω ) d μ ( ω ) d Π N ( ω , θ ) = 0 .
On the other hand, as a consequence of a known theorem of Doob (see Theorem 6.9 and Proposition 6.10 of [4], pp. 129, 130, we have that, for every ω Ω ,
lim n Θ p θ ( ω ) d P ω ( n ) , n ( θ ) = p θ ( ω ) , P θ N a . e .
for Q-almost every θ . Hence
lim n p ω ( n ) , n   P ( ω ) = p θ ( ω ) , P θ N a . e .
for Q-almost every θ , i.e., given ω Ω there exists T ω T such that Q ( T ω ) = 0 and, θ T ω ,
lim n p ω ( n ) , n   P ( ω ) = p θ ( ω ) , P θ N a . e .
So, for θ T ω , there exists N θ , ω A N such that P θ N ( N θ , ω ) = 0 and
lim n p ω ( n ) , n   P ( ω ) = p θ ( ω ) , ω N θ , ω , θ T ω , ω Ω .
In particular,
lim n p ω ( n ) , n   P ( ω ) = p θ ( ω ) , μ × P θ N a . e .
From (4) and (6), it follows that p θ ( ω ) = p ω , N   P ( ω ) , μ × P θ N a . e .
From this and (5), it follows that
lim n Ω N × Θ Ω p ω ( n ) , n   P ( ω ) p θ ( ω ) d μ ( ω ) d Π N ( ω , θ ) = 0 ,
i.e., the risk of the Bayes estimator of the density for the L 1 loss function goes to 0 when n .
It follows from this and (1) that
lim n Ω N × Θ sup A A P ω ( n ) , n   P ( A ) P θ ( A ) d Π N ( ω , θ ) = 0 ,
i.e., the risk of the Bayes estimator of the sampling distribution P θ for the variation total loss function goes to 0 when n .
We ask whether these results remain true for the squared versions of the loss functions. The answer is affirmative because of the following general result: Let ( X n ) be a sequence of r.r.v. on a probability space ( Ω , A , P ) such that lim n | X n | d P = 0 . If there exists a > 0 such that | X n | a , for all n, then lim n | X n | 2 d P = 0 because
0 | X n | 2 d P a | X n | d P n 0 .
In our case a = 2 , P : = Π N and
X n : = Ω p ω ( n ) , n   P ( ω ) p ω , N   P ( ω ) d μ ( ω ) , or X n : = sup A A P ω ( n ) , n   P ( A ) P θ ( A ) .
So, we have proved the following result.
Theorem 1.
Let ( Ω , A , { P θ : θ ( Θ , T , Q ) } ) be a Bayesian experiment dominated by a σ-finite measure μ. Let us assume that ( Ω , A ) is a standard Borel space, and that Θ is a Borel subset of a Polish space and T is its Borel σ-field. Assume also that the likelihood function L ( ω , θ ) : = p θ ( ω ) = d P θ d μ ( ω ) is A × T -measurable and the family { P θ : θ Θ } is identifiable. Then:
(a) 
The posterior predictive density p ω ( n ) , n   P is the Bayes estimator of the density p θ in the product experiment ( Ω n , A n , { P θ n : θ ( Θ , T , Q ) } ) for the squared L 1 loss function. Moreover the risk function converges to 0 for both the L 1 loss function and the squared L 1 loss function.
(b) 
The posterior predictive distribution P ω ( n ) , n   P is the Bayes estimator of the sampling distribution P θ in the product experiment ( Ω n , A n , { P θ n : θ ( Θ , T , Q ) } ) for the squared variation total loss function. Moreover the risk function converges to 0 for both the variation total loss function and the squared variation total loss function.
(c) 
The posterior predictive density is a strongly consistent estimator of the density p θ , i.e.,
lim n p ω ( n ) , n   P ( ω ) = p θ ( ω ) , μ × P θ N a . e .
for Q-almost every θ Θ .

Funding

This research was funded by the Junta de Extremaura (SPAIN) grant number GR21044.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The author declares no conflict of interest.

References

  1. Nogales, A.G. On Bayesian estimation of densities and sampling distributions: The posterior predictive distribution as the Bayes estimator. Stat. Neerl. 2021. accepted. [Google Scholar] [CrossRef]
  2. Geisser, S. Predictive Inference: An Introduction; Chapman & Hall: New York, NY, USA, 1993. [Google Scholar]
  3. Gelman, A.; Carlin, J.B.; Stern, H.S.; Dunson, D.B.; Vehtari, A.; Rubin, D.B. Bayesian Data Analysis, 3rd ed.; CRC Press (Taylor & Francis Group): Boca Raton, FL, USA, 2014. [Google Scholar]
  4. Ghosal, S.; Vaart, A.V.D. Fundamentals of Nonparametric Bayesian Inference; Cambridge University Press: Cambridge, UK, 2017. [Google Scholar]
  5. Rubin, D.B. Bayesianly justifiable and relevant frequency calculations for the applied statistician. Ann. Stat. 1984, 12, 1151–1172. [Google Scholar] [CrossRef]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Nogales, A.G. On Consistency of the Bayes Estimator of the Density. Mathematics 2022, 10, 636. https://doi.org/10.3390/math10040636

AMA Style

Nogales AG. On Consistency of the Bayes Estimator of the Density. Mathematics. 2022; 10(4):636. https://doi.org/10.3390/math10040636

Chicago/Turabian Style

Nogales, Agustín G. 2022. "On Consistency of the Bayes Estimator of the Density" Mathematics 10, no. 4: 636. https://doi.org/10.3390/math10040636

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop