Next Article in Journal
Local Streamline Pattern and Topological Index of an Isotropic Point in a 2D Velocity Field
Previous Article in Journal
Smooth UAV Path Planning Based on Composite-Energy-Minimizing Bézier Curves
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Asymptotic Behavior of the Bayes Estimator of a Regression Curve

by
Agustín G. Nogales
Departamento de Matemáticas, Instituto de Matemáticas, Universidad de Extremadura, 06006 Badajoz, Spain
Mathematics 2025, 13(14), 2319; https://doi.org/10.3390/math13142319
Submission received: 30 May 2025 / Revised: 17 July 2025 / Accepted: 18 July 2025 / Published: 21 July 2025
(This article belongs to the Section D1: Probability and Statistics)

Abstract

In this work, we prove the convergence to 0 in both L 1 and L 2 of the Bayes estimator of a regression curve (i.e., the conditional expectation of the response variable given the regressor). The strong consistency of the estimator is also derived. The Bayes estimator of a regression curve is the regression curve with respect to the posterior predictive distribution. The result is general enough to cover discrete and continuous cases, parametric or nonparametric, and no specific supposition is made about the prior distribution. Some examples, two of them of a nonparametric nature, are given to illustrate the main result; one of the nonparametric examples exhibits a situation where the estimation of the regression curve has an optimal solution, although the problem of estimating the density is meaningless. An important role in the demonstration of these results is the establishment of a probability space as an adequate framework to address the problem of estimating regression curves from the Bayesian point of view, putting at our disposal powerful probabilistic tools in that endeavor.

1. Introduction

Given a random variable X 1 —the independent variable, regressor, or predictor—and a real random variable X 2 —the dependent variable or response—the so called regression curve of X 2 given X 1 is the map r ( x 1 ) : = E ( X 2 | X 1 = x 1 ) , the function of X 1 that best approximates X 2 in the least squares sense; therefore, it becomes an essential tool in the study of the relationship between these two variables. Many statistical problems in practice, especially those related to prediction, require the estimation of the regression function from data, i.e., from a sample ( x 1 i , x 2 i ) , i = 1 , , n , of the joint distribution of X 1 and X 2 . This estimation problem has been addressed in a good number of papers in both parametric and nonparametric contexts, from both the frequentist and the Bayesian points of view. In fact, regression techniques are among the most widely used methods in applied statistics.
In a nonparametric frequentist framework, the problem of estimation of the regression curve was first considered in [1,2]. We refer to [3] for this problem in a Bayesian context; it includes some historical notes about Bayesian nonparametric regression and some results about the consistency of the estimates for some specific priors.
Talking about the probability of an event A (written P θ ( A ) ) in a statistical context is ambiguous, as it depends of the unknown parameter. In a Bayesian context, once the data ω has been observed, a natural estimate of P θ ( A ) is the posterior predictive probability of A given ω since it is the posterior mean of the probabilities of A given ω , which, as is well known, is the Bayes estimator of P θ ( A ) for the squared error loss function. This simple fact already justifies the use of the posterior predictive distribution as an estimator of the sampling distribution, but, in reality, much more is true because, as shown in [4], the posterior predictive distribution is the Bayes estimator of the sampling probability distribution P θ for the squared total variation loss function. It is similar to what happens with the strong law of large numbers and the Glivenko–Cantelli theorem: the first guarantees an almost certain punctual convergence of the empirical distribution function to the unknown population distribution function, but the second yields an almost sure uniform convergence, becoming the fundamental theorem of Mathematical Statistics. The problem of estimation of the density in a Bayesian nonparametric framework is considered as a number of references, such as [5], [6], [7], or [8]. In [4], the problem of estimation of the density from a Bayesian point of view is also addressed, and, under mild conditions, it is shown that the posterior predictive density is the Bayes estimator for the L 1 -squared loss function, and the convergence to 0 of the Bayes risk (and the strong consistency) of this estimator is shown in [9].
As regards the estimation of the regression curve, or even the conditional density, reference [10] or reference [11] contain sufficient arguments on the usefulness of these problems in practice from a frequentist point of view, problems that go back to [12], although they have not produced much literature since then either. The paper [13] deals, among others, with the problem of the Bayesian estimation of a regression curve and proves that the regression curve, with respect to the posterior predictive distribution, is the Bayes estimator (for the squared error loss function). Here, we wonder about the convergence to 0 of its Bayes risk (and its strong consistency). This is the main goal of the paper, and Theorem 1 below answers the question in the affirmative.
So, the posterior predictive distribution is the key to the estimation problems raised above. It has been presented in the literature as the base of Predictive Inference, which seeks to make inferences about a new unknown observation from the previous random sample instead of estimating an unknown parameter. It should be noted that, in practice, the explicit evaluation of the posterior predictive distribution could be cumbersome, and its simulation may become preferable. The interested reader can find in the papers mentioned above, and the references therein, more information on the problems of estimating the density or regression curve, from both the frequentist and Bayesian perspectives, or about the usefulness of the posterior predictive distribution in Bayesian Inference and its calculation. We place special emphasis on the monographs [3,14,15].
In Section 2, an important and useful achievement of the paper is obtained, as it establishes a probability space as the theoretical framework (i.e., Bayesian experiment) appropriate to address the problem (in the same way that [16] considers the Bayesian experiment as a probability space). In fact, starting from the Bayesian experiment (1) corresponding to a sample of size m (possibly infinite) from the joint distribution of the two variables X 1 (predictor) and X 2 (response) of interest, the probability space (3) is presented as the appropriate model for the estimation of the regression curve of X 2 given X 1 = x 1 from an m-sized sample of the joint distribution of the two variables. This has allowed us to obtain an explicit expression of the Bayes risk of an estimator of the regression curve and take advantage of powerful probabilistic tools when solving the problem of its asymptotic behavior.
Section 3 includes the aforementioned Theorem 1, whose proof lies in Jensen’s inequality, Lévy’s martingale convergence theorem, and a result by Doob on the consistency of the posterior distribution. The result is general enough to cover discrete and continuous cases, parametric or nonparametric, as the examples provided show, and, unlike what we have been able to find in the literature, no specific supposition is made about the prior distribution.
Section 4 contains the proof of the main result and some auxiliary results. In particular, Lemma 1, the key of the proof of the Theorem, yields a representation of the Bayes estimator of the regression curve as its conditional mean in the Bayesian experiment (3).
Section 5 includes some examples to illustrate the main result of the paper, two of them of a nonparametric nature. The last of these two nonparametric examples shows a situation where estimating the regression curve has a Bayes estimator, although the problem of estimating the density is meaningless; in fact, the estimation of the regression function is performed through the conditional distribution itself.
For ease of reading, we encourage the reader who is not familiar with the terminology or the notation used in the paper to start by reading the Appendix of [17].

2. The Framework

We recall from [13] the appropriate framework to address the problem and update it to incorporate the required asymptotic flavor.
Let ( Ω , A , { P θ : θ ( Θ , T , Q ) } ) be a Bayesian statistical experiment and
X i : ( Ω , A , { P θ : θ ( Θ , T , Q ) } ) ( Ω i , A i ) , i = 1 , 2 ,
two statistics. Consider the Bayesian experiment image of ( X 1 , X 2 )
( Ω 1 × Ω 2 , A 1 × A 2 , { P θ ( X 1 , X 2 ) : θ ( Θ , T , Q ) } ) .
In the following, we will assume that P ( X 1 , X 2 ) ( θ , A 12 ) : = P θ ( X 1 , X 2 ) ( A 12 ) , θ Θ , A 12 A 1 × A 2 , and the joint distribution of X 1 and X 2 , is a Markov kernel. Let us write R θ = P θ ( X 1 , X 2 ) and p j ( x ) : = x j for j = 1 , 2 , x : = ( x 1 , x 2 ) Ω 1 × Ω 2 . Hence
P θ X 1 = R θ p 1 and P θ X 2 | X 1 = x 1 = R θ p 2 | p 1 = x 1 ,
and, when X 2 is a real random variable, E P θ ( X 2 | X 1 = x 1 ) = E R θ ( p 2 | p 1 = x 1 ) . In order to alleviate and shorten the notation, we write Ω 12 = Ω 1 × Ω 2 and A 12 = A 1 × A 2 .
Given an integer n, for m = n (respectively, m = N ), the Bayesian experiment corresponding to a n-sized sample (respectively, an infinite sample) of the joint distribution of ( X 1 , X 2 ) is
Ω 12 m , A 12 m , R θ m : θ ( Θ , T , Q ) .
We define the Markov kernel R m by R m ( θ , A 12 , m ) : = R θ m ( A 12 , m ) , for A 12 , m A 12 m , and denote
Π 12 , m : = Q R m
for the joint distribution of the parameter and the sample, i.e.,
Π 12 , m ( A 12 , m × T ) = T R θ m ( A 12 , m ) d Q ( θ ) , A 12 , m A 12 m , T T .
The corresponding prior predictive distribution β m * on Ω 12 m is
β m * ( A 12 , m ) = Θ R θ m ( A 12 , m ) d Q ( θ ) , A 12 , m A 12 m .
The posterior distribution is a Markov kernel
R m * : ( Ω 12 m , A 12 m ) ( Θ , T )
such that, for all A 12 , m A 12 m and T T ,
Π 12 , m ( A 12 , m × T ) = T R θ m ( A 12 , m ) d Q ( θ ) = A 12 , m R m * ( x , T ) d β m * ( x ) .
Let us write R m , x * ( T ) : = R m * ( x , T ) .
The posterior predictive distribution on A 12 is the Markov kernel
R m * R : ( Ω 12 m , A 12 m ) ( Ω 12 , A 12 )
defined, for x Ω 12 m , by
R m * R ( x , A 12 ) : = Θ R θ ( A 12 ) d R m , x * ( θ )
This way, given x , the posterior predictive probability of an event A 12 is nothing but the posterior mean of R θ ( A 12 ) . It follows that, with obvious notations,
Ω 12 f ( x ) d R m , x * R ( x ) = Θ Ω 12 f ( x ) d R θ ( x ) d R m , x * ( θ )
for any non-negative or integrable real random variable f.
We can also consider the posterior predictive distribution on A 12 m , defined as the Markov kernel
R m * R m : ( Ω 12 m , A 12 m ) ( Ω 12 m , A 12 m )
such that
R m * R m ( x , A 12 , m ) : = Θ R θ m ( A 12 , m ) d R m , x * ( θ ) .
Looking for the appropriate framework to address the problem of estimating the regression curve of the real random variable X 2 given X 1 = x 1 , we part from the Bayesian experiment (1) corresponding to a sample x Ω 12 m from the joint distribution R θ of the predictor X 1 and the response X 2 , and we choose x = ( x 1 , x 2 ) Ω 12 (in fact, we only need the first coordinate x 1 of x as the argument of the regression curve) from R θ independently of x , which brings us to the product Bayesian experiment
Ω 12 m × Ω 12 , A 12 m × A 12 , R θ m × R θ : θ ( Θ , T , Q ) .
The Bayesian experiment (2) can be identified in a standard way with the probability space
( Ω 12 m × Ω 12 × Θ , A 12 m × A 12 × T , Π m ) ,
where Π m : = ( R m × R ) Q , i.e.,
Π m ( A 12 , m × A 12 × T ) = T R θ ( A 12 ) R θ m ( A 12 , m ) d Q ( θ ) ,
when A 12 , m A 12 m , A 12 A 12 and T T .
So, for a real random variable f on ( Ω 12 m × Ω 12 × Θ , A 12 m × A 12 × T ) ,
f d Π m = Θ Ω 12 m Ω 12 f ( x , x , θ ) d R θ ( x ) d R θ m ( x ) d Q ( θ ) ,
provided that the integral exists. Moreover, for a real random variable h on ( Ω 12 × Θ , A 12 × T ) , by definition of the posterior distributions,
h d Π m = Θ Ω 12 h ( x , θ ) d R θ ( x ) d Q ( θ ) = Ω 12 Θ h ( x , θ ) d R 1 , x * ( θ ) d β 1 * ( x ) .

3. The Bayes Estimator of the Regression Curve: Asymptotic Behavior

Now suppose that ( Ω 2 , A 2 ) = ( R , R ) . Let X 2 be a squared-integrable real random variable, such that E θ ( X 2 2 ) has a finite prior mean; in particular, E θ ( X 2 ) also has a finite prior mean.
In the regression curve of X 2 , given that X 1 is the map, x 1 Ω 1 r θ ( x 1 ) : = E θ ( X 2 | X 1 = x 1 ) An estimator of the regression curve r θ from a sample of size n of the joint distribution of ( X 1 , X 2 ) is a statistic
m : ( x , x 1 ) ( Ω 1 × R ) n × Ω 1 m ( x , x 1 ) R ,
so that, being observed, the sample x ( Ω 1 × R ) n , m ( x , · ) is the estimation of r θ .
From a classical point of view, the simplest way to evaluate the error in estimating an unknown regression curve is to use the expectation of the quadratic deviation (see [18], p. 120):
E θ Ω 1 ( m ( x , x 1 ) r θ ( x 1 ) ) 2 d P θ X 1 ( x 1 ) = ( Ω 1 × R ) n Ω 1 ( m ( x , x 1 ) r θ ( x 1 ) ) 2 d R θ p 1 ( x 1 ) d R θ n ( x ) .
From a Bayesian point of view, the Bayes estimator—the optimal estimator—of the regression curve r θ should minimize the Bayes risk (i.e., the prior mean of the expectation of the quadratic deviation)
Θ ( Ω 1 × R ) n Ω 1 ( m ( x , x 1 ) r θ ( x 1 ) ) 2 d R θ p 1 ( x 1 ) d R θ n ( x ) d Q ( θ ) = E Π n ( m ( x , x 1 ) r θ ( x 1 ) ) 2 .
So, the Bayesian experiment (3) is the appropriate framewok to address these questions (see also Remark 1).
Recall from [13] that the regression curve of p 2 on p 1 with respect to the posterior predictive distribution R n , x * R , given the data x , is the Bayes estimator of the regression curve r θ ( x 1 ) : = E θ ( X 2 | X 1 = x 1 ) for the squared error loss function; for the sake of completeness, this proposition is also included as part (i) of Theorem 1 below.
We wonder about the convergence to 0 of the Bayes risk:
E Π n [ ( m n * ( x , x 1 ) r θ ( x 1 ) ) 2 ] .
Another question of interest is the consistency of this Bayes estimator.
Lemma 1 below is key to solving the problem, since it shows that the Bayes estimator of the regression curve becomes its conditional mean in the Bayesian experiment (3). What the following theorem really provides is the asymptotic behavior of this estimator: the convergence to zero of its Bayes risks and the strong consistency of the Bayes estimator of the regression curve.
Theorem 1.
Let ( Ω , A , { P θ : θ ( Θ , T , Q ) } ) be a Bayesian statistical experiment and X 1 : ( Ω , A , { P θ : θ ( Θ , T , Q ) } ) ( Ω 1 , A 1 ) and X 2 : ( Ω , A , { P θ : θ ( Θ , T , Q ) } ) ( R , R ) be two statistics, such that E θ ( X 2 2 ) has a finite prior mean. Let us suppose that: (a) ( Ω 1 , A 1 ) is a standard Borel space; (b) Θ is a Borel subset of a Polish space, and T is its Borel σ-field; and (c) { R θ : θ Θ } is identifiable.
Then,
(i) The regression curve of p 2 on p 1 with respect to the posterior predictive distribution R n , x * R ,
m n * ( x , x 1 ) : = E R n , x * R ( p 2 | p 1 = x 1 ) ,
is the Bayes estimator of the regression curve r θ ( x 1 ) : = E θ ( X 2 | X 1 = x 1 ) for the squared error loss function, i.e.,
E Π n [ ( m n * ( x , x 1 ) r θ ( x 1 ) ) 2 ] E Π n [ ( m n ( x , x 1 ) r θ ( x 1 ) ) 2 ]
for any other estimator m n of the regression curve r θ .
(ii) Moreover, m n * is a strongly consistent estimator of the regression curve, in the sense that
lim n E R n , x ( n ) * R ( p 2 | p 1 = x 1 ) = E θ ( X 2 | X 1 = x 1 ) , Π N a . e . ,
where x ( n ) : = ( x 1 , , x n ) if x Ω 12 N .
(iii) Finally, the Bayes risk of m n * converges to 0 for both the L 1 and the L 2 loss functions, i.e.,
lim n E Π N [ | m n * ( x , x 1 ) r θ ( x 1 ) | k ] = 0 , k = 1 , 2 .

4. Proofs and Auxiliary Results

Let us introduce some notations for different projections in the Bayesian model (3). Given ( x , x , θ ) Ω 12 m × Ω 12 × Θ , we write
π m ( x , x , θ ) : = x , π m ( x , x , θ ) : = x , π j , m ( x , x , θ ) : = x j , j = 1 , 2 , q m ( x , x , θ ) : = θ π i , m ( x , x , θ ) : = x i : = ( x i 1 , x i 2 ) , π ( i ) , m ( x , x , θ ) : = ( x 1 , , x i ) ,
for 1 i m (read i N if m = N ).
The following result is taken from [13].
Proposition 1.
For n N ,
Π N ( π ( n ) , N , π 1 , N , q N ) = Π n , Π N ( π ( n ) , N , π 1 , N ) = Π n ( π ( n ) , n , π 1 , n ) , Π m q m = Q , Π m ( π m , q m ) = Π 12 , m , Π m π m = β m * , Π m ( π m , q m ) = Π 12 , 1 , Π m π m = β 1 * , Π m π m | q m = θ = R θ m , Π m π m | q m = θ = R θ , Π m q m | π m = x = R m , x * , Π m q m | π m = x = R 1 , x * .
Remark 1.
It follows from this proposition that the probability space (3) contains all the basic ingredients of the Bayesian experiment (1), i.e., the prior distribution, the sampling probabilities, the posterior distributions, and the prior predictive distribution. When m = N , (3) becomes the natural framework to address the asymptotic problem considered in this paper, since it integrates the sample x , the parameter θ, and the argument x 1 of the regression function to be estimated into a simple joint probability distribution.
Lemma 1.
Let Y ( x , x , θ ) : = E θ ( p 2 | p 1 = x 1 ) .
(i) For n N and an infinite sample x of the joint distribution R θ of X 1 and X 2 , we show that
E R n , x ( n ) * R ( p 2 | p 1 = x 1 ) = E Π N ( Y | ( π ( n ) , N , π 1 , N ) = ( x ( n ) , x 1 ) ) ,
where x ( n ) : = ( x 1 , , x n ) .
(ii) For an infinite sample x of the distribution R θ , we show that
E R N , x * R ( p 2 | p 1 = x 1 ) = E Π N ( Y | ( π N , π 1 , N ) = ( x , x 1 ) ) .
Proof for Lemma 1
(i) According to Lemma 1 of [13], we show that, for all A 12 , n ( A 1 × A 2 ) n and all A i A i , i = 1 , 2 ,
A 12 , n × A 1 × Ω 2 × Θ R θ p 2 | p 1 = x 1 ( A 2 ) d Π n ( x , x , θ ) = A 12 , n × A 1 R n , x * R p 2 | p 1 = x 1 ( A 2 ) d Π n ( π , p 1 ) ( x , x 1 ) .
The proof of (i) follows in a standard way from this and Proposition 1 as
E θ ( p 2 | p 1 = x 1 ) = R x 2 d R θ p 2 | p 1 = x 1 ( x 2 ) and E R n , x ( n ) * R ( p 2 | p 1 = x 1 ) = R x 2 d R n , x ( n ) * R ( x 2 ) .
(ii) The proof of (ii) is analogous. □
Proof for Theorem 1
When A ( n ) : = ( π ( n ) , N , π 1 , N ) 1 ( A 12 n × A 1 ) , we have that ( A ( n ) ) n is an increasing sequence of sub- σ -fields of A 12 N × A 1 , such that A 12 N × A 1 = σ ( n A ( n ) ) . According to the martingale convergence theorem of Lévy, if Y is A 12 N × A 1 × T -measurable and Π N -integrable, then
E Π N ( Y | A ( n ) )
converges Π N -a.e. and in L 1 ( Π N ) to E Π N ( Y | A 12 N × A 1 ) .
Let us consider the measurable function
Y ( x , x , θ ) : = E θ ( X 2 | X 1 = x 1 ) .
Notice that E Π N ( Y ) = Θ E θ ( X 2 ) d Q ( θ ) , so Y is Π N -integrable. Hence, it follows from the aforementioned theorem of Lévy that
lim n E R n , x ( n ) * R ( p 2 | p 1 = x 1 ) = E R N , x * R ( p 2 | p 1 = x 1 ) , Π N a . e .
and
lim n Ω 12 N × Ω 12 × Θ E R n , x ( n ) * R ( p 2 | p 1 = x 1 ) E R N , x * R ( p 2 | p 1 = x 1 ) d Π N ( x , x , θ ) = 0 .
As a consequence of the known theorem of Doob (see Theorem 6.9 and Proposition 6.10 from [3], pp. 129, 130), for every x 1 Ω 1 ,
lim n Θ E θ ( X 2 | X 1 = x 1 ) d Π N q N | ( π ( n ) , N , π 1 , N ) = ( x ( n ) , x 1 ) ( θ ) = E θ ( X 2 | X 1 = x 1 ) , R θ N a . e .
and for Q, almost every θ . Hence, according to Lemma 1 (i),
lim n E R n , x ( n ) * R ( p 2 | p 1 = x 1 ) = E θ ( X 2 | X 1 = x 1 ) , R θ N a . e .
and for Q, almost every θ .
In particular,
lim n E R n , x ( n ) * R ( p 2 | p 1 = x 1 ) = E θ ( X 2 | X 1 = x 1 ) , Π N a . e .
In this sense, we can say that the predictive posterior regression curve E R n , x ( n ) * R ( p 2 | p 1 = x 1 ) of X 2 , given X 1 = x 1 , is a strongly consistent estimator of the sampling regression curve E θ ( X 2 | X 1 = x 1 ) of X 2 , given X 1 = x 1 .
From this and (5), we obtain the following:
E R N , x * R ( p 2 | p 1 = x 1 ) = E θ ( X 2 | X 1 = x 1 ) , Π N a . e .
According to (6), we obtain the following:
lim n Ω 12 N × Ω 12 × Θ E R n , x ( n ) * R ( p 2 | p 1 = x 1 ) E θ ( X 2 | X 1 = x 1 ) d Π N ( x , x , θ ) = 0 ,
which proves that the Bayes risk of the optimal estimator E R n , x ( n ) * R ( p 2 | p 1 = x 1 ) of the regression curve E θ ( X 2 | X 1 = x 1 ) converges to 0 for the L 1 -loss function.
We wonder if that also happens for the L 2 -squared loss function, i.e., if the Bayes risk
E Π n [ ( m n * ( x , x 1 ) r θ ( x 1 ) ) 2 ]
converges vers 0 when n goes to . Theorem 6.6.9 of [19] shows that the answer is affirmative because
m n * ( x , x 1 ) = E Π N ( Y | A ( n ) )
and, by Jensen’s inequality,
E Π N ( E Π N ( Y | A ( n ) ) 2 ) E Π N ( E Π N ( Y 2 | A ( n ) ) ) = E Π N ( Y 2 ) Θ E θ ( X 2 2 ) d Q ( θ ) < .
This completes the proof. □

5. Examples

Example 1.
Let us suppose that, for θ , λ , x 1 > 0 , P θ X 1 = G ( 1 , θ 1 ) , P θ X 2 | X 1 = x 1 = G ( 1 , ( θ x 1 ) 1 ) , and Q = G ( 1 , λ 1 ) , where G ( α , β ) denotes the gamma distribution of parameters α , β > 0 . Hence, the joint density of X 1 and X 2 is
f θ ( x 1 , x 2 ) = θ 2 x 1 exp { θ x 1 ( 1 + x 2 ) } I ] 0 , [ 2 ( x 1 , x 2 ) .
It is shown in [13], Example 1, that the Bayes estimator of the conditional density function f θ X 2 | X 1 = x 1 ( t ) = θ x 1 t exp { θ x 1 t } I ] 0 , [ ( t ) (for x 1 > 0 ) is the conditional density of X 2 , given X 1 = x 1 , with respect to the posterior predictive distribution, i.e.,
m n * ( x , x 1 ) = f n , x * X 2 | X 1 = x 1 ( t ) = ( 2 n + 2 ) x 1 a n ( x , x 1 ) 2 n + 2 ( x 1 t + a n ( x , x 1 ) ) 2 n + 3 ,
where a n ( x , x 1 ) = λ + x 1 + i = 1 n x i 1 ( 1 + x i 2 ) .
Let φ : R + R be a real bounded measurable function. Hence, φ X 2 satisfies the conditions of Theorem 1. The Bayes estimator of the regression curve
r θ ( x 1 ) : = E θ ( φ X 2 | X 1 = x 1 ) = 0 φ ( t ) · f θ X 2 | X 1 = x 1 ( t ) d t
is
m n * ( x , x 1 ) = 0 φ ( t ) · f n , x * X 2 | X 1 = x 1 ( t ) d t = 0 φ ( t ) · ( 2 n + 2 ) x 1 a n ( x , x 1 ) 2 n + 2 ( x 1 t + a n ( x , x 1 ) ) 2 n + 3 d t .
For instance, it is readily shown that, if φ = I ] 0 , 1 [ , then the Bayes estimator is
m n * ( x , x 1 ) = 1 λ + x 1 + i = 1 n x i 1 ( 1 + x i 2 ) λ + 2 x 1 + i = 1 n x i 1 ( 1 + x i 2 ) 2 n + 2 .
Theorem 1 shows that this a strongly consistent estimator of the regression curve r θ ( x 1 ) , and its Bayes risk converges to 0 for both the L 1 and L 2 loss functions.
Example 2.
Let us suppose that X 1 has a Bernoulli distribution of unknown parameter θ ] 0 , 1 [ (i.e., P θ X 1 = B i ( 1 , θ ) ), and, given X 1 = k 1 { 0 , 1 } , X 2 has distribution B i ( 1 , 1 θ ) when k 1 = 0 and B i ( 1 , θ ) when k 1 = 1 , i.e., P θ X 2 | X 1 = k 1 = B i ( 1 , k 1 + ( 1 2 k 1 ) ( 1 θ ) ) . We can think of tossing a coin with probability θ of getting heads ( = 1 ) and making a second toss of this coin if it comes up heads on the first toss or tossing a second coin with probability 1 θ of making heads if the first toss is tails ( = 0 ). Consider the uniform distribution on ] 0 , 1 [ as the prior distribution Q.
So, the joint probability function of X 1 and X 2 is
f θ ( k 1 , k 2 ) = θ k 1 ( 1 θ ) 1 k 1 [ k 1 + ( 1 2 k 1 ) ( 1 θ ) ] k 2 [ 1 k 1 ( 1 2 k 1 ) ( 1 θ ) ] 1 k 2 = θ ( 1 θ ) if k 2 = 0 , ( 1 θ ) 2 if k 1 = 0 , k 2 = 1 , θ 2 if k 1 = 1 , k 2 = 1 .
It is shown in [13], Example 2, that the Bayes estimator of the conditional mean r θ ( k 1 ) : = E θ ( X 2 | X 1 = k 1 ) = θ k 1 ( 1 θ ) 1 k 1 is, for k 1 = 0 , 1 ,
m n * ( k , k 1 ) = f n , k * X 2 | X 1 = k 1 ( 1 ) = n + 0 ( k ) + 2 n 01 ( k ) + 1 2 n + n + 0 ( k ) + 2 n 01 ( k ) + 3 if k 1 = 0 , n + 0 ( k ) + 2 n 01 ( k ) + 1 2 n + n + 0 ( k ) + 2 n 01 ( k ) + 4 if k 1 = 1 ,
n j 1 j 2 ( k ) being the number of indices i { 1 , , n } , such that ( k i 1 , k i 2 ) = ( j 1 , j 2 ) and n + j = n 0 j + n 1 j for j = 0 , 1 .
Theorem 1 proves that it is a strongly consistent estimator of the conditional mean r θ ( k 1 ) , and its Bayes risk converges to 0 for both the L 1 and L 2 loss functions.
Example 3.
Let ( X 1 , X 2 ) have a bivariate normal distribution
N 2 θ θ , σ 2 1 ρ ρ 1 ,
and consider the prior distribution Q = N ( μ , τ 2 ) , where μ , σ , τ , a n d ρ are supposed to be known, and θ refers to the unknown parameter. It is shown in [13], Example 3, that the conditional mean
E R n , x * R ( p 2 | p 1 = x 1 ) = ( 1 ρ 1 ) m 1 ( x ) + ρ 1 x 1
is the Bayes estimator of the regression curve
E θ ( X 2 | X 1 = x 1 ) = ( 1 ρ ) θ + ρ x 1
for the squared error loss function, where
ρ 1 = a n ( ρ , σ , τ ) + 1 ρ 1 + ρ a n ( ρ , σ , τ ) 1 ρ 1 + ρ · ρ , m 1 ( x ) = s 1 ( x ) + ( 1 + ρ ) σ 2 τ 2 μ 2 ( 1 ρ 1 ) ( 1 + ρ ) 2 σ 2 a n ( ρ , σ , τ ) ,
being
s 1 ( x ) : = i ( x i 1 + x i 2 ) , a n ( ρ , σ , τ ) : = 2 ( n + 1 ) ( 1 + ρ ) + σ 2 τ 2 .
Theorem 1 proves that it is a strongly consistent estimator of the regression curve, and its Bayes risk converges to 0 for both the L 1 and L 2 loss functions.
Example 4.
In this (discrete and nonparametric) example, we assume that ( X 1 , X 2 ) is a N 0 2 -valued random variable with unknown arbitrary probability distribution P in M ( N 0 2 ) , the set of all probability distributions on ( N 0 2 , P ( N 0 2 ) ) ; P stands for the discrete σ-field. The Bayesian experiment (1) considered in this example for an n-sized sample of the joint distribution P of ( X 1 , X 2 ) is
( ( N 0 2 ) n , P ( N 0 2 ) n , { P n : P ( M ( N 0 2 ) , B M ( N 0 2 ) , D α ) } )
where B M ( N 0 2 ) is the Borel σ-field on M ( N 0 2 ) for the weak topology (for which it becomes a Polish space), α a finite measure on N 0 2 , and D α the Dirichlet process with base measure α, which plays the role of prior distribution. We refer to [3] or [20] for everything related to Dirichlet processes.
Let φ : N 0 R be a real bounded function, so that φ X 2 satisfies the conditions of Theorem 1. To estimate the regression function
E P ( φ p 2 | p 1 = k ) = j N 0 φ ( j ) P ( k , j ) j N 0 P ( k , j )
from a sample x : = ( x 1 , , x n ) ( N 0 2 ) n , we need the posterior predictive distribution given x which, as is known (see [20], for instance), is
R n , x * R = α + i = 1 n δ x i α ( N 0 2 ) + n ,
where δ x i ( k , l ) = 1 when x i = ( k , l ) , and = 0 otherwise. Then, the Bayes estimator of the regression function E P ( φ p 2 | p 1 = k ) = j N 0 φ ( j ) P ( p 2 = j | p 1 = k ) is the regression function of φ p 2 , given p 1 = k , with respect to the posterior predictive distribution, given x , i.e.,
E R n , x * R ( φ p 2 | p 1 = k ) = j N 0 φ ( j ) [ α ( k , j ) + i = 1 n δ x i ( k , j ) ] j N 0 [ α ( k , j ) + i = 1 n δ x i ( k , j ) ] .
Theorem 1 proves that it is a strongly consistent estimator of this regression curve, and its Bayes risk converges to 0 for both the L 1 and L 2 loss functions.
The next example is a continuation of Example 5.5 of [17].
Example 5.
In this (nonparametric with continuous probability measure base for the prior Dirichlet process) example, we assume that ( X 1 , X 2 ) is a R 2 -valued random variable with an unknown arbitrary probability distribution P in M ( R 2 ) , and the set of all probability distributions on ( R 2 , R 2 ) ) ; R 2 stands for the Borel σ-field on R 2 . The Bayesian experiment (1) considered in this example, for an n-sized sample of the joint distribution P of ( X 1 , X 2 ) , is
( ( R 2 ) n , ( R 2 ) n , { P n : P ( M ( R 2 ) , B M ( R 2 ) , D α ) } )
where B M ( R 2 ) is the Borel σ-field on M ( R 2 ) for the weak topology (for which it becomes a Polish space), α a probability measure on R 2 , and D α the Dirichlet process with base measure α, which plays the role of prior distribution. A reference to [3] or [20] for everything related to Dirichlet processes is still appropriate. The posterior predictive distribution, given the sample x : = ( x 1 , , x n ) ( R 2 ) n , is known to be
R n , x * R = α + i = 1 n δ x i n + 1 = 1 n + 1 α + n n + 1 1 n i = 1 n δ x i ,
where δ x i ( x 1 , x 2 ) = 1 when x i = ( x 1 , x 2 ) , and = 0 otherwise. Note that this distribution is a convex combination of the probability measure α, the base measure of the prior distribution, and the empirical measure, a mixture in which the weight of the data increases with the sample size (in fact, it reaches 1). The Bayes estimator of the conditional distribution P X 2 | X 1 = x 1 is the conditional distribution of p 2 given p 1 = x 1 with respect to the posterior predictive distribution, given x :
R n , x * R p 2 | p 1 = x 1 .
It is shown in [17] that this conditional distribution can be calculated, for a Borel set B R , as follows:
R n , x * R p 2 | p 1 = x 1 ( B ) = I { x 11 , , x n 1 } c ( x 1 ) · α p 2 | p 1 = x 1 ( B ) + I { x 11 , , x n 1 } ( x 1 ) · i = 1 n δ x i ( { x 1 } × B ) i = 1 n δ x i 1 ( x 1 )
Let us now consider a real bounded measurable function φ on R (then φ p 2 satisfies the conditions of Theorem 1. The Bayes estimator of the regression function E P ( φ p 2 | p 1 = x 1 ) is the regression function of φ p 2 , given p 1 = x 1 , with respect to the posterior predictive distribution, given x , i.e.,
m n * ( x , x 1 ) : = E R n , x * R ( φ p 2 | p 1 = x 1 ) .
So
m n * ( x , x 1 ) = I { x 11 , , x n 1 } c ( x 1 ) · R φ ( x 2 ) d α X 2 | X 1 = x 1 ( x 2 ) + I { x 11 , , x n 1 } ( x 1 ) · 1 i = 1 n δ x i 1 ( x 1 ) i = 1 n R φ ( x 2 ) d δ x i ( x 1 , d x 2 ) .
This way, if x 1 { x 11 , , x n 1 } c , m n * ( x , x 1 ) is the mean conditional of φ p 2 , given p 1 = x 1 for the base probability measure α (for instance, if α is the product α 1 × α 2 of two probability distributions in R , then m n * ( x , x 1 ) is the mean E α 2 ( φ ) in this case).
If x 1 { x 11 , , x n 1 } , denote S x 1 as the set of index 1 i n , such that x i 1 = x 1 and s x 1 the number of such indices. In this case,
m n * ( x , x 1 ) = 1 s x 1 i S x 1 φ ( x i 2 ) .
Therefore, if x 1 is (respectively, is not) in the sample support, only the empirical measure (respectively, the base measure α of the prior distribution) is taken into account when estimating the regression curve at the point x 1 .
Theorem 1 shows that the Bayes risk of this estimator of the regression curve E P ( φ p 2 | p 1 = x 1 ) converges to 0 for both the L 1 and L 2 loss functions and that it is a strongly consistent estimator of this regression curve.
Remark 2.
In Example 5, the problem of estimating the density has no sense because M ( R 2 ) contains all the probability measures on the plane. In fact, it is known (see [3]) that, even if α is absolutely continuous, as is the case, D α is concentrated in the set of discrete probability measures on R 2 , which discourages its use as the prior distribution in the set of all density functions. Fortunately, Theorem 3.1 of [17] allows us to address the problem in terms of the conditional distribution and, finally, obtain the Bayes estimator of the regression function without needing to appeal to density.

6. Conclusions

The optimality, in a decisional sense, of the regression curve with respect to the posterior predictive distribution as an estimator of the regression curve for the squared error loss function, together with its asymptotic behavior—consistency and convergence to 0 of the Bayes risk—is an important point in favor of this method of estimating the regression curve against to other estimation methods in a Bayesian context.
A remarkable fact about the results of this paper (and [13]) is that the predictor X 1 is an arbitrary random variable (not necessarily a real or n-dimensional random variable). Furthermore, no special assumptions are made about the prior distribution.
An important issue of the paper is the establishment of a certain probability space as the appropriate theoretical framework for the study of the asymptotic behavior of the estimator of the regression curve, allowing for obtaining an explicit expression of the Bayes risk of this estimator and take advantage of powerful probabilistic tools when solving the problem of its asymptotic behavior.

Funding

This research was funded by the Junta de Extremadura (SPAIN) grant number GR24055.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

I would like to thank a reviewer for their comments, which have resulted in a clearer and more precise version of some of the examples presented.

Conflicts of Interest

The author declares no conflicts of interest.

References

  1. Nadaraya, E.A. On estimating regression. Theory Probab. Its Appl. 1964, 9, 141–142. [Google Scholar] [CrossRef]
  2. Watson, G.S. Smooth regression analysis. Sankhya Ser. A 1964, 26, 359–372. [Google Scholar]
  3. Ghosal, S.; Vaart, A.V.D. Fundamentals of Nonparametric Bayesian Inference; Cambridge University Press: Cambridge, UK, 2017. [Google Scholar]
  4. Nogales, A.G. On Bayesian estimation of densities and sampling distributions: The posterior predictive distribution as the Bayes estimator. Stat. Neerl. 2022, 76, 236–250. [Google Scholar] [CrossRef]
  5. Bean, A.; Xu, X.; MacEachern, S. Transformations and Bayesian density estimation. Electron. J. Stat. 2016, 10, 3355–3373. [Google Scholar] [CrossRef]
  6. Lijoi, A.; Prünster, I. Models beyond the Dirichlet Process. In Bayesian Nonparametrics; Hjort, N.L., Holmes, C., Müller, P., Walker, S.G., Eds.; Cambridge Series in Statistical and Probabilistic Mathematics; Cambridge University Press: Cambridge, UK, 2010. [Google Scholar]
  7. Lo, A.Y. On a class of Bayesian nonparametric estimates. I. Density estimates. Ann. Statist. 1984, 12, 351–357. [Google Scholar] [CrossRef]
  8. Marchand, É.; Sadeghkhani, A. Predictive density estimation with additional information. Electron. J. Stat. 2018, 12, 4209–4238. [Google Scholar] [CrossRef]
  9. Nogales, A.G. On consistency of the Bayes Estimator of the Density. Mathematics 2022, 10, 636. [Google Scholar] [CrossRef]
  10. Efromovich, S. Conditional Density Estimation in a Regression Setting. Ann. Stat. 2007, 35, 2504–2535. [Google Scholar] [CrossRef]
  11. Izbicki, R.; Lee, A.B. Nonparametric conditional density estimation in a high setting. J. Comput. Graph. Stat. 2016, 25, 1297–1316. [Google Scholar] [CrossRef]
  12. Rosenblatt, M. Conditional probability density and regression estimators. In Multivariate Analysis II; Krishnaiah, P.R., Ed.; Academic Press: New York, NY, USA, 1969; pp. 25–31. [Google Scholar]
  13. Nogales, A.G. Optimal Bayesian Estimation of a Regression Curve, a Conditional Density, and a Conditional Distribution. Mathematics 2022, 10, 1213. [Google Scholar] [CrossRef]
  14. Geisser, S. Predictive Inference: An Introduction; Chapman & Hall: New York, NY, USA, 1993. [Google Scholar]
  15. Gelman, A.; Carlin, J.B.; Stern, H.S.; Dunson, D.B.; Vehtari, A.; Rubin, D.B. Bayesian Data Analysis, 3rd ed.; CRC Press (Taylor & Francis Group): Boca Raton, FL, USA, 2014. [Google Scholar]
  16. Florens, J.P.; Mouchart, M.; Rolin, J.M. Elements of Bayesian Statistics; Marcel Dekker: New York, NY, USA, 1990. [Google Scholar]
  17. Nogales, A.G. The Bayes Estimator of a Conditional Density: Asymptotic Behavior. Braz. J. Probab. Stat. 2024, 38, 531–548. [Google Scholar] [CrossRef]
  18. Nadaraya, E.A. Nonparametric Estimation of Probability Densities and Regression Curves; Kluwer Academic Publishers: Dordrecht, The Netherlands, 1989. [Google Scholar]
  19. Ash, R.B.; Dóleans-Dade, C. Probability and Measure Theory, 2nd ed.; Academic Press: San Diego, CA, USA, 2000. [Google Scholar]
  20. Ghosh, J.K.; Delampady, M.; Samanta, T. An Introduction to Bayesian Analysis, Theory and Methods; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Nogales, A.G. Asymptotic Behavior of the Bayes Estimator of a Regression Curve. Mathematics 2025, 13, 2319. https://doi.org/10.3390/math13142319

AMA Style

Nogales AG. Asymptotic Behavior of the Bayes Estimator of a Regression Curve. Mathematics. 2025; 13(14):2319. https://doi.org/10.3390/math13142319

Chicago/Turabian Style

Nogales, Agustín G. 2025. "Asymptotic Behavior of the Bayes Estimator of a Regression Curve" Mathematics 13, no. 14: 2319. https://doi.org/10.3390/math13142319

APA Style

Nogales, A. G. (2025). Asymptotic Behavior of the Bayes Estimator of a Regression Curve. Mathematics, 13(14), 2319. https://doi.org/10.3390/math13142319

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop