1. Introduction
Given a random variable —the independent variable, regressor, or predictor—and a real random variable —the dependent variable or response—the so called regression curve of given is the map , the function of that best approximates in the least squares sense; therefore, it becomes an essential tool in the study of the relationship between these two variables. Many statistical problems in practice, especially those related to prediction, require the estimation of the regression function from data, i.e., from a sample , , of the joint distribution of and . This estimation problem has been addressed in a good number of papers in both parametric and nonparametric contexts, from both the frequentist and the Bayesian points of view. In fact, regression techniques are among the most widely used methods in applied statistics.
In a nonparametric frequentist framework, the problem of estimation of the regression curve was first considered in [
1,
2]. We refer to [
3] for this problem in a Bayesian context; it includes some historical notes about Bayesian nonparametric regression and some results about the consistency of the estimates for some specific priors.
Talking about the probability of an event
A (written
) in a statistical context is ambiguous, as it depends of the unknown parameter. In a Bayesian context, once the data
has been observed, a natural estimate of
is the posterior predictive probability of
A given
since it is the posterior mean of the probabilities of
A given
, which, as is well known, is the Bayes estimator of
for the squared error loss function. This simple fact already justifies the use of the posterior predictive distribution as an estimator of the sampling distribution, but, in reality, much more is true because, as shown in [
4], the posterior predictive distribution is the Bayes estimator of the sampling probability distribution
for the squared total variation loss function. It is similar to what happens with the strong law of large numbers and the Glivenko–Cantelli theorem: the first guarantees an almost certain punctual convergence of the empirical distribution function to the unknown population distribution function, but the second yields an almost sure uniform convergence, becoming the fundamental theorem of Mathematical Statistics. The problem of estimation of the density in a Bayesian nonparametric framework is considered as a number of references, such as [
5], [
6], [
7], or [
8]. In [
4], the problem of estimation of the density from a Bayesian point of view is also addressed, and, under mild conditions, it is shown that the posterior predictive density is the Bayes estimator for the
-squared loss function, and the convergence to 0 of the Bayes risk (and the strong consistency) of this estimator is shown in [
9].
As regards the estimation of the regression curve, or even the conditional density, reference [
10] or reference [
11] contain sufficient arguments on the usefulness of these problems in practice from a frequentist point of view, problems that go back to [
12], although they have not produced much literature since then either. The paper [
13] deals, among others, with the problem of the Bayesian estimation of a regression curve and proves that the regression curve, with respect to the posterior predictive distribution, is the Bayes estimator (for the squared error loss function). Here, we wonder about the convergence to 0 of its Bayes risk (and its strong consistency). This is the main goal of the paper, and Theorem 1 below answers the question in the affirmative.
So, the posterior predictive distribution is the key to the estimation problems raised above. It has been presented in the literature as the base of Predictive Inference, which seeks to make inferences about a new unknown observation from the previous random sample instead of estimating an unknown parameter. It should be noted that, in practice, the explicit evaluation of the posterior predictive distribution could be cumbersome, and its simulation may become preferable. The interested reader can find in the papers mentioned above, and the references therein, more information on the problems of estimating the density or regression curve, from both the frequentist and Bayesian perspectives, or about the usefulness of the posterior predictive distribution in Bayesian Inference and its calculation. We place special emphasis on the monographs [
3,
14,
15].
In
Section 2, an important and useful achievement of the paper is obtained, as it establishes a probability space as the theoretical framework (i.e., Bayesian experiment) appropriate to address the problem (in the same way that [
16] considers the Bayesian experiment as a probability space). In fact, starting from the Bayesian experiment (
1) corresponding to a sample of size
m (possibly infinite) from the joint distribution of the two variables
(predictor) and
(response) of interest, the probability space (
3) is presented as the appropriate model for the estimation of the regression curve of
given
from an
m-sized sample of the joint distribution of the two variables. This has allowed us to obtain an explicit expression of the Bayes risk of an estimator of the regression curve and take advantage of powerful probabilistic tools when solving the problem of its asymptotic behavior.
Section 3 includes the aforementioned Theorem 1, whose proof lies in Jensen’s inequality, Lévy’s martingale convergence theorem, and a result by Doob on the consistency of the posterior distribution. The result is general enough to cover discrete and continuous cases, parametric or nonparametric, as the examples provided show, and, unlike what we have been able to find in the literature, no specific supposition is made about the prior distribution.
Section 4 contains the proof of the main result and some auxiliary results. In particular, Lemma 1, the key of the proof of the Theorem, yields a representation of the Bayes estimator of the regression curve as its conditional mean in the Bayesian experiment (
3).
Section 5 includes some examples to illustrate the main result of the paper, two of them of a nonparametric nature. The last of these two nonparametric examples shows a situation where estimating the regression curve has a Bayes estimator, although the problem of estimating the density is meaningless; in fact, the estimation of the regression function is performed through the conditional distribution itself.
For ease of reading, we encourage the reader who is not familiar with the terminology or the notation used in the paper to start by reading the Appendix of [
17].
2. The Framework
We recall from [
13] the appropriate framework to address the problem and update it to incorporate the required asymptotic flavor.
Let
be a Bayesian statistical experiment and
two statistics. Consider the Bayesian experiment image of
In the following, we will assume that
,
, and the joint distribution of
and
, is a Markov kernel. Let us write
and
for
,
. Hence
and, when
is a real random variable,
. In order to alleviate and shorten the notation, we write
and
.
Given an integer
n, for
(respectively,
), the Bayesian experiment corresponding to a
n-sized sample (respectively, an infinite sample) of the joint distribution of
is
We define the Markov kernel
by
, for
, and denote
for the joint distribution of the parameter and the sample, i.e.,
The corresponding prior predictive distribution
on
is
The posterior distribution is a Markov kernel
such that, for all
and
,
Let us write .
The posterior predictive distribution on
is the Markov kernel
defined, for
, by
This way, given
, the posterior predictive probability of an event
is nothing but the posterior mean of
. It follows that, with obvious notations,
for any non-negative or integrable real random variable
f.
We can also consider the posterior predictive distribution on
, defined as the Markov kernel
such that
Looking for the appropriate framework to address the problem of estimating the regression curve of the real random variable
given
, we part from the Bayesian experiment (
1) corresponding to a sample
from the joint distribution
of the predictor
and the response
, and we choose
(in fact, we only need the first coordinate
of
x as the argument of the regression curve) from
independently of
, which brings us to the product Bayesian experiment
The Bayesian experiment (
2) can be identified in a standard way with the probability space
where
, i.e.,
when
,
and
.
So, for a real random variable
f on
,
provided that the integral exists. Moreover, for a real random variable
h on
, by definition of the posterior distributions,
3. The Bayes Estimator of the Regression Curve: Asymptotic Behavior
Now suppose that . Let be a squared-integrable real random variable, such that has a finite prior mean; in particular, also has a finite prior mean.
In the regression curve of
, given that
is the map,
An estimator of the regression curve
from a sample of size
n of the joint distribution of
is a statistic
so that, being observed, the sample
,
is the estimation of
.
From a classical point of view, the simplest way to evaluate the error in estimating an unknown regression curve is to use the expectation of the quadratic deviation (see [
18], p. 120):
From a Bayesian point of view, the Bayes estimator—the optimal estimator—of the regression curve
should minimize the Bayes risk (i.e., the prior mean of the expectation of the quadratic deviation)
So, the Bayesian experiment (
3) is the appropriate framewok to address these questions (see also Remark 1).
Recall from [
13] that the regression curve of
on
with respect to the posterior predictive distribution
, given the data
, is the Bayes estimator of the regression curve
for the squared error loss function; for the sake of completeness, this proposition is also included as part (i) of Theorem 1 below.
We wonder about the convergence to 0 of the Bayes risk:
Another question of interest is the consistency of this Bayes estimator.
Lemma 1 below is key to solving the problem, since it shows that the Bayes estimator of the regression curve becomes its conditional mean in the Bayesian experiment (
3). What the following theorem really provides is the asymptotic behavior of this estimator: the convergence to zero of its Bayes risks and the strong consistency of the Bayes estimator of the regression curve.
Theorem 1. Let be a Bayesian statistical experiment and and be two statistics, such that has a finite prior mean. Let us suppose that: (a) is a standard Borel space; (b) Θ is a Borel subset of a Polish space, and is its Borel σ-field; and (c) is identifiable.
Then,
(i) The regression curve of on with respect to the posterior predictive distribution ,is the Bayes estimator of the regression curve for the squared error loss function, i.e.,for any other estimator of the regression curve . (ii) Moreover, is a strongly consistent estimator of the regression curve, in the sense thatwhere if . (iii) Finally, the Bayes risk of converges to 0 for both the and the loss functions, i.e.,