Statistical Inference of Normal Distribution Based on Several Divergence Measures: A Comparative Study

Suad Alhihi; Maalee Almheidat; Ghassan Abufoudeh; Raed Abu Awwad; Samer Alokaily; Ayat Almomani

doi:10.3390/sym16020212

,

and

¹

Department of Mathematics, Al-Balqa Applied University, Alsalt 19117, Jordan

²

Department of Mathematics, University of Petra, Amman 11196, Jordan

³

Department of Statistics, Yarmouk University, Irbid 21163, Jordan

^*

Authors to whom correspondence should be addressed.

Symmetry2024, 16(2), 212;https://doi.org/10.3390/sym16020212

This article belongs to the Special Issue Research Topics Related to Skew-Symmetric Distributions

Version Notes

Order Reprints

Abstract

Statistical predictive analysis is a very useful tool for predicting future observations. Previous literature has addressed both Bayesian and non-Bayesian predictive distributions of future statistics based on past sufficient statistics. This study focused on evaluating Bayesian and Wald predictive-density functions of a future statistic V based on a past sufficient statistic W obtained from a normal distribution. Several divergence measures were used to assess the closeness of the predictive densities to the future density. The difference between these divergence measures was investigated, using a simulation study. A comparison between the two predictive densities was examined, based on the power of a test. The application of a real data set was used to illustrate the results in this article.

Keywords:

Bayesian predictive-density function; Wald predictive-density function; divergence measures; power of a test; normal distribution

MSC:

62F03; 62F15; 62C10

1. Introduction

One of the most fundamental and significant fields in statistics is predictive analysis, which involves extracting information from historical and current data to predict future trends and behaviors. One of the many possible forms of prediction is the Bayesian predictive approach, which was first introduced by Aitchison [1], who demonstrated its advantage on the Kullback–Leibler (KL) divergence over plug-in predictive densities. Bayesian predictive-density estimation has been applied to different statistical models, including but not limited to the following: Aitchison and Dunsmore [2], who obtained Bayesian predictive distributions based on random samples from binomial, Poisson, gamma, two-parameter exponential, and normal distributions; Escobar et al. [3], who discussed the application of Bayesian inference to density-estimation models, using Dirichlet-process mixtures; and Hamura et al. [4], who used a number of Bayesian predictive densities to introduce prediction for the exponential distribution. Hamura et al. [5] studied the Bayesian prediction distribution of a chi-squared distribution, given a random sample from another chi-squared distribution under the Kullback–Leibler divergence.

Another form of prediction is the Wald predictive approach [6], which follows a non-Bayesian framework. This approach was considered via the concept of predictive likelihood [7]. Awad et al. [8] introduced a review of several prediction procedures and a comparison between them. Wald predictive density was among these procedures.

One way to check the validity of the prediction procedure is by testing the closeness between the classical distribution of future statistics and the predictive distribution by using divergence measures between the two probability distributions. Several divergence measures have been defined in the literature and have been used to measure the distance between pairs of probability-density functions.

Considering the two probability distributions

f_{1} (x)

and

f_{2} (x)

, Kullback and Leibler [9] introduced the Kullback–Leibler divergence measure, which measures the information gain between two distributions. The Kullback–Leibler divergence measure belongs to the family of Shannon-entropy distance measures, defined as

K (f_{1} ∥f_{2}) = E_{f_{1}} [\log (\frac{f_{1} (X)}{f_{2} (X)})] .

(1)

Lin [10] introduced the Jensen–Shannon divergence, which is a symmetric extension of the Kullback–Leibler divergence that has a finite value. In a similar approach, Johnson and Sinanovic [11] provided a new divergence measure known as the resistor-average measure. This measure is closely related to the Kullback–Leibler divergence mentioned in (1), but it is symmetric.

Do et al. [12] combined the similarity measurement of feature extraction into a joint modeling and classification scheme. They computed the Kullback–Leibler distance between the estimated models. Amin et al. [13] used the Kullback–Leibler divergence to develop a data-based Bayesian-network learning strategy; the proposed approach captures the nonlinear dependence of high-dimensional process data.

Jeffreys [14] introduced and studied a divergence measure called the Jeffreys-distance or J-divergence measure, which is considered as a symmetrization of the Kullback–Leibler divergence. The Jeffreys measure is a member of the family of Shannon-entropy distance measures, defined as

J (f_{1}, f_{2}) = E_{f_{1}} [\log (\frac{f_{1} (X)}{f_{2} (X)})] - E_{f_{2}} [\log (\frac{f_{1} (X)}{f_{2} (X)})],

J (f_{1}, f_{2}) = K (f_{1} ∥f_{2}) + K (f_{2} ∥f_{1}) .

(2)

Taneja et al. [15] gave two different parametric generalizations of the Jeffreys measure. Cichocki [16] discussed the basic characteristics of the extensive families of alpha, beta, and gamma divergences including the Jeffreys divergence. They linked these divergences and showed their connections to the Tsallis and Rényi entropies. Sharma et al. [17] found the closeness between two probability distributions, using three similarity measures derived from the concepts of Jeffreys divergence and Kullback–Leibler divergence.

Rényi [18] introduced the Rényi divergence measure, which is related to Rényi entropy and depends on a parameter called r-order. The Rényi measure of order r is defined as

K_{r}^{1} (f_{1} ∥f_{2}) = \frac{1}{r - 1} \log (E_{f_{1}} [{(\frac{f_{1} (X)}{f_{2} (X)})}^{r - 1}]); r \neq 1, r > 0 .

(3)

Note that

lim_{r \to 1} K_{r}^{1} (f_{1} ∥f_{2}) = K (f_{1} ∥f_{2}) .

In other words, the Rényi measure of order 1 is basically the Kullback–Leibler measure.

The Rényi measure is used in physics and is called Tsallis entropy [19]. Krishnamurthy et al. [20] constructed a nonparametric estimator of Rényi divergence between continuous distributions. Their method consists of constructing estimators for certain integral functionals of two densities and transforming them into divergence estimators. Sason et al. [21] created integral formulas for the Rényi measure. Using these formulas, one can obtain bounds on the Rényi divergence as a function of the variational distance, assuming bounded relative information.

Sharma and Autar [22] suggested another generalization of Kullback–Leibler’s information, called relative information of type r, which is given by

^{1} K_{r} (f_{1} ∥f_{2}) = \frac{1}{r - 1} E_{f_{1}} [{(\frac{f_{1} (X)}{f_{2} (X)})}^{r - 1} - 1]; r \neq 1, r > 0 .

(4)

Note that

lim_{r \to 1}^{1} K_{r} (f_{1} ∥f_{2}) = K (f_{1} ∥f_{2}) .

Taneja and Kumar [23] proposed a modified version of the relative-information-of-type-r measure as a parametric generalization of the Kullback–Leibler measure and then considered it in terms of Csiszár’s f-divergence.

Sharma and Mittal [24] generalized the Kullback–Leibler divergence measure to a measure called the r-order-and-s-degree divergence measure, defined as

K_{r}^{s} (f_{1} ∥f_{2}) = \frac{1}{s - 1} [{(E_{f_{1}} [{(\frac{f_{1} (X)}{f_{2} (X)})}^{r - 1}])}^{\frac{s - 1}{r - 1}} - 1]; r, s \neq 1, r > 0 .

(5)

From Equation (5), we can see that:

(i): $lim_{s \to 1} K_{r}^{s} (f_{1} ∥f_{2}) = K_{r}^{1} (f_{1} ∥f_{2}),$ which is basically the Rényi measure defined in (3);
(ii): $lim_{r \to 1} K_{r}^{s} (f_{1} ∥f_{2}) = K_{1}^{s} (f_{1} ∥f_{2});$ where

$K_{1}^{s} (f_{1} ∥f_{2}) = \frac{1}{s - 1} [e^{((s - 1) K (f_{1} ∥f_{2}))} - 1]; s \neq 1;$
(iii): $lim_{s \to 1} K_{1}^{s} (f_{1} ∥f_{2}) = K (f_{1} ∥f_{2}),$ which is basically the Kullback–Leibler measure defined in (1).

Taneja and Kumar [23] conducted a full study about the r-order-and-s-degree divergence measure.

The chi-square divergence measure or Pearson

χ^{2}

measure [25] belongs to a family of measures called squared-

L_{2}

distance measures, defined as

χ^{2} (f_{1} ∥ f_{2}) = E_{f_{1}} [\frac{{(f_{1} (X) - f_{2} (X))}^{2}}{f_{1} (X) f_{2} (X)}] = E_{f_{1}} [\frac{f_{1} (X)}{f_{2} (X)}] - 1 .

(6)

Note that

χ^{2} (f_{1} ∥ f_{2}) = K_{2}^{2} (f_{1} ∥ f_{2}) =^{1} K_{2} (f_{1} ∥ f_{2}) .

The symmetric version of chi-square divergence is given in [23,26].

Hellinger [27] defined the Hellinger divergence measure. This measure belongs to the family of squared-chord distance measures, defined as

H (f_{1}, f_{2}) = \frac{1}{2} E_{f_{1}} [\frac{{(\sqrt{f_{1} (X)} - \sqrt{f_{2} (X)})}^{2}}{f_{1} (X)}] = \frac{1}{2} E_{f_{1}} [{(1 - \sqrt{\frac{f_{2} (X)}{f_{1} (X)}})}^{2}] .

(7)

Note that

H (f_{1}, f_{2}) = \frac{1}{2} K_{\frac{1}{2}}^{\frac{1}{2}} (f_{1} ∥f_{2}) = {\frac{1}{2}}^{1} K_{\frac{1}{2}} (f_{1} ∥f_{2}) .

González-Castro et al. [28] estimated the prior probability that minimizes the divergence by using the Hellinger measure to measure the disparity between the test data distribution and the validation distributions generated in a fully controlled manner. Recently, Dhumras et al. [29] proposed a new kind of Hellinger measure for a single-valued neutrosophic hypersoft set and applied it to dealing with the symptomatic detection of COVID-19 data.

Bhattacharyya [30] defined the Bhattacharyya measure. This measure belongs to the family of squared-chord distance measures, defined as

B (f_{1}, f_{2}) = E_{f_{1}} [\sqrt{\frac{f_{2} (X)}{f_{1} (X)}}] .

(8)

Note that

B (f_{1}, f_{2}) = 1 - H (f_{1}, f_{2}) .

(9)

The Bhattacharyya measure is closely related to the Hellinger measure defined in (7).

Aherne et al. [31] presented the original geometric interpretation of the Bhattacharyya measure and explained the use of the metric in the Bhattacharyya bound. Patra et al. [32] proposed an innovative method for determining the similarity of a pair of users in sparse data. The proposed method is used to determine how the two evaluated items are relevant to each other. Similar to both Bhattacharyya’s and Hellinger’s measures, Pianka’s measure [33] between two probability distributions has values between 0 and 1. Recently, Alhihi et al. [34] introduced Pianka’s overlap coefficient for two exponential populations. A complete study of several divergence measures can be found in [26,35,36].

From the divergence measures (1)–(8), one can conclude that:

Each of the divergence measures (1)–(6) is positive if $f_{1} \neq f_{2}$ and equals zero if and only if $f_{1} = f_{2}$ .
From (7) $0 < H (f_{1}, f_{2}) < 1$ if $f_{1} \neq f_{2}$ and $H (f_{1}, f_{2}) = 0$ if and only if $f_{1} = f_{2} .$
From (8), $0 < B (f_{1}, f_{2}) < 1$ if $f_{1} \neq f_{2}$ and $B (f_{1}, f_{2}) = 1$ if and only if $f_{1} = f_{2} .$

For this paper, we evaluated the Bayesian and the Wald predictive distributions of a future statistic based on a past sufficient statistic from samples taken from the normal density. Several divergence measures were used to test the closeness between the classical distribution of the future statistic and the predictive distributions, including the two approaches: Bayesian and Wald. We used the hypothesis-testing technique to compare the behavior of these divergence measures with respect to the closure test between the two distributions. The main contribution of this work is to provide a comprehensive study of eight divergence measures in a case of normal distribution using both Bayesian and non-Bayesian (Wald) predictive approaches. To the best of our knowledge, this study has not been previously mentioned in the literature.

The rest of this paper is organized as follows: In Section 2, we evaluate the Bayesian and the Wald predictive distributions of future statistic V based on past sufficient statistic W from normal density. In Section 3, we find divergence measures between the classical distribution of future statistics and the predictive distributions found in Section 2. In Section 4, we obtain percentiles of divergence measures and test closures of the predictive distributions, using the two approaches (Bayesian and Wald) to the classical distribution. A real-life application is presented in Section 5. Finally, we provide a brief conclusion in Section 6.

2. Predictive Distributions

Let

X_{1}, \dots, X_{n}

be independent and identically distributed (i.i.d.) past random variables and let

Y_{1}, \dots, Y_{m}

be i.i.d. future random variables, where

X = (X_{1}, \dots, X_{n})

and

Y = (Y_{1}, \dots, Y_{m})

are independent. Consider a past sufficient statistic

W = r (X_{1}, \dots X_{n})

and a future statistic

V = u (Y_{1}, \dots, Y_{m})

. In this section, we construct Bayesian and Wald predictive-density functions of the future statistic V based on the past sufficient statistic W from the normal density.

2.1. Bayesian Predictive Distribution

Let

X_{1}, \dots, X_{n}

be

i . i . d .

past random variables with probability-density function (pdf)

f (x |θ)

and let

Y_{1}, \dots, Y_{m}

be

i . i . d .

future random variables with pdf

k (y |θ)

. Assume that

θ

is a random variable with prior density

π (θ)

for

θ \in Θ

. Consider a past sufficient statistic

W = r (X_{1}, \dots, X_{n})

with pdf

s (w |θ)

and a future statistic

V = u (Y_{1}, \dots, Y_{m})

with pdf

g (v |θ)

. The posterior-density function [37] of

θ

, given

W = w

, is defined as

p (θ |w) = \frac{s (w |θ) π (θ)}{\int_{Θ} s (w |θ) π (θ) d θ} .

(10)

The following theorem provides a Bayesian method to construct predictive distributions.

Theorem 1

([37]). Let

W = r (X_{1}, \dots, X_{n})

be a past sufficient statistic and let θ be a random variable with prior density

π (θ)

for

θ \in Θ .

Assume that

V = u (Y_{1}, \dots, Y_{m})

is a future statistic with pdf

g (v |θ)

. The Bayesian predictive-density function of V, given

W = w

, is given by

h (v |w) = \int_{Θ} g (v |θ) p (θ |w) d θ,

(11)

where

p (θ |w)

is the posterior-density function of θ, given

W = w

, defined in (10).

The next two theorems provide the Bayesian predictive distributions for some statistics. Theorem 2 below presents the Bayesian predictive distribution of the future statistic

V = \sum_{i = 1}^{m} Y_{i}

based on the past sufficient statistic

W = \sum_{i = 1}^{n} X_{i}

from random samples taken from normal density; the choice of these statistics is based on the fact that we need statistics that best describe the characteristics of the data set, and they are closely related to the mean of the data set. Note that the choice of W is a completely sufficient statistic for the parameter under consideration. A similar result is obtained in Theorem 4 for the same statistics V and W, to obtain the Wald predictive distribution.

We use the following notations to represent some distributions that appear in the results of Theorems 2 and 3 below: the normal distribution with mean

μ

and variance

σ^{2}

is denoted by

N (μ, σ^{2})

; the gamma distribution with shape parameter

α

and scale parameter

β

is denoted by Gamma

(α, β);

and the generalized inverted beta distribution with shape parameters

α, β

, and p is denoted by InBe

(α, β, p)

.

Theorem 2.

Let

X_{1}, \dots, X_{n} \overset{i . i . d .}{\sim}

N (θ_{1}, 1 / θ_{2})

and

Y_{1}, \dots, Y_{m} \overset{i . i . d .}{\sim}

N (δ θ_{1}, 1 / θ_{2})

, where δ is known. Assume that

X = (X_{1}, \dots, X_{n})

and

Y = (Y_{1}, \dots, Y_{m})

are independent and that

V = \sum_{i = 1}^{m} Y_{i}

and

W = \sum_{i = 1}^{n} X_{i} .

If

θ_{2}

is known and

θ_{1}

is unknown with prior distribution

N (a, 1 / b)

, where a and b are assumed to be known, the Bayesian predictive distribution of V, given

W = w

, is

N (\frac{m δ (a b + w θ_{2})}{b + n θ_{2}}, \frac{m (b + (n + m δ^{2}) θ_{2})}{(b + n θ_{2}) θ_{2}}) .

(12)

Proof.

It is easy to see that the past sufficient statistic

W = \sum_{i = 1}^{n} X_{i}

, given

θ_{1},

follows the normal distribution

N (n θ_{1}, n / θ_{2})

and that

V = \sum_{i = 1}^{m} Y_{i}

, given

θ_{1}

, follows the normal distribution

N (m δ θ_{1}, m / θ_{2})

. In other words, the classical distribution of the future statistic V, given

θ_{1}

, is

g (v |θ_{1}) = \sqrt{\frac{θ_{2}}{2 π m}} e x p \{- \frac{θ_{2}}{2 m} {(v - m δ θ_{1})}^{2}\} .

(13)

Now, using (10), the posterior-density function of

θ_{1}

, given

W = w

, is equal to

\begin{matrix} p (θ_{1} | w) & \propto e x p \{- \frac{θ_{2}}{2 n} {(w - n θ_{1})}^{2}\} e x p \{- \frac{b}{2} {(θ_{1} - a)}^{2}\} \\ \propto e x p \{- (\frac{b + n θ_{2}}{2}) {(θ_{1} - \frac{a b + w θ_{2}}{b + n θ_{2}})}^{2}\} . \end{matrix}

Therefore, the posterior-density function of

θ_{1},

given

W = w

, is the normal distribution

N (\frac{a b + w θ_{2}}{b + n θ_{2}}, \frac{1}{b + n θ_{2}})

.

Applying Equation (11), the Bayesian predictive-density function of V, given

W = w

, can be derived as

h (v |w) = \int_{- \infty}^{\infty} \sqrt{\frac{θ_{2}}{2 π m}} \exp \{- \frac{θ_{2}}{2 m} {(v - m δ θ_{1})}^{2}\} \cdot \sqrt{\frac{b + n θ_{2}}{2 π}} \exp \{- (\frac{b + n θ_{2}}{2}) {(θ_{1} - \frac{a b + w θ_{2}}{b + n θ_{2}})}^{2}\} d θ_{1} .

By evaluating this integral, the Bayesian predictive-density function of V, given

W = w

, is

h (v |w) = \sqrt{\frac{(b + n θ_{2}) θ_{2}}{2 π m (b + (n + m δ^{2}) θ_{2})}} e x p \{- \frac{(b + n θ_{2}) θ_{2}}{2 m (b + (n + m δ^{2}) θ_{2})} {(v - \frac{δ m (a b + w θ_{2})}{b + n θ_{2}})}^{2}\},

(14)

where

- \infty < v < \infty, - \infty < w < \infty .

Which represents the normal distribution

N (\frac{m δ (a b + w θ_{2})}{b + n θ_{2}}, \frac{m (b + (n + m δ^{2}) θ_{2})}{(b + n θ_{2}) θ_{2}})

. □

The following theorem presents the Bayesian predictive distribution of the future statistic

V = \sum_{i = 1}^{m} {(Y_{i} - θ_{1})}^{2}

based on the past sufficient statistic

W = \sum_{i = 1}^{n} {(X_{i} - θ_{1})}^{2}

from random samples taken from normal density. The choice of these statistics is based on the fact that we need statistics that best describe the characteristics of the data set. The choice of V and W is closely related to the variance of the data set, and it is often used to represent data too. Note that a similar result is obtained in Theorem 5 for the same statistics V and W, to obtain the Wald predictive distribution.

Theorem 3.

Let

X_{1}, \dots, X_{n} \overset{i i d}{\sim}

N (θ_{1}, 1 / θ_{2})

and

Y_{1}, \dots, Y_{m} \overset{i i d}{\sim}

N (θ_{1}, δ / θ_{2}),

where δ is known. Assume that

X = (X_{1}, \dots, X_{n})

and

Y = (Y_{1}, \dots, Y_{m})

are independent and that

V = \sum_{i = 1}^{m} {(Y_{i} - θ_{1})}^{2}

and

W = \sum_{i = 1}^{n} {(X_{i} - θ_{1})}^{2} .

If

θ_{1}

is known and

θ_{2}

is unknown with prior distribution Gamma

(a, 1 / b)

, where a and b are assumed to be known, the Bayesian predictive distribution of V, given

W = w

, is

I n B e (\frac{m}{2}, \frac{n}{2} + a, δ (w + 2 b)) .

(15)

Proof.

It is easy to see that the past sufficient statistic

W = \sum_{i = 1}^{n} {(X_{i} - θ_{1})}^{2}

, given

θ_{2}

, follows the gamma distribution Gamma

(\frac{n}{2}, \frac{2}{θ_{2}})

and that

V = \sum_{i = 1}^{m} {(Y_{i} - θ_{1})}^{2}

, given

θ_{2}

, follows Gamma

(\frac{m}{2}, \frac{2 δ}{θ_{2}}) .

Using Equation (10), the posterior-density function of

θ_{2}

, given

W = w

, is Gamma

(\frac{n + 2 a}{2}, \frac{2}{w + 2 b})

. Applying Equation (11), the Bayesian predictive-density function of V, given

W = w

, can be derived as

\begin{matrix} h (v |w) & = \int_{0}^{\infty} \frac{v^{(m / 2 - 1)} {(\frac{θ_{2}}{2 δ})}^{(m / 2)} e x p \{- \frac{θ_{2}}{2 δ} v\}}{Γ (\frac{m}{2})} \cdot \frac{θ_{2}^{((n + 2 a) / 2) - 1} {(\frac{w + 2 b}{2})}^{(n + 2 a) / 2} e x p \{- θ_{2} (\frac{w + 2 b}{2})\}}{Γ (\frac{n + 2 a}{2})} d θ_{2} \\ = \frac{v^{(m / 2 - 1)} {(δ (w + 2 b))}^{(n + 2 a) / 2}}{B (\frac{m}{2}, \frac{n + 2 a}{2}) {(v + δ (w + 2 b))}^{(m + n + 2 a) / 2}}, where v > 0 and w > 0 . \end{matrix}

As a result, the Bayesian predictive-density function of V, given

W = w

, follows the generalized inverted beta distribution InBe

(\frac{m}{2}, \frac{n + 2 a}{2}, δ (w + 2 b))

. □

2.2. Wald Predictive Distribution

In this subsection, the Wald predictive-density function of the future statistic V based on the past sufficient statistic W from normal density is derived.

Let

X_{1}, \dots, X_{n}

be

i . i . d .

past random variables with pdf

f (x; θ)

and let

Y_{1}, \dots, Y_{m}

be

i . i . d .

future random variables with pdf

k (y; θ)

, where {

θ \in Θ

} is an unknown parameter and

X = (X_{1}, \dots, X_{n})

and

Y = (Y_{1}, \dots, Y_{m})

are independent. Consider a past sufficient statistic

W = r (X_{1}, \dots, X_{n})

with pdf

s (w; θ)

and future statistic

V = u (Y_{1}, \dots, Y_{m})

with pdf

g (v; θ)

. The Wald predictive-density function [6] of V, given

W = w

, is defined as

q (v |w) = g (v; {\hat{θ}}_{w}),

(16)

where

{\hat{θ}}_{W}

is the maximum likelihood estimator (MLE) of

θ

based on the distribution of the past sufficient statistic W, for which

sup_{θ} L (θ) = L ({\hat{θ}}_{W}), where L (θ) = s (w; θ) .

The next two theorems present the Wald predictive distribution for some statistics. Theorem 4 considers the future statistic

V = \sum_{i = 1}^{m} Y_{i}

and the past sufficient statistic

W = \sum_{i = 1}^{n} X_{i},

while Theorem 5 considers the future statistic

V = \sum_{i = 1}^{m} {(Y_{i} - θ_{1})}^{2}

and the past sufficient statistic

W = \sum_{i = 1}^{n} {(X_{i} - θ_{1})}^{2} .

Theorem 4.

Let

X_{1}, \dots, X_{n} \overset{i i d}{\sim}

N (θ_{1}, 1 / θ_{2})

and

Y_{1}, \dots, Y_{m} \overset{i i d}{\sim}

N (δ θ_{1}, 1 / θ_{2})

, where δ is known. Assume that

X = (X_{1}, \dots, X_{n})

and

Y = (Y_{1}, \dots, Y_{m})

are independent and that

V = \sum_{i = 1}^{m} Y_{i}

and

W = \sum_{i = 1}^{n} X_{i} .

If

θ_{1}

is unknown and

θ_{2}

is known, the Wald predictive distribution of V, given

W = w

, is

N (\frac{δ m w}{n}, \frac{m}{θ_{2}}) .

(17)

Proof.

Using the fact that

W = \sum_{i = 1}^{n} X_{i}

follows

N (n θ_{1}, \frac{n}{θ_{2}})

, we obtain

L (θ_{1}) = s (w; θ_{1}) = \sqrt{\frac{θ_{2}}{2 n π}} e x p \{- \frac{θ_{2}}{2 n} {(w - n θ_{1})}^{2}\}; - \infty < θ_{1} < \infty, - \infty < w < \infty .

As a result, the MLE of

θ_{1}

is

{\hat{θ_{1}}}_{W} = \frac{W}{n} .

Now, the distribution of

V = \sum_{i = 1}^{m} Y_{i}

is

N (m δ θ_{1}, \frac{m}{θ_{2}})

. By applying Equation (16), the Wald predictive distribution of V, given

W = w

, is equal to

q (v |w) = \sqrt{\frac{θ_{2}}{2 m π}} e x p \{- \frac{θ_{2}}{2 m} {(v - \frac{δ m w}{n})}^{2}\}; v > 0, w > 0,

(18)

which represents the normal distribution

N (\frac{δ m w}{n}, \frac{m}{θ_{2}})

as required. □

Theorem 5.

Let

X_{1}, \dots, X_{n} \overset{i i d}{\sim}

N (θ_{1}, 1 / θ_{2})

and

Y_{1}, \dots, Y_{m} \overset{i i d}{\sim}

N (θ_{1}, δ / θ_{2}),

where δ is known. Assume that

X = (X_{1}, \dots, X_{n})

and

Y = (Y_{1}, \dots, Y_{m})

are independent and that

V = \sum_{i = 1}^{m} {(Y_{i} - θ_{1})}^{2}

and

W = \sum_{i = 1}^{n} {(X_{i} - θ_{1})}^{2} .

If

θ_{1}

is known and

θ_{2}

is unknown, the Wald predictive distribution of V, given

W = w

, is

G a m m a (\frac{m}{2}, \frac{2 δ w}{n}) .

(19)

Proof.

Using the fact that the past sufficient statistic

W = \sum_{i = 1}^{n} {(X_{i} - θ_{1})}^{2}

follows Gamma

(\frac{n}{2}, \frac{2}{θ_{2}})

, we obtain

L (θ_{2}) = s (w; θ_{2}) = \frac{w^{\frac{n}{2} - 1} {(\frac{θ_{2}}{2})}^{\frac{n}{2}} e^{- \frac{θ_{2}}{2} w}}{Γ (\frac{n}{2})}; θ_{2} > 0, w > 0 .

As a result, the MLE of

θ_{2}

is

{\hat{θ_{2}}}_{W} = \frac{n}{W} .

Now, the distribution of

V = \sum_{i = 1}^{m} {(Y_{i} - θ_{1})}^{2}

follows Gamma

(\frac{m}{2}, \frac{2 δ}{θ_{2}})

. By applying (16), we obtain the result that the Wald predictive distribution of V, given

W = w

, follows Gamma

(\frac{m}{2}, \frac{2 δ w}{n})

as required. □

3. Divergence Measures between the Classical Distribution of Future Statistic and Predictive Distributions

In this section, several divergence measures between the classical distribution of the future statistic V and the predictive distribution of V, given the past sufficient statistic W, are found for both prediction cases—Bayesian and Wald—by considering the future statistic

V = \sum_{i = 1}^{m} Y_{i}

and the past sufficient statistic

W = \sum_{i = 1}^{n} X_{i}

, where the Bayesian and Wald predictive distributions are found in Theorems 2 and 4, respectively. The other case, where

V = \sum_{i = 1}^{m} {(Y_{i} - θ_{1})}^{2}

and

W = \sum_{i = 1}^{n} {(X_{i} - θ_{1})}^{2}

can be considered as a possible idea for future research.

The next theorem gives formulas for the following divergence measures between the classical distribution of the future statistic

g (v |θ_{1})

in Equation (13) and the Bayesian predictive distribution

h (v |w)

in Equation (14): the Kullback–Leibler measure (

K (g ∥ h)

), the Jeffreys measure (

J (g, h)

), the Rényi measure (

K_{r}^{1} (g ∥ h)

), the relative-information-of-type-r measure (

^{1} K_{r} (g ∥ h)

), the r-order-and-s-degree measure (

K_{r}^{s} (g ∥ h)

), the chi-square measure (

χ^{2} (g ∥ h)

), the Hellinger measure (

H (g, h)

), and the Bhattacharyya measure (

B (g, h)

).

For each of the aforementioned measures, we also find its average under the prior distribution of

θ_{1}

. For instance, the Kullback–Leibler measure and its average under the prior distribution of

θ

between the two densities g and h, respectively, are presented as

K (g ∥h) = E_{g} [\log (\frac{g (V |θ_{1})}{h (V |w)})] and A K (g ∥h) = E_{θ_{1}} E_{g} [\log (\frac{g (V |θ_{1})}{h (V |w)})] .

Similarly, the average of the divergence measures under the prior distribution of

θ_{1}

between two probability density functions g and h for each of the remaining divergence measures are denoted as

A K (g ∥h),

A J (g, h),

A K_{r}^{1} (g ∥h),

A^{1} K_{r} (g ∥h),

A K_{r}^{s} (g ∥h),

A K_{1}^{s} (g ∥h),

A χ^{2} (g ∥h),

and

A H (g, h) .

Theorem 6.

Under the assumption of Theorem 2, let

g (v | θ_{1})

be the classical distribution of the future statistic and let

h (v | w)

be the Bayesian predictive distribution of V, given

W = w

, from the normal density. The value of the following divergence measures and their average under the prior distribution of

θ_{1}

between

g (v | θ_{1})

and

h (v | w)

—(1) the Kullback–Leibler measure, (2) the Jeffreys measure, (3) the Rényi measure, (4) the relative-information-of-type-r measure, (5) the r-order-and-s-degree measure, (6) the chi-square measure, (7) the Hellinger measure, and (8) the Bhattacharyya measure—are equal to, respectively:

(1): $K (g ∥ h) = \frac{1}{2} \log (\frac{b + (n + m δ^{2}) θ_{2}}{b + n θ_{2}}) + \frac{m δ^{2} (b + n θ_{2}) θ_{2}}{2 (b + (n + m δ^{2}) θ_{2})} ({(θ_{1} - \frac{a b + w θ_{2}}{b + n θ_{2}})}^{2} - \frac{1}{b + n θ_{2}}),$ and
$A K (g ∥h) = \frac{1}{2} \log (\frac{b + (n + m δ^{2}) θ_{2}}{b + n θ_{2}}) + \frac{m δ^{2} (b + n θ_{2}) θ_{2}}{2 (b + (n + m δ^{2}) θ_{2})} ({(a - \frac{a b + w θ_{2}}{b + n θ_{2}})}^{2} + \frac{n θ_{2}}{b + n θ_{2}});$
(2): $J (g, h) = \frac{m δ^{2} θ_{2}}{2 (b + (n + m δ^{2}) θ_{2})} (\frac{m δ^{2} θ_{2}}{b + n θ_{2}} + (m δ^{2} θ_{2} + 2 (b + n θ_{2})) {(θ_{1} - \frac{a b + w θ_{2}}{b + n θ_{2}})}^{2}),$ and
$A J (g, h) = \frac{m δ^{2} θ_{2}}{2 (b + (n + m δ^{2}) θ_{2})} (\frac{m δ^{2} θ_{2}}{b + n θ_{2}} + (m δ^{2} θ_{2} + 2 (b + n θ_{2})) ({(a - \frac{a b + w θ_{2}}{b + n θ_{2}})}^{2} + \frac{1}{b}));$
(3): $K_{r}^{1} (g ∥h) = \frac{1}{2 (r - 1)} \log (\frac{b + (n + m δ^{2}) θ_{2}}{b + (n + m r δ^{2}) θ_{2}}) + \frac{1}{2} \log (\frac{b + (n + m δ^{2}) θ_{2}}{b + n θ_{2}}) + \frac{m r (b + n θ_{2}) δ^{2} θ_{2}}{2 (b + (n + m r δ^{2}) θ_{2})} {(θ_{1} - \frac{a b + w θ_{2}}{b + n θ_{2}})}^{2},$ where $r \neq 1, r > 0,$ and
$A K_{r}^{1} (g ∥h) = \frac{1}{2 (r - 1)} \log (\frac{b (b + (n + m r δ^{2}) θ_{2})}{c}) + \frac{b m r {(w - a n)}^{2} δ^{2} θ_{2}^{3}}{2 (b + n θ_{2}) c},$ where $r \neq 1, r > 0, c > 0;$
(4): $^{1} K_{r} (g ∥h) = \frac{1}{r - 1} (- 1 + {(\frac{b + (n + m δ^{2}) θ_{2}}{b + n θ_{2}})}^{\frac{r - 1}{2}} \sqrt{\frac{b + (n + m δ^{2}) θ_{2}}{b + (n + m r δ^{2}) θ_{2}}} \times e x p \{\frac{m r (r - 1) (b + n θ_{2}) δ^{2} θ_{2}}{2 (b + (n + m r δ^{2}) θ_{2})} {(θ_{1} - \frac{a b + w θ_{2}}{b + n θ_{2}})}^{2}\}),$ where $r \neq 1, r > 0,$ and
$A^{1} K_{r} (g ∥h) = \frac{1}{r - 1} (- 1 + \sqrt{\frac{b (b + (n + m r δ^{2}) θ_{2})}{c}} e x p \{\frac{b m r (r - 1) {(w - a n)}^{2} δ^{2} θ_{2}^{3}}{2 (b + n θ_{2}) c}\}),$ where $r \neq 1, r > 0, c > 0;$
(5): $K_{r}^{s} (g ∥h) = \frac{1}{s - 1} (- 1 + {(\frac{b + (n + m δ^{2}) θ_{2}}{b + n θ_{2}})}^{\frac{s - 1}{2}} {(\frac{b + (n + m δ^{2}) θ_{2}}{b + (n + m r δ^{2}) θ_{2}})}^{\frac{s - 1}{2 (r - 1)}} \times e x p \{\frac{m r (s - 1) (b + n θ_{2}) δ^{2} θ_{2}}{2 (b + (n + m r δ^{2}) θ_{2})} {(θ_{1} - \frac{a b + w θ_{2}}{b + n θ_{2}})}^{2}\}),$ where $r, s \neq 1, r > 0,$ and
$A K_{r}^{s} (g ∥h) = \frac{1}{s - 1} (- 1 + {(\frac{b (b + (n + m r δ^{2}) θ_{2})}{c})}^{\frac{s - 1}{2 (r - 1)}} e x p \{\frac{b m r (s - 1) {(w - a n)}^{2} δ^{2} θ_{2}^{3}}{2 (b + n θ_{2}) c}\}),$ where $r, s \neq 1, r > 0, c > 0;$
(6): $χ^{2} (g ∥h) = - 1 + \sqrt{\frac{b + (n + m δ^{2}) θ_{2}}{b + n θ_{2}}} \sqrt{\frac{b + (n + m δ^{2}) θ_{2}}{b + (n + 2 m δ^{2}) θ_{2}}} e x p \{\frac{m r (b + n θ_{2}) δ^{2} θ_{2}}{2 (b + (n + m r δ^{2}) θ_{2})} {(θ_{1} - \frac{a b + w θ_{2}}{b + n θ_{2}})}^{2}\},$ and
$A χ^{2} (g ∥h) = - 1 + \sqrt{\frac{b (b + (n + 2 m δ^{2}) θ_{2})}{b^{2} + b n θ_{2} - 2 m n δ^{2} θ_{2}^{2}}} e^{\frac{b m {(w - a n)}^{2} δ^{2} θ_{2}^{3}}{(b + n θ_{2}) (b^{2} + b n θ_{2} - 2 m n δ^{2} θ_{2}^{2})}},$ where $0 < δ < \frac{b}{θ_{2}} \sqrt{\frac{b + n θ_{2}}{2 n m b}};$
(7): $H (g, h) = 1 - \sqrt[4]{\frac{b + n θ_{2}}{b + (n + m δ^{2}) θ_{2}}} \sqrt{\frac{2 b + 2 (n + m δ^{2}) θ_{2}}{2 b + (2 n + m δ^{2}) θ_{2}}} e x p \{- \frac{m r (b + n θ_{2}) δ^{2} θ_{2}}{4 (b + (n + m r δ^{2}) θ_{2})} {(θ_{1} - \frac{a b + w θ_{2}}{b + n θ_{2}})}^{2}\},$ and
$A H (g, h) = 1 - \sqrt{\frac{4 b (b + (n + m r δ^{2}) θ_{2})}{4 b^{2} + b (4 n + 3 m δ^{2}) θ_{2} + m n δ^{2} θ_{2}^{2}}} e x p \{- \frac{b m {(w - a n)}^{2} δ^{2} θ_{2}^{3}}{8 (b + n θ_{2}) (4 b^{2} + b (4 n + 3 m δ^{2}) θ_{2} + m n δ^{2} θ_{2}^{2})}\},$ where $r \neq 1, r > 0;$
(8): $B (g, h) = \sqrt[4]{\frac{b + n θ_{2}}{b + (n + m δ^{2}) θ_{2}}} \sqrt{\frac{2 b + 2 (n + m δ^{2}) θ_{2}}{2 b + (2 n + m δ^{2}) θ_{2}}} e x p \{- \frac{m r (b + n θ_{2}) δ^{2} θ_{2}}{4 (b + (n + m r δ^{2}) θ_{2})} {(θ_{1} - \frac{a b + w θ_{2}}{b + n θ_{2}})}^{2}\},$ and
$A B (g, h) = \sqrt{\frac{4 b (b + (n + m r δ^{2}) θ_{2})}{4 b^{2} + b (4 n + 3 m δ^{2}) θ_{2} + m n δ^{2} θ_{2}^{2}}} e x p \{- \frac{b m {(w - a n)}^{2} δ^{2} θ_{2}^{3}}{8 (b + n θ_{2}) (4 b^{2} + b (4 n + 3 m δ^{2}) θ_{2} + m n δ^{2} θ_{2}^{2})}\},$ where $r \neq 1, r > 0,$

where

c = b^{2} + b (n - m r (r - 2) δ^{2}) θ_{2} - m n r (r - 1) δ^{2} θ_{2}^{2} .

Proof.

By using (13) and the result of Theorem 2, the following expectations can be calculated easily and will be used in the proof:

E_{g} {(V - m δ θ_{1})}^{2} = \frac{m}{θ_{2}},

(20)

E_{g} {(V - \frac{m δ (a b + w θ_{2})}{b + n θ_{2}})}^{2} = \frac{m}{θ_{2}} + {(m δ θ_{1} - \frac{m δ (a b + w θ_{2})}{b + n θ_{2}})}^{2},

(21)

E_{h} {(V - \frac{m δ (a b + w θ_{2})}{b + n θ_{2}})}^{2} = \frac{m (b + (n + m δ^{2}) θ_{2}}{(b + n θ_{2}) θ_{2}},

(22)

and

E_{h} {(V - m δ θ_{1})}^{2} = \frac{m (b + (n + m δ^{2}) θ_{2}}{(b + n θ_{2}) θ_{2}} + {(\frac{m δ (a b + w θ_{2})}{b + n θ_{2}} - m δ θ_{1})}^{2} .

(23)

In addition, using

g (v |θ_{1})

in (13) and

h (v |w)

in (14), the following ratios (after simplification) will be used in the proof:

\frac{g (v |θ_{1})}{h (v |w)} = \sqrt{\frac{b + (n + m δ^{2}) θ_{2}}{b + n θ_{2}}} e x p \{- \frac{θ_{2}}{2 m} {(v - m δ θ_{1})}^{2} + \frac{(b + n θ_{2}) θ_{2}}{2 m (b + (n + m δ^{2}) θ_{2})} {(v - \frac{m δ (a b + w θ_{2})}{b + n θ_{2}})}^{2}\},

\frac{h (v |w)}{g (v |θ_{1})} = \sqrt{\frac{b + n θ_{2}}{b + (n + m δ^{2}) θ_{2}}} e x p \{\frac{θ_{2}}{2 m} {(v - m δ θ_{1})}^{2} - \frac{(b + n θ_{2}) θ_{2}}{2 m (b + (n + m δ^{2}) θ_{2})} {(v - \frac{m δ (a b + w θ_{2})}{b + n θ_{2}})}^{2}\} .

(1): By using Equation (1), the Kullback–Leibler divergence between $g (v |θ_{1})$ and $h (v |w)$ equals

$\begin{matrix} K (g ∥h) & = E_{g} [\log (\frac{g (V |θ_{1})}{h (V |w)})] \\ = \log (\sqrt{\frac{b + (n + m δ^{2}) θ_{2}}{b + n θ_{2}}}) - \frac{θ_{2}}{2 m} E_{g} {(V - m δ θ_{1})}^{2} \\ + \frac{(b + n θ_{2}) θ_{2}}{2 m (b + (n + m δ^{2}) θ_{2})} E_{g} {(V - \frac{m δ (a b + w θ_{2})}{b + n θ_{2}})}^{2}, by using (20) and (21) \\ = \frac{1}{2} \log (\frac{b + (n + m δ^{2}) θ_{2}}{b + n θ_{2}}) - \frac{1}{2} + \frac{(b + n θ_{2}) θ_{2}}{2 m (b + (n + m δ^{2}) θ_{2})} (\frac{m}{θ_{2}} + {(m δ θ_{1} - \frac{m δ (a b + w θ_{2})}{b + n θ_{2}})}^{2}) \\ = \frac{1}{2} \log (\frac{b + (n + m δ^{2}) θ_{2}}{b + n θ_{2}}) + \frac{m δ^{2} (b + n θ_{2}) θ_{2}}{2 (b + (n + m δ^{2}) θ_{2}} ({(θ_{1} - \frac{a b + w θ_{2}}{b + n θ_{2}})}^{2} - \frac{1}{b + n θ_{2}}) . \end{matrix}$

Now, to find the average of the Kullback–Leibler measure under the prior distribution of $θ_{1}$ between $g (v |θ_{1})$ and $h (v |w)$ , we use the assumption $θ_{1} \sim N (a, 1 / b)$ . As a result,
$E_{θ_{1}} {(θ_{1} - \frac{a b + w θ_{2}}{b + n θ_{2}})}^{2} = \frac{1}{b} + {(a - \frac{a b + w θ_{2}}{b + n θ_{2}})}^{2} .$ Thus,

$\begin{matrix} A K (g ∥h) & = E_{θ_{1}} [K (g ∥h)] \\ = \frac{1}{2} \log (\frac{b + (n + m δ^{2}) θ_{2}}{b + n θ_{2}}) + \frac{m δ^{2} (b + n θ_{2}) θ_{2}}{2 (b + (n + m δ^{2}) θ_{2})} (E_{θ_{1}} {(θ_{1} - \frac{a b + w θ_{2}}{b + n θ_{2}})}^{2} - \frac{1}{b + n θ_{2}}) \\ = \frac{1}{2} \log (\frac{b + (n + m δ^{2}) θ_{2}}{b + n θ_{2}}) + \frac{m δ^{2} (b + n θ_{2}) θ_{2}}{2 (b + (n + m δ^{2}) θ_{2})} ({(a - \frac{a b + w θ_{2}}{b + n θ_{2}})}^{2} + \frac{n θ_{2}}{b + n θ_{2}}) . \end{matrix}$
(2): First, we need to find $K (h ∥g) .$ By applying definition (1) and using Equations (22) and (23) we obtain
$\begin{matrix} \begin{matrix} K (h ∥g) & = E_{h} [\log (\frac{h (V |w)}{g (V |θ_{1})})] \\ = - \log (\sqrt{\frac{b + (n + m δ^{2}) θ_{2}}{b + n θ_{2}}}) + \frac{θ_{2}}{2 m} E_{h} {(V - m δ θ_{1})}^{2} - \frac{(b + n θ_{2}) θ_{2}}{2 m (b + (n + m δ^{2}) θ_{2})} \\ \times E_{h} {(V - \frac{m δ (a b + w θ_{2})}{b + n θ_{2}})}^{2} \\ = - \frac{1}{2} \log (\frac{b + (n + m δ^{2}) θ_{2}}{b + n θ_{2}}) + \frac{m θ_{2} δ^{2}}{2} ({(θ_{1} - \frac{a b + w θ_{2}}{b + n θ_{2}})}^{2} + \frac{1}{b + n θ_{2}}) . \end{matrix} \end{matrix}$
From (2) and $K (h ∥g)$ calculated in part (1) above, the Jeffreys divergence and its average under the prior distribution of $θ_{1}$ between $g (v |θ_{1})$ and $h (v |w)$ are, respectively, provided by
$\begin{matrix} \begin{matrix} J (g, h) & = K (g ∥h) + K (h ∥g) \\ = \frac{m δ^{2} (b + n θ_{2}) θ_{2}}{2 (b + (n + m δ^{2}) θ_{2})} ({(θ_{1} - \frac{a b + w θ_{2}}{b + n θ_{2}})}^{2} - \frac{1}{b + n θ_{2}}) + \frac{m θ_{2} δ^{2}}{2} ({(θ_{1} - \frac{a b + w θ_{2}}{b + n θ_{2}})}^{2} + \frac{1}{b + n θ_{2}}) \\ = \frac{m δ^{2} θ_{2}}{2 (b + (n + m δ^{2}) θ_{2})} (\frac{m δ^{2} θ_{2}}{b + n θ_{2}} + (m δ^{2} θ_{2} + 2 (b + n θ_{2})) {(θ_{1} - \frac{a b + w θ_{2}}{b + n θ_{2}})}^{2}) . \end{matrix} \end{matrix}$
Now, to find the average of the Jeffreys measure under the prior distribution of $θ_{1}$ between $g (v |θ_{1})$ and $h (v |w)$ , we use the assumption $θ_{1} \sim N (a, 1 / b)$ . As a result, $E_{θ_{1}} {(θ_{1} - \frac{a b + w θ_{2}}{b + n θ_{2}})}^{2} = \frac{1}{b} + {(a - \frac{a b + w θ_{2}}{b + n θ_{2}})}^{2} .$ Thus,
$\begin{matrix} \begin{matrix} A J (g, h) & = E_{θ_{1}} (J (g, h)) \\ = \frac{m δ^{2} θ_{2}}{2 (b + (n + m δ^{2}) θ_{2})} (\frac{m δ^{2} θ_{2}}{b + n θ_{2}} + (m δ^{2} θ_{2} + 2 (b + n θ_{2})) E_{θ_{1}} {(θ_{1} - \frac{a b + w θ_{2}}{b + n θ_{2}})}^{2}) \\ = \frac{m δ^{2} θ_{2}}{2 (b + (n + m δ^{2}) θ_{2})} (\frac{m δ^{2} θ_{2}}{b + n θ_{2}} + (m δ^{2} θ_{2} + 2 (b + n θ_{2})) ({(a - \frac{a b + w θ_{2}}{b + n θ_{2}})}^{2} + \frac{1}{b})) . \end{matrix} \end{matrix}$
(3): First, we find
$\begin{matrix} \begin{matrix} E_{g} ({(\frac{g (V |θ_{1})}{h (V |w)})}^{r - 1}) & = {(\frac{b + (n + m δ^{2}) θ_{2}}{b + n θ_{2}})}^{\frac{r - 1}{2}} \\ \times E_{g} (e x p \{- \frac{(r - 1) θ_{2}}{2 m} {(V - m δ θ_{1})}^{2} + \frac{(r - 1) (b + n θ_{2}) θ_{2}}{2 m (b + (n + m δ^{2}) θ_{2})} {(V - \frac{m δ (a b + w θ_{2})}{b + n θ_{2}})}^{2}\}) \\ = {(\frac{b + (n + m δ^{2}) θ_{2}}{b + n θ_{2}})}^{\frac{r - 1}{2}} \sqrt{\frac{b + (n + m δ^{2}) θ_{2}}{b + (n + m r δ^{2}) θ_{2}}} \\ \times e x p \{\frac{m r (r - 1) (b + n θ_{2}) δ^{2} θ_{2}}{2 (b + (n + m r δ^{2}) θ_{2})} {(θ_{1} - \frac{a b + w θ_{2}}{b + n θ_{2}})}^{2}\} . \end{matrix} \end{matrix}$
The average of $E_{g} ({(\frac{g (V |θ_{1})}{h (V |w)})}^{r - 1})$ under the prior distribution of $θ_{1}$ is
$\begin{matrix} \begin{matrix} E_{θ_{1}} E_{g} ({(\frac{g (V |θ_{1})}{h (V |w)})}^{r - 1}) & = {(\frac{b + (n + m δ^{2}) θ_{2}}{b + n θ_{2}})}^{\frac{r - 1}{2}} \sqrt{\frac{b + (n + m δ^{2}) θ_{2}}{b + (n + m r δ^{2}) θ_{2}}} \\ \times \underset{- \infty}{\int^{\infty}} e x p \{\frac{m r (r - 1) (b + n θ_{2}) δ^{2} θ_{2}}{2 (b + (n + m r δ^{2}) θ_{2})} {(θ_{1} - \frac{a b + w θ_{2}}{b + n θ_{2}})}^{2}\} \sqrt{\frac{b}{2 π}} e x p \{- \frac{b}{2} {(θ_{1} - a)}^{2}\} d θ_{1} \\ = \sqrt{\frac{b (b + (n + m r δ^{2}) θ_{2})}{c}} e x p \{\frac{b m r (r - 1) {(w - a n)}^{2} δ^{2} θ_{2}^{3}}{2 (b + n θ_{2}) c}\} . \end{matrix} \end{matrix}$
Note that $E_{θ} E_{g} ({(\frac{g (V |θ)}{h (V |w)})}^{r - 1})$ converges if $c > 0$ ; that is,

$b^{2} + b (n - m r (r - 2) δ^{2}) θ_{2} - m n r (r - 1) δ^{2} θ_{2}^{2} > 0 .$

From (3) we obtain the Rényi divergence measure and its average under the prior distribution of $θ_{1}$ between $g (v |θ_{1})$ and $h (v |w)$ as required.
(4): From the results of part (3) above and using Equation (4) we obtain the relative-information-of-type-r measure and its average under the prior distribution of $θ_{1}$ between $g (v |θ_{1})$ and $h (v |w)$ as required.
(5): From the results of part (3) above and using Equation (5) we obtain the r-order-and-s-degree divergence measure and its average under the prior distribution of $θ_{1}$ between $g (v |θ_{1})$ and $h (v |w)$ as required.
(6): Substituting $r = 2$ in the relative-information-of-type-r measure found in part (4) above, we obtain the chi-square divergence and its average under the prior distribution of $θ_{1}$ between $g (v |θ_{1})$ and $h (v |w)$ as required. Note that $A χ^{2} (g ∥h)$ converges if

$b^{2} + b n θ_{2} - 2 m n δ^{2} θ_{2}^{2} > 0,$

which implies $0 < δ < \frac{b}{θ_{2}} \sqrt{\frac{b + n θ_{2}}{2 n m b}} .$
(7): Substituting $r = \frac{1}{2}$ in the relative-information-of-type-r measure found in part (4) above and multiplying by $\frac{1}{2},$ we obtain the Hellinger measure and its average under the prior distribution of $θ_{1}$ between $g (v |θ_{1})$ and $h (v |w)$ as required. Note that $A H (g, h)$ converge if

$4 b^{2} + b (4 n + 3 m δ^{2}) θ_{2} + m n δ^{2} θ 2^{2} > 0,$

which implies that $δ > 0 .$
(8): Using the relationship of the Bhattacharyya and Hellinger measures described in Equation (9), we obtain the result of the Bhattacharyya measure and its average under the prior distribution of $θ_{1}$ between $g (v |θ_{1})$ and $h (v |w)$ as required.

□

In the same way, Theorem 7 below provides similar results for the same divergence measures considered in Theorem 6, but this time between the classical distribution of the future statistic

g (v; θ_{1})

in Equation (13) and the Wald predictive distribution

q (v |w)

in Equation (18).

Theorem 7.

Under the assumption of Theorem 4, let

g (v; θ_{1})

be the classical distribution of the future statistic and

q (v |w)

be the Wald predictive distribution of V, given

W = w

, from the normal density. The value of the following divergence measures and their average under the distribution of W between

g (v; θ_{1})

and

q (v |w)

—(1) the Kullback–Leibler measure, (2) the Jeffreys measure, (3) the Rényi measure, (4) the relative-information-of-type-r measure, (5) the r-order-and-s-degree-measure, (6) the chi-square measure, (7) the Hellinger measure, and (8) the Bhattacharyya measure—are equal to, respectively:

(1): $K (g ∥q) = \frac{m δ^{2} θ_{2}}{2 n^{2}} {(w - n θ_{1})}^{2} and A K (g ∥q) = \frac{m δ^{2}}{2 n};$
(2): $J (g, q) = \frac{m δ^{2} θ_{2}}{n^{2}} {(w - n θ_{1})}^{2} and A J (g, q) = \frac{m δ^{2}}{n};$
(3): $K_{r}^{1} (g ∥q) = \frac{m r δ^{2} θ_{2} {(w - n θ_{1})}^{2}}{{2 n}^{2}}, r \neq 1, r > 0, - \infty < w < \infty and A K_{r}^{1} (g ∥q) = \frac{1}{2 (r - 1)} \log (\frac{n}{n - m r (r - 1) δ^{2}}), r \neq 1, 0 < r < \frac{1}{2} + \sqrt{\frac{1}{4} + \frac{n}{m δ^{2}}};$
(4): $^{1} K_{r} (g ∥q) = \frac{1}{r - 1} (- 1 + e x p \{\frac{m r (r - 1) δ^{2} θ_{2} {(w - n θ_{1})}^{2}}{{2 n}^{2}}\}), r \neq 1, r > 0, - \infty < w < \infty and A^{1} K_{r} (g ∥q) = \frac{1}{r - 1} (- 1 + \sqrt{\frac{n}{n - m r (r - 1) δ^{2}}}), r \neq 1, 0 < r < \frac{1}{2} + \sqrt{\frac{1}{4} + \frac{n}{m δ^{2}}};$
(5): $K_{r}^{s} (g ∥q) = \frac{1}{s - 1} (- 1 + e x p \{\frac{m r (s - 1) δ^{2} θ_{2} {(w - n θ_{1})}^{2}}{{2 n}^{2}}\}), r, s \neq 1, r > 0, - \infty < w < \infty and A K_{r}^{s} (g ∥q) = \frac{1}{s - 1} (- 1 + {(\frac{n}{n - m r (r - 1) δ^{2}})}^{\frac{s - 1}{2 (r - 1)}}), r \neq 1, 0 < r < \frac{1}{2} + \sqrt{\frac{1}{4} + \frac{n}{m δ^{2}}};$
(6): $χ^{2} (g ∥q) = - 1 + e x p \{\frac{2 m r δ^{2} θ_{2} {(w - n θ_{1})}^{2}}{{2 n}^{2}}\}, - \infty < w < \infty, and A χ^{2} (g ∥q) = - 1 + \sqrt{\frac{n}{n - 2 m δ^{2}}}, 0 < δ < \sqrt{\frac{n}{2 m}};$
(7): $H (g, q) = 1 - e x p \{- \frac{m r δ^{2} θ_{2} {(w - n θ_{1})}^{2}}{{8 n}^{2}}\}, - \infty < w < \infty and A H (g, q) = 1 - 2 \sqrt{\frac{n}{4 n + m δ^{2}}};$
(8): $B (g, q) = e x p \{- \frac{m r δ^{2} θ_{2} {(w - n θ_{1})}^{2}}{{8 n}^{2}}\}, - \infty < w < \infty and A B (g, q) = 2 \sqrt{\frac{n}{4 n + m δ^{2}}} .$

Proof.

As the distribution of

V = \sum_{i = 1}^{m} Y_{i}

is

N (m δ θ_{1}, \frac{m}{θ_{2}})

and by Theorem 4 the Wald predictive distribution of V, given

W = w

, is

N (\frac{m δ w}{n}, \frac{m}{θ_{2}})

, the following ratios (after simplification) will be used in the proof:

\begin{matrix} \frac{g (v; θ_{1})}{q (v |w)} & = \sqrt{\frac{θ_{2}}{2 π m}} e x p \{- \frac{θ_{2}}{2 m} {(v - m δ θ_{1})}^{2}\} / \sqrt{\frac{θ_{2}}{2 π m}} e x p \{- \frac{θ_{2}}{2 m} {(v - \frac{m δ w}{n})}^{2} \\ = e x p \{- \frac{θ_{2}}{2 m} ({(v - m δ θ_{1})}^{2} - {(v - \frac{m δ w}{n})}^{2})\} and \\ \frac{q (v |w)}{g (v; θ_{1})} & = e x p \{- \frac{θ_{2}}{2 m} (- {(v - m δ θ_{1})}^{2} + {(v - \frac{m δ w}{n})}^{2})\} . \end{matrix}

(1): The Kullback–Leibler measure between $g (v; θ)$ and $q (v |w)$ , defined in Equation (1), is equal to

$\begin{matrix} K (g ∥q) & = E_{g} [\log (\frac{g (V; θ_{1})}{q (V |w)})] \\ = E g (- \frac{θ_{2}}{2 m} ({(V - m δ θ_{1})}^{2} - {(V - \frac{m δ w}{n})}^{2}) |W) \\ = \frac{m δ^{2} θ_{2}}{2 n^{2}} {(w - n θ_{1})}^{2} . \end{matrix}$

Now, the average of the Kullback–Leibler measure under the distribution of W between $g (v; θ_{1})$ and $q (v |w)$ is equal to

$\begin{matrix} A K (g ∥q) & = E_{W} E_{g} [\log (\frac{g (V; θ)}{q (V |w)})] \\ = \frac{m δ^{2} θ_{2}}{2 n^{2}} E_{W} {(W - n θ_{1})}^{2} \\ = \frac{m δ^{2} θ_{2}}{2 n^{2}} * \frac{n}{θ_{2}} \\ = \frac{m δ^{2}}{2 n} . \end{matrix}$
(2): First, we need to find $K (q ∥g) .$ By applying definition (1), we obtain

$\begin{matrix} K (q ∥g) & = E_{q} [\log (\frac{q (V |w)}{g (V; θ)})] \\ = E_{q} (\frac{θ_{2}}{2 m} ({(v - m δ θ_{1})}^{2} - {(v - \frac{m δ w}{n})}^{2}) |W) \\ = \frac{m δ^{2} θ_{2}}{2 n^{2}} {(w - n θ_{1})}^{2} . \end{matrix}$

From the definition in (2) and $K (g ∥q)$ calculated in part (1) above, the Jeffreys divergence between $g (v; θ)$ and $q (v |w)$ is given by

$\begin{matrix} J (g, q) & = K (g ∥q) + K (q ∥g) \\ = \frac{m δ^{2} θ_{2}}{2 n^{2}} {(w - n θ_{1})}^{2} + \frac{m δ^{2} θ_{2}}{2 n^{2}} {(w - n θ_{1})}^{2} \\ = \frac{m δ^{2} θ_{2}}{n^{2}} {(w - n θ_{1})}^{2} . \end{matrix}$

Now, to find the average of the Jeffreys measure under the distribution of W between $g (v; θ_{1})$ and $q (v |w)$ , first we need $A K (q ∥g) = \frac{m δ^{2}}{2 n},$ and by using $A K (g ∥q)$ found in part (1) above, we obtain

$\begin{matrix} A J (g, q) & = A K (g ∥q) + A K (q ∥g) \\ = \frac{m δ^{2}}{2 n} + \frac{m δ^{2}}{2 n} \\ = \frac{m δ^{2}}{n} . \end{matrix}$
(3): First, we find

$\begin{matrix} E_{g} ({(\frac{g (V; θ_{1})}{q (V |w)})}^{r - 1}) & = E_{g} (e x p \{- \frac{(r - 1) θ_{2}}{2 m} ({(v - m δ θ_{1})}^{2} - {(v - \frac{m δ w}{n})}^{2}) |W)\} \\ = e x p \{\frac{m r (r - 1) δ^{2} θ_{2} {(w - n θ_{1})}^{2}}{{2 n}^{2}}\} . \end{matrix}$

The average of $E_{g} ({(\frac{g (V; θ_{1})}{q (V |w)})}^{r - 1})$ under the distribution of w is

$\begin{matrix} E_{W} E_{g} ({(\frac{g (V; θ)}{q (V |w)})}^{r - 1}) & = \overset{\infty}{\int_{- \infty}} e x p \{- \frac{m r (r - 1) δ^{2} θ_{2} {(w - n θ_{1})}^{2}}{{2 n}^{2}}\} \frac{1}{\sqrt{2 π n}} e^{- \frac{1}{2 n} {(w - n θ)}^{2}} d w \\ = \sqrt{\frac{n}{n - m r (r - 1) δ^{2}}} . \end{matrix}$

Note that $E_{W} E_{g} ({(\frac{g (V; θ_{1})}{q (V |w)})}^{r - 1})$ converges if $n - m r (r - 1) δ^{2} > 0$ ; that is,

$0 < r < \frac{1}{2} + \sqrt{\frac{1}{4} + \frac{n}{m δ^{2}}} .$

From the definition in Equation (3), we obtain the Rényi divergence measure and its average under the distribution of W between $g (v; θ_{1})$ and $q (v |w)$ as required.
(4): From the results of part (3) above, and using the definition in Equation (4), we obtain the relative-information-of-type-r measure and its average under the distribution of W between $g (v; θ_{1})$ and $q (v |w)$ as required.
(5): From the results of part (3) above, and using Equation (5), we obtain the r-order-and-s-degree divergence measure and its average under the distribution of W between $g (v; θ_{1})$ and $q (v |w)$ as required.
(6): Substituting $r = 2$ in the relative-information-of-type-r measure found in part (4) above, we obtain the chi-square divergence and its average under the distribution of W between $g (v; θ_{1})$ and $q (v |w)$ as required.
(7): Substituting $r = \frac{1}{2}$ in the relative-information-of-type-r measure found in part (4) above and then multiplying by $\frac{1}{2},$ we obtain the Hellinger measure and its average under the distribution of W between $g (v; θ_{1})$ and $q (v |w)$ as required.
(8): Using the relationship of the Bhattacharyya and Hellinger measures described in Equation (9), we obtain the result of the Bhattacharyya measure and its average under the distribution of W between $g (v; θ_{1})$ and $q (v |w)$ as required.

□

In the next section, we test the closeness between the classical distribution and the prediction distributions (Bayesian and Wald), using several divergence measures. In addition, we present a simulation study to compare the two prediction approaches based on the power of a test.

4. Simulation Study

In the previous section, we used several divergence measures to determine the distance between the classical distribution of a future statistic g and the predictive distribution

h,

using the Bayesian approach, and the predictive distribution

q,

using the Wald approach with random samples taken from the normal density. For this section, we used a simulation study to achieve two goals: Firstly, we tested the closeness between g and h, based on the first seven divergence measures found in Theorem 6, and the closeness between g and q, based on the first seven divergence measures found in Theorem 7, and we decided whether there was a difference in the behavior of these divergence measures with respect to the closeness test for each case of the two prediction approaches. Secondly, we decided which of the two prediction approaches was more appropriate in the current setting.

To achieve the first goal, we employed hypothesis testing to test the closeness between the classical distribution and the predictive distribution using the two prediction approaches. This technique consisted of two steps: first, the percentiles were simulated from the divergence measures to be used in making the decision based on the test criteria; next, the hypothesis testing was applied, to test the closeness between the classical and the predictive distributions, and this closeness was measured, based on the divergence measures.

For the second goal, the power of the test was determined numerically, based on the results of the simulation procedure, to decide which of the two prediction approaches was more appropriate in the in the current setting.

Simulation of Percentiles

To find the percentiles of the divergence measures, we applied the following steps:

A random sample of size n was generated from the past underlying distribution and was used to calculate W, where $W = \sum_{i = 1}^{n} X_{i}$ , as defined in Theorem 2.
Based on the obtained values of W, the divergence measures and their averages under the prior distribution of $θ_{1}$ between g and h derived in (1)–(7) of Theorem 6 were calculated for fixed n, m, $δ,$ a, b, and $θ_{1}$ .
Steps (1) and (2) above were repeated $10,000$ times.
The $10,000$ values of each of the simulated divergence measures obtained in step (3) were used to produce simulated percentiles of size $α = 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.975, 0.99,$ and $0.995$ .

Table S1 in the Supplementary Materials presents the percentiles of the divergence measures (1)–(7) found in Theorem 6 between the classical distribution of future statistic

g (v |θ_{1})

and the Bayesian predictive distribution

h (v |w)

based on the past sample from

N (θ_{1}, 1)

and the future sample from

N (δ θ_{1}, 1),

for

n = 18, m = 5, θ_{1} = 2.5, δ = 0.1, a = 12.5, b = 5

. The number of simulated samples was

n_{i} = 10,000 .

Table S2 in the Supplementary Materials presents the percentiles of the average divergence measures (1)–(7) found in Theorem 6 between between the classical distribution of the future statistic

g (v |θ_{1})

and the Bayesian predictive distribution

h (v |w)

based on the past sample from

N (θ_{1}, 1)

and the future sample from

N (δ θ_{1}, 1),

n = 18, m = 5, δ = 0.1, a = 12.5, b = 5

. The number of simulated samples was

n_{i} = 10,000 .

The above simulation of percentiles procedure steps (1) through (4) was also performed in the Wald predictive distribution approach. In this case, the divergence measures derived in (1)–(7) of Theorem 7 were calculated for fixed n, m, and

δ

. Table S3 in the Supplementary Materials shows the percentiles of the divergence measures (1)–(7) found in Theorem 7 between

g (v; θ_{1})

and

q (v |w)

based on the past sample from

N (θ_{1}, 1)

and the future sample from

N (δ θ_{1}, 1),

for

n = 18, m = 5, θ_{1} = 2.5, δ = 0.1

. The number of simulated samples was

n_{i} = 10,000 .

Testing Closeness

At this stage, we tested the closeness of the future density

g (v | θ_{1})

and the Bayesian predictive density

h (v | w),

using the following hypothesis:

H_{0} : d (g, h) = 0 versus H_{1} : d (g, h) > 0

(24)

at significance level

α,

where d was the distance between g and h and it was measured using the divergence measures (1)–(7) found in Theorem 6. In order to apply the closeness criteria, a simulation was carried out by the following steps:

All parameters, n, m, $δ,$ a, b, and $θ_{1},$ were fixed.
One sample of size n was generated from the underlying distribution, and was used to calculate W.
For w obtained in step (2), a test statistics $d_{c a l}^{*}$ was calculated, based on each of the divergence measures and their averages under the prior distribution of $θ$ between g and h.
The test statistics $d_{c a l}^{*}$ obtained in step (2) was compared to the corresponding critical value $d_{t a b}$ , from the simulated percentiles Tables S1 and S2 in the Supplementary Materials, at significance level $α = 0.01,$ $0.05$ .
On the basis of each divergence measure, decisions were made to reject (R) $H_{0}$ if $d_{c a l}^{*} > d_{t a b}$ and to fail to reject (FTR) $H_{0}$ if $d_{c a l}^{*} < d_{t a b}$ , or by using p-value criteria to reject $H_{0}$ if p-value $< α$ and to fail to reject $H_{0}$ if p-value $\geq α$ .

Table 1 gives the test statistics

d_{c a l}^{*}

for the divergence measures (1)–(7) of Theorem 6 between

g (v |θ_{1})

and

h (v |w)

, for

n = 18,

m = 5,

θ_{1} = 2.5,

δ = 0.1,

a = 12.5,

b = 5

, and the decisions (FTR, R) to test the hypothesis in (24) at

α = 0.01, 0.05, 0.1

.

Table 1. Hypothesis-testing decisions for closeness based on divergence measures between the classical distribution of future statistics and the Bayesian predictive distribution.

Table 2 gives test statistics

{\bar{d}}_{c a l}^{*}

for the average divergence measures given in (1)–(7) of Theorem 6 for

n = 18,

m = 5,

δ = 0.1,

a = 12.5,

b = 5

, and the decisions (FTR, R) for testing the hypothesis in (24) at

α

= 0.01, 0.05, 0.1.

Table 2. Hypothesis-testing decisions for closeness, based on the average divergence measures between the classical distribution of future statistics and the Bayesian predictive distribution.

The above testing of closeness procedure steps (1) through (5) was also performed in the Wald predictive distribution approach. In this case, the divergence measures derived in (1)–(7) of Theorem 7 were used to measure the distance between the future density

g (v; θ_{1})

and the Wald predictive density

q (v; w)

for fixed n, m, and

δ

, and the percentiles Table S3 in the Supplementary Materials was used in the comparison criteria, to make a decision. Table 3 shows test statistics

d_{c a l}^{*}

for the divergence measures given in (1)–(7) of Theorem 7, for

n = 18,

m = 5,

δ = 0.1

, and the decisions (FTR, R) for testing the hypothesis in (24), at

α

= 0.01, 0.05, 0.1.

Table 3. Hypothesis-testing decisions for closeness based on divergence measures between the classical distribution of future statistics and the Wald predictive distribution.

From Table 1, Table 2 and Table 3, we can see that all the divergence measures gave the same decision for each case in hypothesis (24). As a result, we could choose any of these divergence measures to test the closeness between the classical distribution and the predictive distribution in a normal case.

The Power of a Test

To simulate the power of a test, as described previously, we applied the following steps:

The percentiles points $P_{α}$ were taken from the percentiles Table S2 in the Supplementary Materials when we dealt with the Bayesian predictive-distribution approach for fixed parameters, $n = 18,$ $m = 5,$ $δ = 0.1,$ $a = 12.5,$ $b = 5$ , and $θ_{1} = θ_{0}$ , and from Table S3 in the Supplementary Materials when we dealt with the Wald predictive-distribution approach for fixed parameters, $n = 18,$ $m = 5,$ $δ = 0.1,$ and $θ_{1} = θ_{0}$ , at significance level $α = 0.05 .$
A random sample of size n was generated from the underlying distribution at $θ_{1} \neq θ_{0}$ , and was used to calculate W.
For w obtained in step (2), we calculated the values of $d^{*}$ based on each of the divergence measures (1)–(7) of Theorem 6 for the Bayesian predictive-distribution approach and on the divergence measures (1)–(7) of Theorem 7 for the Wald predictive-distribution approach.
The values of $d^{*}$ obtained in step (3) were compared to $P_{α}$ in step (1).
Steps (2)–(4) above were repeated $10,000$ times.
The power of a test $γ (θ_{1}),$ where $θ_{1} \neq θ_{0}$ , was calculated based on the percentages when $d^{*} > P_{α} .$

Table 4 and Table 5 give the values of the power of a test

γ (θ_{1})

for testing

H_{0} : d (θ_{1}) = d (2.5)

vs.

H_{1} : d (θ_{1}) \neq d (2.5)

, where d was measured using divergence measures (1)–(7) of Theorem 6 and Theorem 7, respectively, at significance level

0.05

. The number of simulated samples was

n_{i}

= 10,000.

Table 4. Values of the power function

γ (θ_{1})

around

θ_{1} = 2.5

for Bayesian prediction approach.

Table 5. Values of the power function

γ (θ_{1})

around

θ_{1} = 2.5

for Wald prediction approach.

From Table 4, we can see that the values of

γ (θ_{1})

depended on the value of

θ_{1},

and as the value of

θ_{1}

became larger or smaller,

γ (θ_{1})

approached 1. On the other hand, Table 5 shows that the values of

γ (θ_{1})

oscillated around

0.05

and did not depend on

θ_{1} .

Thus, the Bayesian predictive distribution was better in predicting future data than the Wald predictive distribution.

To check the validity of the power of a test, plots of

γ (θ_{1})

versus

θ_{1}

are shown in Figure 1 and Figure 2 for the Bayesian predictive-distribution approach and the Wald predictive-distribution approach, respectively.

Figure 1. The power function

γ (θ_{1})

for the Bayesian predictive-distribution approach.

Figure 2. The power function

γ (θ_{1})

for the Wald predictive-distribution approach.

5. Application

In this section, we present an application of making a prediction using Bayesian predictive density about a future statistic based on a past sufficient statistic calculated from a real data set. In this application, we consider data set number 139 from [38], which represents the measurements

L_{i}

of the length of the forearms (in inches) taken from 140 adult males. The distribution of this data set is normal, with mean

18.8

and variance

1.25546

;

L_{i} \sim N (18.8, 1.25546), i = 1, 2, \dots, 140 .

By using the following transformation,

X_{i} = \frac{L_{i} - 18.8}{1.12047} + θ,

we transform this data set to

N (θ, 1) .

The following data comprise a random sample of size 18 taken from the original transformed data to

N (θ, 1),

when

θ = 2.5

:

\begin{array}{l} 3.12282 & 2.58734 & 3.92605 & 2.67658 & 1.78410 & 1.9626 \\ 0.26689 & 1.69486 & 2.94433 & 2.23034 & 3.30132 & 2.23034 \\ 2.49809 & 4.37229 & 1.33786 & 3.39057 & 1.15937 & 0.534634 . \end{array}

From this data set, the value of the past sufficient statistic

w = \sum x_{i} = 42.0204 .

At significance level

α = 0.05,

we want to test the hypothesis,

H_{0} : d (g, h) = 0 versus H_{1} : d (g, h) > 0 .

As all divergence measures give the same decision regarding the hypothesis testing in (24), we use the Kullback–Leibler divergence measure in (1) to calculate the distance d between the classical distribution g and the Bayesian predictive distribution h.

The value of

d_{c a l}^{*} = 0.0741083,

and from the percentiles Table S1, we obtain

d_{t a b} = 6.77810

; as can be seen,

d_{c a l}^{*} < d_{t a b}

. Furthermore, the value of

{\bar{d}}_{c a l}^{*} = 2.64472,

and from the percentiles Table S2, we obtain

{\bar{d}}_{t a b} = 2.78029;

as can be seen,

{\bar{d}}_{c a l}^{*} < {\bar{d}}_{t a b}

. Thus, the decision is to fail to reject

H_{0}

. In other words, it is appropriate to use the Bayesian predictive distribution to predict the sum of the forearm lengths of males. As a result, we can predict the average length of forearms for males in general.

6. Conclusions

In this article, the Bayesian and Wald predictive distributions of future statistic V based on past sufficient statistic W from normal density were derived. Several divergence measures were used as criteria to obtain the distance between the classical distribution of future statistics and each of the Bayesian and the Wald predictive distributions. Hypothesis testing was used to test the closeness between the classical and each of the Bayesian and the Wald predictive distributions. The simulation results showed that all divergence measures used in this paper gave the same decisions in all cases. Therefore, it is recommended to use a divergence measure with simpler computations. Based on the power of a test, we conclude that Bayesian predictive distribution is better than Wald predictive distribution in normal distribution cases.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/sym16020212/s1.

Author Contributions

Conceptualization, S.A. (Suad Alhihi), M.A. and A.A.; methodology, S.A. (Suad Alhihi), M.A. and R.A.A.; software, S.A. (Suad Alhihi), M.A., G.A. and S.A. (Samer Alokaily); validation, S.A. (Suad Alhihi), M.A., R.A.A. and S.A. (Samer Alokaily); formal analysis, G.A. and A.A.; investigation, S.A. (Suad Alhihi), R.A.A. and A.A.; resources, G.A. and R.A.A.; data curation, A.A.; writing—original draft preparation, S.A. (Suad Alhihi) and A.A.; writing—review and editing, M.A., G.A. and S.A. (Samer Alokaily); visualization, S.A. (Samer Alokaily); supervision, M.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Aitchison, J. Goodness of prediction fit. Biometrika 1975, 62, 547–554. [Google Scholar] [CrossRef]
Aitchison, J.; Dunsmore, I. Statistical Prediction Analysis; Cambridge University Press: Cambridge, NY, USA, 1975. [Google Scholar]
Escobar, M.; West, M. Bayesian density estimation and inference using mixtures. J. Am. Stat. Assoc. 1995, 90, 577–588. [Google Scholar] [CrossRef]
Hamura, Y.; Kubokawa, T. Bayesian predictive density estimation with parametric constraints for the exponential distribution with unknown location. Metrika 2021, 85, 515–536. [Google Scholar] [CrossRef]
Hamura, Y.; Kubokawa, T. Bayesian predictive density estimation for a Chi-squared model using information from a normal observation with unknown mean and variance. J. Stat. Plan. Inference 2022, 217, 33–51. [Google Scholar] [CrossRef]
Wald, A. Setting of tolerance limits when the sample is large. Ann. Math. Stat. 1942, 13, 389–399. [Google Scholar] [CrossRef]
Bjornstad, J. Predictive likelihood: A review. Stat. Sci. 1990, 242–254. [Google Scholar] [CrossRef]
Awad, A.; Saad, T. Predictive Density Functions: A Comparative Study. Pak. J. Stat. 1987, 3, 91–118. [Google Scholar]
Kullback, S.; Leibler, R. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
Lin, J. Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 1991, 37, 145–151. [Google Scholar] [CrossRef]
Johnson, D.; Sinanovic, S. Symmetrizing the Kullback-Leibler Distance. IEEE Trans. Inf. Theory 2001. Available online: https://www.ece.rice.edu/~dhj/resistor.pdf (accessed on 1 December 2023). [CrossRef]
Do, M.; Vetterli, M. Wavelet-based texture retrieval using generalized Gaussian density and Kullback-Leibler distance. IEEE Trans. Image Process. 2002, 11, 146–158. [Google Scholar] [CrossRef]
Amin, M.; Khan, F.; Ahmed, S.; Imtiaz, S. A data-driven Bayesian network learning method for process fault diagnosis. Process Saf. Environ. Prot. 2021, 150, 110–122. [Google Scholar] [CrossRef]
Jeffreys, H. An invariant form for the prior probability in estimation problems. Proc. R. Soc. London. Ser. A Math. Phys. Sci. 1946, 186, 453–461. [Google Scholar]
Taneja, I.; Pardo, L.; Morales, D.; Menéndez, M. On generalized information and divergence measures and their applications: A brief review. QüEstiió Quad. d’EstadíStica Investig. Oper. 1989, 13, 47–73. [Google Scholar]
Cichocki, A.; Amari, S. Families of alpha-beta-and gamma-divergences: Flexible and robust measures of similarities. Entropy 2010, 12, 1532–1568. [Google Scholar] [CrossRef]
Sharma, K.; Seal, A.; Yazidi, A.; Selamat, A.; Krejcar, O. Clustering uncertain data objects using Jeffreys-divergence and maximum bipartite matching based similarity measure. IEEE Access 2021, 9, 79505–79519. [Google Scholar] [CrossRef]
Rényi, A. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics; University of California Press: Berkeley, CA, USA, 1961; Volume 4, pp. 547–562. [Google Scholar]
Tsallis, C. Possible generalization of Boltzmann-Gibbs statistics. J. Stat. Phys. 1988, 52, 479–487. [Google Scholar] [CrossRef]
Krishnamurthy, A.; Kandasamy, K.; Poczos, B.; Wasserman, L. Nonparametric estimation of Rényi divergence and friends. In Proceedings of the 31st International Conference on Machine Learning, Beijing, China, 22–24 June 2014; pp. 919–927. [Google Scholar]
Sason, I.; Verdú, S. f-divergence Inequalities. IEEE Trans. Inf. Theory 2016, 62, 5973–6006. [Google Scholar] [CrossRef]
Sharma, B.; Autar, R. Relative Information Function and Their Type (α, β) Generalizations. Metrika 1974, 21, 41–50. [Google Scholar] [CrossRef]
Taneja, I.; Kumar, P. Relative information of type s, Csiszár’s f-divergence, and information inequalities. Inf. Sci. 2004, 166, 105–125. [Google Scholar] [CrossRef]
Sharma, B.; Mittal, D. New non-additive measures of relative information. J. Comb. Inf. Syst. Sci. 1977, 2, 122–132. [Google Scholar]
Pearson, K. On the Criterion that a Given System of Deviations From the Probable in the Case of a Correlated System of Variables is such that it Can be Reasonably Supposed to have a Risen From Random Sampling. Philos. Mag. 1900, 50, 157–172. [Google Scholar] [CrossRef]
Taneja, I. On symmetric and nonsymmetric divergence measures and their generalizations. Adv. Imaging Electron Phys. 2005, 138, 177–250. [Google Scholar]
Hellinger, E. Neue begründung der theorie quadratischer formen von unendlichvielen veränderlichen. J. Die Reine Angew. Math. 1909, 136, 210–271. [Google Scholar] [CrossRef]
González-Castro, V.; Alaiz-Rodríguez, R.; Alegre, E. Class distribution estimation based on the Hellinger distance. Inf. Sci. 2013, 218, 146–164. [Google Scholar] [CrossRef]
Dhumras, H.; Bajaj, R. On Novel Hellinger Divergence Measure of Neutrosophic Hypersoft Sets in Symptomatic Detection of COVID-19. Neutrosophic Sets Syst. 2023, 55, 16. [Google Scholar]
Bhattacharyya, A. Several Analogues to the Amount of Information and Their Uses in Statistical Estimation. Sankhya 1946, 8, 315–328. [Google Scholar]
Aherne, F.; Thacker, N.; Rockett, P. The Bhattacharyya metric as an absolute similarity measure for frequency coded data. Kybernetika 1998, 34, 363–368. [Google Scholar]
Patra, B.; Launonen, R.; Ollikainen, V.; Nandi, S. A new similarity measure using Bhattacharyya coefficient for collaborative filtering in sparse data. Knowl.-Based Syst. 2015, 82, 163–177. [Google Scholar] [CrossRef]
Pianka, E. Niche Overlap and Diffuse Competition. Proc. Natl. Acad. Sci. USA 1974, 71, 2141–2145. [Google Scholar] [CrossRef]
Alhihi, S.; Almheidat, M. Estimation of Pianka Overlapping Coefficient for Two Exponential Distributions. Mathematics 2023, 11, 4152. [Google Scholar] [CrossRef]
Abu Alfeilat, H.; Hassanat, A.; Lasassmeh, O.; Tarawneh, A.; Alhasanat, M.; Eyal Salman, H.; Prasath, V. Effects of distance measure choice on k-nearest neighbor classifier performance: A review. Big Data 2019, 7, 221–248. [Google Scholar] [CrossRef] [PubMed]
Pardo, L. Statistical Inference Based on Divergence Measures; CRC Press: Boca Raton, FL, USA, 2018. [Google Scholar]
Fisher, R. The Logic of Inductive Inference. J. R. Stat. Soc. 1935, 98, 39–82. [Google Scholar] [CrossRef]
Hand, D.; Fergus, D.; McConway, D.; Ostrowski, E. A Handbook of Small Data Sets; CRC Press: Boca Raton, FL, USA, 1993. [Google Scholar]

Figure 1. The power function

γ (θ_{1})

for the Bayesian predictive-distribution approach.

Figure 2. The power function

γ (θ_{1})

for the Wald predictive-distribution approach.

Table 1. Hypothesis-testing decisions for closeness based on divergence measures between the classical distribution of future statistics and the Bayesian predictive distribution.

Measure	$w = 30.5161$	$α$			$w = 45.0341$	$α$
Measure	$d_{cal}^{*}$	0.01	0.05	0.1	$d_{cal}^{*}$	0.01	0.05	0.1
K	11.7321	FTR	R	R	$0.31330$	FTR	FTR	FTR
J	$23.4964$	FTR	R	R	$0.62746$	FTR	FTR	FTR
$K_{0.2}^{1}$	$2.35157$	FTR	R	R	$0.06280$	FTR	FTR	FTR
$K_{1.1}^{1}$	$12.9018$	FTR	R	R	$0.34454$	FTR	FTR	FTR
$^{1} K_{0.2}$	$2.34936$	FTR	R	R	$0.06280$	FTR	FTR	FTR
$^{1} K_{1.1}$	$12.9101$	FTR	R	R	$0.34455$	FTR	FTR	FTR
$K_{0.2}^{0.3}$	$2.34964$	FTR	R	R	$0.06280$	FTR	FTR	FTR
$K_{0.2}^{1.5}$	$2.35296$	FTR	R	R	$0.06280$	FTR	FTR	FTR
$K_{1.1}^{0.3}$	$12.8437$	FTR	R	R	$0.34450$	FTR	FTR	FTR
$K_{1.1}^{1.5}$	$12.9435$	FTR	R	R	$0.34457$	FTR	FTR	FTR
$χ^{2}$	$23.6760$	FTR	R	R	$0.62509$	FTR	FTR	FTR
H	$2.93274$	FTR	R	R	$0.07843$	FTR	FTR	FTR

FTR: fail to reject

H_{0}

, and R: reject

H_{0}

. The

d_{c a l}^{*}

values in the table were multiplied by 1000.

Table 2. Hypothesis-testing decisions for closeness, based on the average divergence measures between the classical distribution of future statistics and the Bayesian predictive distribution.

Measure	$w = 30.5161$	$α$			$w = 45.0341$	$α$
Measure	$\bar{d_{cal}^{*}}$	$0.01$	$0.05$	$0.1$	$\bar{d_{cal}^{*}}$	$0.01$	$0.05$	$0.1$
K	$2.97156$	R	R	R	$2.56239$	FTR	FTR	FTR
J	$5.95129$	R	R	R	$5.13183$	FTR	FTR	FTR
$K_{0.2}^{1}$	$0.57320$	R	R	R	$0.49434$	FTR	FTR	FTR
$K_{1.1}^{1}$	$3.35800$	R	R	R	$2.89535$	FTR	FTR	FTR
$^{1} K_{0.2}$	$0.45976$	R	R	R	$0.40829$	FTR	FTR	FTR
$^{1} K_{1.1}$	$3.99059$	R	R	R	$3.35806$	FTR	FTR	FTR
$K_{0.2}^{0.3}$	$0.47216$	R	R	R	$0.41787$	FTR	FTR	FTR
$K_{0.2}^{1.5}$	$0.66378$	R	R	R	$0.56079$	FTR	FTR	FTR
$K_{1.1}^{0.3}$	$1.29241$	R	R	R	$1.24034$	FTR	FTR	FTR
$K_{1.1}^{1.5}$	$8.72037$	R	R	R	$6.50641$	FTR	FTR	FTR
$χ^{2}$	$113395.0$	R	R	R	$22364.9$	FTR	FTR	FTR
H	$0.50396$	R	R	R	$0.45375$	FTR	FTR	FTR

Table 3. Hypothesis-testing decisions for closeness based on divergence measures between the classical distribution of future statistics and the Wald predictive distribution.

Measure	$w = 30.5161$	$α$			$w = 45.0341$	$α$
Measure	$d_{cal}^{*}$	0.01	0.05	0.1	$d_{cal}^{*}$	0.01	0.05	0.1
K	161.870	R	R	R	$0.000897$	FTR	FTR	FTR
J	$323.740$	R	R	R	$0.001794$	FTR	FTR	FTR
$K_{0.2}^{1}$	$32.3740$	R	R	R	$0.000179$	FTR	FTR	FTR
$K_{1.1}^{1}$	$178.057$	R	R	R	$0.000987$	FTR	FTR	FTR
$^{1} K_{0.2}$	$32.3321$	R	R	R	$0.000179$	FTR	FTR	FTR
$^{1} K_{1.1}$	$178.215$	R	R	R	$0.000987$	FTR	FTR	FTR
$K_{0.2}^{0.3}$	$32.3373$	R	R	R	$0.000179$	FTR	FTR	FTR
$K_{0.2}^{1.5}$	$32.4002$	R	R	R	$0.000179$	FTR	FTR	FTR
$K_{1.1}^{0.3}$	$176.952$	R	R	R	$0.000987$	FTR	FTR	FTR
$K_{1.1}^{1.5}$	$178.852$	R	R	R	$0.000987$	FTR	FTR	FTR
$χ^{2}$	$329.037$	R	R	R	$0.001794$	FTR	FTR	FTR
H	$40.3857$	R	R	R	$0.000224$	FTR	FTR	FTR

FTR: fail to reject

H_{0}

, and R: reject

H_{0}

. The

d_{c a l}^{*}

values in the table are multiplied by 10,000.

Table 4. Values of the power function

γ (θ_{1})

around

θ_{1} = 2.5

for Bayesian prediction approach.

Table 4. Values of the power function

γ (θ_{1})

around

θ_{1} = 2.5

for Bayesian prediction approach.

$θ_{1}$	$- 110$	$- 60$	$- 50$	$- 40$	$- 30$	$- 20$
$γ (θ_{1})$	$0.9992$	$0.8335$	$0.6897$	$0.5081$	$0.3293$	$0.1907$
$θ_{1}$	$2.5$	50	60	70	100	120
$γ (θ_{1})$	$0.0494$	$0.2538$	$0.4152$	$0.6075$	$0.9527$	$0.9947$

Table 5. Values of the power function

γ (θ_{1})

around

θ_{1} = 2.5

for Wald prediction approach.

Table 5. Values of the power function

γ (θ_{1})

around

θ_{1} = 2.5

for Wald prediction approach.

$θ_{1}$	$- 110$	$- 60$	$- 50$	$- 40$	$- 30$	$- 20$
$γ (θ_{1})$	$0.0454$	$0.0474$	$0.0427$	$0.0464$	$0.0437$	$0.0440$
$θ_{1}$	$2.5$	50	60	70	100	120
$γ (θ_{1})$	$0.0474$	$0.0459$	$0.0443$	$0.0423$	$0.0439$	$0.0446$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Statistical Inference of Normal Distribution Based on Several Divergence Measures: A Comparative Study

Abstract

1. Introduction

2. Predictive Distributions

2.1. Bayesian Predictive Distribution

2.2. Wald Predictive Distribution

3. Divergence Measures between the Classical Distribution of Future Statistic and Predictive Distributions

4. Simulation Study

5. Application

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics