Calibrating and Visualizing Some Bootstrap Confidence Regions

Welagedara, Welagedara Arachchilage Dhanushka M.; Olive, David J.

doi:10.3390/axioms13100659

Open AccessArticle

Calibrating and Visualizing Some Bootstrap Confidence Regions

by

Welagedara Arachchilage Dhanushka M. Welagedara

¹

and

David J. Olive

^2,*

¹

Department of Mathematics, Hampton University, Hampton, VA 23668, USA

²

Mathematical & Statistical Sciences, Southern Illinois University, Carbondale, IL 62901, USA

^*

Author to whom correspondence should be addressed.

Axioms 2024, 13(10), 659; https://doi.org/10.3390/axioms13100659

Submission received: 16 August 2024 / Revised: 17 September 2024 / Accepted: 19 September 2024 / Published: 25 September 2024

(This article belongs to the Special Issue New Perspectives in Mathematical Statistics)

Download

Browse Figures

Versions Notes

Abstract

When the bootstrap sample size is moderate, bootstrap confidence regions tend to have undercoverage. Improving the coverage is known as calibrating the confidence region. Consider testing

H_{0} : θ = θ_{0}

versus

H_{1} : θ \neq θ_{0}

. We reject

H_{0}

only if

θ_{0}

is not contained in a large-sample 95% confidence region. If the confidence region has 3% undercoverage for the data set sample size, then the type I error is 8% instead of the nominal 5%. Hence, calibrating confidence regions is also useful for testing hypotheses. Several bootstrap confidence regions are also prediction regions for a future value of a bootstrap statistic. A new bootstrap confidence region uses a simple prediction region calibration technique to improve the coverage. The DD plot for visualizing prediction regions can also be used to visualize some bootstrap confidence regions.

Keywords:

data splitting; prediction regions

MSC:

62F40

1. Introduction

When the bootstrap sample size B is small or moderate, bootstrap confidence regions, including bootstrap confidence intervals, tend to have undercoverage: the probability that the confidence region contains the

p \times 1

parameter vector

θ

is less than the nominal large-sample coverage probability

1 - δ

. Then, coverage can be increased by increasing the nominal coverage of the large-sample bootstrap confidence region. For example, if the undercoverage of the nominal large-sample 95% bootstrap confidence region with

B = 1000

is 2%, the coverage is increased to 97%. This procedure is known as calibrating the confidence region. Calibration tends to be difficult since the amount of undercoverage is usually unknown. This paper provides a simple method for improving the coverage and provides a method for visualizing some bootstrap confidence regions.

Using correction factors for large-sample confidence intervals, tests, prediction intervals, prediction regions, and confidence regions improves the coverage performance for a moderate sample size n. If confidence regions are used for hypothesis testing, then this calibration reduces the type I error. For a random variable X, let

P (X \leq x_{1 - δ}) = 1 - δ .

Note that correction factors

b_{n} \to 1

as

n \to \infty

are used in large-sample confidence intervals and large-sample tests if the limiting distribution is

Z \sim N (0, 1)

or

X \sim χ_{k}^{2}

, but a

t_{d_{n}}

or

k F_{k, d_{n}}

cutoff is used:

t_{d_{n}, 1 - δ} = (t_{d_{n}, 1 - δ} / z_{1 - δ}) z_{1 - δ}

with

b_{n} = t_{d_{n}, 1 - δ} / z_{1 - δ} \to 1

and

k F_{k, d_{n}, 1 - δ} = (k F_{k, d_{n}, 1 - δ} / χ_{k, 1 - δ}^{2}) χ_{k, 1 - δ}^{2}

with

b_{n} = k F_{k, d_{n}, 1 - δ} / χ_{k, 1 - δ}^{2} \to 1

if

d_{n} \to \infty

as

n \to \infty

. For moderate n, the test or confidence interval with the correction factor

b_{n} > 1

has better level or coverage than the test or confidence interval that does not use the correction factor, in that the simulated level or coverage is closer to the nominal level or coverage.

Sometimes, the test statistic has a

t_{d_{n}}

or

F_{k, d_{n}}

distribution under normality, but the test statistic (possibly scaled by multiplying by k) is asymptotically normal or asymptotically

χ_{k}^{2}

for a large class of distributions. The t test and t confidence interval for the sample mean are examples where the asymptotic normality holds by the central limit theorem. Many F tests for linear models, experimental design models, and multivariate analyses also satisfy

k F_{0} \overset{D}{\to} χ_{k}^{2}

as

n \to \infty

, where

F_{0}

is the test statistic. See, for example, Olive (2017) [1].

Section 1.1 reviews prediction intervals, prediction regions, confidence intervals, and confidence regions. Several of these methods use correction factors to improve the coverage, and several bootstrap confidence intervals and regions are obtained by applying prediction intervals and regions to the bootstrap sample. Section 1.2 reviews a bootstrap theorem and shows that some bootstrap confidence regions are asymptotically equivalent.

Section 2.1 gives a new bootstrap confidence region with a simple correction factor, while Section 2.2 shows how to visualize some bootstrap confidence regions. Section 3 presents some simulation results.

1.1. Prediction Regions and Confidence Regions

Consider predicting a future test value

Y_{f}

given past training data

Y_{1}, \dots, Y_{n}

, where

Y_{1}, \dots, Y_{n}, Y_{f}

are independent and identically distributed (iid). A large-sample

100 (1 - δ) %

prediction interval (PI) for

Y_{f}

is

[L_{n}, U_{n}]

, where the coverage

P (L_{n} \leq Y_{f} \leq U_{n}) = 1 - δ_{n}

is eventually bounded below by

1 - δ

as

n \to \infty

. We often want

1 - δ_{n} \to 1 - δ

as

n \to \infty

. A large-sample

100 (1 - δ) %

PI is asymptotically optimal if it has the shortest asymptotic length: the length of

[{\hat{L}}_{n}, {\hat{U}}_{n}]

converges to

U_{s} - L_{s}

as

n \to \infty

, where

[L_{s}, U_{s}]

is the population shorth, the shortest interval covering at least

100 (1 - δ) %

of the mass.

Let the data

Y = {(Y_{1}, \dots, Y_{n})}^{T}

have joint probability density function or probability mass function

f (y | θ)

with parameter space

Θ

and support

Y

. Let

L_{n} = L_{n} (Y)

and

U_{n} = U_{n} (Y)

be statistics such that

L_{n} (y) \leq U_{n} (y),

\forall y \in Y .

Then, the interval

[L_{n} (y), U_{n} (y)]

is a large-sample

100 (1 - δ) %

confidence interval (CI) for

θ

if

P_{θ} (L_{n} (Y) \leq θ \leq U_{n} (Y))

is eventually bounded below by

1 - δ

for all

θ \in Θ

as the sample size

n \to \infty .

Consider predicting a

p \times 1

future test value

x_{f}

, given past training data

x_{1}, \dots, x_{n}

, where

x_{1}, \dots, x_{n}, x_{f}

are iid. A large-sample

100 (1 - δ) %

prediction region is a set

A_{n}

such that

P (x_{f} \in A_{n})

is eventually bounded below by

1 - δ

as

n \to \infty

. A prediction region is asymptotically optimal if its volume converges in probability to the volume of the minimum volume covering region or the highest density region of the distribution of

x_{f} .

A large-sample

100 (1 - δ) %

confidence region for a

p \times 1

vector of parameters

θ

is a set

A_{n}

such that

P (θ \in A_{n})

is eventually bounded below by

1 - δ

as

n \to \infty .

For testing

H_{0} : θ = θ_{0}

versus

H_{1} : θ \neq θ_{0}

, we fail to reject

H_{0}

if

θ_{0}

is in the confidence region and reject

H_{0}

if

θ_{0}

is not in the confidence region.

For prediction intervals, let

Y_{(1)} \leq Y_{(2)} \leq \dots \leq Y_{(n)}

be the order statistics of the training data. Open intervals need more regularity conditions than closed intervals. For the following prediction interval, if the open interval

(Y_{(k_{1})}, Y_{(k_{2})})

was used, we would need to add the regularity condition that the population percentiles

Y_{δ / 2}

and

Y_{1 - δ / 2}

are continuity points of the cumulative distribution function

F_{Y} (y) .

See Frey (2013) [2] for references.

Let

k_{1} = ⌈ n δ / 2 ⌉

and

k_{2} = ⌈ n (1 - δ / 2) ⌉

, where

0 < δ < 1

. A large-sample

100 (1 - δ) %

percentile prediction interval for

Y_{f}

is

[Y_{(k_{1})}, Y_{(k_{2})}] .

(1)

The bootstrap percentile confidence interval given by Equation (2) is obtained by applying the percentile prediction interval (1) to the bootstrap sample

T_{1}^{*}, \dots, T_{B}^{*}

, where

T = T_{n}

is a test statistic. See Efron (1982) [3].

A large-sample

100 (1 - δ) %

bootstrap percentile confidence interval for

θ

is an interval

[T_{(k_{L})}^{*}, T_{(k_{U})}^{*}]

containing

\approx ⌈ B (1 - δ) ⌉

of the

T_{i}^{*}

. Let

k_{1} = ⌈ B δ / 2 ⌉

and

k_{2} = ⌈ B (1 - δ / 2) ⌉

. A common choice is

[T_{(k_{1})}^{*}, T_{(k_{2})}^{*}] .

(2)

The shorth (c) estimator of the population shorth is useful for making asymptotically optimal prediction intervals. For a large-sample

100 (1 - δ) %

PI, the nominal coverage is

100 (1 - δ) %

. Undercoverage occurs if the actual coverage is below the nominal coverage. For example, if the actual coverage is 0.93 for a large-sample 95% PI, then the undercoverage is 0.02. Consider intervals that contain c cases

[Y_{(1)}, Y_{(c)}], [Y_{(2)}, Y_{(c + 1)}], \dots, [Y_{(n - c + 1)}, Y_{(n)}]

. Compute

Y_{(c)} - Y_{(1)}, Y_{(c + 1)} - Y_{(2)}, \dots, Y_{(n)} - Y_{(n - c + 1)}

. Then, the estimator shorth (c)

= [Y_{(s)}, Y_{(s + c - 1)}]

is the interval with the shortest length. The shorth (c) interval is a large-sample

100 (1 - δ) %

PI if

c / n \to 1 - δ

as

n \to \infty

that often has the asymptotically shortest length. Let

k_{n} = ⌈ n (1 - δ) ⌉

. Frey (2013) [2] showed that for large

n δ

and iid data, the large-sample

100 (1 - δ) %

shorth (

k_{n}

) prediction interval has maximum undercoverage ≈

1.12 \sqrt{δ / n}

, and then used the large-sample

100 (1 - δ) %

PI shorth (c) =

[Y_{(s)}, Y_{(s + c - 1)}] w i t h c = min (n, ⌈ n [1 - δ + 1.12 \sqrt{δ / n}] ⌉) .

(3)

The shorth confidence interval is a practical implementation of Hall’s (1988) [4] shortest bootstrap percentile interval based on all possible bootstrap samples, and is obtained by applying shorth PI (3) to the bootstrap sample

T_{1}^{*}, \dots, T_{B}^{*} .

See Pelawa Watagoda and Olive (2021) [5]. The large-sample

100 (1 - δ) %

shorth (c) CI =

[T_{(s)}^{*}, T_{(s + c - 1)}^{*}] w h e r e c = min (B, ⌈ B [1 - δ + 1.12 \sqrt{δ / B}] ⌉) .

(4)

To describe Olive’s (2013) [6] nonparametric prediction region, Mahalanobis distances will be useful. Let the

p \times 1

column vector

T = T_{n}

be a multivariate location estimator, and let the

p \times p

symmetric positive definite matrix

C_{n}

be a dispersion estimator. Then, the ith squared sample Mahalanobis distance is the scalar

D_{i}^{2} = D_{i}^{2} (T_{n}, C_{n}) = D_{x_{i}}^{2} (T_{n}, C_{n}) = {(x_{i} - T_{n})}^{T} C_{n}^{- 1} (x_{i} - T_{n})

(5)

for each observation

x_{i},

where

i = 1, \dots, n

. Notice that the Euclidean distance of

x_{i}

from the estimate of center T is

D_{i} (T_{n}, I_{p})

, where

I_{p}

is the

p \times p

identity matrix. The classical Mahalanobis distance

D_{i}

uses

(T_{n}, C_{n}) = (\bar{x}, S)

, the sample mean, and sample covariance matrix, where

\bar{x} = \frac{1}{n} \sum_{i = 1}^{n} x_{i} and S = \frac{1}{n - 1} \sum_{i = 1}^{n} (x_{i} - \bar{x}) {(x_{i} - \bar{x})}^{T} .

(6)

Let the

p \times 1

location vector be

μ

, which is often the population mean, and let the

p \times p

dispersion matrix be

Σ

, which is often the population covariance matrix. If x is a random vector, then the population squared Mahalanobis distance is

D_{x}^{2} (μ, Σ) = {(x - μ)}^{T} Σ^{- 1} (x - μ) .

(7)

Like prediction intervals, prediction regions often need correction factors. For iid data from a distribution with a

p \times p

nonsingular covariance matrix, it was found that the simulated maximum undercoverage of prediction region (9) without the correction factor was about 0.05 when

n = 20 p

. Hence, correction factor (8) is used to obtain better coverage for small n. Let

q_{n} = min (1 - δ + 0.05, 1 - δ + p / n)

for

δ > 0.1

and

q_{n} = min (1 - δ / 2, 1 - δ + 10 δ p / n), otherwise .

(8)

If

1 - δ < 0.999

and

q_{n} < 1 - δ + 0.001

, set

q_{n} = 1 - δ

. Let

D_{(U_{n})}

be the

100 q_{n}

th sample quantile of the

D_{i}

, where

i = 1, \dots, n

. Olive (2013) [6] suggests that

n \geq 50 p

may be needed for the following prediction region to have a good volume, and

n \geq 20 p

for good coverage. Of course, for any n, there are distributions that will have severe undercoverage.

The large-sample

100 (1 - δ) %

nonparametric prediction region for a future value

x_{f}

given iid data

x_{1}, \dots, x_{n}

is

{z : {(z - \bar{x})}^{T} S^{- 1} (z - \bar{x}) \leq D_{(U_{n})}^{2}} = {z : D_{z}^{2} (\bar{x}, S) \leq D_{(U_{n})}^{2}} .

(9)

Olive’s (2017, 2018) [1,7] prediction region method confidence region applies prediction region (9) to the bootstrap sample. Let the bootstrap sample be

T_{1}^{*}, \dots, T_{B}^{*}

. Let

{\bar{T}}^{*}

and

S_{T}^{*}

be the sample mean and sample covariance matrix of the bootstrap sample.

The large-sample

100 (1 - δ) %

prediction region method confidence region for

θ

is

{w : {(w - {\bar{T}}^{*})}^{T} {[S_{T}^{*}]}^{- 1} (w - {\bar{T}}^{*}) \leq D_{(U_{B})}^{2}} = {w : D_{w}^{2} ({\bar{T}}^{*}, S_{T}^{*}) \leq D_{(U_{B})}^{2}}

(10)

where the cutoff

D_{(U_{B})}^{2}

is the

100 q_{B}

th sample quantile of the

D_{i}^{2} = {(T_{i}^{*} - {\bar{T}}^{*})}^{T} {[S_{T}^{*}]}^{- 1} (T_{i}^{*} - {\bar{T}}^{*})

for

i = 1, \dots, B

. Note that the corresponding test for

H_{0} : θ = θ_{0}

rejects

H_{0}

if

{({\bar{T}}^{*} - θ_{0})}^{T} {[S_{T}^{*}]}^{- 1} ({\bar{T}}^{*} - θ_{0}) > D_{(U_{B})}^{2}

.

Olive’s (2017, 2018) [1,7] large-sample

100 (1 - δ) %

modification of Bickel and Ren’s (2001) [8] confidence region is

{w : {(w - T_{n})}^{T} {[S_{T}^{*}]}^{- 1} (w - T_{n}) \leq D_{(U_{B T})}^{2}} = {w : D_{w}^{2} (T_{n}, S_{T}^{*}) \leq D_{(U_{B T})}^{2}}

(11)

where the cutoff

D_{(U_{B T})}^{2}

is the

100 q_{B}

th sample quantile of the

D_{i}^{2} = {(T_{i}^{*} - T_{n})}^{T} {[S_{T}^{*}]}^{- 1} (T_{i}^{*} - T_{n}) .

Note that the corresponding test for

H_{0} : θ = θ_{0}

rejects

H_{0}

if

{(T_{n} - θ_{0})}^{T} {[S_{T}^{*}]}^{- 1} (T_{n} - θ_{0}) > D_{(U_{B T})}^{2}

.

Shift region (9) to have center

T_{n}

, or equivalently, to change the cutoff of region (11) to

D_{(U_{B})}^{2}

to obtain Pelawa Watagoda and Olive’s (2021) [5] large-sample

100 (1 - δ) %

hybrid confidence region,

{w : {(w - T_{n})}^{T} {[S_{T}^{*}]}^{- 1} (w - T_{n}) \leq D_{(U_{B})}^{2}} = {w : D_{w}^{2} (T_{n}, S_{T}^{*}) \leq D_{(U_{B})}^{2}} .

(12)

Note that the corresponding test for

H_{0} : θ = θ_{0}

rejects

H_{0}

if

{(T_{n} - θ_{0})}^{T} {[S_{T}^{*}]}^{- 1} (T_{n} - θ_{0}) > D_{(U_{B})}^{2}

.

Rajapaksha and Olive (2024) [9] gave the following two confidence regions. The names of these confidence regions were chosen since they are similar to Bickel and Ren’s and the prediction region method’s confidence regions.

The large-sample

100 (1 - δ) %

BR confidence region is

{w : n {(w - T_{n})}^{T} C_{n}^{- 1} (w - T_{n}) \leq D_{(U_{B T})}^{2}} = {w : D_{w}^{2} (T_{n}, C_{n} / n) \leq D_{(U_{B T})}^{2}}

(13)

where the cutoff

D_{(U_{B T})}^{2}

is the

100 q_{B}

th sample quantile of the

D_{i}^{2} = n {(T_{i}^{*} - T_{n})}^{T} C_{n}^{- 1} (T_{i}^{*} - T_{n})

. Note that the corresponding test for

H_{0} : θ = θ_{0}

rejects

H_{0}

if

n {(T_{n} - θ_{0})}^{T} C_{n}^{- 1} (T_{n} - θ_{0}) > D_{(U_{B T})}^{2}

.

The large-sample

100 (1 - δ) %

PR confidence region for

θ

is

{w : n {(w - {\bar{T}}^{*})}^{T} C_{n}^{- 1} (w - {\bar{T}}^{*}) \leq D_{(U_{B})}^{2}} = {w : D_{w}^{2} ({\bar{T}}^{*}, C_{n} / n) \leq D_{(U_{B})}^{2}}

(14)

where

D_{(U_{B})}^{2}

is the

100 q_{B}

th sample quantile of the

D_{i}^{2} = n {(T_{i}^{*} - {\bar{T}}^{*})}^{T} C_{n}^{- 1} (T_{i}^{*} - {\bar{T}}^{*})

for

i = 1, \dots, B

. Note that the corresponding test for

H_{0} : θ = θ_{0}

rejects

H_{0}

if

n {({\bar{T}}^{*} - θ_{0})}^{T} C_{n}^{- 1} ({\bar{T}}^{*} - θ_{0}) > D_{(U_{B})}^{2}

.

Assume that

x_{1}, \dots, x_{n}, x_{f}

are iid

N_{p} (μ, Σ_{x})

. Then, Chew’s (1966) [10] large-sample

100 (1 - δ) %

classical prediction region for multivariate normal data is

{z : D_{z}^{2} (\bar{x}, S) \leq χ_{p, 1 - δ}^{2}} .

(15)

The next bootstrap confidence region is similar to what would be obtained if the classical prediction region (15) for multivariate normal data was applied to the bootstrap sample. The large-sample

100 (1 - δ) %

standard bootstrap confidence region for

θ

is

{w : {(w - T_{n})}^{T} {[S_{T}^{*}]}^{- 1} (w - T_{n}) \leq D_{1 - δ}^{2}} = {w : D_{w}^{2} (T_{n}, S_{T}^{*}) \leq D_{1 - δ}^{2}}

(16)

where

D_{1 - δ}^{2} = χ_{p, 1 - δ}^{2}

or

D_{1 - δ}^{2} = p F_{p, d_{n}, 1 - δ}

, where

d_{n} \to \infty

as

n \to \infty

.

If

p = 1

, then a hyperellipsoid is an interval, and confidence intervals are special cases of confidence regions. Suppose the parameter of interest is

θ

, and there is a bootstrap sample

T_{1}^{*}, \dots, T_{B}^{*}

where the statistic

T_{n}

is an estimator of

θ

based on a sample of size n. Let

a_{i} = | T_{i}^{*} - {\bar{T}}^{*} |

and let

b_{i} = | T_{i}^{*} - T_{n} | .

Let

{\bar{T}}^{*}

and

S_{T}^{2 *}

be the sample mean and variance of

T_{i}^{*}

. Then, the squared Mahalanobis distance

D_{θ}^{2} = {(θ - {\bar{T}}^{*})}^{2} / S_{T}^{* 2} \leq D_{(U_{B})}^{2}

is equivalent to

θ \in [{\bar{T}}^{*} - S_{T}^{*} D_{(U_{B})}, {\bar{T}}^{*} + S_{T}^{*} D_{(U_{B})}] = [{\bar{T}}^{*} - a_{(U_{B})}, {\bar{T}}^{*} + a_{(U_{B})}]

, which is an interval centered at

{\bar{T}}^{*}

just long enough to cover

U_{B}

of the

T_{i}^{*}

. Efron (2014) [11] used a similar large-sample

100 (1 - δ) %

confidence interval assuming that

{\bar{T}}^{*}

is asymptotically normal. Then, the large-sample

100 (1 - δ) %

PR CI is

[{\bar{T}}^{*} - a_{(U_{B})}, {\bar{T}}^{*} + a_{(U_{B})}] .

The large-sample

100 (1 - δ) %

BR CI is

[T_{n} - b_{(U_{B T})}, T_{n} + b_{(U_{B T})}]

, which is an interval centered at

T_{n}

just long enough to cover

U_{B T}

of the

T_{i}^{*}

. The large-sample

100 (1 - δ) %

hybrid CI is

[T_{n} - a_{(U_{B})}, T_{n} + a_{(U_{B})}]

.

The following prediction region will be used to develop a new correction factor for bootstrap confidence regions. See Section 2.1. Data splitting divides the training data

x_{1}, \dots, x_{n}

into two sets: H and the validation set V, where H has

n_{H}

of the cases and V has the remaining

n_{V} = n - n_{H}

cases

i_{1}, \dots, i_{n_{V}}

.

The estimator

(T_{H}, C_{H})

is computed using data set H. Then, the squared validation distances

D_{j}^{2} = D_{x_{i_{j}}}^{2} (T_{H}, C_{H}) = {(x_{i_{j}} - T_{H})}^{T} C_{H}^{- 1} (x_{i_{j}} - T_{H})

are computed for the

j = 1, \dots, n_{V}

cases in the validation set V. Let

D_{(U_{V})}^{2}

be the

U_{V}

th order statistic of the

D_{j}^{2}

, where

U_{V} = min (n_{V}, ⌈ (n_{V} + 1) (1 - δ) ⌉) .

(17)

Haile, Zhang, and Olive’s (2024) [12] large-sample

100 (1 - δ) %

data splitting prediction region for

x_{f}

is

{z : D_{z}^{2} (T_{H}, C_{H}) \leq D_{(U_{V})}^{2}} .

(18)

1.2. Some Confidence Region Theories

Some large-sample theories for bootstrap confidence regions are given in the references in Section 1.1. The following theorem of Pelawa Watagoda and Olive (2021) [5] and its proof are useful.

Theorem 1.

(a) Suppose as

n \to \infty

, (i)

\sqrt{n} (T_{n} - θ) \overset{D}{\to} u

, and (ii)

\sqrt{n} (T_{i}^{*} - T_{n}) \overset{D}{\to} u

with

E (u) = 0

and

Cov (u) = Σ_{u}

. Then, (iii)

\sqrt{n} ({\bar{T}}^{*} - θ) \overset{D}{\to} u

, (iv)

\sqrt{n} (T_{i}^{*} - {\bar{T}}^{*}) \overset{D}{\to} u

, and (v)

\sqrt{n} ({\bar{T}}^{*} - T_{n}) \overset{P}{\to} 0

.

(b) Then, the prediction region method gives a large-sample confidence region for

θ

provided that

B \to \infty

and the sample percentile

{\hat{D}}_{1 - δ}^{2}

of the

D_{T_{i}^{*}}^{2} ({\bar{T}}^{*}, S_{T}^{*}) = \sqrt{n} {(T_{i}^{*} - {\bar{T}}^{*})}^{T} {(n S_{T}^{*})}^{- 1} \sqrt{n} (T_{i}^{*} - {\bar{T}}^{*})

is a consistent estimator of the percentile

D_{n, 1 - δ}^{2}

of the random variable

D_{θ}^{2} ({\bar{T}}^{*}, S_{T}^{*}) = \sqrt{n} {(θ - {\bar{T}}^{*})}^{T} {(n S_{T}^{*})}^{- 1} \sqrt{n} (θ - {\bar{T}}^{*})

in that

{\hat{D}}_{1 - δ}^{2} - D_{n, 1 - δ}^{2} \overset{P}{\to} 0 .

Proof.

With respect to the bootstrap sample,

T_{n}

is a constant, and the

\sqrt{n} (T_{i}^{*} - T_{n})

are iid for

i = 1, \dots, B

. Fix B. Then,

[\begin{matrix} \sqrt{n} (T_{1}^{*} - T_{n}) \\ ⋮ \\ \sqrt{n} (T_{B}^{*} - T_{n}) \end{matrix}] \overset{D}{\to} [\begin{matrix} v_{1} \\ ⋮ \\ v_{B} \end{matrix}]

where the

v_{i}

are iid with the same distribution as u. For fixed B, the average of the

\sqrt{n} (T_{i}^{*} - T_{n})

is

\sqrt{n} ({\bar{T}}^{*} - T_{n}) \overset{D}{\to} \frac{1}{B} \sum_{i = 1}^{B} v_{i} \sim A N_{g} (0, \frac{Σ_{u}}{B})

by the Continuous Mapping Theorem, where

z \sim A N_{g} (0, Σ)

is an asymptotic multivariate normal approximation. Note that if

u \sim N_{g} (0, Σ_{u})

, then

\sqrt{n} ({\bar{T}}^{*} - T_{n}) \overset{D}{\to} N_{g} (0, \frac{Σ_{u}}{B}) .

Hence, as

B \to \infty

,

\sqrt{n} ({\bar{T}}^{*} - T_{n}) \overset{P}{\to} 0,

and (iii), (iv), and (v) hold. Hence, (b) follows. □

Under regularity conditions, Bickel and Ren (2001), Olive (2017, 2018), and Pelawa Watagoda and Olive (2021) [1,5,7,8] proved that (10), (11), and (12) are large-sample confidence regions. For Theorem 1, usually (i) and (ii) are proven using large-sample theory. Then,

D_{1}^{2} = D_{T_{i}^{*}}^{2} ({\bar{T}}^{*}, C_{n} / n) = \sqrt{n} {(T_{i}^{*} - {\bar{T}}^{*})}^{T} C_{n}^{- 1} \sqrt{n} (T_{i}^{*} - {\bar{T}}^{*}),

D_{2}^{2} = D_{θ}^{2} (T_{n}, C_{n} / n) = \sqrt{n} {(T_{n} - θ)}^{T} C_{n}^{- 1} \sqrt{n} (T_{n} - θ),

D_{3}^{2} = D_{θ}^{2} ({\bar{T}}^{*}, C_{n} / n) = \sqrt{n} {({\bar{T}}^{*} - θ)}^{T} C_{n}^{- 1} \sqrt{n} ({\bar{T}}^{*} - θ), and

D_{4}^{2} = D_{T_{i}^{*}}^{2} (T_{n}, C_{n} / n) = \sqrt{n} {(T_{i}^{*} - T_{n})}^{T} C_{n}^{- 1} \sqrt{n} (T_{i}^{*} - T_{n}),

are well behaved. If

C_{n}^{- 1} \overset{P}{\to} C^{- 1},

then

D_{j}^{2} \overset{D}{\to} D^{2} = u^{T} C^{- 1} u

, and (13) and (14) are large-sample confidence regions. If

C_{n}^{- 1}

is “not too ill conditioned," then

D_{j}^{2} \approx u^{T} C_{n}^{- 1} u

for large n, and confidence regions (13) and (14) will have coverage near

1 - δ

. See Rajapaksha and Olive (2024) [9].

If

\sqrt{n} (T_{n} - θ) \overset{D}{\to} U

and

\sqrt{n} (T_{i}^{*} - T_{n}) \overset{D}{\to} U

, where U has a unimodal probability density function symmetric about zero, then the confidence intervals from Section 1.1, including (2) and (3), are asymptotically equivalent (use the central proportion of the bootstrap sample, asymptotically). See Pelawa Watagoda and Olive (2021) [5].

2. Materials and Methods

2.1. The Two-Sample Bootstrap

Correction factors for calibrating confidence regions and prediction regions are often difficult to obtain. For prediction regions, see Barndorff-Nielsen and Cox (1996); Beran (1990); Fonseca, Giummole, and Vidoni (2012); Frey (2013); Hall, Peng, and Tajvidi (1999); Hall and Rieck (2001); and Ueki and Fueda (2007) [2,13,14,15,16,17,18]. For confidence regions, see DiCiccio and Efron (1996) and Loh (1987, 1991) [19,20,21]. Simulation was used to obtain correction factor (8). The bootstrap confidence regions (2), (4), and (10) were obtained by applying prediction regions (1), (3), and (9), respectively, on the bootstrap sample. By Theorem 1, bootstrap confidence regions (11) and (12) are asymptotically equivalent to (10). Hence, these large-sample confidence regions for

θ

are also large-sample prediction regions for a future value of the bootstrap statistic

T_{F}^{*}

.

Haile, Zhang, and Olive (2024) [12] proved that the data splitting prediction regions (18) have coverage

\geq min (n_{V}, ⌈ (n_{V} + 1) (1 - δ) ⌉) / (n_{V} + 1)

, with equality if the probability of ties is zero. Hence, data splitting can be used to calibrate some prediction regions. The new confidence region obtains

(T_{H}, C_{H})

from the bootstrap data set

T_{1}^{*}, \dots, T_{B}^{*}

using

n_{H} = B

. For example,

(T_{H}, C_{H}) = ({\bar{T}}^{*}, S_{T}^{*})

. Then, a second bootstrap sample

T_{2, 1}^{*}, \dots, T_{2, n_{V}}^{*}

is drawn. Then, the new large-sample

100 (1 - δ) %

two-sample bootstrap confidence region is

{w : D_{w}^{2} (T_{H}, C_{H}) \leq D_{(U_{V})}^{2}} .

(19)

This result holds since if

(T_{H}, C_{H}) = ({\bar{T}}^{*}, S_{T}^{*})

, then both (10) and (19) are also

100 (1 - δ) %

prediction regions for a future value of

T_{F}^{*}

, and only differ by the cutoff used:

D_{(U_{B})}^{2}

or

D_{(U_{V})}^{2}

. See the following paragraph. Hence, as

n, B,

and

n_{V} \to \infty

,

D_{(U_{B})}^{2} - D_{(U_{V})}^{2} \overset{P}{\to} 0

, and confidence regions (10) and (19) are asymptotically equivalent. For a large-sample 95% confidence region, we recommend

n_{v} = 49, 99,

or B.

The two-sample bootstrap confidence region applies the data splitting prediction region on

T_{1}^{*}, \dots, T_{B}^{*}, T_{2, 1}^{*}, \dots, T_{2, n_{V}}^{*}

with

n_{H} = B

and

n_{V} = n_{V}

, where H uses the first B cases, and V uses the remaining

n_{V}

cases. A random selection of cases is not needed since the

T^{*}

s are iid with respect to the bootstrap sample. For (19) to be a large-sample

100 (1 - δ) %

confidence region, the region applied to the first sample H needs to be both a large-sample

100 (1 - δ) %

confidence region for

θ

and a large-sample

100 (1 - δ) %

prediction region for

T_{f}^{*}

. Using

(T_{H}, C_{H}) = ({\bar{T}}^{*}, S_{T}^{*})

corresponds to (10), while using

(T_{H}, C_{H}) = (T_{n}, S_{T}^{*})

corresponds to (11). Thus, the two-sample bootstrap confidence region corresponding to (10) is

{w : {(w - {\bar{T}}^{*})}^{T} {[S_{T}^{*}]}^{- 1} (w - {\bar{T}}^{*}) \leq D_{(U_{V})}^{2}} = {w : D_{w}^{2} ({\bar{T}}^{*}, S_{T}^{*}) \leq D_{(U_{V})}^{2}} .

Hence, the sample percentile

D_{(U_{B})}^{2}

in (10) is replaced by the order statistic

D_{(U_{V})}^{2}

.

2.2. Visualizing Some Bootstrap Confidence Regions

Olive (2013) [6] showed how to visualize nonparametric prediction region (9) with the Rousseeuw and Van Driessen (1999) [22] DD plot of classical distances versus robust distances on the vertical axis. Hence, the exact same method can be used to visualize bootstrap confidence region (10).

If a good robust estimator is used, the plotted points in a DD plot cluster about the identity line with zero intercept and unit slope if the

x_{i}

are iid from a multivariate normal distribution with nonsingular covariance matrix, while the plotted points cluster about some other line through the origin if the

x_{i}

are iid from a large family of non-normal elliptically contoured distributions. For the robust estimator of the multivariate location and dispersion, we recommend the RFCH or RMVN estimator. See Olive (2017) [1]. These two estimators

(T_{n}, C_{n})

are such that

C_{n}

is a

\sqrt{n}

consistent estimator of

a Cov (x)

for a large class of elliptically contoured distributions, where the constant

a > 0

depends on the elliptically contoured distribution and the estimator RFCH or RMVN, and

a = 1

for the multivariate normal distribution with a nonsingular covariance matrix. We used the RMVN estimator in the simulations.

Example 1, in the following section, shows how to use a DD plot to visualize some bootstrap confidence regions. Often,

\sqrt{n} (T_{n} - θ) \overset{D}{\to} N_{p} (0, Σ_{T})

,

\sqrt{n} (T_{i}^{*} - T_{n}) \overset{D}{\to} N_{p} (0, Σ_{T})

, and

\sqrt{n} (T_{i}^{*} - {\bar{T}}^{*}) \overset{D}{\to} N_{p} (0, Σ_{T})

. Then, the plotted points in the DD plot tend to cluster about the identity line in the DD plot. Note that

{w : D_{w}^{2} ({\bar{T}}^{*}, S_{T}^{*}) \leq D_{(U_{B})}^{2}} = {w : D_{w} ({\bar{T}}^{*}, S_{T}^{*}) \leq D_{(U_{B})}} .

Hence,

T_{i}^{*}

such that

D_{T_{i}^{*}} ({\bar{T}}^{*}, S_{T}^{*}) \leq D_{(U_{B})}

are in confidence region (10). These

T_{i}^{*}

correspond to the points to the left of the vertical line

M D = D_{(U_{B})}

in the DD plot.

3. Results

Example 1.

We generated

x_{i} \sim N_{4} (0, I)

for

i = 1, \dots, 250

. The coordinate-wise median was the statistic

T_{n}

. The nonparametric bootstrap was used with

B = 1000

for the 90% confidence region (10). Then, the

100 q_{B}

th sample quantile of the

D_{i}

is the 90.4% quantile. The DD plot of the bootstrap sample is shown in Figure 1. This bootstrap sample was a rather poor sample: the plotted points cluster about the identity line, but for most bootstrap samples, the clustering is tighter (as in Figure 2). The vertical line MD = 2.9098 is the cutoff for the prediction region method 90% confidence region (10). Hence, the points to the left of the vertical line correspond to

T_{i}^{*}

, which are inside confidence region (10), while the points to the right of the vertical line correspond to

T_{i}^{*}

, which are outside of confidence region (10). The long horizontal line RD = 3.0995 is the cutoff using the robust estimator. When

\sqrt{n} (T_{n} - θ) \overset{D}{\to} N_{p} (0, Σ_{T})

, under mild regularity conditions,

\sqrt{n} (T_{n} - {\bar{T}}_{n}^{*}) \overset{P}{\to} 0 .

The short horizontal line is RD = 2.8074, and MD = 2.8074 =

\sqrt{χ_{4, 0.904}^{2}}

is approximately the cutoff

\sqrt{χ_{4, 0.9}^{2}} = 2.7892

that would be used by the standard bootstrap confidence region (mentally drop a vertical line from where the short horizontal line ends at the identity line). Variability in DD plots increases as MD increases.

Inference after variable selection is an example where the undercoverage of confidence regions can be quite high. See, for example, Kabaila (2009) [23]. Variable selection methods often use the Schwarz (1978) [24] BIC criterion, the Mallows (1973) [25]

C_{p}

criterion, or lasso due to Tibshirani (1996) [26]. To describe a variable selection model, we will follow Rathnayake and Olive (2023) [27] closely. Consider regression models where the response variable Y depends on the

p \times 1

vector of predictor x only through

x^{T} β

. Multiple linear regression models, generalized linear models, and proportional hazards regression models are examples of such regression models. Then, a model for variable selection can be described by

x^{T} β = x_{S}^{T} β_{S} + x_{E}^{T} β_{E} = x_{S}^{T} β_{S}

(20)

where

x = {(x_{S}^{T}, x_{E}^{T})}^{T}

is a

p \times 1

vector of predictors,

x_{S}

is an

a_{S} \times 1

vector, and

x_{E}

is a

(p - a_{S}) \times 1

vector. Given that

x_{S}

is in the model,

β_{E} = 0

, and E denotes the subset of terms that can be eliminated given that the subset S is in the model. Since S is unknown, candidate subsets will be examined. Let

x_{I}

be the vector of a terms from a candidate subset indexed by I, and let

x_{O}

be the vector of the remaining predictors (out of the candidate submodel). Then,

x^{T} β = x_{I}^{T} β_{I} + x_{O}^{T} β_{O} .

Suppose that S is a subset of I and that model (20) holds. Then,

x^{T} β = x_{S}^{T} β_{S} = x_{S}^{T} β_{S} + x_{I / S}^{T} β_{(I / S)} + x_{O}^{T} 0 = x_{I}^{T} β_{I}

where

x_{I / S}

denotes the predictors in I that are not in

S .

Underfitting occurs if submodel I does not contain S.

To clarify the notation, suppose that

p = 4

, a constant

x_{1} = 1

corresponding to

β_{1}

, is always in the model, and

β = {(β_{1}, β_{2}, 0, 0)}^{T}

. Then, there are

J = 2^{p - 1} = 8

possible subsets of

{1, 2, \dots, p}

that contain 1, including

I_{1} = {1}

and

S = I_{2} = {1, 2}

. There are

2^{p - a_{S}} = 4

subsets such that

S \subseteq I_{j}

. Let

{\hat{β}}_{I_{2}} = {({\hat{β}}_{1}, {\hat{β}}_{2})}^{T}

and

x_{I_{2}} = {(x_{1}, x_{2})}^{T} .

The full model uses

β_{F} = β .

Let

I_{m i n}

correspond to the set of predictors selected by a variable selection method such as forward selection or lasso variable selection. If

{\hat{β}}_{I}

is

a \times 1

, use zero padding to form the

p \times 1

vector

{\hat{β}}_{I, 0}

from

{\hat{β}}_{I}

by adding 0s corresponding to the omitted variables. For example, if

p = 4

and

{\hat{β}}_{I_{m i n}} = {({\hat{β}}_{1}, {\hat{β}}_{3})}^{T}

, then the observed variable selection estimator

{\hat{β}}_{V S} = {\hat{β}}_{I_{m i n}, 0} = {({\hat{β}}_{1}, 0, {\hat{β}}_{3}, 0)}^{T} .

As a statistic,

{\hat{β}}_{V S} = {\hat{β}}_{I_{k}, 0}

with probabilities

π_{k n} = P (I_{m i n} = I_{k})

for

k = 1, \dots, J

, where there are J subsets, e.g.,

J = 2^{p} - 1

. Then, the variable selection estimator

{\hat{β}}_{V S} = {\hat{β}}_{I_{m i n}, 0}

, and

{\hat{β}}_{V S} = {\hat{β}}_{I_{k}, 0}

with probabilities

π_{k n} = P (I_{m i n} = I_{k})

for

k = 1, \dots, J

, where there are J subsets.

Assume p is fixed. Suppose model (20) holds, and that if

S \subseteq I_{j}

, where the dimension of

I_{j}

is

a_{j}

, then

\sqrt{n} ({\hat{β}}_{I_{j}} - β_{I_{j}}) \overset{D}{\to} N_{a_{j}} (0, V_{j})

, where

V_{j}

is the covariance matrix of the asymptotic multivariate normal distribution. Then,

\sqrt{n} ({\hat{β}}_{I_{j}, 0} - β) \overset{D}{\to} N_{p} (0, V_{j, 0})

(21)

where

V_{j, 0}

adds columns and rows of zeros corresponding to the

x_{i}

not in

I_{j}

, and

V_{j, 0}

is singular unless

I_{j}

corresponds to the full model. This large-sample theory holds for many models.

If

A_{1}, A_{2}, \dots, A_{k}

are pairwise disjoint and if

\cup_{i = 1}^{k} A_{i} = S,

then the collection of sets

A_{1}, A_{2}, \dots, A_{k}

is a partition of

S .

Then, the Law of Total Probability states that if

A_{1}, A_{2}, \dots, A_{k}

form a partition of S such that

P (A_{i}) > 0

for

i = 1, \dots, k

, then

P (B) = \sum_{j = 1}^{k} P (B \cap A_{j}) = \sum_{j = 1}^{k} P (B | A_{j}) P (A_{j}) .

Let sets

A_{k + 1}, \dots, A_{m}

satisfy

P (A_{i}) = 0

for

i = k + 1, \dots, m .

Define

P (B | A_{j}) = 0

if

P (A_{j}) = 0

. Then, a Generalized Law of Total Probability is

P (B) = \sum_{j = 1}^{m} P (B \cap A_{j}) = \sum_{j = 1}^{m} P (B | A_{j}) P (A_{j}) .

Pötscher (1991) [28] used the conditional distribution of

{\hat{β}}_{V S} | ({\hat{β}}_{V S} = {\hat{β}}_{I_{k}, 0})

to find the distribution of

w_{n} = \sqrt{n} ({\hat{β}}_{V S} - β) .

Let

{\hat{β}}_{I_{k}, 0}^{C}

be a random vector from the conditional distribution

{\hat{β}}_{I_{k}, 0} | ({\hat{β}}_{V S} = {\hat{β}}_{I_{k}, 0})

. Let

w_{k n} = \sqrt{n} ({\hat{β}}_{I_{k}, 0} - β) | ({\hat{β}}_{V S} = {\hat{β}}_{I_{k}, 0}) \sim \sqrt{n} ({\hat{β}}_{I_{k}, 0}^{C} - β) .

Denote

F_{z} (t) = P (z_{1} \leq t_{1}, \dots, z_{p} \leq t_{p})

by

P (z \leq t) .

Then, Pötscher (1991) [28] used the Generalized Law of Total Probability to prove that the cumulative distribution function (cdf) of

w_{n}

is

F_{w_{n}} (t) = P [n^{1 / 2} ({\hat{β}}_{V S} - β) \leq t] = \sum_{k = 1}^{J} F_{w_{k n}} (t) π_{k n} .

Hence,

{\hat{β}}_{V S}

has a mixture distribution of the

{\hat{β}}_{I_{k}, 0}^{C}

with probabilities

π_{k n}

, and

w_{n}

has a mixture distribution of the

w_{k n}

with probabilities

π_{k n} .

For the following Rathnayake and Olive (2023) [27] theorem, the first assumption is

P (S \subseteq I_{m i n}) \to 1

as

n \to \infty

. Then, the variable selection estimator corresponding to

I_{m i n}

underfits with probability going to zero, and the assumption holds under regularity conditions, if BIC and AIC is used for many parametric regression models such as GLMs. See Charkhi and Claeskens (2018) [29] and Claeskens and Hjort (2008, pp. 70, 101, 102, 114, 232) [30]. This assumption is a necessary condition for a variable selection estimator to be a consistent estimator. See Zhao and Yu (2006) [31]. Thus, if a sparse estimator that performs variable selection is a consistent estimator of

β

, then

P (S \subseteq I_{m i n}) \to 1

as

n \to \infty

. Hence, Theorem 2 proves that the lasso variable selection estimator is a

\sqrt{n}

consistent estimator of

β

if lasso is consistent. Charkhi and Claeskens (2018) [29] showed that

w_{j n} = \sqrt{n} ({\hat{β}}_{I_{j}, 0}^{C} - β) \overset{D}{\to} w_{j}

if

S \subseteq I_{j}

for the maximum likelihood estimator with AIC, and gave a forward selection example. For a multiple linear regression model where S is the model with exactly one predictor that can be deleted, then only

π_{S}

and

π_{F}

are positive. If the

C_{p}

criterion is used, then it can be shown that

π_{S} = P (χ_{1}^{2} < 2) = 0.8427

, and

π_{F} = 1 - π_{S} = 0.1573

. Theorem 2 proves that w is a mixture distribution of the

w_{j}

with probabilities

π_{j}

.

Theorem 2.

Assume

P (S \subseteq I_{m i n}) \to 1

as

n \to \infty

, and let

{\hat{β}}_{V S} = {\hat{β}}_{I_{k}, 0}

with probabilities

π_{k n}

, where

π_{k n} \to π_{k}

as

n \to \infty

. Denote the positive

π_{k}

by

π_{j}

. Assume

w_{j n} = \sqrt{n} ({\hat{β}}_{I_{j}, 0}^{C} - β) \overset{D}{\to} w_{j}

. Then,

w_{n} = \sqrt{n} ({\hat{β}}_{V S} - β) \overset{D}{\to} w

(22)

where the cdf of w is

F_{w} (t) = \sum_{j} π_{j} F_{w_{j}} (t)

.

Rathnayake and Olive (2023) [27] suggested the following bootstrap procedure. Use a bootstrap method for the full model, such as the nonparametric bootstrap or the residual bootstrap, and then compute the full model and the variable selection estimator from the bootstrap data set. Repeat this B times to obtain the bootstrap sample for the full model and for the variable selection model. They could only prove that the bootstrap procedure works under very strong regularity conditions such as a

π_{i} = 1

in Theorem 2, where

π_{S} = 1

is known as the oracle property. See Claeskens and Hjort (2008, pp. 101–114) [30] for references for the oracle property. For many statistics, a bootstrap data cloud

T_{1}^{*}, \dots, T_{B}^{*}

and a data cloud from B iid statistics

T_{1}, \dots, T_{B}

tend to have similar variability. Rathnayake and Olive (2023) [27] suggested that when T is the variable selection estimator

{\hat{β}}_{V S}

, the bootstrap data cloud often has more variability than the iid data cloud, and that this result tends to increase the bootstrap confidence region coverage.

For variable selection with the

p \times 1

vector

{\hat{β}}_{I_{m i n}, 0}

, consider testing

H_{0} : A β = θ_{0}

versus

H_{1} : A β \neq θ_{0}

with

θ = A β

, where oftentimes,

θ_{0} = 0

. Then, let

T_{n} = A {\hat{β}}_{I_{m i n}, 0}

and let

T_{i}^{*} = A {\hat{β}}_{I_{m i n}, 0, i}^{*}

for

i = 1, \dots, B

. The shorth estimator can be applied to a bootstrap sample

{\hat{β}}_{i 1}^{*}, \dots, {\hat{β}}_{i B}^{*}

to obtain a confidence interval for

β_{i}

. Here,

T_{n} = {\hat{β}}_{i}

and

θ = β_{i}

. The simulations used

θ = A β = β_{i}

,

θ = A β = β_{S} = 1

, and

θ = A β = β_{E} = 0

. Let the multiple linear regression model

Y_{i} = 1 + 1 x_{i, 2} + \dots + 1 x_{i, k + 1} + e_{i}

for

i = 1, \dots, n

. Hence,

β = {(1, \dots, 1, 0, \dots, 0)}^{T}

with

k + 1

ones and

p - k - 1

zeros.

The regression models used the residual bootstrap with the forward selection estimator

{\hat{β}}_{I_{m i n}, 0}

. Table 1 gives results for when the iid errors

e_{i} \sim N (0, 1)

with

n = 100

,

p = 4

, and

k = 1

. Table 1 shows two rows for each model giving the observed confidence interval coverages and average lengths of the confidence intervals. The nominal coverage was 95%. The term “reg" is for the full model regression, and the term “vs" is for forward selection. The last six columns give results for the tests. The terms pr, hyb, and br are for prediction region method (10), hybrid region (12), and Bickel and Ren region (11). The 0 indicates that the test was

H_{0} : β_{E} = {(β_{3}, β_{4})}^{T} = 0

versus

H_{1} : β_{E} \neq 0

, while the 1 indicates that the test was

H_{0} : β_{S} = {(β_{1}, β_{2})}^{T} = 1

versus

H_{1} : β_{S} \neq 1

. The length and coverage = P (fail to reject

H_{0}

) for the interval

[0, D_{(U_{B})}]

or

[0, D_{(U_{B}, T)}]

, where

D_{(U_{B})}

or

D_{(U_{B}, T)}

is the cutoff for the confidence region. The cutoff will often be near

\sqrt{χ_{g, 0.95}^{2}}

if the statistic T is asymptotically normal. Note that

\sqrt{χ_{2, 0.95}^{2}} = 2.448

is close to 2.45 for the full model regression bootstrap tests. For the full model,

\sqrt{n}

len

\to 3.92

as

n \to \infty

for the simulated data, and the shorth 95% confidence intervals have simulated length

\approx 0.398 \approx 3.92 / 10 = 0.392 .

The variable selection estimator and the full model estimator were similar for

β_{1}, β_{2}

, and

β_{S}

. The two estimators differed for

β_{3}, β_{4},

and

β_{E}

because

{\hat{β}}_{i}^{*} = 0

often occurred for

i = 3

and 4. In particular, the confidence interval coverages for the variable selection estimator were very high, but the average lengths were shorter than those for the full model. If

x_{3}

was never selected, then

{\hat{β}}_{3}^{*} \equiv 0

for all runs, and the confidence interval would be [0, 0] with 100% coverage and zero length.

Note that for the variable selection estimator with

H_{0} : β_{E} = 0

, the average cutoff values were near 2.7 and 3.0, which are larger than the

χ_{2}^{2}

cutoff 2.448. Hence, using the standard bootstrap confidence region (16) would result in undercoverage. For

H_{0} : β_{S} = 1

, the bootstrap estimator often appeared to be approximately multivariate normal. Example 2 illustrates this result with a DD plot.

Example 2.

We generated

x_{i} \sim N_{4} (0, I)

and

Y_{i} = 1 + x_{i 1} + x_{i 2} + x_{i 3} + e_{i}

for

i = 1, \dots, n = 1000

with the

e_{i}

iid

N (0, 1)

and

β = β_{F} = {(1, 1, 1, 1, 0)}^{T}

. Then, we examined several bootstrap methods for multiple linear regression variable selection. The nonparametric bootstrap draws n cases

{(x_{j}^{T}, Y_{j})}^{T}

with replacement from the n original cases, and then selects variables on the resulting data set, resulting in

{\hat{β}}_{I}^{*}

. If

{\hat{β}}_{I}^{*}

is

a \times 1

, use zero padding to form the

p = 5 \times 1

vector

{\hat{β}}_{I, 0}^{*} = {\hat{β}}_{1}^{*}

from

{\hat{β}}_{I}

by adding 0s corresponding to the omitted variables. Repeat

B = 1000

times to obtain the bootstrap sample

{\hat{β}}_{1}^{*}, \dots, {\hat{β}}_{B}^{*}

. Typically, the full model

I = F

or the submodel

I = S

that omitted

x_{i 4}

was selected. The residual bootstrap using the full model residuals was also used, where

Y_{i}^{*} = (1 x_{i}^{T}) \hat{β} + r_{i}^{*}

for

i = 1, \dots, n

where the

r_{i}^{*}

are sampled with replacement from the full model residuals

r_{1}, \dots, r_{n}

. Forward selection and backward elimination could be used with the

C_{p}

or BIC criterion, or lasso could be used to perform the variable selection. Let

{\hat{β}}_{S : I, 0}^{*}

be obtained from

{\hat{β}}_{I, 0}^{*}

by leaving out the fifth value. Hence, if

{\hat{β}}_{I, 0}^{*} = {(0.9351, 1.0252, 0.9251, 0.9542, 0)}^{T},

then

{\hat{β}}_{S : I, 0}^{*} = {(0.9351, 1.0252, 0.9251, 0.9542)}^{T}

. Figure 2 shows the DD plot for the confidence region corresponding to the

{\hat{β}}_{S : I, 0}^{*}

using forward selection with the

C_{p}

criterion. This confidence region corresponds to the test

H_{0} : β_{S} = b = {(b_{1}, b_{2}, b_{3}, b_{4})}^{T}

, e.g.,

b = 1

. Plots created with backward elimination and lasso were similar. Rathnayake and Olive (2023) [27] obtained the large-sample theory for the variable selection estimators

{\hat{β}}_{V S} = {\hat{β}}_{I, 0}

for multiple linear regression and many other regression methods. The limiting distribution is a complicated non-normal mixture distribution by Theorem 2, but in simulations, where S is known, the

{\hat{β}}_{S : I, 0}^{*}

often appeared to have an approximate multivariate normal distribution.

A small simulation study was conducted on large-sample 95% confidence regions. The coordinate-wise median was used since this statistic is moderately difficult to bootstrap. We used 5000 runs. Then, the coverage within [0.94, 0.96] suggests that the true coverage is near the nominal coverage 0.95. The simulation used 10 distributions, where xtype = 1 for

N_{p} (0, I);

xtype = 2, 3, 4, and 5 for

(1 - δ) N_{p} (0, I) + δ N_{p} (0, 25 I)

; xtype = 6, 7, 8, and 9 for a multivariate

t_{d}

with d = 3, 5, 19, or d, given by the user; and xtype=10 for a log-normal distribution shifted to have the coordinate-wise median = 0. If w corresponds to one of the above distributions, then

x = A w

with

A = d i a g (\sqrt{1}, \sqrt{2}, \dots, \sqrt{p})

. Then, the population coordinate-wise median is 0 for each distribution. Table 2 shows the coverages and average cutoff for four large-sample confidence regions: (10), (19), with

n_{V} = B = 1000

, (19) with

n_{V} = B = 49

, and (19) with

n_{V} = B = 99

. The coverage is the proportion of times that the confidence region contained

θ = 0

, where

θ

is a

p \times 1

vector. Each confidence region has a cutoff,

D = \sqrt{D^{2}}

, that depends on the bootstrap sample, and the average of the 5000 cutoffs is given. Here,

D^{2} = D_{(U_{B})}^{2}

for confidence region (10), while

D^{2} = D_{(U_{V})}^{2}

for confidence region (19), where the cutoff also depends on

n_{V}

. The coverages were usually between 0.94 and 0.96. The average cutoffs for the prediction region method’s large-sample 95% confidence region tended to be very close to the average cutoffs for confidence region (19) with

n_{V} = B = 1000

. Note that

\sqrt{χ_{2, 0.95}^{2}} = 2.4477

and

\sqrt{χ_{4, 0.95}^{2}} = 3.0802

are the cutoffs for the standard bootstrap confidence region (15). The ratio of volumes of the two confidence regions is volume (10)/volume (19)

= {(D_{(U_{B})} / D_{(U_{V})})}^{p}

.

4. Discussion

The bootstrap was due to Efron (1979) [32]. Also, see Efron (1982) [3] and Bickel and Freedman (1981) [33]. Ghosh and Polansky (2014) and Politis and Romano (1994) [34,35] are useful references for bootstrap confidence regions. For a small dimension p, nonparametric density estimation can be used to construct confidence regions and prediction regions. See, for example, Hall (1987) and Hyndman (1986) [36,37] Visualizing a bootstrap confidence region is useful for checking whether the asymptotic normal approximation for the statistic is good since the plotted points will then tend to cluster tightly about the identity line. Making five plots corresponding to five bootstrap samples can be used to check the variability of the plots and the probability of obtaining a bad sample. For Example 1, most of the bootstrap samples produced plots that had tighter clustering about the identity line than the clustering in Figure 1.

The new bootstrap confidence region (19) used the fact that bootstrap confidence region (10) is simultaneously a prediction region for a future bootstrap statistic

T_{F}^{*}

and a confidence region for

θ

with the same asymptotic coverage

1 - δ

. Hence, increasing the coverage as a prediction region also increases the coverage as a confidence region. The data splitting technique used to increase the coverage only depends on the

T_{i}^{*}

being iid with respect to the bootstrap distribution. Correction factor (8) increases the coverage, but this calibration technique needed intensive simulation.

Calibrating a bootstrap confidence region is useful for several reasons. For simulations, computation time can be reduced if B can be reduced. Using correction factor (8) is faster than using the two-sample bootstrap of Section 2.1, but the two-sample bootstrap can be used to check the accuracy of (8), as in Table 2 with

n_{V} = B

. For a nominal 95% prediction region, correction factor (8) increases the coverage to at most 97.5% of the training data. Coverage for test data

x_{f}

tends to be worse than coverage for training data. Using the cutoff

D_{(U_{B})}^{2}

of (8) gives better coverage than using cutoff

D_{(U)}^{2}

with

U = ⌈ B (1 - δ) ⌉

. The two calibration methods in this paper were first applied to prediction regions, and work for bootstrap confidence regions (10) and (11) since those two regions are also prediction regions for

T_{f}^{*}

.

Plots and simulations were conducted in R. See R Core Team (2020) [38]. Welagedara (2023) [39] lists some R functions for bootstrapping several statistics. The programs used are in the collection of functions slpack.txt. See http://parker.ad.siu.edu/Olive/slpack.txt, accessed on 1 August 2024. The function ddplot4 applied to the bootstrap sample can be used to visualize the bootstrap prediction region method’s confidence region. The function medbootsim was used for Table 2. Some functions for bootstrapping multiple linear regression variable selection with the residual bootstrap are belimboot for backward elimination using

C_{p}

, bicboot for forward selection using BIC, fselboot for forward selection using

C_{p}

, lassoboot for lasso variable selection, and vselboot for all of the subsets’ variable selection with

C_{p}

.

Author Contributions

Conceptualization, W.A.D.M.W. and D.J.O.; methodology, W.A.D.M.W. and D.J.O.; software D.J.O.; validation, W.A.D.M.W. and D.J.O.; formal analysis, W.A.D.M.W. and D.J.O.; investigation, W.A.D.M.W.; writing—original draft, W.A.D.M.W. and D.J.O.; writing—review & editing, W.A.D.M.W. and D.J.O.; All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

See slpack.txt for programs for simulating the data.

Acknowledgments

The authors thank the editors and referees for their work.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Olive, D.J. Robust Multivariate Analysis; Springer: New York, NY, USA, 2017. [Google Scholar]
Frey, J. Data-driven nonparametric prediction intervals. J. Stat. Plan. Inference 2013, 143, 1039–1048. [Google Scholar] [CrossRef]
Efron, B. The Jackknife, the Bootstrap and Other Resampling Plans; SIAM: Philadelphia, PA, USA, 1982. [Google Scholar]
Hall, P. Theoretical comparisons of bootstrap confidence intervals. Ann. Stat. 1988, 16, 927–985. [Google Scholar] [CrossRef]
Pelawa Watagoda, L.C.R.; Olive, D.J. Bootstrapping multiple linear regression after variable selection. Stat. Pap. 2021, 62, 681–700. [Google Scholar] [CrossRef]
Olive, D.J. Asymptotically optimal regression prediction intervals and prediction regions for multivariate data. Int. J. Stat. Probab. 2013, 2, 90–100. [Google Scholar] [CrossRef]
Olive, D.J. Applications of hyperellipsoidal prediction regions. Stat. Pap. 2018, 59, 913–931. [Google Scholar] [CrossRef]
Bickel, P.J.; Ren, J.J. The Bootstrap in hypothesis testing. In State of the Art in Probability and Statistics: Festschrift for William R. van Zwet; de Gunst, M., Klaassen, C., van der Vaart, A., Eds.; The Institute of Mathematical Statistics: Hayward, CA, USA, 2001; pp. 91–112. [Google Scholar]
Rajapaksha, K.W.G.D.H.; Olive, D.J. Wald type tests with the wrong dispersion matrix. Commun. Stat.-Theory Methods 2024, 53, 2236–2251. [Google Scholar] [CrossRef]
Chew, V. Confidence, prediction and tolerance regions for the multivariate normal distribution. J. Am. Stat. Assoc. 1966, 61, 605–617. [Google Scholar] [CrossRef]
Efron, B. Estimation and accuracy after model selection. J. Am. Stat. Assoc. 2014, 109, 991–1007. [Google Scholar] [CrossRef] [PubMed]
Haile, M.G.; Zhang, L.; Olive, D.J. Predicting random walks and a data splitting prediction region. Stats 2024, 7, 23–33. [Google Scholar] [CrossRef]
Barndorff-Nielsen, O.E.; Cox, D.R. Prediction and asymptotics. Bernoulli 1996, 2, 319–340. [Google Scholar] [CrossRef]
Beran, R. Calibrating prediction regions. J. Am. Stat. Assoc. 1990, 85, 715–723. [Google Scholar] [CrossRef]
Fonseca, G.; Giummole, F.; Vidoni, P. A note about calibrated prediction regions and distributions. J. Stat. Plan. Inference 2012, 142, 2726–2734. [Google Scholar] [CrossRef]
Hall, P.; Peng, L.; Tajvidi, N. On prediction intervals based on predictive likelihood or bootstrap methods. Biometrika 1999, 86, 871–880. [Google Scholar] [CrossRef]
Hall, P.; Rieck, A. Improving coverage accuracy of nonparametric prediction intervals. J. R. Stat. Soc. B 2001, 63, 717–725. [Google Scholar] [CrossRef]
Ueki, M.; Fueda, K. Adjusting estimative prediction limits. Biometrika 1996, 94, 509–511. [Google Scholar] [CrossRef]
DiCiccio, T.J.; Efron, B. Bootstrap confidence intervals. Stat. Sci. 1996, 11, 189–228. [Google Scholar] [CrossRef]
Loh, W.Y. Calibrating confidence coefficients. J. Am. Stat. Assoc. 1987, 82, 155–162. [Google Scholar] [CrossRef]
Loh, W.Y. Bootstrap calibration for confidence interval construction and selection. Stat. Sin. 1991, 1, 477–491. [Google Scholar]
Rousseeuw, P.J.; Van Driessen, K. A fast algorithm for the minimum covariance determinant estimator. Technometrics 1999, 41, 212–223. [Google Scholar] [CrossRef]
Kabaila, P. The coverage properties of confidence regions after model selection. Int. Stat. Rev. 2009, 77, 405–414. [Google Scholar] [CrossRef]
Schwarz, G. Estimating the dimension of a model. Ann. Stat. 1978, 6, 461–464. [Google Scholar] [CrossRef]
Mallows, C. Some comments on C_p. Technometrics 1973, 15, 661–676. [Google Scholar]
Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B 1996, 58, 267–288. [Google Scholar] [CrossRef]
Rathnayake, R.C.; Olive, D.J. Bootstrapping some GLMs and survival regression models after variable selection. Commun. Stat.-Theory Methods 2023, 52, 2625–2645. [Google Scholar] [CrossRef]
Pötscher, B. Effects of model selection on inference. Econom. Theory 1991, 7, 163–185. [Google Scholar] [CrossRef]
Charkhi, A.; Claeskens, G. Asymptotic post-selection inference for the Akaike information criterion. Biometrika 2018, 105, 645–664. [Google Scholar] [CrossRef]
Claeskens, G.; Hjort, N.L. Model Selection and Model Averaging; Cambridge University Press: New York, NY, USA, 2008. [Google Scholar]
Zhao, P.; Yu, B. On model selection consistency of lasso. J. Mach. Learn. Res. 2006, 7, 2541–2563. [Google Scholar]
Efron, B. Bootstrap methods, another look at the jackknife. Ann. Stat. 1979, 7, 1–26. [Google Scholar] [CrossRef]
Bickel, P.J.; Freedman, D.A. Some asymptotic theory for the bootstrap. Ann. Stat. 1981, 9, 1196–1217. [Google Scholar] [CrossRef]
Ghosh, S.; Polansky, A.M. Smoothed and iterated bootstrap confidence regions for parameter vectors. J. Mult. Anal. 2014, 132, 171–182. [Google Scholar] [CrossRef]
Politis, D.N.; Romano, J.P. Large sample confidence regions based on subsamples under minimal assumptions. Ann. Stat. 1994, 22, 2031–2050. [Google Scholar] [CrossRef]
Hall, P. On the bootstrap and likelihood-based confidence regions. Biometrika 1987, 74, 481–493. [Google Scholar] [CrossRef]
Hyndman, R.J. Computing and graphing highest density regions. Am. Stat. 1996, 50, 120–126. [Google Scholar] [CrossRef]
R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2020; Available online: www.r-project.org (accessed on 1 August 2024).
Welagedara, W.A.D.M. Model Selection, Data Splitting for ARMA Time Series, and Visualizing Some Bootstrap Confidence Regions. Ph.D. Thesis, Southern Illinois University, Carbondale, IL, USA, 2023. Available online: http://parker.ad.siu.edu/Olive/swelagedara.pdf (accessed on 1 August 2024).

Figure 1. Visualizing the confidence region with a DD plot.

Figure 2. Visualizing the forward selection confidence region for

β_{S}

.

Figure 2. Visualizing the forward selection confidence region for

β_{S}

.

Table 1. Bootstrapping OLS forward selection with

C_{p}

,

e_{i} \sim N (0, 1)

.

Table 1. Bootstrapping OLS forward selection with

C_{p}

,

e_{i} \sim N (0, 1)

.

	$β_{1}$	$β_{2}$	$β_{p - 1}$	$β_{p}$	pr0	hyb0	br0	pr1	hyb1	br1
reg	0.946	0.950	0.947	0.948	0.940	0.941	0.941	0.937	0.936	0.937
len	0.396	0.399	0.399	0.398	2.451	2.451	2.452	2.450	2.450	2.451
vs	0.948	0.950	0.997	0.996	0.991	0.979	0.991	0.938	0.939	0.940
len	0.395	0.398	0.323	0.323	2.699	2.699	3.002	2.450	2.450	2.457

Table 2. Coverages and average cutoffs for some large-sample 95% confidence regions, B = 1000.

n	p	Dist	CR (10)	(19), $n_{V} = 1000$	(19), $n_{V} = 49$	(19), $n_{V} = 99$
100	2	N	(0.9430,2.4931)	(0.9450,2.5015)	(0.9536,2.7127)	(0.9452,2.5351)
100	2	LN	(0.9494,2.5025)	(0.9488,2.5088)	(0.9598,2.7401)	(0.9500,2.5539)
100	4	N	(0.9386,3.1738)	(0.9384,3.1795)	(0.9522,3.3922)	(0.9384,3.2177)
100	4	LN	(0.9456,3.2012)	(0.9466,3.2046)	(0.9598,3.4512)	(0.9468,3.2543)
200	4	N	(0.9476,3.1489)	(0.9480,3.1575)	(0.9590,3.3510)	(0.9490,3.1948)
200	4	LN	(0.9432,3.1673)	(0.9440,3.1700)	(0.9554,3.3861)	(0.9440,3.2065)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Welagedara, W.A.D.M.; Olive, D.J. Calibrating and Visualizing Some Bootstrap Confidence Regions. Axioms 2024, 13, 659. https://doi.org/10.3390/axioms13100659

AMA Style

Welagedara WADM, Olive DJ. Calibrating and Visualizing Some Bootstrap Confidence Regions. Axioms. 2024; 13(10):659. https://doi.org/10.3390/axioms13100659

Chicago/Turabian Style

Welagedara, Welagedara Arachchilage Dhanushka M., and David J. Olive. 2024. "Calibrating and Visualizing Some Bootstrap Confidence Regions" Axioms 13, no. 10: 659. https://doi.org/10.3390/axioms13100659

APA Style

Welagedara, W. A. D. M., & Olive, D. J. (2024). Calibrating and Visualizing Some Bootstrap Confidence Regions. Axioms, 13(10), 659. https://doi.org/10.3390/axioms13100659

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Calibrating and Visualizing Some Bootstrap Confidence Regions

Abstract

1. Introduction

1.1. Prediction Regions and Confidence Regions

1.2. Some Confidence Region Theories

2. Materials and Methods

2.1. The Two-Sample Bootstrap

2.2. Visualizing Some Bootstrap Confidence Regions

3. Results

4. Discussion

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI