A Comparison of Existing Bootstrap Algorithms for Multi-Stage Sampling Designs

Chen, Sixia; Haziza, David; Mashreghi, Zeinab

doi:10.3390/stats5020031

Open AccessFeature PaperArticle

A Comparison of Existing Bootstrap Algorithms for Multi-Stage Sampling Designs

by

Sixia Chen

¹,

David Haziza

^2,* and

Zeinab Mashreghi

³

¹

Department of Biostatistics and Epidemiology, University of Oklahoma Health Sciences Center, Oklahoma City, OK 73104, USA

²

Department of Mathematics and Statistics, University of Ottawa, Ottawa, ON K1N 6N5, Canada

³

Department of Mathematics and Statistics, University of Winnipeg, Winnipeg, MB R3B 2E9, Canada

^*

Author to whom correspondence should be addressed.

Stats 2022, 5(2), 521-537; https://doi.org/10.3390/stats5020031

Submission received: 20 April 2022 / Revised: 23 May 2022 / Accepted: 29 May 2022 / Published: 6 June 2022

(This article belongs to the Special Issue Re-sampling Methods for Statistical Inference of the 2020s)

Download Review Reports Versions Notes

Abstract

:

Multi-stage sampling designs are often used in household surveys because a sampling frame of elements may not be available or for cost considerations when data collection involves face-to-face interviews. In this context, variance estimation is a complex task as it relies on the availability of second-order inclusion probabilities at each stage. To cope with this issue, several bootstrap algorithms have been proposed in the literature in the context of a two-stage sampling design. In this paper, we describe some of these algorithms and compare them empirically in terms of bias, stability, and coverage probability.

Keywords:

bootstrap algorithms; multi-stage sampling; Taylor linearization; variance estimation

1. Introduction

Many surveys conducted by national statistical offices use stratified multi-stage sampling designs for selecting a sample. Reasons for using multi-stage sampling rather than direct element sampling include the lack of element-level sampling frames and cost considerations when data collection involves face-to-face interviews. Stratified multi-stage sampling designs include some form of stratification, selection of primary sampling units (psu), and subsampling within selected psus. This is especially common in social and health surveys. For general multi-stage sampling designs, unbiased variance estimation is a complex task as it relies on the availability of the second-order inclusion probabilities at each stage. If the first-stage sampling fraction is small, a common variance estimation strategy is to pretend that the psus were selected with replacement and use the customary with replacement variance estimator. The resulting estimator is generally conservative in the sense that it may suffer from a small positive bias.

Another approach to variance estimation for survey data is bootstrap variance estimation originally proposed by Efron [1] in the context of independent and identically distributed observations. In a finite population sampling, bootstrap procedures can be classified into two broad groups. In the first, bootstrap samples are selected from the original sample; e.g., [2,3] among others. Rao and Wu [2] applied a scale adjustment directly to the survey data values so as to recover the usual variance formulae. Rao et al. [4] presented a modification of the method of Rao and Wu [2], where the scale adjustment is applied to the survey weights rather than to the data values. The second group of procedures consists of first creating a pseudo-population from the original sample. Bootstrap samples are then selected from the pseudo-population using the same sampling design utilized to select the original samples; see [5,6,7,8], among others. Many of the aforementioned bootstrap procedures may be implemented by randomly generating bootstrap weights so that the first two (or more) design moments of the sampling error are tracked by the corresponding bootstrap moments; see [9,10]. These procedures are often referred to as bootstrap weight procedures. For a comprehensive review of bootstrap procedures for survey data, the reader is referred to Mashreghi et al. [11].

The goal of this paper is to empirically compare several existing bootstrap algorithms that have been proposed in the literature for two-stage sampling designs. The bootstrap procedures are compared with respect to bias, stability, and coverage probability of confidence intervals. In Section 2 we present the basic setup and discuss some classical variance estimation procedures for two-stage sampling designs. In Section 3, we present some bootstrap algorithms proposed in the case of simple random sampling without replacement in both stages. Bootstrap algorithms for unequal probability sampling designs are described in Section 4. In Section 5, we present the results from a simulation study. We make some final remarks in Section 6.

2. The Setup

Consider a finite population U consisting of N primary sampling units (psu),

U_{1}, \dots, U_{N},

of size

M_{1}, \dots, M_{N},

such that

U = ⋃_{i \in U} U_{i} a n d U_{i} \cap U_{j} = \emptyset i f i \neq j

. Let

M_{0} = \sum_{i \in U} M_{i}

be the total number of elements in the population. We are interested in estimating a population total

t_{y}

of a survey variable y:

t_{y} = \sum_{i \in U} t_{i},

where

t_{i} = \sum_{k \in U_{i}} y_{i k}

denotes the ith psu total,

i = 1, \dots, N

, and

y_{i k}

denotes the y-value for the kth element in the ith psu. To that end, we select a sample according to a two-stage sampling design:

(i): A sample $S_{1 s t}$ of psus, of size n, is selected according to a given sampling design $P (S_{1 s t})$ with first-order inclusion probabilities, $π_{i} = P (i \in S_{1 s t}),$ and with second-order inclusion probabilities, $π_{i j} = P (i \in S_{1 s t} & j \in S_{1 s t}) .$ Finally, let $Δ_{i j} = π_{i j} - π_{i} π_{j} .$
(ii): In the ith psu sampled at the first stage, $i \in S_{1 s t}$ , a subsample of the elements, $S_{i}$ , of size $m_{i}$ is selected according to a given sampling design $P (S_{i} | S_{1 s t})$ with first-order inclusion probabilities $π_{k | i} = P (k \in S_{i} | i \in S_{1 s t})$ and second-order inclusion probabilities $π_{k ℓ | i} = P (k \in S_{i} & ℓ \in S_{i} | i \in S_{1 s t})$ . Subsampling in a given psu is carried out independently of subsampling in any other psu.

A design-unbiased estimator of

t_{y}

is the Horvitz–Thompson estimator given by

{\hat{t}}_{y} = \sum_{i \in S_{1 s t}} π_{i}^{- 1} {\hat{t}}_{i},

(1)

where

{\hat{t}}_{i} = \sum_{k \in S_{i}} y_{i k} / π_{k | i} .

The estimator (1) can be written as

{\hat{t}}_{y} = \sum_{k \in \tilde{S}} w_{k} y_{k},

where

w_{k} = π_{k}^{- 1}

with

π_{k} = π_{i} \times π_{k | i},

and

\tilde{S}

denotes the sample of elements of size

\sum_{i \in S_{1 s t}} m_{i} .

The design variance of

{\hat{t}}_{y},

denoted by

V_{p} ({\hat{t}}_{y}),

can be unbiasedly estimated by

\hat{V} ({\hat{t}}_{y}) = \sum_{i \in S_{1 s t}} \sum_{j \in S_{1 s t}} \frac{Δ_{i j}}{π_{i j}} \frac{{\hat{t}}_{i}}{π_{i}} \frac{{\hat{t}}_{j}}{π_{j}} + \sum_{i \in S_{1 s t}} π_{i}^{- 1} {\hat{V}}_{i},

(2)

where

{\hat{V}}_{i} = \sum_{k \in S_{i}} \sum_{ℓ \in S_{i}} \frac{Δ_{k ℓ | i}}{π_{k ℓ | i}} \frac{y_{i k}}{π_{k | i}} \frac{y_{i ℓ}}{π_{ℓ | i}}

and

Δ_{k ℓ | i} = π_{k ℓ | i} - π_{k | i} π_{ℓ | i} .

That is,

E_{1} E_{2} \{\hat{V} ({\hat{t}}_{y}) ∣ S_{1 s t}\} = V_{p} ({\hat{t}}_{y}),

where

E_{1} (\cdot)

denotes the expectation with respect to the first-stage sampling design, and

E_{2} (\cdot ∣ S_{1 s t})

denotes the expectation with respect to the second-stage sampling design conditionally on

S_{1 s t}

. In the case of simple random sampling without replacement at both stages, the estimator (2) reduces to

\hat{V} ({\hat{t}}_{y, π}) = N^{2} (1 - \frac{n}{N}) \frac{s_{t}^{2}}{n} + \frac{N}{n} \sum_{i \in S} M_{i}^{2} (1 - \frac{m_{i}}{M_{i}}) \frac{s_{y i}^{2}}{m_{i}},

(3)

where

s_{t}^{2} = \frac{1}{n - 1} \sum_{i \in S_{1 s t}} {({\hat{t}}_{i} - \frac{\sum_{i \in S} {\hat{t}}_{i}}{n})}^{2}

and

s_{y i}^{2} = \frac{1}{m_{i} - 1} \sum_{k \in S_{i}} {(y_{i k} - {\bar{y}}_{i})}^{2}

with

{\bar{y}}_{i} = m_{i}^{- 1} \sum_{k \in S_{i}} y_{i k} .

For general two-stage sampling designs, the computation of (2) is cumbersome as it requires the availability of the second-order inclusion probabilities at each stage. A simplified variance estimator is given by

{\hat{V}}_{sim} ({\hat{t}}_{y}) = \sum_{i \in S_{1 s t}} \sum_{j \in S_{1 s t}} \frac{Δ_{i j}}{π_{i j}} \frac{{\hat{t}}_{i}}{π_{i}} \frac{{\hat{t}}_{j}}{π_{j}} .

(4)

That is, only the first term of (2) is kept. The bias of

{\hat{V}}_{sim} ({\hat{t}}_{y}),

which is always negative, is expected to be small provided that the first-stage sampling fraction,

f_{1} = n / N,

is small; see [12,13].

An alternative simplified variance estimator can be obtained by pretending that the psus are selected with replacement. It is given by

\hat{V} ({\hat{t}}_{y, w r}) = \frac{1}{n (n - 1)} \sum_{i = 1}^{n} {(\frac{{\hat{t}}_{i}}{p_{i}} - {\hat{t}}_{y, w r})}^{2},

(5)

where

{\hat{t}}_{y, w r} = \sum_{i \in S} \frac{{\hat{t}}_{i}}{n p_{i}},

with

p_{i}

denoting the probability of selection of the ith psu at any given draw. If the first-stage sampling fraction,

f_{1}

, is small, we expect (5) to suffer from a small positive bias. Unlike (4), the estimator does not require the availability of the second-order inclusion probabilities

π_{i j} .

So far, we have considered the case of a population total

t_{y}

. In practice, it may be of interest to estimate more complex parameters such as distribution functions and quantiles. Let

θ_{N}

be defined as the solution of the following census estimating equation:

U_{N} (θ) = \frac{1}{M_{0}} \sum_{i \in U} \sum_{k \in U_{i}} u (y_{i k}; θ) = 0,

(6)

where

u (y_{i k}; θ)

can be either a smooth (i.e., a function that is differentiable and whose derivatives are continuous) or a non-smooth function of

θ .

When

u (\cdot)

is smooth, the solution of (6) is called a smooth parameter; otherwise, it is called a non-smooth parameter. Common parameters include: (i) the population mean obtained with

u (y_{i k}; θ) = y_{i k} - θ

; (ii) the finite population distribution function obtained with

u (y_{i k}; θ) = I (y_{i k} \leq t) - θ, t \in R;

(iii) the

τ

-th population percentile obtained with

u (y_{i k}; θ) = I (y_{i k} \leq θ) - τ .

The population mean is an example of a smooth parameter, whereas distribution functions and quantiles are examples of non-smooth parameters.

An estimator

\hat{θ}

of

θ_{N}

can be obtained by solving the following sample estimating equation:

{\hat{U}}_{S} (θ) = \frac{1}{M_{0}} \sum_{k \in \tilde{S}} w_{k} u (y_{k}; θ) = 0 .

(7)

The variance of

\hat{θ}

may be obtained using a first-order Taylor expansion or by using a resampling method such as balanced repeated replication, jackknife, and bootstrap; see [14] for a discussion of resampling methods. In the remainder of this paper, we confine to bootstrap.

3. Bootstrap Procedures for Simple Random Sampling without Replacement at Both Stages

In this section, we describe several bootstrap algorithms for a two-stage sampling design with simple random sampling without replacement at both stages.

3.1. The Rescaling Bootstrap Algorithm

Rao and Wu [2] proposed a rescaled bootstrap algorithm for both uni-stage and two-stage sampling designs. Because the rescaling factor is applied to the y-values, this method is applicable to smooth statistics but not to the case of non-smooth statistics such as quantiles. The algorithm can be described as follows:

Step 1.: Draw a sample of size n psus from $S_{1 s t}$ , according to simple random sampling with replacement.
Step 2.: From each psu selected in Step 1, select a sample of elements, of size $m_{i}$ according to simple random sampling with replacement. For a psu selected more than once in Step 1, perform independent subsampling.
Step 3.: Let $y_{i k}^{*}$ be the y-value of the kth bootstrap element in the ith bootstrap psu and $m_{i}^{*}$ be the $m_{i}$ -value of the ith bootstrap psu and $M_{i}^{*}$ is defined similarly. Let

${\tilde{y}}_{i k} = \hat{\bar{Y}} + {\{\frac{n (1 - f_{1})}{n - 1}\}}^{1 / 2} (\frac{{\hat{t}}_{i}^{*}}{{\bar{M}}_{0}} - \hat{\bar{Y}}) + {\{\frac{m_{i}^{*} f_{1} (1 - f_{2 i}^{*})}{m_{i}^{*} - 1}\}}^{1 / 2} (\frac{M_{i}^{*} y_{i k}^{*}}{{\bar{M}}_{0}} - \frac{{\hat{t}}_{i}^{*}}{{\bar{M}}_{0}}),$

where $f_{2 i}^{*} = m_{i}^{*} / M_{i}^{*},$ ${\hat{t}}_{i}^{*} = \frac{M_{i}^{*}}{m_{i}^{*}} \sum_{k = 1}^{m_{i}^{*}} y_{i k}$ , ${\bar{M}}_{0} = M_{0} / N$ and $\hat{\bar{Y}} = \frac{1}{M_{0}} \frac{N}{n} \sum_{i \in S_{1 s t}} \frac{M_{i}}{m_{i}}$ $\sum_{k \in S_{i}} y_{i k} .$
Step 4.: Compute ${\hat{θ}}^{*}$ using the same formulae that were used to obtain the original point estimator.
Step 5.: Repeat Steps 1–4 a large number of times, B, to obtain ${\hat{θ}}_{1}^{*}, \dots, {\hat{θ}}_{B}^{*}$ .
Step 6.: The bootstrap variance estimator is $v a r^{*} ({\hat{θ}}^{*})$ . In practice, the Monte Carlo approximation of $v a r^{*} ({\hat{θ}}^{*})$ is applied

${\hat{v a r}}^{*} = \frac{1}{B - 1} \sum_{b = 1}^{B} {({\hat{θ}}_{b}^{*} - \bar{\hat{θ^{*}}})}^{2},$

where $\bar{{\hat{θ}}^{*}} = B^{- 1} \sum_{b = 1}^{B} {\hat{θ}}_{b}^{*}$ .

Rao and Wu [2] showed that in the case of a population total, the above algorithm matches the standard variance estimator (3). Rao et al. [4] proposed a weighted version of the Rao–Wu method, whereby the rescaling is applied to the sampling weights rather than the y-values; see also [10]. The method of Rao et al. [4] is described in Section 4.

3.2. The Mirror-Match Bootstrap Algorithm

Sitter [3] proposed an extension of his mirror-match bootstrap to the case of a two-stage sampling design. In [3], the algorithm assumed that the number of repetetions

k_{1}

and

k_{2 i}

(see below) are integers. It can be described as follows:

Step 1.: Choose $1 \leq n^{'} < n$ and draw a sample of size $n^{'}$ psus from $S_{1 s t}$ , according to simple random sampling without replacement.
Step 2.: Repeat Step 1 $k_{1} = n (1 - f_{1}^{*}) / {n^{'} (1 - f_{1})}$ times independently to obtain a bootstrap sample of psus of size $n^{*} = k_{1} n^{'}$ , where $f_{1}^{*} = n^{'} / n$ .
Step 3.: Choose $1 \leq m_{i}^{'} < m_{i}$ and draw according to simple random sampling without replacement $m_{i}^{'}$ units within the ith psu obtained in Steps 1 and 2.
Step 4.: Repeat Step 3 $k_{2 i} = N m_{i} (1 - f_{2 i}^{*}) / {n^{*} m_{i}^{'} (1 - f_{2 i})}$ times independently to obtain a bootstrap sample of size $m_{i}^{*} = k_{2 i} m_{i}^{'}$ from the ith psu drawn in Step 3, where $f_{2 i}^{*} = m_{i}^{'} / m_{i} .$
Step 5.: Compute ${\hat{θ}}^{*}$ using the same formulae that were used to obtain the original point estimator.
Step 6.: Repeat Steps 1–5 a large number of times, B, to obtain ${\hat{θ}}_{1}^{*}, \dots, {\hat{θ}}_{B}^{*}$ .
Step 7.: The bootstrap variance estimator is $v a r^{*} ({\hat{θ}}^{*})$ . In practice, the Monte Carlo approximation of $v a r^{*} ({\hat{θ}}^{*})$ is applied

${\hat{v a r}}^{*} = \frac{1}{B - 1} \sum_{b = 1}^{B} {({\hat{θ}}_{b}^{*} - \bar{{\hat{θ}}^{*}})}^{2},$

where $\bar{{\hat{θ}}^{*}} = B^{- 1} \sum_{b = 1}^{B} {\hat{θ}}_{b}^{*}$ .

Sitter [3] showed that in the case of a population total, the above algorithm matches the standard variance estimator (3). If

k_{1}

and

k_{2 i}

are not integers, a randomization between bracketing integer values is proposed in [15].

3.3. The Without-Replacement Bootstrap Algorithm

Sitter [16] proposed a pseudo-population bootstrap algorithm, referred to as the without-replacement bootstrap (BWO) method, in the case of uni-stage and two-stage sampling designs. We focus on the latter. In [16], the algorithm assumed that the quantities

k_{1}

,

n^{'}

,

k_{2 i}

, and

m_{i}^{'}

(see below) are integers. It can be described as follows:

Step 1:: Create a pseudo-population by replicating each psu in $S_{1 s t}$ $k_{1}$ times and each unit within the ith psu $k_{2 i}$ times. Let $U^{'}$ be the resulting pseudo-population consisting of $N^{'} = n k_{1}$ psus, $U_{1}^{'}, \dots, U_{N^{'}}^{'}$ , of size $M_{1}^{'}, \dots, M_{N^{'}}^{'}$ , where there exists $j \in S_{1 s t}$ such that $M_{i}^{'} = m_{j} k_{2 j}$ . Let $M_{0}^{'} = \sum_{i \in U^{'}} M_{i}^{'}$ be the total number of elements in the pseudo-population.
Step 2:: From the pseudo-population $U^{'}$ , select a sample of psus, $S_{1 s t}^{*}$ , of size $n^{'},$ according to simple random sampling without replacement. In each selected psu, select a sample, $S_{i}^{*}$ , of size $m_{i}^{'}$ according to simple random sampling without replacement.
Step 3:: Compute ${\hat{θ}}^{*}$ using the formulae that were used to obtain the original point estimator.
Step 4:: Repeat Steps 2 and 3 a large number of times, B, to obtain ${\hat{θ}}_{1}^{*}, \dots, {\hat{θ}}_{B}^{*}$ .
Step 5:: The bootstrap variance estimator is $v a r^{*} ({\hat{θ}}^{*})$ . In practice, the Monte Carlo approximation of $v a r^{*} ({\hat{θ}}^{*})$ is applied

${\hat{v a r}}^{*} = \frac{1}{B - 1} \sum_{b = 1}^{B} {({\hat{θ}}_{b}^{*} - \bar{{\hat{θ}}^{*}})}^{2},$

where $\bar{{\hat{θ}}^{*}} = B^{- 1} \sum_{b = 1}^{B} {\hat{θ}}_{b}^{*}$ .

In the case of the population total (or the population mean), Sitter [16] showed that the bootstrap variance estimator reduces to the standard variance estimator provided that

n^{'}

and

k_{1}

satisfy

f_{1}^{'} = f_{1} a n d \frac{k_{1} (n - 1)}{n^{'} (k_{1} n - 1)} = \frac{1}{n},

(8)

and

m_{i}^{'}

and

k_{2 i}

satisfy

f_{2 i}^{'} = f_{2 i} a n d \frac{k_{2 i} (m_{i} - 1)}{n^{'} (k_{2 i} m_{i} - 1)} = \frac{f_{1}}{n m_{i}},

(9)

where

f_{1}^{'} = n^{'} / N^{'}

and

f_{2 i}^{'} = m_{i}^{'} / M_{i}^{'}

, for each i. If

k_{1}

,

n^{'}

,

k_{2 i}

, and

m_{i}^{'}

are not integers, a randomization between bracketing integer values was proposed in [15]. In Appendix A, we show that, if we define

k_{1}

,

n^{'}

,

k_{2 i}

, and

m_{i}^{'}

as in (8) and (9), the bootstrap variance estimator does not reduce to the standard variance estimator in (3). We suggest a modification to (8) and (9) so that the bootstrap variance estimators reduces to the standard variance estimator (3). In the simulation study (see Section 5), we show that the modified version of Sitter [16] works well in terms of bias and coverage probability of confidence intervals.

3.4. The Bernoulli Bootstrap Algorithm

Funaoka et al. [17] proposed two bootstrap procedures, referred to as Bernoulli bootstrap, for stratified three-stage sampling. Here, we consider the special case of two-stage sampling. Funaoka et al. [17] proposed a short-cut algorithm and a general algorithm. The general algorithm can handle any combination of sample sizes but is more computationally intensive. The general algorithm can be described as follows:

Step 1.: Draw a sample, ${\tilde{S}}_{1},$ of size $n^{'} = n - 1,$ from the original sample of clusters, $S_{1 s t}$ , according to simple random sampling with replacement. Generate n Bernoulli random variables, $I_{1 i}$ , with probability

$p_{1} = 1 - \frac{1 - f_{1}}{2 (1 - n^{- 1})} .$

For each $i \in S_{1 s t}$ , keep the ith cluster in the bootstrap sample $S^{*}$ and go to Step 2, if $I_{1 i} = 1$ , and replace the ith cluster with one randomly selected cluster from ${\tilde{S}}_{1 s t}$ , if $I_{1 i} = 0$ .
Step 2.: For cluster i kept in Step 1, draw a sample, ${\tilde{S}}_{2 i},$ of size $m_{i}^{'} = m_{i} - 1,$ from the original sample $S_{2 i}$ according to simple random sampling with replacement. Generate $m_{i}$ Bernoulli random variable, $I_{2 i j}$ , with probability

$p_{2 i} = 1 - \frac{f_{1} (1 - f_{2 i})}{2 p_{1}^{- 1} (1 - m_{i}^{- 1})} .$

For each $(i j) \in S_{2 i}$ , keep the $(i j)$ th element in the bootstrap sample, $S^{*}$ , if $I_{2 i j} = 1$ , and replace it with one randomly selected element from ${\tilde{S}}_{2 i}$ , if $I_{2 i j} = 0$ .
Step 3.: Compute ${\hat{θ}}^{*}$ using the formulae that were used to obtain the original point estimator.
Step 4.: Repeat Steps 1–3 a large number of times, B, to obtain ${\hat{θ}}_{1}^{*}, \dots, {\hat{θ}}_{B}^{*}$ .
Step 5.: The bootstrap variance estimator is $v a r^{*} ({\hat{θ}}^{*})$ . In practice, the Monte Carlo approximation of $v a r^{*} ({\hat{θ}}^{*})$ is applied

${\hat{v a r}}^{*} = \frac{1}{B - 1} \sum_{b = 1}^{B} {({\hat{θ}}_{b}^{*} - \bar{{\hat{θ}}^{*}})}^{2},$

where $\bar{{\hat{θ}}^{*}} = B^{- 1} \sum_{b = 1}^{B} {\hat{θ}}_{b}^{*}$ .

Funaoka et al. [17] argued that the resulting bootstrap variance estimator is consistent for both smooth and non-smooth parameters.

3.5. The Preston Bootstrap Weights Algorithm

Preston [18] proposed a bootstrap weights approach for stratified three-stage sampling. Here, we focus on the special case of two-stage sampling. The algorithm can be described as follows:

Step 1.: Draw a sample of size $n^{'}$ psus from $S_{1 s t}$ , according to simple random sampling without replacement. Let $δ_{i} = 1$ if the ith psu is selected and $δ_{i} = 0,$ otherwise.
Step 2.: Define the psu bootstrap weights:

$w_{i}^{*} = \{1 + λ_{1} (\frac{n}{n^{'}} δ_{i} - 1)\} π_{i}^{- 1},$

where $λ_{1} = {n^{'} (1 - f_{1}) / (n - n^{'})}^{1 / 2} .$
Step 3.: Within each of the sample of psus selected in Step 1, draw a simple random sample without replacement, of size $m_{i}^{'} .$ Let $δ_{i k} = 1$ if the kth element in the ith psu is selected and $δ_{i k} = 0,$ otherwise. We define the conditional element bootstrap weights:

$w_{i k}^{*} = \{1 + λ_{1} (\frac{n}{n^{'}} δ_{i} - 1) + λ_{2 i} {(\frac{n}{n^{'}})}^{1 / 2} δ_{i} (\frac{m_{i}}{m_{i}^{'}} δ_{i k} - 1)\} \frac{π_{i}^{- 1}}{w_{i}^{*}} π_{k | i}^{- 1},$

where $λ_{2 i} = {m_{i}^{'} f_{1} (1 - f_{2 i}) / (m_{i} - m_{i}^{'})}^{1 / 2} .$
Step 4.: Compute ${\hat{θ}}^{*}$ using the formulae that were used to obtain the original point estimator with the original weights replaced by the bootstrap weights $w_{i k}^{*}$ .
Step 5.: Repeat Steps 1–4 a large number of times, B, to obtain ${\hat{θ}}_{1}^{*}, \dots, {\hat{θ}}_{B}^{*}$ .
Step 6.: The bootstrap variance estimator is $v a r^{*} ({\hat{θ}}^{*})$ . In practice, the Monte Carlo approximation of $v a r^{*} ({\hat{θ}}^{*})$ is applied

${\hat{v a r}}^{*} = \frac{1}{B - 1} \sum_{b = 1}^{B} {({\hat{θ}}_{b}^{*} - \bar{{\hat{θ}}^{*}})}^{2},$

where $\bar{{\hat{θ}}^{*}} = B^{- 1} \sum_{b = 1}^{B} {\hat{θ}}_{b}^{*}$ .

In the case of the population total, Preston [18] showed that the bootstrap variance estimator reduces to the textbook variance estimator (3). He suggested that the choice of

n^{'} = ⌊ n / 2 ⌋

and

m_{i}^{'} = ⌊ m_{i} / 2 ⌋

will be optimal and lead to non-negative bootstrap weights, where

⌊ \cdot ⌋

denotes the integer part.

4. Bootstrap Procedures for Unequal Probability Sampling Designs

4.1. The Rao-Wu-Yue Bootstrap Weights Algorithm

Rao et al. [4] proposed a bootstrap weights approach for stratified multi-stage sampling designs. Unlike the method of Rao and Wu [2], it can be applied to estimate the variance of smooth and non-smooth parameters (e.g., quantiles).

Step 1.: Select $n^{'}$ psus according to simple random sampling with replacement from $S_{1 s t}$ .
Step 2.: Define the bootstrap weight as

$w_{i k}^{*} = \{1 + {(\frac{n^{'}}{n - 1})}^{1 / 2} (\frac{n n_{i}^{*}}{n^{'}} - 1)\} π_{i}^{- 1} π_{k | i}^{- 1},$

where $n_{i}^{*}$ denotes the number of times the ith psu is selected in the bootstrap sample.
Step 3.: Compute ${\hat{θ}}^{*}$ using the formulae that were used to obtain the original point estimator with the original weights replaced by the bootstrap weights $w_{i k}^{*}$ .
Step 4.: Repeat Steps 1–3 B times to obtain ${\hat{θ}}_{1}^{*}, \dots, {\hat{θ}}_{B}^{*}$ .
Step 5.: The bootstrap variance estimator is $v a r^{*} ({\hat{θ}}^{*})$ . In practice, the Monte Carlo approximation of $v a r^{*} ({\hat{θ}}^{*})$ is applied

${\hat{v a r}}^{*} = \frac{1}{B - 1} \sum_{b = 1}^{B} {({\hat{θ}}_{b}^{*} - \bar{{\hat{θ}}^{*}})}^{2},$

where $\bar{{\hat{θ}}^{*}} = B^{- 1} \sum_{b = 1}^{B} {\hat{θ}}_{b}^{*}$ .

The algorithm of Rao et al. [4] leads to consistent variance estimators provided that the first-stage sampling fraction is negligible. The choice

0 < n^{'} \leq n - 1

leads to non-negative bootstrap weights.

4.2. The Pseudo-Population Bootstrap Algorithm

Chauvet [8] proposed a pseudo-population bootstrap approach (PPB) in the case of unequal two-stage sampling designs. It can be described as follows:

Step 1.: Each unit $k \in S_{i}$ is duplicated $[π_{k | i}^{- 1}]$ times to create a second-stage pseudo-population denoted by $U_{i}^{*},$ $i \in S_{1 s t}$ , where $[\cdot]$ denotes the closet integer.
Step 2.: Each pair $(S_{i}, U_{i}^{*})$ is duplicated $⌊ π_{i}^{- 1} ⌋$ times. The population of pairs is completed by selecting a sample in the set ${(S_{i}, U_{i}^{*}); i \in S_{1 s t}}$ by means of sampling design with first-order inclusion probabilities $π_{i}^{- 1} - ⌊ π_{i}^{- 1} ⌋$ . This leads to the pseudo-population $U^{*}$ .
Step 3.: Select a first-stage bootstrap sample $S_{1 s t}^{*}$ from $U^{*}$ using the original first-stage sampling design with first-order inclusion probabilities $π_{i}$ .
Step 4.: Select a second-stage bootstrap sample $S_{i}^{* *}$ from $U_{i}^{*}$ using the original second-stage sampling design. We set $S_{i}^{*} = S_{i}^{* *}$ with probability $π_{i}$ and $S_{i}^{*} = S_{i}$ with probability $1 - π_{i}$ . This procedure is applied to each pair $(S_{i}, U_{i}^{*}) \in S_{1 s t}^{*} .$ The union of the $S_{i}^{*}$ ’s leads to the bootstrap sample $S^{*}$ .
Step 5.: Compute ${\hat{θ}}^{*}$ using the formulae that were used to obtain the original point estimator.
Step 6.: Steps 3–5 are repeated $B_{P P B}$ times to obtain the bootstrap statistics ${\hat{θ}}_{1}^{*}, \dots, {\hat{θ}}_{P P B}^{*}$ . Let

${\hat{v}}^{*} = \frac{1}{B_{P P B} - 1} \sum_{b = 1}^{B_{P P B}} {({\hat{θ}}_{b}^{*} - \bar{{\hat{θ}}^{*}})}^{2},$

where $\bar{{\hat{θ}}^{*}} = B_{P P B}^{- 1} \sum_{b = 1}^{B_{P P B}} \hat{θ_{b}^{*}}$ .
Step 7.: Steps 2–6 are repeated $A_{P P B}$ times to obtain ${\hat{v}}_{1}^{*}, \dots, {\hat{v}}_{A_{P P B}}^{*}$ . The variance of $\hat{θ}$ is estimated by

${\hat{v a r}}^{*} = \frac{1}{A_{P P B}} \sum_{a = 1}^{A_{P P B}} {\hat{v}}_{a}^{*} .$

Chauvet [8] showed that in the case of high entropy sampling design (e.g., [19,20,21,22]) at both stages, the above algorithm leads to a consistent estimator in the context of a population total. In the case of a fixed-size sampling design, Chauvet [8] suggested completing the pseudo-population in Step 2, by applying Poisson sampling design with inclusion probabilities

π_{i}^{- 1} - ⌊ π_{i}^{- 1} ⌋

.

5. Simulation Study

We conducted a simulation study to assess the performance of the bootstrap procedures described in Section 3 and Section 4 in terms of bias, stability, and coverage rate of confidence intervals based on the t-distribution with

n - 1

degrees of freedom. The simulation study was carried out using the R software. We created two finite populations consisting of

N = 200

primary sampling units. The cluster sizes

M_{i}

were generated according to a Poisson distribution with a mean equal to 50. In each population, we generated a survey variable y according to

y_{i j} = 10 + x_{i} + ε_{i j},

where

x_{i} \sim N (0, \sqrt{\frac{ρ}{1 - ρ}}) and ε_{i j} \sim N (M_{i}, 1) .

(10)

The parameter

ρ

in (10) was set to 0.1 for Population 1 and 0.3 for Population 2. We were interested in estimating the population total of the y-variable,

t_{y},

as well as the finite population median.

From each population, we selected

K = 3000

samples according to a two-stage sampling design:

(i): At the first stage, we selected n psus according to two sampling designs: simple random sampling without replacement and inclusion probability-proportional-to-size randomized systematic sampling. The value of n was set to $n = 10$ which corresponds to a first-stage sampling fraction, $f_{1} = 5 %,$ and $n = 40,$ which corresponds to $f_{1} = 20 % .$
(ii): At the second stage, $m_{i} = 5$ elements within each psu selected at the first stage were selected according to simple random sampling without replacement.

In each sample, we computed the estimator

{\hat{t}}_{y}

given by (1). Its variance was estimated using the variance estimation procedures listed in Table 1. Except of the procedure of Chauvet [8], we used

B = 500

bootstrap samples for all the other bootstrap procedures. For the procedure of Chauvet [8], we used

A_{P P B} = 10

and

B_{P P B} = 50

(

B = A_{P P B} \times B_{P P B}

).

As a measure of bias of a variance estimator

\hat{v a r}

, we computed its Monte Carlo percent relative bias

R B_{M C} (\hat{v a r}) = 100 \times \frac{E_{M C} (\hat{v a r}) - V_{M C} ({\hat{t}}_{y})}{V_{M C} ({\hat{t}}_{y})},

where

E_{M C} (\hat{v a r})

denotes the Monte Carlo expectation of

\hat{v a r}

and

V_{M C} ({\hat{t}}_{y})

denotes the Monte Carlo variance estimator of

{\hat{t}}_{y} .

As a measure of stability of a variance estimator

\hat{v a r},

we computed its Monte Carlo percent coefficient of variation given by

C V_{M C} = 100 \times \frac{\sqrt{V_{M C} (\hat{v a r})}}{E_{M C} (\hat{v a r})},

where

V_{M C} (\hat{v a r})

denotes the Monte Carlo variance estimator of

\hat{v a r}

. Finally, we computed the Monte Carlo coverage rate of t-based confidence intervals and their corresponding Monte Carlo average length.

Table 2, Table 3, Table 4 and Table 5 show the results for the bootstrap methods in the case of SRSWOR/SRSWOR. Table 2 shows the Monte Carlo percent relative bias of the bootstrap variance estimators. For the population total, all the procedures led to small biases for

f_{1} = 5 %

with the value of absolute relative bias ranging from 1.08% to 8.66%. For the population median, except for the method of Rao and Wu [2], the other procedures led to good results with an absolute relative bias ranging from 0.02% to 16.58%. As expected, the method of Rao and Wu [2] led to substantial bias because it cannot be applied to non-smooth statistics. For

f_{1} = 20 %

, the absolute relative bias varied between 2.49% and 10.00% for all bootstrap methods except for the method of Rao et al. [4] who suffered from a significant positive bias. This can be explained by the fact that the sampling fraction was not small. For

f_{1} = 20 %

the absolute relative bias varied between 2.60% and 13.20% for all bootstrap methods except for the methods of Rao and Wu [2] and Rao et al. [4].

Table 3 shows the percent CV. All the bootstrap methods led to similar Monte Carlo coefficients of variation (CV). For

f_{1} = 5 %

, the CV varied between 41.2% and 46.0% for the population total, and between 59.0% and 64.3% (except for the method of Rao and Wu [2] that led to a CV of 69.1% for

ρ = 0.3

) for the population median. For

f_{1} = 20 %

, the CV varied between 18.9% and 21.0% for the population total, and between 36.4% and 42.3% for the population median.

Table 4 and Table 5 show the coverage probability and the average length of 95% confidence intervals based on the t-distribution, respectively. All the bootstrap methods led to good coverage and similar average length except the method of Rao and Wu [2] for the population median. The coverage rate in all cases, except the method of Rao and Wu [2] for the population median, varied between 93.17% and 96.73%.

Table 6, Table 7, Table 8 and Table 9 show the results for the bootstrap methods in the case of randomized IPPS systematic/SRSWOR. Table 6 shows the percent relative bias of the bootstrap variance estimators. The method of Chauvet [8] worked generally better than the method of Rao et al. [4] in terms of relative bias, especially in the case of the population median. The percent CVs presented in Table 7 were very similar for both methods.

Table 8 and Table 9 respectively show the coverage probability and the average length of the 95 percent confidence intervals based on the t-distribution for both methods. Both bootstrap methods led to good coverage and similar average length. The coverage rate in all cases varied from 93.23% to 96.43%.

6. Final Remarks

The results of the simulation studies suggest that most bootstrap procedures work well in terms of bias and coverage rate of confidence intervals for estimating smooth or non-smooth parameters. An exception is the method of Rao and Wu [2] for quantiles and the method of Rao et al. [4] for appreciable first-stage sampling fractions. In terms of stability of the variance estimators, there is little difference between the bootstrap procedures. Our results are aligned with those of Saigo [23] who empirically compared four bootstrap procedures in the context of stratified three-stage sampling with simple random sampling without replacement at each stage: the procedure of Funaoka et al. [17], the mirror match bootstrap of Sitter [3], the method of Rao et al. [4], and the BWO method of Sitter [16].

In this paper, we have compared several bootstrap algorithms in the context of a two-stage sampling design. The algorithms were described in an ideal setup. In practice, the design weights

π_{k}^{- 1}

undergo a weighting process that involves a nonresponse adjustment followed by some form of calibration, whose goal is to ensure consistency between survey estimates and known population quantities. When the first-stage sampling fraction is small, the method of Rao et al. [4] is typically used in surveys conducted by national statistical offices. To account for unit nonresponse and calibration, the bootstrap weights need to undergo the same weighting process (non-response adjustment and calibration) that was used in the original sample were.

Bootstrap may be used to estimate the variance of imputed estimators in the context of imputation for item nonresponse. If the first-stage sampling fraction is small, one can re-impute the missing values in each bootstrap sample using the same imputation procedure that was utilized in the original sample, see [24]. The case of non-negligible first-stage sampling fractions requires further research.

We end this paper by pointing out a very recent paper by Beaumont and Émond [25], who proposed a bootstrap weight approach for multi-stage sampling designs. It would be interesting to compare this method to the procedures considered in this paper.

Author Contributions

Conceptualization, S.C., D.H. and Z.M.; methodology, S.C., D.H. and Z.M.; software, S.C. and Z.M.; validation, S.C., D.H. and Z.M.; formal analysis, NA; investigation, S.C., D.H. and Z.M.; resources, S.C., D.H. and Z.M.; data curation, NA; writing—original draft preparation, S.C., D.H. and Z.M.; writing—review and editing, S.C., D.H. and Z.M.; visualization, NA; supervision, NA; project administration, S.C., D.H. and Z.M.; funding acquisition, NA. All authors have read and agreed to the published version of the manuscript.

Funding

Sixia Chen is partly supported by the National Institute on Minority Health and Health Disparities at the National Institutes of Health (1R21MD014658-01A1) and the Oklahoma Shared Clinical and Translational Resources (U54GM104938) with an Institutional Development Award (IDeA) from the National Institute of General Medical Sciences. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The work of David Haziza and Z.M. is funded by grants of the Natural Sciences and Engineering Research Council of Canada.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

We thank the Associate Editor and three reviewers for their comments that helped improve the overall quality of this paper.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

In this section, we show that, obtaining

k_{1}

,

n^{'}

,

k_{2 i}

, and

m_{i}^{'}

from Equations (8) and (9) in the Sitter [16] algorithm in Section 3.3, the resulting bootstrap variance estimator does not reduce to the standard variance estimator in the case of the population total

θ = t_{y}

.

In the case of

θ = t_{y}

, the bootstrap estimator is

{\hat{θ}}^{*} = \frac{N^{'}}{n^{'}} \sum_{i \in S_{1 s t}^{*}} \frac{M_{i}^{'}}{m_{i}^{'}} \sum_{k \in S_{i}^{*}} y_{i k}^{*}

where

y_{i k}^{*}

is the y-value for the kth selected element in

S_{i}^{*}

in Step 2. The bootstrap variance of

{\hat{θ}}^{*}

is

v a r^{*} ({\hat{θ}}^{*}) = V_{1 *} E_{2 *} ({\hat{θ}}^{*} | S_{1 s t}^{*}) + E_{1 *} V_{2 *} ({\hat{θ}}^{*} | S_{1 s t}^{*}),

(A1)

where the subscripts

1 *

and

2 *

respectively denote the expectation and the variance with respect to the first-stage and second-stage resampling design in Step 2. Let

t_{i}^{'} = \sum_{k \in U_{i}^{'}} y_{i k}

be the total of y-values in

U_{i}^{'}

in Step 1. The first and the second component of the bootstrap variance estimator

v a r^{*} ({\hat{θ}}^{*})

in (A1) respectively are

\begin{matrix} V_{1 *} E_{2 *} ({\hat{θ}}^{*} | S_{1 s t}^{*}) & = & V_{1 *} (\frac{N^{'}}{n^{'}} \sum_{i \in S_{1 s t}^{*}} t_{i}^{'}) \\ = & {N'}^{2} \frac{1 - f_{1}^{'}}{n^{'}} \frac{1}{N^{'} - 1} \sum_{i \in U^{'}} {(t_{i}^{'} - \frac{1}{N^{'}} \sum_{j \in U^{'}} t_{j}^{'})}^{2} \\ = & {N'}^{2} \frac{1 - f_{1}^{'}}{n^{'}} \frac{k_{1}}{k_{1} n - 1} \sum_{i \in S_{1 s t}} {(k_{2 i} m_{i} {\bar{y}}_{i} - \frac{1}{n} \sum_{j \in S_{1 s t}} k_{2 j} m_{j} {\bar{y}}_{j})}^{2} \end{matrix}

and

\begin{matrix} E_{1 *} V_{2 *} ({\hat{θ}}^{*} | S_{1 s t}^{*}) & = & E_{1 *} \{\frac{{N'}^{2}}{{n'}^{2}} \sum_{i \in S_{1 s t}^{*}} M_{i}^{'} \frac{1 - f_{2 i}^{'}}{m_{i}^{'}} \frac{1}{M_{i}^{'} - 1} \sum_{k \in U_{i}^{'}} {(y_{i k}^{*} - \frac{1}{M_{i}^{'}} \sum_{l \in U_{i}^{'}} y_{i l}^{*})}^{2}\} \\ = & \frac{N^{'}}{n^{'}} \sum_{i \in U^{'}} {M_{i}^{'}}^{2} \frac{1 - f_{2 i}^{'}}{m_{i}^{'}} \frac{1}{M_{i}^{'} - 1} \sum_{k \in U_{i}^{'}} {(y_{i k}^{*} - \frac{1}{M_{i}^{'}} \sum_{l \in U_{i}^{'}} y_{i l}^{*})}^{2} \\ = & {N'}^{2} \frac{1}{n n^{'}} \sum_{i \in S_{1 s t}} {(k_{2 i} m_{i})}^{2} \frac{1 - f_{2 i}^{'}}{m_{i}^{'}} \frac{k_{2 i}}{k_{2 i} m_{i} - 1} \sum_{k \in S_{i}} {(y_{i k} - {\bar{y}}_{i})}^{2} \\ = & \sum_{i \in S_{1 s t}} {(N^{'} k_{2 i} m_{i})}^{2} \frac{1 - f_{2 i}^{'}}{n n^{'} m_{i}^{'}} \frac{k_{2 i} (m_{i} - 1)}{k_{2 i} m_{i} - 1} s_{y i}^{2}, \end{matrix}

where

{\bar{y}}_{i} = \sum_{i \in S_{i}} / m_{i}

. In [16], it is claimed that the first and the second components of the bootstrap variance estimator respectively are

N^{2} \frac{k_{1} (n - 1)}{k_{1} n - 1} \frac{1 - f_{1}^{'}}{n^{'}} s_{t}^{2},

and

N^{2} \sum_{i \in S_{1 s t}} M_{i}^{2} \frac{1 - f_{2 i}^{'}}{n n^{'} m_{i}^{'}} \frac{k_{2 i} (m_{i} - 1)}{k_{2 i} m_{i} - 1} s_{y i}^{2},

see Equation (3.5) in Section 3.2 in [16]. This is true only when

N^{'} = k_{1} n = N, and k_{2 i} m_{i} = M_{i} for all i \in S_{1 s t} .

(A2)

In other words, we have to define

k_{1} = N / n

and

k_{2 i} = M_{i} / m_{i}

which is contradictory to the way Sitter [16] defined

k_{1}

,

n^{'}

,

k_{2 i}

, and

m_{i}^{'}

using Equations (8) and (9).

In the following, we suggest how the method of Sitter [16] can be modified. We first define

k_{1} = \frac{N}{n} and k_{2 i} = \frac{M_{i}}{m_{i}} .

(A3)

To have the equality between the first and the second component of the bootstrap variance estimator and the first and the second component of the standard variance estimator in (3), respectively, we need to have

\begin{matrix} \frac{k_{1} (n - 1)}{k_{1} n - 1} \frac{(1 - f_{1}^{'})}{n^{'}} = \frac{1 - f_{1}}{n} \\ ⟹ & \frac{k_{1} (n - 1)}{k_{1} n - 1} \frac{1 - \frac{n^{'}}{N}}{n^{'}} = \frac{1 - f_{1}}{n} \\ ⟹ & n^{'} = \frac{N}{1 + \frac{1 - f_{1}}{f_{1}} \frac{k_{1} n - 1}{k_{1} (n - 1)}} \end{matrix}

(A4)

and

\begin{matrix} \frac{k_{2 i} (m_{i} - 1)}{k_{2 i} m_{i} - 1} \frac{1 - \frac{m_{i}^{'}}{M_{i}^{'}}}{n^{'} m_{i}^{'}} = \frac{f_{1}}{n m_{i}} (1 - f_{2 i}) \\ ⟹ & \frac{1 - \frac{m_{i}^{'}}{M_{i}}}{m_{i}^{'}} = \frac{f_{1} (1 - f_{2 i})}{n m_{i}} \frac{n^{'} (k_{2 i} m_{i} - 1)}{k_{2 i} (m_{i} - 1)} \\ ⟹ & m_{i}^{'} = \frac{M_{i}}{1 + \frac{f_{1} (1 - f_{2 i})}{n f_{2 i}} \frac{n^{'} (k_{2 i} m_{i} - 1)}{k_{2 i} (m_{i} - 1)}} \end{matrix}

(A5)

Defining

k_{1}

,

n^{'}

,

k_{2 i}

, and

m_{i}^{'}

as in Equations (A3)–(A5), the bootstrap variance estimator reduces to the usual variance estimator in (3). In the simulation study, the method "Modified Sitter" refers to the method of Sitter [16] while applying the modified

k_{1}

,

n^{'}

,

k_{2 i}

, and

m_{i}^{'}

defined in Equations (A3)–(A5). When

k_{1}

,

n^{'}

,

k_{2 i}

, or

m_{i}^{'}

is not integer, we simply rounded it to the closest integer value.

References

Efron, B. Bootstrap methods: Another look at the jackknife. Ann. Stat. 1979, 7, 1–26. [Google Scholar] [CrossRef]
Rao, J.N.K.; Wu, C.F.J. Resampling inference with complex survey data. J. Am. Stat. Assoc. 1988, 83, 231–241. [Google Scholar] [CrossRef]
Sitter, R.R. A resampling procedure for complex survey data. J. Am. Stat. Assoc. 1992, 87, 755–765. [Google Scholar] [CrossRef]
Rao, J.N.K.; Wu, C.F.J.; Yue, K. Some recent work on resampling methods for complex surveys. Surv. Methodol. 1992, 18, 209–217. [Google Scholar]
Gross, S. Median estimation in sample surveys. Proc. Sect. Surv. Res. Methods 1980, 181–184. [Google Scholar]
Bickel, P.J.; Freedman, D.A. Asymptotic normality and the bootstrap in stratified sampling. Ann. Stat. 1984, 12, 470–482. [Google Scholar] [CrossRef]
Booth, J.G.; Butler, R.W.; Hall, P. Bootstrap methods for finite populations. J. Am. Stat. Assoc. 1994, 89, 1282–1289. [Google Scholar] [CrossRef]
Chauvet, G. Méthodes de Bootstrap en Population Finie. Ph.D. Thesis, École Nationale de Statistique et Analyse de l’Information, Bruz, France, 2007. [Google Scholar]
Antal, E.; Tillé, Y. A direct bootstrap method for complex sampling designs from a finite population. J. Am. Stat. Assoc. 2011, 106, 534–543. [Google Scholar] [CrossRef] [Green Version]
Beaumont, J.F.; Patak, Z. On the generalized bootstrap for sample surveys with special attention to Poisson sampling. Int. Stat. Rev. 2012, 80, 127–148. [Google Scholar] [CrossRef]
Mashreghi, Z.; Haziza, D.; Léger, C. A survey of bootstrap methods in finite population sampling. Stat. Surv. 2016, 10, 1–52. [Google Scholar] [CrossRef]
Särndal, C.E.; Swensson, B.; Wretman, J. Model-Assisted Survey Sampling; Springer: New York, NY, USA, 1992. [Google Scholar]
Beaumont, J.F.; Béliveau, A.; Haziza, D. Clarifying some aspects of variance estimation in two-phase sampling. J. Surv. Stat. Methodol. 2015, 3, 524–542. [Google Scholar] [CrossRef]
Wolter, K.M. Introduction to Variance Estimation; Springer Series in Statistics: New York, NY, USA, 2007. [Google Scholar]
Sitter, R.R. Resampling Procedures for Complex Survey Data. Ph.D. Thesis, University of Waterloo, Waterloo, ON, Canada, 1989. [Google Scholar]
Sitter, R.R. Comparing three bootstrap methods for survey data. Can. J. Stat. 1992, 20, 135–154. [Google Scholar] [CrossRef]
Funaoka, F.; Saigo, H.; Sitter, R.R.; Toida, T. Bernoulli bootstrap for stratified multistage sampling. Surv. Methodol. 2006, 32, 151–156. [Google Scholar]
Preston, J. Rescaled bootstrap for stratified multistage sampling. Surv. Methodol. 2009, 35, 227–234. [Google Scholar]
Hájek, J. Asymptotic theory of rejective sampling with varying probabilities from a finite population. Ann. Math. Stat. 1964, 35, 1491–1523. [Google Scholar] [CrossRef]
Berger, Y.G. Rate of convergence for asymptotic variance of the Horvitz–Thompson estimator. J. Stat. Plan. Inference 1998, 74, 149–168. [Google Scholar] [CrossRef]
Matei, A.; Tillé, Y. Evaluation of variance approximations and estimators in maximum entropy sampling with unequal probability and fixed sample size. J. Off. Stat. 2005, 21, 543–570. [Google Scholar]
Haziza, D.; Mecatti, F.; Rao, J.N.K. Evaluation of some approximate variance estimators under the Rao–Sampford unequal probability sampling design. Metron 2008, 66, 91–108. [Google Scholar]
Saigo, H. Comparing four bootstrap methods for stratified three-stage sampling. J. Off. Stat. 2010, 26, 193–207. [Google Scholar]
Shao, J.; Sitter, R.R. Bootstrap for imputed survey data. J. Am. Stat. Assoc. 1996, 91, 1278–1288. [Google Scholar] [CrossRef]
Beaumont, J.F.; Émond, N. A bootstrap variance estimation method for multistage sampling and two-phase sampling when poisson sampling is used at the second phase. Stats 2022, 5, 339–357. [Google Scholar] [CrossRef]

Table 1. Variance estimation procedures for each population parameter/sampling design.

Sampling Design	Population Total	Population Median
SRSWOR/SRSWOR	Textbook variance estimator Rao and Wu [2] Rao et al. [4] Modified Sitter Funaoka et al. [17] Chauvet [8] Preston [18]	Linearization variance estimator Rao and Wu [2] Rao et al. [4] Modified Sitter Funaoka et al. [17] Chauvet [8] Preston [18]
IPPSWOR/SRSWOR	Textbook variance estimator Rao et al. [4] Chauvet [8]	Linearization variance estimator Rao et al. [4] Chauvet [8]

Table 2. Monte Carlo percent RB of the bootstrap variance estimators to estimate the variance of the point estimator based on 3000 samples selected according to SRSWOR/SRSWOR.

		$ρ = 0.1$		$ρ = 0.3$
		$f_{1} = 5 %$	$f_{1} = 20 %$	$f_{1} = 5 %$	$f_{1} = 20 %$
$\hat{θ}$	Bootstrap Method	$(n = 10)$	$(n = 40)$	$(n = 10)$	$(n = 40)$
${\hat{t}}_{y}$	Textbook	1.99	5.10	1.28	5.05
	Chauvet [8]	−7.79	2.49	−8.39	2.54
	Rao et al. [4]	6.99	29.07	6.27	29.38
	Rao and Wu [2]	8.66	10.00	6.96	9.20
	Preston [18]	1.88	5.20	1.13	5.11
	Modified Sitter	6.97	9.93	5.84	9.49
	Funaoka et al. [17]	1.80	4.00	1.08	4.15
${\hat{θ}}_{1 / 2}$	Textbook	11.71	7.42	19.31	10.96
	Chauvet [8]	−0.02	7.56	0.03	8.67
	Rao et al. [4]	11.19	20.39	13.05	27.92
	Rao and Wu [2]	930.28	729.81	502.50	367.14
	Preston [18]	10.97	7.58	16.58	11.00
	Modified Sitter	11.81	12.98	10.74	13.20
	Funaoka et al. [17]	7.38	2.60	9.21	6.76

Table 3. Monte Carlo percent CV based on 3000 samples selected according to SRSWOR/SRSWOR.

		$ρ = 0.1$		$ρ = 0.3$
		$f_{1} = 5 %$	$f_{1} = 20 %$	$f_{1} = 5 %$	$f_{1} = 20 %$
$\hat{θ}$	Bootstrap Method	$(n = 10)$	$(n = 40)$	$(n = 10)$	$(n = 40)$
${\hat{t}}_{y}$	Textbook	43.4	18.9	45.3	19.8
	Chauvet [8]	43.9	19.9	45.9	20.9
	Rao et al. [4]	44.0	20.1	45.8	21.0
	Rao and Wu [2]	41.2	19.3	43.3	20.2
	Preston [18]	43.8	20.2	45.6	21.0
	Modified Sitter	43.6	19.9	45.0	20.6
	Funaoka et al. [17]	44.1	20.2	46.0	21.0
${\hat{θ}}_{1 / 2}$	Textbook	62.0	37.2	62.4	36.4
	Chauvet [8]	61.0	41.1	60.5	38.9
	Rao et al. [4]	59.9	41.1	59.3	38.0
	Rao and Wu [2]	64.3	37.4	69.1	39.7
	Preston [18]	63.8	38.0	64.0	37.3
	Modified Sitter	59.4	40.4	59.0	38.3
	Funaoka et al. [17]	59.8	42.3	59.3	39.5

Table 4. Coverage rate (CR) of the 95% t-distribution based confidence intervals constructed using bootstrap standard error estimators based on 3000 samples selected according to SRSWOR/SRSWOR.

		$ρ = 0.1$		$ρ = 0.3$
		$f_{1} = 5 %$	$f_{1} = 20 %$	$f_{1} = 5 %$	$f_{1} = 20 %$
$\hat{θ}$	Bootstrap Method	$(n = 10)$	$(n = 40)$	$(n = 10)$	$(n = 40)$
${\hat{t}}_{y}$	Textbook	95.33	95.07	95.40	94.73
	Chauvet [8]	94.50	94.87	94.70	94.67
	Rao et al. [4]	95.50	96.63	95.77	96.73
	Rao and Wu [2]	96.10	95.33	96.00	95.47
	Preston [18]	95.37	95.07	95.30	94.77
	Modified Sitter	95.80	95.47	95.73	95.30
	Funaoka et al. [17]	95.27	94.97	95.33	94.73
${\hat{θ}}_{1 / 2}$	Textbook	94.63	95.23	94.27	95.20
	Chauvet [8]	93.90	94.67	93.17	94.57
	Rao et al. [4]	95.10	95.97	94.70	96.03
	Rao and Wu [2]	100.00	99.97	99.97	99.93
	Preston [18]	94.50	95.23	93.70	94.80
	Modified Sitter	95.07	95.13	94.67	94.93
	Funaoka et al. [17]	94.73	94.27	94.20	94.13

Table 5. Average length (AL) of the 95% t-distribution based confidence intervals constructed using bootstrap standard error estimators based on 3000 samples selected according to SRSWOR/SRSWOR.

		$ρ = 0.1$		$ρ = 0.3$
		$f_{1} = 5 %$	$f_{1} = 20 %$	$f_{1} = 5 %$	$f_{1} = 20 %$
$\hat{θ}$	Bootstrap Method	$(n = 10)$	$(n = 40)$	$(n = 10)$	$(n = 40)$
${\hat{t}}_{y}$	Textbook	21,470.7	8725.5	23,157.5	9721.9
	Chauvet [8]	20,404.4	8612.6	22,012.1	9599.7
	Rao et al. [4]	21,978.5	9663.6	23,707.2	10,782.5
	Rao and Wu [2]	22,218.5	8925.4	23,857.5	9910.3
	Preston [18]	21,450.9	8724.4	23,134.4	9718.9
	Modified Sitter	21,988.4	8919.4	23,684.7	9921.4
	Funaoka et al. [17]	21,435.4	8674.5	23,119.5	9674.2
${\hat{θ}}_{1 / 2}$	Textbook	0.9796	0.4081	1.3847	0.5705
	Chauvet [8]	0.9284	0.4071	1.2717	0.5634
	Rao et al. [4]	0.9798	0.4306	1.3537	0.6117
	Rao and Wu [2]	2.9770	1.1343	3.0965	1.1681
	Preston [18]	0.9736	0.4081	1.3654	0.5701
	Modified Sitter	0.9795	0.4175	1.3407	0.5753
	Funaoka et al. [17]	0.9634	0.3972	1.3306	0.5582

Table 6. Monte Carlo percent RB of the bootstrap variance estimators to estimate the variance of the point estimator based on 3000 samples selected according to IPPS/SRSWOR.

		$ρ = 0.1$		$ρ = 0.3$
		$f_{1} = 5 %$	$f_{1} = 20 %$	$f_{1} = 5 %$	$f_{1} = 20 %$
$\hat{θ}$	Bootstrap Method	$(n = 10)$	$(n = 40)$	$(n = 10)$	$(n = 40)$
${\hat{t}}_{y}$	Textbook	−1.26	−0.36	−1.98	−0.05
	Chauvet [8]	−13.65	−5.74	−11.86	−3.83
	Rao et al. [4]	1.24	9.93	2.06	18.03
${\hat{θ}}_{1 / 2}$	Textbook	18.37	3.92	22.58	5.14
	Chauvet [8]	0.16	−3.24	2.65	2.09
	Rao et al. [4]	16.45	15.71	17.72	21.56

Table 7. Monte Carlo percent CV based on 3000 samples selected according to IPPS/SRSWOR.

		$ρ = 0.1$		$ρ = 0.3$
		$f_{1} = 5 %$	$f_{1} = 20 %$	$f_{1} = 5 %$	$f_{1} = 20 %$
$\hat{θ}$	Bootstrap Method	$(n = 10)$	$(n = 40)$	$(n = 10)$	$(n = 40)$
${\hat{t}}_{y}$	Textbook	44.9	18.9	44.6	19.0
	Chauvet [8]	46.2	21.0	46.7	21.3
	Rao et al. [4]	46.9	22.4	46.1	21.3
${\hat{θ}}_{1 / 2}$	Textbook	62.0	37.7	59.3	36.1
	Chauvet [8]	61.3	40.5	61.0	40.1
	Rao et al. [4]	60.3	41.2	58.4	37.5

Table 8. Coverage rate (CR) of the 95% t-distribution based confidence intervals constructed using bootstrap standard error estimators based on 3000 samples selected according to IPPS/SRSWOR.

		$ρ = 0.1$		$ρ = 0.3$
		$f_{1} = 5 %$	$f_{1} = 20 %$	$f_{1} = 5 %$	$f_{1} = 20 %$
$\hat{θ}$	Bootstrap Method	$(n = 10)$	$(n = 40)$	$(n = 10)$	$(n = 40)$
${\hat{t}}_{y}$	Textbook	95.27	95.50	94.73	94.83
	Chauvet [8]	93.23	94.83	93.63	94.70
	Rao et al. [4]	95.40	96.07	95.03	96.43
${\hat{θ}}_{1 / 2}$	Textbook	94.53	94.63	95.70	93.63
	Chauvet [8]	93.77	93.80	93.27	93.93
	Rao et al. [4]	95.17	94.93	95.27	95.43

Table 9. Average length (AL) of the 95% t-distribution based confidence intervals constructed using bootstrap standard error estimators based on 3000 samples selected according to IPPS/SRSWOR.

		$ρ = 0.1$		$ρ = 0.3$
		$f_{1} = 5 %$	$f_{1} = 20 %$	$f_{1} = 5 %$	$f_{1} = 20 %$
$\hat{θ}$	Bootstrap Method	$(n = 10)$	$(n = 40)$	$(n = 10)$	$(n = 40)$
${\hat{t}}_{y}$	Textbook	7694.4	3401.9	11,021.3	4746.8
	Chauvet [8]	7317.2	3314.3	10,508.3	4650.7
	Rao et al. [4]	7772.1	3566.9	11228.1	5152.5
${\hat{θ}}_{1 / 2}$	Textbook	0.9797	0.4184	1.3709	0.5597
	Chauvet [8]	0.9247	0.4110	1.2627	0.5472
	Rao et al. [4]	0.9736	0.4403	1.3461	0.6011

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, S.; Haziza, D.; Mashreghi, Z. A Comparison of Existing Bootstrap Algorithms for Multi-Stage Sampling Designs. Stats 2022, 5, 521-537. https://doi.org/10.3390/stats5020031

AMA Style

Chen S, Haziza D, Mashreghi Z. A Comparison of Existing Bootstrap Algorithms for Multi-Stage Sampling Designs. Stats. 2022; 5(2):521-537. https://doi.org/10.3390/stats5020031

Chicago/Turabian Style

Chen, Sixia, David Haziza, and Zeinab Mashreghi. 2022. "A Comparison of Existing Bootstrap Algorithms for Multi-Stage Sampling Designs" Stats 5, no. 2: 521-537. https://doi.org/10.3390/stats5020031

APA Style

Chen, S., Haziza, D., & Mashreghi, Z. (2022). A Comparison of Existing Bootstrap Algorithms for Multi-Stage Sampling Designs. Stats, 5(2), 521-537. https://doi.org/10.3390/stats5020031

Article Menu

A Comparison of Existing Bootstrap Algorithms for Multi-Stage Sampling Designs

Abstract

1. Introduction

2. The Setup

3. Bootstrap Procedures for Simple Random Sampling without Replacement at Both Stages

3.1. The Rescaling Bootstrap Algorithm

3.2. The Mirror-Match Bootstrap Algorithm

3.3. The Without-Replacement Bootstrap Algorithm

3.4. The Bernoulli Bootstrap Algorithm

3.5. The Preston Bootstrap Weights Algorithm

4. Bootstrap Procedures for Unequal Probability Sampling Designs

4.1. The Rao-Wu-Yue Bootstrap Weights Algorithm

4.2. The Pseudo-Population Bootstrap Algorithm

5. Simulation Study

6. Final Remarks

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI