Doubly Robust Estimation of the Finite Population Distribution Function Using Nonprobability Samples

Soonpil Kwon; Dongmin Jang; Kyu-Seong Kim

doi:10.3390/math13193227

,

and

¹

Department of Statistics and Data Science, University of Seoul, 163 Seoulsiripdae-ro, Dongdaemun-gu, Seoul 02504, Republic of Korea

²

Social Statistics Bureau, Statistics Korea, 189 Cheongsa-ro, Seo-gu, Daejeon 35208, Republic of Korea

³

Department of Statistics, University of Seoul, 163 Seoulsiripdae-ro, Dongdaemun-gu, Seoul 02504, Republic of Korea

^*

Author to whom correspondence should be addressed.

Mathematics2025, 13(19), 3227;https://doi.org/10.3390/math13193227

This article belongs to the Special Issue Uncertainty Quantification Techniques in Statistics, Machine Learning and FinTech: 2nd Edition

Version Notes

Order Reprints

Abstract

The growing use of nonprobability samples in survey statistics has motivated research on methodological adjustments that address the selection bias inherent in such samples. Most studies, however, have concentrated on the estimation of the population mean. In this paper, we extend our focus to the finite population distribution function and quantiles, which are fundamental to distributional analysis and inequality measurement. Within a data integration framework that combines probability and nonprobability samples, we propose two estimators, a regression estimator and a doubly robust estimator, and discuss their asymptotic properties. Furthermore, we derive quantile estimators and construct Woodruff confidence intervals using a bootstrap method. Simulation results based on both a synthetic population and the 2023 Korean Survey of Household Finances and Living Conditions demonstrate that the proposed estimators perform stably across scenarios, supporting their applicability to the production of policy-relevant indicators.

Keywords:

data integration; inverse probability weighting; quantiles; regression

MSC:

62D05

1. Introduction

Probability sampling, introduced in Neyman’s seminal work [1], has long been regarded as the gold standard for finite population inference in survey statistics [2]. In contrast, nonprobability samples were historically used mainly in observational studies or as supplementary sources for probability sampling. In recent years, however, probability sampling has faced increasing challenges, including rising costs, incomplete sampling frames, declining response rates, and increasingly strict data privacy regulations [3,4]. These difficulties have renewed interest in nonprobability samples as a practical alternative for population inference [5,6].

A key characteristic of nonprobability samples is their unknown selection mechanism, which, if ignored, can lead to substantial selection bias [7,8,9]. Moreover, the theoretical foundations for inference with nonprobability samples remain insufficiently developed, and standardized criteria for evaluating the reliability of resulting estimates are largely lacking.

These challenges have motivated the development of alternative strategies for improving population inference. Among them, an increasingly prominent approach is data integration, which improves population inference by combining information from different sources. Earlier applications focused on merging two probability samples [10,11]. More recently, the framework has been extended to integrate nonprobability with probability samples to correct for the selection bias inherent in such data.

Building on this framework, subsequent research has proposed several approaches to mitigate selection bias. For example, Elliott and Valliant [12] proposed an approach that models the selection mechanism of the nonprobability sample and applies the inverse of estimated propensity scores as weights. Alternatively, Kim et al. [13] introduced a mass imputation method, which fits an outcome regression model using study variables observed in the nonprobability sample and predicts corresponding values for units in the probability sample. A central limitation of both approaches, however, is that their validity depends entirely on correct model specification.

To overcome these limitations, doubly robust estimation methods have attracted considerable attention because they maintain key asymptotic properties when at least one of the two underlying models is correctly specified [6,14,15,16]. Existing applications of doubly robust estimation, however, have largely concentrated on measures of central tendency, such as the population mean. The mean is intuitive and easily interpretable, but it is highly sensitive to extreme values and may be inadequate when information about specific regions of the distribution is crucial, for instance, in studies of income inequality or health outcomes. By contrast, the finite population distribution function and quantiles provide information across the entire distribution, enabling more comprehensive analyses [17,18].

Within probability sampling theory, a simple design-based estimator of the finite distribution function is the Horvitz–Thompson estimator [19], but it does not incorporate auxiliary information. To exploit such information, Chambers and Dunstan [20] proposed a model-based estimator that is asymptotically model-unbiased, though not design-unbiased. Rao et al. [21] subsequently developed an estimator that is asymptotically unbiased under both the sampling design and the model. Wu and Sitter [22] derived corresponding variance estimators, and Rueda et al. [23] introduced a calibration estimator of the finite distribution function. Extending beyond this probability sample foundation to settings with uncontrolled sample selection, only a few studies have attempted to estimate the finite distribution function and quantiles using nonprobability samples. For example, Castro-Martín et al. [24] developed estimators based on either a selection model or an outcome regression model, whereas Cobo et al. [25] proposed several estimators grounded in inverse probability weighting and calibration techniques.

In this study, we extend the doubly robust estimator for the population mean proposed by Chen et al. [15] to the finite distribution function under a data integration framework. For comparison, we first consider the conventional inverse probability weighting approach, and then propose two estimators: a regression estimator and a doubly robust estimator. The proposed doubly robust estimator is asymptotically unbiased for the finite distribution function if either the propensity score model or the outcome regression model is correctly specified.

The remainder of the paper is structured as follows. Section 2 presents the basic setup and assumptions. Section 3 defines the three estimators of the finite distribution function and analyzes their asymptotic properties. Section 4 applies these estimators to the estimation of population quantiles and describes the construction of Woodruff confidence intervals. Section 5 evaluates the performance of the proposed methods through simulation studies. Finally, Section 6 concludes with the implications of the study and directions for future research.

2. Basic Setup

Consider a finite population

U = {1, 2, \dots, N}

of size N. Each unit i ∈ U is associated with a vector of auxiliary variables x_i and a study variable y_i. The parameter of interest in this study is the finite population distribution function, defined by

F_{y} (t) = \frac{1}{N} \sum_{i \in U} I (y_{i} \leq t), - \infty < t < \infty,

(1)

where I(A) is the indicator function that equals 1 if the event A is true and 0 otherwise.

Suppose two samples are drawn from the population. The first is a nonprobability sample S_A of size n_A, in which both the auxiliary variables x_i and the study variable y_i are observed for each unit i ∈ S_A. The second is a probability sample S_B of size n_B, in which the auxiliary variables x_i are observed along with their inclusion probabilities

π_{i}^{B} = P (i \in S_{B})

under a given sampling design p.

In this setup, the nonprobability sample is not representative of the population, whereas the probability sample does not contain observations on the study variable. Therefore, an integrated approach is required for population inference. This approach relies on common auxiliary variables observed in both samples, which serve as a bridge linking the study variable with the design information. Such a data integration framework is well-established and has been widely applied in prior studies [6,12,15,16,26,27,28]. Following Chen et al. [15], we adopt this framework under the assumption that the two samples are drawn independently, which simplifies the analysis and enhances theoretical validity.

To utilize nonprobability samples for population inference, we assume an implicit probabilistic selection mechanism, with unknown inclusion probabilities. This corresponds to the concept of an unknown probability sample discussed by Wu [29]. Under this assumption, the issue of undercoverage is excluded from the scope of this study.

To formalize the selection mechanism, let the indicator variable R_i be defined as

R_{i} = \{\begin{matrix} 1, & if i \in S_{A}, \\ 0, & if i \notin S_{A} . \end{matrix}

Its conditional expectation is given by

π_{i}^{A} = E_{δ} [R_{i} ∣ x_{i}, y_{i}] = P (R_{i} = 1 ∣ x_{i}, y_{i}),

(2)

where δ denotes the selection mechanism model. This probability

π_{i}^{A}

is commonly referred to as the propensity score in observational studies [30] and as the participation probability in survey sampling [6,31].

We adopt the concept of an unknown probability sample and impose the following assumptions on the propensity score, as introduced in Chen et al. [15].

A1: Given the auxiliary variables x_i, the study variable y_i and the selection indicator R_i are conditionally independent.
A2: For all units i ∈ U, $π_{i}^{A} > 0$ .
A3: Given $x_{1}, x_{2}, \dots, x_{N}$ , the selection indicators $R_{1}, R_{2}, \dots, R_{N}$ are independent.

Assumptions A1 and A2 jointly constitute the strong ignorability condition [30]. Under this condition,

π_{i}^{A}

in Equation (2) depends only on the auxiliary variables x_i.

3. Estimators of the Finite Distribution Function

In this section, we present three estimators of the finite distribution function defined in Equation (1) within the data integration framework: the inverse probability weighted (IPW) estimator, the regression (REG) estimator, and the doubly robust (DR) estimator. Their statistical properties are investigated under a joint randomization framework that consists of three components [16]:

(i): δ: The selection mechanism for the nonprobability sample;
(ii): p: The probability sampling design;
(iii): M: The outcome regression model for the study variable y.

Specifically, the IPW estimator is analyzed under the δp-framework, the REG estimator under the Mp-framework, and the DR estimator under either the δp- or the Mp-framework, without specification of which one. Notably, all frameworks incorporate the probability sampling design p.

To establish the asymptotic properties of these estimators, we next introduce the regularity conditions required for our theoretical results. Following Chen et al. [15], we adopt conditions C1, C4, C5, and C6, which we reformulate as regularity conditions B1–B4 in this paper by substituting I(y_i ≤ t) for y_i and

G (t - x_{i}^{⊤} β)

for m(x_i, β). To ensure the validity of the Taylor expansion, we further impose the following additional condition:

B5: The error distribution function G(t) is twice continuously differentiable for all t, and its first derivative is denoted by g(t).

In addition, we assume an increasing population framework for asymptotic analysis. That is, we consider a sequence of finite populations indexed by ν, denoted as

U_{ν} : ν = 1, 2, \dots

with corresponding samples S_A,ν and S_B,ν. As ν → ∞, the population size N_ν and the sample sizes n_A,ν, n_B,ν all diverge to infinity. For simplicity, the subscript ν is omitted hereafter, and asymptotics are expressed in terms of N → ∞.

3.1. Inverse Probability Weighted Estimator

The IPW method is widely used to correct for selection bias in nonprobability samples, where the inclusion probability

π_{i}^{A}

is assumed to be a function of auxiliary variables x_i and unknown parameters θ of a participation model.

π_{i}^{A} = π (x_{i}, θ)

Estimating

π_{i}^{A}

directly, however, requires auxiliary information for the entire population, which is rarely available in practice. In response, Chen et al. [15] proposed a pseudo-likelihood approach that incorporates the design weights from a probability sample. The participation model parameters are first estimated as

\hat{θ}

, which are then used to compute the estimated inclusion probabilities

{\hat{π}}_{i}^{A} = π (x_{i}, \hat{θ})

. The corresponding pseudo-weight is

{\hat{d}}_{i}^{A} = 1 / {\hat{π}}_{i}^{A}

, which is used to construct a Hájek-type estimator of the finite distribution function. This quasi-randomization approach enables valid inference from nonprobability samples [12,32]. Using the pseudo-weights

{\hat{d}}_{i}^{A}

, the IPW estimator of the finite distribution function at a fixed point t is given by

{\hat{F}}_{IPW} (t) = \frac{1}{{\hat{N}}_{A}} \sum_{i \in S_{A}} {\hat{d}}_{i}^{A} I (y_{i} \leq t),

(3)

where

{\hat{N}}_{A} = \sum_{i \in S_{A}} {\hat{d}}_{i}^{A}

.

Theorem 1.

Under A1–A3 and B1–B5, and if the propensity score model is correctly specified as a logistic regression model,

{\hat{F}}_{IPW} (t)

is an asymptotically δp-unbiased estimator of F_y(t) for a fixed point t.

Proof.

The asymptotic property of the IPW estimator for the population mean μ_y, denoted as

{\hat{μ}}_{y} = {\hat{N}}_{A}^{- 1} \sum_{i \in S_{A}} y_{i} / {\hat{π}}_{i}^{A}

, is given by Chen et al. [15] as

{\hat{μ}}_{y} - μ_{y} = O_{p} (n_{A}^{- 1 / 2})

. The finite distribution function estimator

{\hat{F}}_{IPW} (t)

can be viewed as a plug-in estimator with y_i replaced by I(y_i ≤ t). Therefore,

{\hat{F}}_{IPW} (t) - F_{y} (t) = O_{p} (n_{A}^{- 1 / 2}), - \infty < t < \infty

Moreover, since both

{\hat{F}}_{IPW} (t)

and F_y(t) lie in [0, 1], we have

| {\hat{F}}_{IPW} (t) - F_{y} (t) | \leq 1

. Hence,

E_{δ p} [{\hat{F}}_{IPW} (t) - F_{y} (t)] = O (n_{A}^{- 1 / 2})

Thus,

{\hat{F}}_{IPW} (t)

is an asymptotically δp-unbiased estimator of the finite distribution function F_y(t). □

The IPW estimator is asymptotically unbiased under the assumption of strong ignorability, that is, when the selection mechanism is fully explained by the auxiliary variables. However, if the propensity score model is misspecified, asymptotic unbiasedness may fail. Moreover, even under correct specification, extreme propensity score values (close to 0 or 1) may yield highly unstable inverse probability weights, inflating the variance of the estimator [33].

3.2. Regression Estimator

The REG estimator is obtained by fitting a regression model of y_i on x_i using the nonprobability sample S_A and then applying the fitted model to the probability sample S_B to estimate the finite distribution function. Since this procedure imputes the study variable for all units in S_B, it is commonly referred to as mass imputation.

This approach relies on a superpopulation framework, where the finite population

U = {1, 2, \dots, N}

is regarded as a random sample drawn from the following outcome regression model:

y_{i} = x_{i}^{⊤} β + ϵ_{i}, i = 1, 2, \dots, N,

(4)

where the error terms ϵ_i are independent, with E_M(ϵ_i) = 0 and Var_M(ϵ_i) = σ².

The subscript M indicates that the corresponding expectation or variance is taken under the model (4). By Assumption A1, we have

E_{M} [y_{i} ∣ x_{i}, R_{i}] = E_{M} [y_{i} ∣ x_{i}]

and

{Var}_{M} [y_{i} ∣ x_{i}, R_{i}] = {Var}_{M} [y_{i} ∣ x_{i}]

, ensuring that the regression model remains valid for the nonprobability sample.

A straightforward model-based approach to estimating the finite distribution function is to substitute I(y_i ≤ t) with the plug-in indicator

I (x_{i}^{⊤} \hat{β} \leq t)

. Since

E_{M} [I (x_{i}^{⊤} \hat{β} \leq t)] \neq P (y_{i} < t)

, in general, this naive substitution may result in bias [17]. To address this issue, Chambers and Dunstan [20] introduced a model-based estimator constructed from regression residuals.

Let G(·) denote the distribution function of ϵ_i. The estimator is defined as

{\hat{G}}_{i} {(t)}_{REG} = \frac{1}{n_{A}} \sum_{j \in S_{A}} I (y_{j} - x_{j}^{⊤} \hat{β} \leq t - x_{i}^{⊤} \hat{β}),

(5)

where

\hat{β}

is the ordinary least-squares (OLS) estimator of β.

Based on Equation (5), the regression estimator of F_y(t) at a fixed point t is given by

\begin{matrix} {\hat{F}}_{REG} (t) = \frac{1}{{\hat{N}}_{B}} \sum_{i \in S_{B}} d_{i}^{B} {\hat{G}}_{i} {(t)}_{REG}, \end{matrix}

(6)

where

{\hat{N}}_{B} = \sum_{i \in S_{B}} d_{i}^{B}

.

Theorem 2.

Under A1–A3 and B1–B5,

{\hat{F}}_{REG} (t)

is an asymptotically Mp-unbiased estimator of F_y(t) for a fixed point t.

Proof.

Following Chambers et al. [34], we can show that under conditions B1–B5,

\begin{matrix} E_{M} & [{\hat{G}}_{i} {(t)}_{REG}] \\ = \frac{1}{n_{A}} \sum_{j \in S_{A}} E_{M} [I \{ϵ_{j} \leq t_{i} + {(x_{j} - x_{i})}^{⊤} (\hat{β} - β)\}] \\ = \frac{1}{n_{A}} \sum_{j \in S_{A}} E_{Δ_{(j)}} [G \{t_{i} + {(x_{j} - x_{i})}^{⊤} Δ_{(j)}\}] + O (n_{A}^{- 1}) \\ = \frac{1}{n_{A}} \sum_{j \in S_{A}} E_{Δ_{(j)}} [G (t_{i}) + g (t_{i}) {(x_{j} - x_{i})}^{⊤} Δ_{(j)} + \frac{1}{2} g^{'} (t_{i}) {\{{(x_{j} - x_{i})}^{⊤} Δ_{(j)}\}}^{2}] + O (n_{A}^{- 1}) \\ = G (t_{i}) + O (n_{A}^{- 1}) . \end{matrix}

where

t_{i} = t - x_{i}^{⊤} β

,

Δ_{(j)} = {\hat{β}}_{(j)} - β

, and

{\hat{β}}_{(j)}

denotes the leave-one-out OLS estimator based on S_A \ {j}.

Using this result, the expectation of the bias can be evaluated as

\begin{matrix} E [{\hat{F}}_{REG} (t) - F_{y} (t)] & = E_{p} [E_{M} [{\hat{F}}_{REG} (t) - F_{y} (t)]] \\ = E_{p} [\frac{1}{{\hat{N}}_{B}} \sum_{i \in S_{B}} d_{i}^{B} G (t_{i}) - \frac{1}{N} \sum_{i = 1}^{N} G (t_{i})] + O (n_{A}^{- 1}) \end{matrix}

The first term in the second equation is asymptotically p-unbiased to zero. Therefore,

{\hat{F}}_{REG} (t)

is an asymptotically Mp-unbiased estimator for the finite population distribution function F_y(t). □

When the outcome regression model is correctly specified, the REG estimator is highly efficient and supports broader use of nonprobability samples. However, if the regression model fails to capture the true distribution, bias may arise, and the method becomes sensitive to misspecification. The next section introduces the DR estimator, which remains asymptotically unbiased, provided that either the propensity score model or the outcome regression model is correctly specified.

3.3. Doubly Robust Estimator

The asymptotic unbiasedness of the IPW estimator in Equation (3) and the REG estimator in Equation (6) relies on the correct specification of their respective working models. In practice, however, ensuring such correctness is often difficult, which motivates the development of procedures that remain valid under model misspecification. The DR estimator was introduced to address this issue and has been regarded as a successful approach since Robins et al. [35].

To construct a DR estimator for the finite distribution function, we require an analog of

{\hat{G}}_{i} {(t)}_{REG}

in Equation (5) that estimates the error distribution G(·) and remains valid under joint randomization. Because

{\hat{G}}_{i} {(t)}_{REG}

is derived under the Mp-framework, it cannot be directly applied under the δp-framework. Accordingly, we extend the method of Rao et al. [21] and propose a new estimator of the error distribution that is valid under such joint randomization.

{\hat{G}}_{i} {(t)}_{DR} = \frac{1}{{\hat{N}}_{A}} \sum_{j \in S_{A}} {\hat{d}}_{j}^{A} I (y_{j} - x_{j}^{⊤} \hat{β} \leq t - x_{i}^{⊤} \hat{β})

(7)

Based on

{\hat{G}}_{i} {(t)}_{DR}

, defined in Equation (7), the DR estimator of the finite distribution function F_y(t) at a fixed point t is then given by

{\hat{F}}_{DR} (t) = \frac{1}{{\hat{N}}_{A}} \sum_{i \in S_{A}} {\hat{d}}_{i}^{A} \{I (y_{i} \leq t) - {\hat{G}}_{i} {(t)}_{DR}\} + \frac{1}{{\hat{N}}_{B}} \sum_{i \in S_{B}} d_{i}^{B} {\hat{G}}_{i} {(t)}_{DR}

(8)

Theorem 3.

Under regularity conditions A1–A3 and B1–B5 , if at least one of the models, the propensity score model or the outcome regression model, is correctly specified,

{\hat{F}}_{DR} (t)

is an asymptotically unbiased estimator of F_y(t) at a fixed point t under the δp- or Mp-framework.

Proof.

(i) When the propensity score model is correctly specified

The doubly robust estimator can be rewritten as

{\hat{F}}_{DR} (t) = {\hat{F}}_{IPW} (t) - \frac{1}{{\hat{N}}_{A}} \sum_{i \in S_{A}} {\hat{d}}_{i}^{A} {\hat{G}}_{i} {(t)}_{DR} + \frac{1}{{\hat{N}}_{B}} \sum_{i \in S_{B}} d_{i}^{B} {\hat{G}}_{i} {(t)}_{DR}

The second and third terms on the right-hand side are Hájek estimators of

N^{- 1} \sum_{i = 1}^{N} {\hat{G}}_{i} {(t)}_{DR}

based on the nonprobability sample S_A and the probability sample S_B, respectively, and therefore cancel out asymptotically. Given the asymptotic δp-unbiasedness of

{\hat{F}}_{IPW} (t)

,

{\hat{F}}_{DR} (t)

is also asymptotically δp-unbiased.

(ii): When the outcome regression model is correctly specified

Similarly to the proof for the REG estimator, we can show

\begin{matrix} E_{M} & [{\hat{G}}_{i} {(t)}_{DR}] \\ = \frac{1}{{\hat{N}}_{A}} \sum_{j \in S_{A}} {\hat{d}}_{j}^{A} E_{M} [I \{ϵ_{j} \leq t_{i} + {(x_{j} - x_{i})}^{⊤} (\hat{β} - β)\}] \\ = \frac{1}{{\hat{N}}_{A}} \sum_{j \in S_{A}} {\hat{d}}_{j}^{A} E_{Δ_{(j)}} [G \{t_{i} + {(x_{j} - x_{i})}^{⊤} Δ_{(j)}\}] + O (n_{A}^{- 1}) \\ = \frac{1}{{\hat{N}}_{A}} \sum_{j \in S_{A}} {\hat{d}}_{j}^{A} E_{Δ_{(j)}} [G (t_{i}) + g (t_{i}) {(x_{j} - x_{i})}^{⊤} Δ_{(j)} + \frac{1}{2} g^{'} (t_{i}) {\{{(x_{j} - x_{i})}^{⊤} Δ_{(j)}\}}^{2}] + O (n_{A}^{- 1}) \\ = G (t_{i}) + O (n_{A}^{- 1}) . \end{matrix}

It follows that the expected bias is

\begin{matrix} E & [{\hat{F}}_{DR} (t) - F_{y} (t)] \\ = E_{p} [\frac{1}{{\hat{N}}_{A}} \sum_{i \in S_{A}} {\hat{d}}_{i}^{A} E_{M} [I (y_{i} \leq t) - {\hat{G}}_{i} {(t)}_{DR}] + \frac{1}{{\hat{N}}_{B}} \sum_{i \in S_{B}} d_{i}^{B} E_{M} [{\hat{G}}_{i} {(t)}_{DR}] - \frac{1}{N} \sum_{i \in U} E_{M} [I (y_{i} \leq t)]] \\ = E_{p} [\frac{1}{{\hat{N}}_{B}} \sum_{i \in S_{B}} d_{i}^{B} G (t_{i}) - \frac{1}{N} \sum_{i = 1}^{N} G (t_{i})] + O (n_{A}^{- 1}) . \end{matrix}

The first term in the second equation is asymptotically p-unbiased to zero. Therefore, under the Mp-framework,

{\hat{F}}_{DR} (t)

is an asymptotically unbiased estimator of the finite population distribution function F_y(t). □

The asymptotic unbiasedness of the DR estimator requires the estimated coefficients to satisfy probability-limit conditions; specifically, for the propensity score model parameters

\hat{θ}

and the outcome regression model parameters

\hat{β}

, there exist fixed vectors θ^∗ and β^∗ such that

p lim \hat{θ} = θ^{*}

and

p lim \hat{β} = β^{*}

. If the propensity score model is correctly specified, then θ^∗ = θ, and if the outcome regression model is correctly specified, then β^∗ = β. Under misspecification, by contrast, these probability limits need not coincide with the true parameters, and the limiting value may not have a meaningful interpretation.

4. Quantile Estimation

An important application of the finite distribution function estimators is the estimation of population quantiles, defined as

ξ_{q} = inf {t; F_{y} (t) \geq q} .

Quantiles provide informative summaries of distributional features, including central tendency, spread, and asymmetry, and are also useful for assessing outliers. Because estimators of the finite distribution function are typically step functions, linear interpolation is employed to obtain a unique estimate of the qth quantile [21,36,37]. The quantile estimator

{\hat{ξ}}_{q}

is expressed as

{\hat{ξ}}_{q} = a + \frac{q - \hat{F} (a)}{\hat{F} (b) - \hat{F} (a)} (b - a),

where

a = max {t; \hat{F} (t) \leq q}

and

b = min {t; \hat{F} (t) \geq q}

.

A widely used method for constructing a confidence interval (CI) for a quantile estimator was proposed by Woodruff [36]. The key idea is to first construct a CI for the estimated finite distribution function and then invert this interval to derive a CI for the quantile. The resulting 100(1 − α)% CI is given by

\begin{matrix} {\hat{ξ}}_{q}^{L} = inf \{t; \hat{F} (t) \geq q - z_{1 - α / 2} \sqrt{\hat{V} [\hat{F} ({\hat{ξ}}_{q})]}\}, \\ {\hat{ξ}}_{q}^{U} = inf \{t; \hat{F} (t) \geq q + z_{1 - α / 2} \sqrt{\hat{V} [\hat{F} ({\hat{ξ}}_{q})]}\}, \end{matrix}

where z_1−α/2 is the (1 − α/2) quantile of the standard normal distribution and

\hat{V} [\hat{F} ({\hat{ξ}}_{q})]

denotes the estimated variance of

\hat{F} (t)

evaluated at

{\hat{ξ}}_{q}

. Sitter and Wu [38] provided empirical evidence that the Woodruff method achieves approximately correct coverage even for extreme quantiles (large or small q).

5. Simulation Studies

To evaluate the performance of the finite distribution function estimators,

{\hat{F}}_{IPW} (t)

,

{\hat{F}}_{REG} (t)

, and

{\hat{F}}_{DR} (t)

, we conducted simulation studies based on two populations: (i) a synthetic finite population from Chen et al. [15] and (ii) the 2023 Korean Survey of Household Finances and Living Conditions.

The variances of the finite distribution function estimators were obtained via a bootstrap procedure following Chen et al. [15]:

From the nonprobability sample S_A and the probability sample S_B, draw bootstrap samples $S_{A}^{(j)}$ and $S_{B}^{(j)}$ of sizes n_A and n_B, respectively, by simple random sampling with replacement for J = 1000 replicates.
For each bootstrap replicate, compute ${\hat{F}}^{(j)} (t)$ .
Using ${{\hat{F}}^{(j)} (t)}_{j = 1}^{J}$ , calculate the bootstrap variance estimator v_BT.

Performance was evaluated over R = 3000 simulation replications using percentage relative bias (%RB) and relative root mean squared error (RRMSE), where

% RB = \frac{1}{R} \sum_{r = 1}^{R} \frac{{\hat{θ}}^{(r)} - θ}{θ} \times 100, RRMSE = \frac{1}{θ} \sqrt{\frac{1}{R} \sum_{r = 1}^{R} {({\hat{θ}}^{(r)} - θ)}^{2}},

with

{\hat{θ}}^{(r)}

denoting the estimate from replication r and θ the target parameter. For the finite distribution function, the bootstrap variance, and quantiles, the corresponding choices were

The finite distribution function: ${\hat{θ}}^{(r)} = {\hat{F}}^{(r)} (t), θ = F_{y} (t)$ ,
Bootstrap variance: ${\hat{θ}}^{(r)} = v_{BT}^{(r)}, θ = V$ ,
Quantile: ${\hat{θ}}^{(r)} = {\hat{ξ}}_{q}^{(r)}, θ = ξ_{q}$ ,

where V denotes the simulation-based variance of

\hat{F} (t)

computed from 10,000 replications.

The coverage probability of the CI based on the bootstrap variance (%CP_v) was evaluated as

% {CP}_{v} = \frac{1}{R} \sum_{r = 1}^{R} I (| {\hat{F}}^{(r)} (t) - F_{y} (t) | \leq z_{1 - α / 2} \sqrt{v_{BT}^{(r)}}) .

The performance of the Woodruff CI was assessed by its coverage probability (%CP_ξ), lower error rate (%L), and upper error rate (%U):

\begin{matrix} % {CP}_{ξ} & = \frac{1}{R} \sum_{r = 1}^{R} I (ξ_{q}^{L (r)} < ξ_{q} < ξ_{q}^{U (r)}), \\ % L & = \frac{1}{R} \sum_{r = 1}^{R} I (ξ_{q} < ξ_{q}^{L (r)}), \\ % U & = \frac{1}{R} \sum_{r = 1}^{R} I (ξ_{q}^{U (r)} < ξ_{q}), \end{matrix}

where

ξ_{q}^{L (r)}

and

ξ_{q}^{U (r)}

denote, respectively, the lower and upper Woodruff CI bounds for the qth quantile in replication r.

5.1. Study 1

Following the simulation design of Chen et al. [15], we generated a finite population of size N = 20,000. The study variable y and auxiliary variables x were generated from

y_{i} = 2 + x_{1 i} + x_{2 i} + x_{3 i} + x_{4 i} + σ ϵ_{i}, i = 1, 2, \dots, N,

where (x_1i, x_2i, x_3i, x_4i) follow the design in Chen et al. [15], and the error terms ϵ_i ∼ N(0, 1). The parameter σ was chosen such that the correlation coefficient ρ between y and the linear predictor x^⊤β equals 0.5.

We consider four model specification scenarios:

TT: Both δ and M are correctly specified.
TF: δ is correctly specified, but M is misspecified, with x_4i omitted from the model.
FT: M is correctly specified, but δ is misspecified, with x_4i omitted from the model.
FF: Both models are misspecified, with x_4i omitted from each model.

The analysis uses a nonprobability sample S_A of size n_A = 500 and a probability sample S_B of size n_B = 1000. Table 1 reports %RB and RRMSE for the proposed finite distribution function estimators. Under TT, all estimators exhibit low bias and error, demonstrating stable performance. Under TF and FT, the DR estimator attains lower bias and error than the alternatives, highlighting the advantages of the doubly robust property. By contrast, under FF, performance deteriorates markedly for all estimators.

Table 1. %RB and RRMSE of the finite distribution function estimators (Study 1).

Table 2 compares the bootstrap variance estimators in terms of %RB and %CP_v. Under TT, all variance estimators perform satisfactorily. Under TF and FT, despite model misspecification, the variance estimator associated with the DR method retains low bias and a %CP_v close to 95%, indicating stable reliability and accuracy. Conversely, under FF, coverage performance deteriorates markedly across all methods.

Table 2. %RB and %CP_v of bootstrap variance estimators (Study 1).

Table 3 summarizes the results for the quantile estimators. Mirroring the findings for the finite distribution function estimators, all methods perform well under the TT scenario. Under TF and FT, the DR-based quantiles remain stable, confirming the robustness of the doubly robust approach. By contrast, under FF, overall estimation accuracy deteriorates.

Table 3. %RB and RRMSE of quantile estimators (Study 1).

Table 4 reports the Woodruff CI results for the quantile estimators, including %CP_ξ, %L, and %U. Consistent with previous findings, all methods perform well under the TT scenario. Under TF and FT, the DR-based intervals maintain a %CP_ξ close to the nominal 95% with balanced tail errors, indicating high reliability. By contrast, under FF, coverage deteriorates substantially across methods. %CP_ξ falls below the nominal level and both tail error rates increase, signaling degraded interval performance.

Table 4. %CP_ξ, %L, and %U of Woodruff confidence intervals (Study 1).

5.2. Study 2

In the second simulation study, we treat the 2023 Korean Survey of Household Finances and Living Conditions (N = 16,730) as the finite population and repeatedly draw subsamples from it. Table 5 summarizes the key variables used in the experiment and their definitions.

Table 5. Variables and definitions from the Korean Survey of Household Finances and Living Conditions (2023).

The nonprobability sample S_A was generated to mimic structures commonly observed in practice. The propensity score model was specified as a logistic regression,

log \{\frac{π_{i}^{A}}{1 - π_{i}^{A}}\} = ζ_{0} + ζ_{1} EDU + ζ_{2} SNG + ζ_{3} APT + ζ_{4} DEBT, i = 1, \dots, N,

where ζ₀ was chosen so that

\sum_{i = 1}^{N} π_{i}^{A} = n_{A}

. Under this design, households with higher educational attainment of the household head, non-single households, apartment residents, and households without debt were more likely to be included in S_A. The nonprobability sample S_A was then selected by Poisson sampling with inclusion probabilities

π_{i}^{A}

.

The probability sample S_B was stratified into nine strata defined by GEO, HOME, and SIZE. A mixed-allocation scheme—combining Neyman and proportional allocation—was used to determine stratum-specific sample sizes, followed by simple random sampling without replacement within each stratum. The sample sizes were set to n_A = 500 and n_B = 1000.

The study variable of interest was current income (INCOME). Because the true outcome regression model was unknown, we included EXP1 and EXP2—the covariates with comparatively strong explanatory power—as regressors in the working model. This setup allows us to assess the impact of model misspecification on estimation performance and to isolate efficiency gains attributable to the DR estimator. We consider two scenarios regarding the propensity score model:

A: Correctly specified propensity score model.
B: Misspecified propensity score model (excluding SNG and DEBT).

Table 6 reports the results for the distribution–function estimators. Overall, the REG estimator performs reasonably well, although its bias and error are somewhat larger at lower quantiles than at middle and upper quantiles, likely reflecting the limited explanatory power of the auxiliary variables and the possible overrepresentation of high-income households. Under Scenario A, the IPW estimator and the DR estimator both exhibit low bias and error, confirming the effectiveness of propensity score adjustment. Under Scenario B, the REG estimator is the most stable, while the DR estimator inherits some bias from the misspecified IPW component and thus loses efficiency. In summary, when the propensity score model is correctly specified, the IPW estimator, the REG estimator, and the DR estimator all yield stable results. However, when the propensity score model is misspecified, only the REG estimator and the DR estimator perform well, with the REG estimator performing best. These findings highlight that the choice of estimator may critically depend on the availability and explanatory power of the auxiliary variables.

Table 6. %RB and RRMSE of the finite distribution function estimators (Study 2).

Table 7 compares the bootstrap variance estimators in terms of %RB and %CP_v. Consistent with the findings for the finite distribution function estimators, the REG estimator shows degraded variance performance at lower quantiles. The IPW estimator maintains coverage close to 95% %CP_v under Scenario A, but its %CP_v declined markedly under Scenario B. The DR estimator achieves both low bias and stable %CP_v across scenarios, indicating reliable variance estimation.

Table 7. %RB and %CP_v of bootstrap variance estimators (Study 2).

Table 8 compares the quantile estimators in terms of %RB and RRMSE. The REG estimator shows substantial bias at lower quantiles, whereas the IPW estimator performs well under Scenario A but deteriorates under Scenario B. The DR estimator maintains moderate bias and error across both scenarios, yielding comparatively stable performance overall.

Table 8. %RB and RRMSE of quantile estimators (Study 2).

Table 9 reports results for the Woodruff confidence intervals of the quantile estimators-%CP_ξ, %L, and %U. The IPW estimator attains a %CP_ξ close to the nominal 95% under Scenario A, but coverage drops sharply under Scenario B, accompanied by an upward bias in %U, indicating sensitivity to propensity score misspecification. The REG estimator performs well at the middle and upper quantiles, but shows increased %L at lower quantiles. The DR estimator maintains a stable %CP_ξ across scenarios, with only a slight upward bias in %U under Scenario B.

Table 9. %CP_ξ, %L and %U of Woodruff confidence intervals (Study 2).

In summary, two simulation experiments were conducted to evaluate the proposed methods. First, when both models were correctly specified, all estimators (IPW, REG, and DR) performed well, while under double misspecification, all methods failed to provide reliable results. Second, under single-model misspecification, the DR estimator consistently maintained low bias and stable inference, confirming the robustness of the approach. Third, in the empirical application using the 2023 Korean Survey of Household Finances and Living Conditions, the central role of auxiliary variables was evident, with the DR estimator showing comparatively the most reliable performance, especially in the lower tail.

6. Conclusions

This study examined methods for estimating the finite population distribution function and quantiles within a data integration framework that combines probability and nonprobability samples. We considered three estimators—IPW, REG, and DR. Our contribution extends DR methodology beyond mean estimation to distributional estimands and establishes that the proposed DR estimator remains asymptotically unbiased for the finite distribution function when either the propensity score model or the outcome regression model is correctly specified. This property makes the DR approach especially valuable in survey practice, where model misspecification is almost unavoidable.

The simulation evidence, based on both a synthetic population and the 2023 Korean Survey of Household Finances and Living Conditions, underscores these advantages. When both models were correct, all estimators performed well. With single-model misspecification, DR retained stable performance, while IPW and REG broke down depending on the misspecified model. Under double misspecification, no method succeeded, highlighting the importance of at least one correctly specified model. In the empirical application using the 2023 Korean survey, the role of auxiliary variables was also evident: when auxiliary variables were strong, both REG and DR improved in accuracy, but when they were weak, the REG estimator showed clear bias and coverage deterioration in the lower tail. Taken together, these findings demonstrate that DR estimation offers a practical safeguard against model misspecification in quantile inference.

Future research might, first of all, extend the evaluation of the proposed estimators to more challenging and realistic survey settings, such as small sample sizes, heavy-tailed outcomes, and highly unbalanced propensity scores. Next, more refined approaches to variance estimation would be valuable, since conventional bootstrap methods are prone to overestimating variability in integrated data settings. This underscores the need for bootstrap methodologies specifically designed for the data integration framework, which properly incorporate both the sampling weights from the probability sample and the inclusion probabilities from the nonprobability sample. Finally, further work could consider relaxing the Missing at Random assumption by exploring methods that address Not Missing at Random mechanisms or undercoverage and by incorporating modern tools such as nonparametric or machine learning approaches with richer auxiliary information.

Author Contributions

Conceptualization and methodology, S.K. and K.-S.K.; software and data curation, D.J.; writing—original draft, S.K.; writing—review and editing, S.K., D.J. and K.-S.K.; supervision and funding acquisition, K.-S.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Research Foundation of Korea (NRF), Grant No. RS-2022-NR068754.

Data Availability Statement

The original data presented in the study are openly available at https://mdis.kostat.go.kr/ (accessed on 1 January 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

IPW	Inverse Probability Weighting
REG	Regression
DR	Doubly Robust
OLS	Ordinary Least Squares
CI	Confidence Interval
MAR	Missing at Random
NMAR	Not Missing at Random

References

Neyman, J. On the Two Different Aspects of the Representative Method: The Method of Stratified Sampling and the Method of Purposive Selection. J. R. Stat. Soc. 1934, 97, 558–625. [Google Scholar] [CrossRef]
Kim, J.K. A gentle introduction to data integration in survey sampling. Surv. Stat. 2022, 85, 19–29. [Google Scholar]
Baker, R.; Brick, J.M.; Bates, N.A.; Battaglia, M.; Couper, M.P.; Dever, J.A.; Gile, K.J.; Tourangeau, R. Summary report of the AAPOR task force on non-probability sampling. J. Surv. Stat. Methodol. 2013, 1, 90–143. [Google Scholar] [CrossRef]
Keiding, N.; Louis, T.A. Perils and potentials of self-selected entry to epidemiological studies and surveys. J. R. Stat. Soc. Ser. A Stat. Soc. 2016, 179, 319–376. [Google Scholar] [CrossRef]
Rancourt, E. Admin-First as a statistical paradigm for Canadian official statistics: Meaning, challenges and opportunities. In Proceedings of the Statistics Canada 2018 International Methodology Symposium, Ottawa, ON, Canada, 6–9 November 2018. [Google Scholar]
Beaumont, J.F. Are probability surveys bound to disappear for the production of official statistics? Surv. Methodol. 2020, 46, 1–29. [Google Scholar]
Harms, T.; Duchesne, P. On calibration estimation for quantiles. Surv. Methodol. 2006, 32, 37. [Google Scholar]
Meng, X.L. Statistical paradises and paradoxes in big data (i) law of large populations, big data paradox, and the 2016 us presidential election. Ann. Appl. Stat. 2018, 12, 685–726. [Google Scholar] [CrossRef]
Bethlehem, J. Selection bias in web surveys. Int. Stat. Rev. 2010, 78, 161–188. [Google Scholar] [CrossRef]
Wu, C. Combining information from multiple surveys through the empirical likelihood method. Can. J. Stat. 2004, 32, 15–26. [Google Scholar] [CrossRef]
Kim, J.K.; Rao, J.N. Combining data from two independent surveys: A model-assisted approach. Biometrika 2012, 99, 85–100. [Google Scholar] [CrossRef]
Elliott, M.R.; Valliant, R. Inference for nonprobability samples. Stat. Sci. 2017, 32, 249–264. [Google Scholar] [CrossRef]
Kim, J.K.; Park, S.; Chen, Y.; Wu, C. Combining non-probability and probability survey samples through mass imputation. J. R. Stat. Soc. Ser. A Stat. Soc. 2021, 184, 941–963. [Google Scholar] [CrossRef]
Kim, J.K.; Haziza, D. Doubly robust inference with missing data in survey sampling. Stat. Sin. 2014, 24, 375–394. [Google Scholar]
Chen, Y.; Li, P.; Wu, C. Doubly robust inference with nonprobability survey samples. J. Am. Stat. Assoc. 2020, 115, 2011–2021. [Google Scholar] [CrossRef]
Wu, C. Statistical inference with non-probability survey samples. Surv. Methodol. 2022, 48, 283–311. [Google Scholar]
Valliant, R.; Dorfman, A.H.; Royall, R.M. Finite Population Sampling and Inference: A Prediction Approach; Wiley: New York, NY, USA, 2000. [Google Scholar]
Särndal, C.E.; Swensson, B.; Wretman, J. Model Assisted Survey Sampling; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2003. [Google Scholar]
Horvitz, D.G.; Thompson, D.J. A generalization of sampling without replacement from a finite universe. J. Am. Stat. Assoc. 1952, 47, 663–685. [Google Scholar] [CrossRef]
Chambers, R.L.; Dunstan, R. Estimating distribution functions from survey data. Biometrika 1986, 73, 597–604. [Google Scholar] [CrossRef]
Rao, J.; Kovar, J.; Mantel, H. On estimating distribution functions and quantiles from survey data using auxiliary information. Biometrika 1990, 77, 365–375. [Google Scholar] [CrossRef]
Wu, C.; Sitter, R.R. Variance estimation for the finite population distribution function with complete auxiliary information. Can. J. Stat. 2001, 29, 289–307. [Google Scholar] [CrossRef]
Rueda, M.; Martínez, S.; Martínez, H.; Arcos, A. Estimation of the distribution function with calibration methods. J. Stat. Plan. Inference 2007, 137, 435–448. [Google Scholar] [CrossRef]
Castro-Martín, L.; Rueda, M.d.M.; Ferri-García, R. Inference from non-probability surveys with statistical matching and propensity score adjustment using modern prediction techniques. Mathematics 2020, 8, 879. [Google Scholar] [CrossRef]
Cobo, B.; Martínez, S.; Rueda, M. Estimation of the distribution function and quantiles through data integration. Stat. Pap. 2025, 66, 111. [Google Scholar] [CrossRef]
Rivers, D. Sampling for web surveys. In Proceedings of the Joint Statistical Meetings; American Statistical Association: Alexandria, VA, USA, 2007; Volume 4, p. 1320. [Google Scholar]
Vavreck, L.; Rivers, D. The 2006 cooperative congressional election study. J. Elections Public Opin. Parties 2008, 18, 355–366. [Google Scholar] [CrossRef]
Lee, S.; Valliant, R. Estimation for volunteer panel web surveys using propensity score adjustment and calibration adjustment. Sociol. Methods Res. 2009, 37, 319–343. [Google Scholar] [CrossRef]
Wu, C. Author’s response to comments on “Statistical inference with non-probability survey samples”. Surv. Methodol. 2022, 48, 367–373. [Google Scholar]
Rosenbaum, P.R.; Rubin, D.B. The central role of the propensity score in observational studies for causal effects. Biometrika 1983, 70, 41–55. [Google Scholar] [CrossRef]
Rao, J. On making valid inferences by integrating data from surveys and other sources. Sankhya B 2021, 83, 242–272. [Google Scholar] [CrossRef]
Kott, P.S. A note on handling nonresponse in sample surveys. J. Am. Stat. Assoc. 1994, 89, 693–696. [Google Scholar] [CrossRef]
Kang, J.D.; Schafer, J.L. Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Stat. Sci. 2007, 22, 523–539. [Google Scholar]
Chambers, R.; Dorfman, A.H.; Hall, P. Properties of estimators of the finite population distribution function. Biometrika 1992, 79, 577–582. [Google Scholar] [CrossRef]
Robins, J.M.; Rotnitzky, A.; Zhao, L.P. Estimation of regression coefficients when some regressors are not always observed. J. Am. Stat. Assoc. 1994, 89, 846–866. [Google Scholar] [CrossRef]
Woodruff, R.S. Confidence intervals for medians and other position measures. J. Am. Stat. Assoc. 1952, 47, 635–646. [Google Scholar] [CrossRef]
Lohr, S.L. Sampling: Design and Analysis; Chapman and Hall; CRC: Boca Raton, FL, USA, 2021. [Google Scholar]
Sitter, R.R.; Wu, C. A note on Woodruff confidence intervals for quantiles. Stat. Probab. Lett. 2001, 52, 353–358. [Google Scholar] [CrossRef]

Table 1. %RB and RRMSE of the finite distribution function estimators (Study 1).

		$t_{1} = ξ_{0.25}$		$t_{2} = ξ_{0.50}$		$t_{3} = ξ_{0.75}$
Scenario	Estimator	%RB	RRMSE	%RB	RRMSE	%RB	RRMSE
TT	${\hat{F}}_{IPW} (t)$	0.60	0.10	0.30	0.06	0.23	0.03
	${\hat{F}}_{REG} (t)$	0.22	0.08	0.01	0.04	−0.28	0.02
	${\hat{F}}_{DR} (t)$	0.24	0.10	0.01	0.05	0.06	0.03
TF	${\hat{F}}_{IPW} (t)$	0.60	0.10	0.30	0.06	0.23	0.03
	${\hat{F}}_{REG} (t)$	−30.27	0.31	−24.23	0.25	−17.02	0.17
	${\hat{F}}_{DR} (t)$	0.53	0.10	0.25	0.06	0.20	0.03
FT	${\hat{F}}_{IPW} (t)$	−30.06	0.31	−23.91	0.24	−16.88	0.17
	${\hat{F}}_{REG} (t)$	0.22	0.08	0.01	0.04	−0.28	0.02
	${\hat{F}}_{DR} (t)$	0.33	0.09	0.31	0.05	0.09	0.03
FF	${\hat{F}}_{IPW} (t)$	−30.06	0.31	−23.91	0.24	−16.88	0.17
	${\hat{F}}_{REG} (t)$	−30.27	0.31	−24.23	0.25	−17.02	0.17
	${\hat{F}}_{DR} (t)$	−30.01	0.31	−23.88	0.24	−16.86	0.17

Table 2. %RB and %CP_v of bootstrap variance estimators (Study 1).

		$t_{1} = ξ_{0.25}$		$t_{2} = ξ_{0.50}$		$t_{3} = ξ_{0.75}$
Scenario	Estimator	%RB	%CP_v	%RB	%CP_v	%RB	%CP_v
TT	$v_{IPW, BT}$	6.11	95.50	6.98	95.10	7.16	95.40
	$v_{REG, BT}$	3.07	95.00	3.50	94.90	5.51	96.40
	$v_{DR, BT}$	2.39	95.10	3.45	95.50	5.02	96.00
TF	$v_{IPW, BT}$	6.11	95.50	6.98	95.10	7.16	95.40
	$v_{REG, BT}$	6.01	0.60	5.36	0.00	9.52	0.00
	$v_{DR, BT}$	5.04	95.90	5.42	95.00	5.05	95.60
FT	$v_{IPW, BT}$	3.01	2.20	4.01	0.00	8.60	0.00
	$v_{REG, BT}$	3.07	95.00	3.50	94.90	5.51	96.40
	$v_{DR, BT}$	1.78	95.50	3.26	95.80	6.33	96.00
FF	$v_{IPW, BT}$	3.01	2.20	4.01	0.00	8.60	0.00
	$v_{REG, BT}$	6.01	0.60	5.36	0.00	9.52	0.00
	$v_{DR, BT}$	2.98	2.30	3.78	0.00	8.47	0.00

Table 3. %RB and RRMSE of quantile estimators (Study 1).

		$t_{1} = ξ_{0.25}$		$t_{2} = ξ_{0.50}$		$t_{3} = ξ_{0.75}$
Scenario	Estimator	%RB	RRMSE	%RB	RRMSE	%RB	RRMSE
TT	${\hat{ξ}}_{q}^{IPW}$	−1.57	0.18	−0.56	0.06	−0.33	0.04
	${\hat{ξ}}_{q}^{REG}$	−0.31	0.13	−0.04	0.05	0.33	0.03
	${\hat{ξ}}_{q}^{DR}$	−0.94	0.17	−0.21	0.06	−0.21	0.04
TF	${\hat{ξ}}_{q}^{IPW}$	−1.57	0.18	−0.56	0.06	−0.33	0.04
	${\hat{ξ}}_{q}^{REG}$	60.70	0.62	30.24	0.31	22.65	0.23
	${\hat{ξ}}_{q}^{DR}$	−1.40	0.18	−0.45	0.07	−0.35	0.05
FT	${\hat{ξ}}_{q}^{IPW}$	60.60	0.63	29.49	0.30	22.84	0.23
	${\hat{ξ}}_{q}^{REG}$	−0.31	0.13	−0.04	0.05	0.33	0.03
	${\hat{ξ}}_{q}^{DR}$	−0.97	0.15	−0.46	0.05	−0.29	0.04
FF	${\hat{ξ}}_{q}^{IPW}$	60.60	0.63	29.49	0.30	22.84	0.23
	${\hat{ξ}}_{q}^{REG}$	60.70	0.62	30.24	0.31	22.65	0.23
	${\hat{ξ}}_{q}^{DR}$	60.48	0.62	29.48	0.30	22.81	0.23

Table 4. %CP_ξ, %L, and %U of Woodruff confidence intervals (Study 1).

		$t_{1} = ξ_{0.25}$			$t_{2} = ξ_{0.50}$			$t_{3} = ξ_{0.75}$
Scenario	Estimator	%CP_ξ	%L	%U	%CP_ξ	%L	%U	%CP_ξ	%L	%U
TT	${\hat{ξ}}_{q}^{IPW}$	95.70	1.83	2.47	95.37	2.20	2.43	96.37	1.73	1.90
	${\hat{ξ}}_{q}^{REG}$	94.43	3.30	2.27	95.07	3.23	1.70	94.70	4.27	1.03
	${\hat{ξ}}_{q}^{DR}$	94.53	2.47	3.00	94.93	2.43	2.63	96.00	2.10	1.90
TF	${\hat{ξ}}_{q}^{IPW}$	95.70	1.83	2.47	95.37	2.20	2.43	96.37	1.73	1.90
	${\hat{ξ}}_{q}^{REG}$	0.57	99.43	0.00	0.03	99.97	0.00	0.03	99.97	0.00
	${\hat{ξ}}_{q}^{DR}$	95.10	2.10	2.80	95.03	2.47	2.50	95.97	1.87	2.17
FT	${\hat{ξ}}_{q}^{IPW}$	2.53	97.47	0.00	0.03	99.97	0.00	0.03	99.97	0.00
	${\hat{ξ}}_{q}^{REG}$	94.43	3.30	2.27	95.07	3.23	1.70	94.70	4.27	1.03
	${\hat{ξ}}_{q}^{DR}$	94.60	2.40	3.00	95.00	2.23	2.77	95.50	2.13	2.37
FF	${\hat{ξ}}_{q}^{IPW}$	2.53	97.47	0.00	0.03	99.97	0.00	0.03	99.97	0.00
	${\hat{ξ}}_{q}^{REG}$	0.57	99.43	0.00	0.03	99.97	0.00	0.03	99.97	0.00
	${\hat{ξ}}_{q}^{DR}$	2.70	97.30	0.00	0.03	99.97	0.00	0.03	99.97	0.00

Table 5. Variables and definitions from the Korean Survey of Household Finances and Living Conditions (2023).

Variable	Description
INCOME	Current income
EDU	Educational attainment
GEO	Metropolitan status: In metropolitan area (1), Not in metropolitan area (2)
SNG	One-person household: Yes (1), No (2)
APT	Residence in an apartment: Yes (1), No (2)
SIZE	Size of net Floor Area: Classified into 4 groups by size
HOME	Housing types
DEBT	Any debt held by the household: Yes (1), No (2)
EXP1	Consumption expenditure
EXP2	Non-consumption expenditure

Table 6. %RB and RRMSE of the finite distribution function estimators (Study 2).

		$t_{1} = ξ_{0.25}$		$t_{2} = ξ_{0.50}$		$t_{3} = ξ_{0.75}$
Scenario	Estimator	%RB	RRMSE	%RB	RRMSE	%RB	RRMSE
A	${\hat{F}}_{IPW} (t)$	0.11	0.07	0.02	0.04	0.05	0.03
	${\hat{F}}_{REG} (t)$	−7.80	0.09	−2.60	0.04	0.55	0.02
	${\hat{F}}_{DR} (t)$	0.19	0.06	0.10	0.04	0.15	0.02
B	${\hat{F}}_{IPW} (t)$	15.91	0.18	9.19	0.10	3.67	0.04
	${\hat{F}}_{REG} (t)$	−7.80	0.09	−2.60	0.04	0.55	0.02
	${\hat{F}}_{DR} (t)$	6.48	0.09	3.14	0.05	1.08	0.02

Table 7. %RB and %CP_v of bootstrap variance estimators (Study 2).

		$t_{1} = ξ_{0.25}$		$t_{2} = ξ_{0.50}$		$t_{3} = ξ_{0.75}$
Scenario	Estimator	%RB	%CP_v	%RB	%CP_v	%RB	%CP_v
A	$v_{IPW, BT}$	10.03	96.10	11.95	96.07	7.61	95.67
	$v_{REG, BT}$	16.34	62.37	25.00	88.10	17.29	94.53
	$v_{DR, BT}$	14.67	96.67	16.86	96.67	12.18	95.23
B	$v_{IPW, BT}$	7.40	47.20	9.65	42.50	7.46	66.67
	$v_{REG, BT}$	16.34	62.37	25.00	88.10	17.29	94.53
	$v_{DR, BT}$	13.26	84.90	16.73	88.10	12.47	92.57

Table 8. %RB and RRMSE of quantile estimators (Study 2).

		$t_{1} = ξ_{0.25}$		$t_{2} = ξ_{0.50}$		$t_{3} = ξ_{0.75}$
Scenario	Estimator	%RB	RRMSE	%RB	RRMSE	%RB	RRMSE
A	${\hat{ξ}}_{q}^{IPW}$	−0.18	0.06	−0.14	0.05	−0.28	0.04
	${\hat{ξ}}_{q}^{REG}$	7.57	0.09	2.96	0.04	−0.82	0.03
	${\hat{ξ}}_{q}^{DR}$	−0.35	0.06	−0.23	0.04	−0.47	0.04
B	${\hat{ξ}}_{q}^{IPW}$	−14.06	0.15	−10.80	0.12	−6.94	0.08
	${\hat{ξ}}_{q}^{REG}$	7.57	0.09	2.96	0.04	−0.82	0.03
	${\hat{ξ}}_{q}^{DR}$	−6.32	0.08	−3.91	0.06	−2.12	0.04

Table 9. %CP_ξ, %L and %U of Woodruff confidence intervals (Study 2).

		$t_{1} = ξ_{0.25}$			$t_{2} = ξ_{0.50}$			$t_{3} = ξ_{0.75}$
Scenario	Estimator	%CP_ξ	%L	%U	%CP_ξ	%L	%U	%CP_ξ	%L	%U
A	${\hat{ξ}}_{q}^{IPW}$	96.43	1.67	1.90	96.33	1.47	2.20	95.87	2.13	2.00
	${\hat{ξ}}_{q}^{REG}$	58.43	41.57	0.00	83.67	16.23	0.10	95.67	1.53	2.80
	${\hat{ξ}}_{q}^{DR}$	96.87	1.60	1.53	96.90	1.33	1.77	95.70	1.87	2.43
B	${\hat{ξ}}_{q}^{IPW}$	44.57	0.00	55.43	43.03	0.00	56.97	71.57	0.03	28.40
	${\hat{ξ}}_{q}^{REG}$	58.43	41.57	0.00	83.67	16.23	0.10	95.67	1.53	2.80
	${\hat{ξ}}_{q}^{DR}$	83.60	0.00	16.40	88.70	0.03	11.27	94.37	0.47	5.17

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Doubly Robust Estimation of the Finite Population Distribution Function Using Nonprobability Samples

Abstract

1. Introduction

2. Basic Setup

3. Estimators of the Finite Distribution Function

3.1. Inverse Probability Weighted Estimator

3.2. Regression Estimator

3.3. Doubly Robust Estimator

4. Quantile Estimation

5. Simulation Studies

5.1. Study 1

5.2. Study 2

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Article Metrics

Citations

Article Access Statistics