Next Article in Journal
Distinguishing Human- and AI-Generated Image Descriptions Using CLIP Similarity and Transformer-Based Classification
Previous Article in Journal
Theoretical Modeling of Light-Fueled Self-Harvesting in Piezoelectric Beams Actuated by Liquid Crystal Elastomer Fibers
Previous Article in Special Issue
Gaussian Process with Vine Copula-Based Context Modeling for Contextual Multi-Armed Bandits
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Doubly Robust Estimation of the Finite Population Distribution Function Using Nonprobability Samples

1
Department of Statistics and Data Science, University of Seoul, 163 Seoulsiripdae-ro, Dongdaemun-gu, Seoul 02504, Republic of Korea
2
Social Statistics Bureau, Statistics Korea, 189 Cheongsa-ro, Seo-gu, Daejeon 35208, Republic of Korea
3
Department of Statistics, University of Seoul, 163 Seoulsiripdae-ro, Dongdaemun-gu, Seoul 02504, Republic of Korea
*
Author to whom correspondence should be addressed.
Mathematics 2025, 13(19), 3227; https://doi.org/10.3390/math13193227
Submission received: 10 September 2025 / Revised: 2 October 2025 / Accepted: 5 October 2025 / Published: 8 October 2025

Abstract

The growing use of nonprobability samples in survey statistics has motivated research on methodological adjustments that address the selection bias inherent in such samples. Most studies, however, have concentrated on the estimation of the population mean. In this paper, we extend our focus to the finite population distribution function and quantiles, which are fundamental to distributional analysis and inequality measurement. Within a data integration framework that combines probability and nonprobability samples, we propose two estimators, a regression estimator and a doubly robust estimator, and discuss their asymptotic properties. Furthermore, we derive quantile estimators and construct Woodruff confidence intervals using a bootstrap method. Simulation results based on both a synthetic population and the 2023 Korean Survey of Household Finances and Living Conditions demonstrate that the proposed estimators perform stably across scenarios, supporting their applicability to the production of policy-relevant indicators.

1. Introduction

Probability sampling, introduced in Neyman’s seminal work [1], has long been regarded as the gold standard for finite population inference in survey statistics [2]. In contrast, nonprobability samples were historically used mainly in observational studies or as supplementary sources for probability sampling. In recent years, however, probability sampling has faced increasing challenges, including rising costs, incomplete sampling frames, declining response rates, and increasingly strict data privacy regulations [3,4]. These difficulties have renewed interest in nonprobability samples as a practical alternative for population inference [5,6].
A key characteristic of nonprobability samples is their unknown selection mechanism, which, if ignored, can lead to substantial selection bias [7,8,9]. Moreover, the theoretical foundations for inference with nonprobability samples remain insufficiently developed, and standardized criteria for evaluating the reliability of resulting estimates are largely lacking.
These challenges have motivated the development of alternative strategies for improving population inference. Among them, an increasingly prominent approach is data integration, which improves population inference by combining information from different sources. Earlier applications focused on merging two probability samples [10,11]. More recently, the framework has been extended to integrate nonprobability with probability samples to correct for the selection bias inherent in such data.
Building on this framework, subsequent research has proposed several approaches to mitigate selection bias. For example, Elliott and Valliant [12] proposed an approach that models the selection mechanism of the nonprobability sample and applies the inverse of estimated propensity scores as weights. Alternatively, Kim et al. [13] introduced a mass imputation method, which fits an outcome regression model using study variables observed in the nonprobability sample and predicts corresponding values for units in the probability sample. A central limitation of both approaches, however, is that their validity depends entirely on correct model specification.
To overcome these limitations, doubly robust estimation methods have attracted considerable attention because they maintain key asymptotic properties when at least one of the two underlying models is correctly specified [6,14,15,16]. Existing applications of doubly robust estimation, however, have largely concentrated on measures of central tendency, such as the population mean. The mean is intuitive and easily interpretable, but it is highly sensitive to extreme values and may be inadequate when information about specific regions of the distribution is crucial, for instance, in studies of income inequality or health outcomes. By contrast, the finite population distribution function and quantiles provide information across the entire distribution, enabling more comprehensive analyses [17,18].
Within probability sampling theory, a simple design-based estimator of the finite distribution function is the Horvitz–Thompson estimator [19], but it does not incorporate auxiliary information. To exploit such information, Chambers and Dunstan [20] proposed a model-based estimator that is asymptotically model-unbiased, though not design-unbiased. Rao et al. [21] subsequently developed an estimator that is asymptotically unbiased under both the sampling design and the model. Wu and Sitter [22] derived corresponding variance estimators, and Rueda et al. [23] introduced a calibration estimator of the finite distribution function. Extending beyond this probability sample foundation to settings with uncontrolled sample selection, only a few studies have attempted to estimate the finite distribution function and quantiles using nonprobability samples. For example, Castro-Martín et al. [24] developed estimators based on either a selection model or an outcome regression model, whereas Cobo et al. [25] proposed several estimators grounded in inverse probability weighting and calibration techniques.
In this study, we extend the doubly robust estimator for the population mean proposed by Chen et al. [15] to the finite distribution function under a data integration framework. For comparison, we first consider the conventional inverse probability weighting approach, and then propose two estimators: a regression estimator and a doubly robust estimator. The proposed doubly robust estimator is asymptotically unbiased for the finite distribution function if either the propensity score model or the outcome regression model is correctly specified.
The remainder of the paper is structured as follows. Section 2 presents the basic setup and assumptions. Section 3 defines the three estimators of the finite distribution function and analyzes their asymptotic properties. Section 4 applies these estimators to the estimation of population quantiles and describes the construction of Woodruff confidence intervals. Section 5 evaluates the performance of the proposed methods through simulation studies. Finally, Section 6 concludes with the implications of the study and directions for future research.

2. Basic Setup

Consider a finite population U = { 1 , 2 , , N } of size N. Each unit iU is associated with a vector of auxiliary variables xi and a study variable yi. The parameter of interest in this study is the finite population distribution function, defined by
F y ( t ) = 1 N i U I ( y i t ) , < t < ,
where I(A) is the indicator function that equals 1 if the event A is true and 0 otherwise.
Suppose two samples are drawn from the population. The first is a nonprobability sample SA of size nA, in which both the auxiliary variables xi and the study variable yi are observed for each unit iSA. The second is a probability sample SB of size nB, in which the auxiliary variables xi are observed along with their inclusion probabilities π i B = P ( i S B ) under a given sampling design p.
In this setup, the nonprobability sample is not representative of the population, whereas the probability sample does not contain observations on the study variable. Therefore, an integrated approach is required for population inference. This approach relies on common auxiliary variables observed in both samples, which serve as a bridge linking the study variable with the design information. Such a data integration framework is well-established and has been widely applied in prior studies [6,12,15,16,26,27,28]. Following Chen et al. [15], we adopt this framework under the assumption that the two samples are drawn independently, which simplifies the analysis and enhances theoretical validity.
To utilize nonprobability samples for population inference, we assume an implicit probabilistic selection mechanism, with unknown inclusion probabilities. This corresponds to the concept of an unknown probability sample discussed by Wu [29]. Under this assumption, the issue of undercoverage is excluded from the scope of this study.
To formalize the selection mechanism, let the indicator variable Ri be defined as
R i = 1 , if i S A , 0 , if i S A .
Its conditional expectation is given by
π i A = E δ [ R i x i , y i ] = P ( R i = 1 x i , y i ) ,
where δ denotes the selection mechanism model. This probability π i A is commonly referred to as the propensity score in observational studies [30] and as the participation probability in survey sampling [6,31].
We adopt the concept of an unknown probability sample and impose the following assumptions on the propensity score, as introduced in Chen et al. [15].
A1
Given the auxiliary variables xi, the study variable yi and the selection indicator Ri are conditionally independent.
A2
For all units iU, π i A > 0 .
A3
Given x 1 , x 2 , , x N , the selection indicators R 1 , R 2 , , R N are independent.
Assumptions A1 and A2 jointly constitute the strong ignorability condition [30]. Under this condition, π i A in Equation (2) depends only on the auxiliary variables xi.

3. Estimators of the Finite Distribution Function

In this section, we present three estimators of the finite distribution function defined in Equation (1) within the data integration framework: the inverse probability weighted (IPW) estimator, the regression (REG) estimator, and the doubly robust (DR) estimator. Their statistical properties are investigated under a joint randomization framework that consists of three components [16]:
(i)
δ: The selection mechanism for the nonprobability sample;
(ii)
p: The probability sampling design;
(iii)
M: The outcome regression model for the study variable y.
Specifically, the IPW estimator is analyzed under the δp-framework, the REG estimator under the Mp-framework, and the DR estimator under either the δp- or the Mp-framework, without specification of which one. Notably, all frameworks incorporate the probability sampling design p.
To establish the asymptotic properties of these estimators, we next introduce the regularity conditions required for our theoretical results. Following Chen et al. [15], we adopt conditions C1, C4, C5, and C6, which we reformulate as regularity conditions B1–B4 in this paper by substituting I(yit) for yi and G ( t x i β ) for m(xi, β). To ensure the validity of the Taylor expansion, we further impose the following additional condition:
B5
The error distribution function G(t) is twice continuously differentiable for all t, and its first derivative is denoted by g(t).
In addition, we assume an increasing population framework for asymptotic analysis. That is, we consider a sequence of finite populations indexed by ν, denoted as U ν : ν = 1 , 2 , with corresponding samples SA,ν and SB,ν. As ν → ∞, the population size Nν and the sample sizes nA,ν, nB,ν all diverge to infinity. For simplicity, the subscript ν is omitted hereafter, and asymptotics are expressed in terms of N → ∞.

3.1. Inverse Probability Weighted Estimator

The IPW method is widely used to correct for selection bias in nonprobability samples, where the inclusion probability π i A is assumed to be a function of auxiliary variables xi and unknown parameters θ of a participation model.
π i A = π ( x i , θ )
Estimating π i A directly, however, requires auxiliary information for the entire population, which is rarely available in practice. In response, Chen et al. [15] proposed a pseudo-likelihood approach that incorporates the design weights from a probability sample. The participation model parameters are first estimated as θ ^ , which are then used to compute the estimated inclusion probabilities π ^ i A = π ( x i , θ ^ ) . The corresponding pseudo-weight is d ^ i A = 1 / π ^ i A , which is used to construct a Hájek-type estimator of the finite distribution function. This quasi-randomization approach enables valid inference from nonprobability samples [12,32]. Using the pseudo-weights d ^ i A , the IPW estimator of the finite distribution function at a fixed point t is given by
F ^ IPW ( t ) = 1 N ^ A i S A d ^ i A I ( y i t ) ,
where N ^ A = i S A d ^ i A .
Theorem 1.
Under A1–A3 and B1–B5, and if the propensity score model is correctly specified as a logistic regression model, F ^ IPW ( t ) is an asymptotically δp-unbiased estimator of Fy(t) for a fixed point t.
Proof. 
The asymptotic property of the IPW estimator for the population mean μy, denoted as μ ^ y = N ^ A 1 i S A y i / π ^ i A , is given by Chen et al. [15] as μ ^ y μ y = O p ( n A 1 / 2 ) . The finite distribution function estimator F ^ IPW ( t ) can be viewed as a plug-in estimator with yi replaced by I(yit). Therefore,
F ^ IPW ( t ) F y ( t ) = O p ( n A 1 / 2 ) , < t <
Moreover, since both F ^ IPW ( t ) and Fy(t) lie in [0, 1], we have | F ^ IPW ( t ) F y ( t ) | 1 . Hence,
E δ p F ^ IPW ( t ) F y ( t ) = O ( n A 1 / 2 )
Thus, F ^ IPW ( t ) is an asymptotically δp-unbiased estimator of the finite distribution function Fy(t).    □
The IPW estimator is asymptotically unbiased under the assumption of strong ignorability, that is, when the selection mechanism is fully explained by the auxiliary variables. However, if the propensity score model is misspecified, asymptotic unbiasedness may fail. Moreover, even under correct specification, extreme propensity score values (close to 0 or 1) may yield highly unstable inverse probability weights, inflating the variance of the estimator [33].

3.2. Regression Estimator

The REG estimator is obtained by fitting a regression model of yi on xi using the nonprobability sample SA and then applying the fitted model to the probability sample SB to estimate the finite distribution function. Since this procedure imputes the study variable for all units in SB, it is commonly referred to as mass imputation.
This approach relies on a superpopulation framework, where the finite population U = { 1 , 2 , , N } is regarded as a random sample drawn from the following outcome regression model:
y i = x i β + ϵ i , i = 1 , 2 , , N ,
where the error terms ϵi are independent, with EM(ϵi) = 0 and VarM(ϵi) = σ2.
The subscript M indicates that the corresponding expectation or variance is taken under the model (4). By Assumption A1, we have E M [ y i x i , R i ] = E M [ y i x i ] and Var M [ y i x i , R i ] = Var M [ y i x i ] , ensuring that the regression model remains valid for the nonprobability sample.
A straightforward model-based approach to estimating the finite distribution function is to substitute I(yit) with the plug-in indicator I ( x i β ^ t ) . Since E M [ I ( x i β ^ t ) ] P ( y i < t ) , in general, this naive substitution may result in bias [17]. To address this issue, Chambers and Dunstan [20] introduced a model-based estimator constructed from regression residuals.
Let G(·) denote the distribution function of ϵi. The estimator is defined as
G ^ i ( t ) REG = 1 n A j S A I ( y j x j β ^ t x i β ^ ) ,
where β ^ is the ordinary least-squares (OLS) estimator of β.
Based on Equation (5), the regression estimator of Fy(t) at a fixed point t is given by
F ^ REG ( t ) = 1 N ^ B i S B d i B G ^ i ( t ) REG ,
where N ^ B = i S B d i B .
Theorem 2.
Under A1–A3 and B1–B5, F ^ REG ( t ) is an asymptotically Mp-unbiased estimator of Fy(t) for a fixed point t.
Proof. 
Following Chambers et al. [34], we can show that under conditions B1–B5,
E M G ^ i ( t ) REG = 1 n A j S A E M I ϵ j t i + ( x j x i ) ( β ^ β ) = 1 n A j S A E Δ ( j ) G t i + ( x j x i ) Δ ( j ) + O ( n A 1 ) = 1 n A j S A E Δ ( j ) G ( t i ) + g ( t i ) ( x j x i ) Δ ( j ) + 1 2 g ( t i ) ( x j x i ) Δ ( j ) 2 + O ( n A 1 ) = G ( t i ) + O ( n A 1 ) .
where t i = t x i β , Δ ( j ) = β ^ ( j ) β , and  β ^ ( j ) denotes the leave-one-out OLS estimator based on SA \ {j}.
Using this result, the expectation of the bias can be evaluated as
E F ^ REG ( t ) F y ( t ) = E p E M F ^ REG ( t ) F y ( t ) = E p 1 N ^ B i S B d i B G ( t i ) 1 N i = 1 N G ( t i ) + O ( n A 1 )
The first term in the second equation is asymptotically p-unbiased to zero. Therefore, F ^ REG ( t ) is an asymptotically Mp-unbiased estimator for the finite population distribution function Fy(t).    □
When the outcome regression model is correctly specified, the REG estimator is highly efficient and supports broader use of nonprobability samples. However, if the regression model fails to capture the true distribution, bias may arise, and the method becomes sensitive to misspecification. The next section introduces the DR estimator, which remains asymptotically unbiased, provided that either the propensity score model or the outcome regression model is correctly specified.

3.3. Doubly Robust Estimator

The asymptotic unbiasedness of the IPW estimator in Equation (3) and the REG estimator in Equation (6) relies on the correct specification of their respective working models. In practice, however, ensuring such correctness is often difficult, which motivates the development of procedures that remain valid under model misspecification. The DR estimator was introduced to address this issue and has been regarded as a successful approach since Robins et al. [35].
To construct a DR estimator for the finite distribution function, we require an analog of G ^ i ( t ) REG in Equation (5) that estimates the error distribution G(·) and remains valid under joint randomization. Because   G ^ i ( t ) REG is derived under the Mp-framework, it cannot be directly applied under the δp-framework. Accordingly, we extend the method of Rao et al. [21] and propose a new estimator of the error distribution that is valid under such joint randomization.
G ^ i ( t ) DR = 1 N ^ A j S A d ^ j A I ( y j x j β ^ t x i β ^ )
Based on G ^ i ( t ) DR , defined in Equation (7), the DR estimator of the finite distribution function Fy(t) at a fixed point t is then given by
F ^ DR ( t ) = 1 N ^ A i S A d ^ i A I ( y i t ) G ^ i ( t ) DR + 1 N ^ B i S B d i B G ^ i ( t ) DR
Theorem 3.
Under regularity conditions A1–A3 and B1–B5 , if at least one of the models, the propensity score model or the outcome regression model, is correctly specified, F ^ DR ( t ) is an asymptotically unbiased estimator of Fy(t) at a fixed point t under the δp- or Mp-framework.
Proof. 
(i) When the propensity score model is correctly specified
The doubly robust estimator can be rewritten as
F ^ DR ( t ) = F ^ IPW ( t ) 1 N ^ A i S A d ^ i A G ^ i ( t ) DR + 1 N ^ B i S B d i B G ^ i ( t ) DR
The second and third terms on the right-hand side are Hájek estimators of N 1 i = 1 N G ^ i ( t ) DR based on the nonprobability sample SA and the probability sample SB, respectively, and therefore cancel out asymptotically. Given the asymptotic δp-unbiasedness of F ^ IPW ( t ) , F ^ DR ( t ) is also asymptotically δp-unbiased.
(ii)
When the outcome regression model is correctly specified
Similarly to the proof for the REG estimator, we can show
E M G ^ i ( t ) DR = 1 N ^ A j S A d ^ j A E M I ϵ j t i + ( x j x i ) ( β ^ β ) = 1 N ^ A j S A d ^ j A E Δ ( j ) G t i + ( x j x i ) Δ ( j ) + O ( n A 1 ) = 1 N ^ A j S A d ^ j A E Δ ( j ) G ( t i ) + g ( t i ) ( x j x i ) Δ ( j ) + 1 2 g ( t i ) ( x j x i ) Δ ( j ) 2 + O ( n A 1 ) = G ( t i ) + O ( n A 1 ) .
It follows that the expected bias is
E F ^ DR ( t ) F y ( t ) = E p 1 N ^ A i S A d ^ i A E M I ( y i t ) G ^ i ( t ) DR + 1 N ^ B i S B d i B E M G ^ i ( t ) DR 1 N i U E M I ( y i t ) = E p 1 N ^ B i S B d i B G ( t i ) 1 N i = 1 N G ( t i ) + O ( n A 1 ) .
The first term in the second equation is asymptotically p-unbiased to zero. Therefore, under the Mp-framework, F ^ DR ( t ) is an asymptotically unbiased estimator of the finite population distribution function Fy(t).  □
The asymptotic unbiasedness of the DR estimator requires the estimated coefficients to satisfy probability-limit conditions; specifically, for the propensity score model parameters θ ^ and the outcome regression model parameters β ^ , there exist fixed vectors θ and β such that p lim θ ^ = θ and p lim β ^ = β . If the propensity score model is correctly specified, then θ = θ, and if the outcome regression model is correctly specified, then β = β. Under misspecification, by contrast, these probability limits need not coincide with the true parameters, and the limiting value may not have a meaningful interpretation.

4. Quantile Estimation

An important application of the finite distribution function estimators is the estimation of population quantiles, defined as
ξ q = inf { t ; F y ( t ) q } .
Quantiles provide informative summaries of distributional features, including central tendency, spread, and asymmetry, and are also useful for assessing outliers. Because estimators of the finite distribution function are typically step functions, linear interpolation is employed to obtain a unique estimate of the qth quantile [21,36,37]. The quantile estimator ξ ^ q is expressed as
ξ ^ q = a + q F ^ ( a ) F ^ ( b ) F ^ ( a ) ( b a ) ,
where a = max { t ; F ^ ( t ) q } and b = min { t ; F ^ ( t ) q } .
A widely used method for constructing a confidence interval (CI) for a quantile estimator was proposed by Woodruff [36]. The key idea is to first construct a CI for the estimated finite distribution function and then invert this interval to derive a CI for the quantile. The resulting 100(1 − α)% CI is given by
ξ ^ q L = inf t ; F ^ ( t ) q z 1 α / 2 V ^ [ F ^ ( ξ ^ q ) ] , ξ ^ q U = inf t ; F ^ ( t ) q + z 1 α / 2 V ^ [ F ^ ( ξ ^ q ) ] ,
where z1−α/2 is the (1 − α/2) quantile of the standard normal distribution and V ^ [ F ^ ( ξ ^ q ) ] denotes the estimated variance of F ^ ( t ) evaluated at ξ ^ q . Sitter and Wu [38] provided empirical evidence that the Woodruff method achieves approximately correct coverage even for extreme quantiles (large or small q).

5. Simulation Studies

To evaluate the performance of the finite distribution function estimators, F ^ IPW ( t ) , F ^ REG ( t ) , and  F ^ DR ( t ) , we conducted simulation studies based on two populations: (i) a synthetic finite population from Chen et al. [15] and (ii) the 2023 Korean Survey of Household Finances and Living Conditions.
The variances of the finite distribution function estimators were obtained via a bootstrap procedure following Chen et al. [15]:
  • From the nonprobability sample SA and the probability sample SB, draw bootstrap samples S A ( j ) and S B ( j ) of sizes nA and nB, respectively, by simple random sampling with replacement for J = 1000 replicates.
  • For each bootstrap replicate, compute F ^ ( j ) ( t ) .
  • Using { F ^ ( j ) ( t ) } j = 1 J , calculate the bootstrap variance estimator vBT.
Performance was evaluated over R = 3000 simulation replications using percentage relative bias (%RB) and relative root mean squared error (RRMSE), where
% RB = 1 R r = 1 R θ ^ ( r ) θ θ × 100 , RRMSE = 1 θ 1 R r = 1 R θ ^ ( r ) θ 2 ,
with θ ^ ( r ) denoting the estimate from replication r and θ the target parameter. For the finite distribution function, the bootstrap variance, and quantiles, the corresponding choices were
  • The finite distribution function: θ ^ ( r ) = F ^ ( r ) ( t ) , θ = F y ( t ) ,
  • Bootstrap variance: θ ^ ( r ) = v BT ( r ) , θ = V ,
  • Quantile: θ ^ ( r ) = ξ ^ q ( r ) , θ = ξ q ,
where V denotes the simulation-based variance of F ^ ( t ) computed from 10,000 replications.
The coverage probability of the CI based on the bootstrap variance (%CPv) was evaluated as
% CP v = 1 R r = 1 R I | F ^ ( r ) ( t ) F y ( t ) | z 1 α / 2 v BT ( r ) .
The performance of the Woodruff CI was assessed by its coverage probability (%CPξ), lower error rate (%L), and upper error rate (%U):
% CP ξ = 1 R r = 1 R I ξ q L ( r ) < ξ q < ξ q U ( r ) , % L = 1 R r = 1 R I ξ q < ξ q L ( r ) , % U = 1 R r = 1 R I ξ q U ( r ) < ξ q ,
where ξ q L ( r ) and ξ q U ( r ) denote, respectively, the lower and upper Woodruff CI bounds for the qth quantile in replication r.

5.1. Study 1

Following the simulation design of Chen et al. [15], we generated a finite population of size N = 20,000. The study variable y and auxiliary variables x were generated from
y i = 2 + x 1 i + x 2 i + x 3 i + x 4 i + σ ϵ i , i = 1 , 2 , , N ,
where (x1i, x2i, x3i, x4i) follow the design in Chen et al. [15], and the error terms ϵiN(0, 1). The parameter σ was chosen such that the correlation coefficient ρ between y and the linear predictor xβ equals 0.5.
We consider four model specification scenarios:
  • TT: Both δ and M are correctly specified.
  • TF: δ is correctly specified, but M is misspecified, with x4i omitted from the model.
  • FT: M is correctly specified, but δ is misspecified, with x4i omitted from the model.
  • FF: Both models are misspecified, with x4i omitted from each model.
The analysis uses a nonprobability sample SA of size nA = 500 and a probability sample SB of size nB = 1000. Table 1 reports %RB and RRMSE for the proposed finite distribution function estimators. Under TT, all estimators exhibit low bias and error, demonstrating stable performance. Under TF and FT, the DR estimator attains lower bias and error than the alternatives, highlighting the advantages of the doubly robust property. By contrast, under FF, performance deteriorates markedly for all estimators.
Table 2 compares the bootstrap variance estimators in terms of %RB and %CPv. Under TT, all variance estimators perform satisfactorily. Under TF and FT, despite model misspecification, the variance estimator associated with the DR method retains low bias and a %CPv close to 95%, indicating stable reliability and accuracy. Conversely, under FF, coverage performance deteriorates markedly across all methods.
Table 3 summarizes the results for the quantile estimators. Mirroring the findings for the finite distribution function estimators, all methods perform well under the TT scenario. Under TF and FT, the DR-based quantiles remain stable, confirming the robustness of the doubly robust approach. By contrast, under FF, overall estimation accuracy deteriorates.
Table 4 reports the Woodruff CI results for the quantile estimators, including %CPξ, %L, and %U. Consistent with previous findings, all methods perform well under the TT scenario. Under TF and FT, the DR-based intervals maintain a %CPξ close to the nominal 95% with balanced tail errors, indicating high reliability. By contrast, under FF, coverage deteriorates substantially across methods. %CPξ falls below the nominal level and both tail error rates increase, signaling degraded interval performance.

5.2. Study 2

In the second simulation study, we treat the 2023 Korean Survey of Household Finances and Living Conditions (N = 16,730) as the finite population and repeatedly draw subsamples from it. Table 5 summarizes the key variables used in the experiment and their definitions.
The nonprobability sample SA was generated to mimic structures commonly observed in practice. The propensity score model was specified as a logistic regression,
log π i A 1 π i A = ζ 0 + ζ 1 EDU + ζ 2 SNG + ζ 3 APT + ζ 4 DEBT , i = 1 , , N ,
where ζ0 was chosen so that i = 1 N π i A = n A . Under this design, households with higher educational attainment of the household head, non-single households, apartment residents, and households without debt were more likely to be included in SA. The nonprobability sample SA was then selected by Poisson sampling with inclusion probabilities π i A .
The probability sample SB was stratified into nine strata defined by GEO, HOME, and SIZE. A mixed-allocation scheme—combining Neyman and proportional allocation—was used to determine stratum-specific sample sizes, followed by simple random sampling without replacement within each stratum. The sample sizes were set to nA = 500 and nB = 1000.
The study variable of interest was current income (INCOME). Because the true outcome regression model was unknown, we included EXP1 and EXP2—the covariates with comparatively strong explanatory power—as regressors in the working model. This setup allows us to assess the impact of model misspecification on estimation performance and to isolate efficiency gains attributable to the DR estimator. We consider two scenarios regarding the propensity score model:
  • A: Correctly specified propensity score model.
  • B: Misspecified propensity score model (excluding SNG and DEBT).
Table 6 reports the results for the distribution–function estimators. Overall, the REG estimator performs reasonably well, although its bias and error are somewhat larger at lower quantiles than at middle and upper quantiles, likely reflecting the limited explanatory power of the auxiliary variables and the possible overrepresentation of high-income households. Under Scenario A, the IPW estimator and the DR estimator both exhibit low bias and error, confirming the effectiveness of propensity score adjustment. Under Scenario B, the REG estimator is the most stable, while the DR estimator inherits some bias from the misspecified IPW component and thus loses efficiency. In summary, when the propensity score model is correctly specified, the IPW estimator, the REG estimator, and the DR estimator all yield stable results. However, when the propensity score model is misspecified, only the REG estimator and the DR estimator perform well, with the REG estimator performing best. These findings highlight that the choice of estimator may critically depend on the availability and explanatory power of the auxiliary variables.
Table 7 compares the bootstrap variance estimators in terms of %RB and %CPv. Consistent with the findings for the finite distribution function estimators, the REG estimator shows degraded variance performance at lower quantiles. The IPW estimator maintains coverage close to 95% %CPv under Scenario A, but its %CPv declined markedly under Scenario B. The DR estimator achieves both low bias and stable %CPv across scenarios, indicating reliable variance estimation.
Table 8 compares the quantile estimators in terms of %RB and RRMSE. The REG estimator shows substantial bias at lower quantiles, whereas the IPW estimator performs well under Scenario A but deteriorates under Scenario B. The DR estimator maintains moderate bias and error across both scenarios, yielding comparatively stable performance overall.
Table 9 reports results for the Woodruff confidence intervals of the quantile estimators-%CPξ, %L, and %U. The IPW estimator attains a %CPξ close to the nominal 95% under Scenario A, but coverage drops sharply under Scenario B, accompanied by an upward bias in %U, indicating sensitivity to propensity score misspecification. The REG estimator performs well at the middle and upper quantiles, but shows increased %L at lower quantiles. The DR estimator maintains a stable %CPξ across scenarios, with only a slight upward bias in %U under Scenario B.
In summary, two simulation experiments were conducted to evaluate the proposed methods. First, when both models were correctly specified, all estimators (IPW, REG, and DR) performed well, while under double misspecification, all methods failed to provide reliable results. Second, under single-model misspecification, the DR estimator consistently maintained low bias and stable inference, confirming the robustness of the approach. Third, in the empirical application using the 2023 Korean Survey of Household Finances and Living Conditions, the central role of auxiliary variables was evident, with the DR estimator showing comparatively the most reliable performance, especially in the lower tail.

6. Conclusions

This study examined methods for estimating the finite population distribution function and quantiles within a data integration framework that combines probability and nonprobability samples. We considered three estimators—IPW, REG, and DR. Our contribution extends DR methodology beyond mean estimation to distributional estimands and establishes that the proposed DR estimator remains asymptotically unbiased for the finite distribution function when either the propensity score model or the outcome regression model is correctly specified. This property makes the DR approach especially valuable in survey practice, where model misspecification is almost unavoidable.
The simulation evidence, based on both a synthetic population and the 2023 Korean Survey of Household Finances and Living Conditions, underscores these advantages. When both models were correct, all estimators performed well. With single-model misspecification, DR retained stable performance, while IPW and REG broke down depending on the misspecified model. Under double misspecification, no method succeeded, highlighting the importance of at least one correctly specified model. In the empirical application using the 2023 Korean survey, the role of auxiliary variables was also evident: when auxiliary variables were strong, both REG and DR improved in accuracy, but when they were weak, the REG estimator showed clear bias and coverage deterioration in the lower tail. Taken together, these findings demonstrate that DR estimation offers a practical safeguard against model misspecification in quantile inference.
Future research might, first of all, extend the evaluation of the proposed estimators to more challenging and realistic survey settings, such as small sample sizes, heavy-tailed outcomes, and highly unbalanced propensity scores. Next, more refined approaches to variance estimation would be valuable, since conventional bootstrap methods are prone to overestimating variability in integrated data settings. This underscores the need for bootstrap methodologies specifically designed for the data integration framework, which properly incorporate both the sampling weights from the probability sample and the inclusion probabilities from the nonprobability sample. Finally, further work could consider relaxing the Missing at Random assumption by exploring methods that address Not Missing at Random mechanisms or undercoverage and by incorporating modern tools such as nonparametric or machine learning approaches with richer auxiliary information.

Author Contributions

Conceptualization and methodology, S.K. and K.-S.K.; software and data curation, D.J.; writing—original draft, S.K.; writing—review and editing, S.K., D.J. and K.-S.K.; supervision and funding acquisition, K.-S.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Research Foundation of Korea (NRF), Grant No. RS-2022-NR068754.

Data Availability Statement

The original data presented in the study are openly available at https://mdis.kostat.go.kr/ (accessed on 1 January 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
IPWInverse Probability Weighting
REG     Regression
DRDoubly Robust
OLSOrdinary Least Squares
CIConfidence Interval
MARMissing at Random
NMARNot Missing at Random

References

  1. Neyman, J. On the Two Different Aspects of the Representative Method: The Method of Stratified Sampling and the Method of Purposive Selection. J. R. Stat. Soc. 1934, 97, 558–625. [Google Scholar] [CrossRef]
  2. Kim, J.K. A gentle introduction to data integration in survey sampling. Surv. Stat. 2022, 85, 19–29. [Google Scholar]
  3. Baker, R.; Brick, J.M.; Bates, N.A.; Battaglia, M.; Couper, M.P.; Dever, J.A.; Gile, K.J.; Tourangeau, R. Summary report of the AAPOR task force on non-probability sampling. J. Surv. Stat. Methodol. 2013, 1, 90–143. [Google Scholar] [CrossRef]
  4. Keiding, N.; Louis, T.A. Perils and potentials of self-selected entry to epidemiological studies and surveys. J. R. Stat. Soc. Ser. A Stat. Soc. 2016, 179, 319–376. [Google Scholar] [CrossRef]
  5. Rancourt, E. Admin-First as a statistical paradigm for Canadian official statistics: Meaning, challenges and opportunities. In Proceedings of the Statistics Canada 2018 International Methodology Symposium, Ottawa, ON, Canada, 6–9 November 2018. [Google Scholar]
  6. Beaumont, J.F. Are probability surveys bound to disappear for the production of official statistics? Surv. Methodol. 2020, 46, 1–29. [Google Scholar]
  7. Harms, T.; Duchesne, P. On calibration estimation for quantiles. Surv. Methodol. 2006, 32, 37. [Google Scholar]
  8. Meng, X.L. Statistical paradises and paradoxes in big data (i) law of large populations, big data paradox, and the 2016 us presidential election. Ann. Appl. Stat. 2018, 12, 685–726. [Google Scholar] [CrossRef]
  9. Bethlehem, J. Selection bias in web surveys. Int. Stat. Rev. 2010, 78, 161–188. [Google Scholar] [CrossRef]
  10. Wu, C. Combining information from multiple surveys through the empirical likelihood method. Can. J. Stat. 2004, 32, 15–26. [Google Scholar] [CrossRef]
  11. Kim, J.K.; Rao, J.N. Combining data from two independent surveys: A model-assisted approach. Biometrika 2012, 99, 85–100. [Google Scholar] [CrossRef]
  12. Elliott, M.R.; Valliant, R. Inference for nonprobability samples. Stat. Sci. 2017, 32, 249–264. [Google Scholar] [CrossRef]
  13. Kim, J.K.; Park, S.; Chen, Y.; Wu, C. Combining non-probability and probability survey samples through mass imputation. J. R. Stat. Soc. Ser. A Stat. Soc. 2021, 184, 941–963. [Google Scholar] [CrossRef]
  14. Kim, J.K.; Haziza, D. Doubly robust inference with missing data in survey sampling. Stat. Sin. 2014, 24, 375–394. [Google Scholar]
  15. Chen, Y.; Li, P.; Wu, C. Doubly robust inference with nonprobability survey samples. J. Am. Stat. Assoc. 2020, 115, 2011–2021. [Google Scholar] [CrossRef]
  16. Wu, C. Statistical inference with non-probability survey samples. Surv. Methodol. 2022, 48, 283–311. [Google Scholar]
  17. Valliant, R.; Dorfman, A.H.; Royall, R.M. Finite Population Sampling and Inference: A Prediction Approach; Wiley: New York, NY, USA, 2000. [Google Scholar]
  18. Särndal, C.E.; Swensson, B.; Wretman, J. Model Assisted Survey Sampling; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2003. [Google Scholar]
  19. Horvitz, D.G.; Thompson, D.J. A generalization of sampling without replacement from a finite universe. J. Am. Stat. Assoc. 1952, 47, 663–685. [Google Scholar] [CrossRef]
  20. Chambers, R.L.; Dunstan, R. Estimating distribution functions from survey data. Biometrika 1986, 73, 597–604. [Google Scholar] [CrossRef]
  21. Rao, J.; Kovar, J.; Mantel, H. On estimating distribution functions and quantiles from survey data using auxiliary information. Biometrika 1990, 77, 365–375. [Google Scholar] [CrossRef]
  22. Wu, C.; Sitter, R.R. Variance estimation for the finite population distribution function with complete auxiliary information. Can. J. Stat. 2001, 29, 289–307. [Google Scholar] [CrossRef]
  23. Rueda, M.; Martínez, S.; Martínez, H.; Arcos, A. Estimation of the distribution function with calibration methods. J. Stat. Plan. Inference 2007, 137, 435–448. [Google Scholar] [CrossRef]
  24. Castro-Martín, L.; Rueda, M.d.M.; Ferri-García, R. Inference from non-probability surveys with statistical matching and propensity score adjustment using modern prediction techniques. Mathematics 2020, 8, 879. [Google Scholar] [CrossRef]
  25. Cobo, B.; Martínez, S.; Rueda, M. Estimation of the distribution function and quantiles through data integration. Stat. Pap. 2025, 66, 111. [Google Scholar] [CrossRef]
  26. Rivers, D. Sampling for web surveys. In Proceedings of the Joint Statistical Meetings; American Statistical Association: Alexandria, VA, USA, 2007; Volume 4, p. 1320. [Google Scholar]
  27. Vavreck, L.; Rivers, D. The 2006 cooperative congressional election study. J. Elections Public Opin. Parties 2008, 18, 355–366. [Google Scholar] [CrossRef]
  28. Lee, S.; Valliant, R. Estimation for volunteer panel web surveys using propensity score adjustment and calibration adjustment. Sociol. Methods Res. 2009, 37, 319–343. [Google Scholar] [CrossRef]
  29. Wu, C. Author’s response to comments on “Statistical inference with non-probability survey samples”. Surv. Methodol. 2022, 48, 367–373. [Google Scholar]
  30. Rosenbaum, P.R.; Rubin, D.B. The central role of the propensity score in observational studies for causal effects. Biometrika 1983, 70, 41–55. [Google Scholar] [CrossRef]
  31. Rao, J. On making valid inferences by integrating data from surveys and other sources. Sankhya B 2021, 83, 242–272. [Google Scholar] [CrossRef]
  32. Kott, P.S. A note on handling nonresponse in sample surveys. J. Am. Stat. Assoc. 1994, 89, 693–696. [Google Scholar] [CrossRef]
  33. Kang, J.D.; Schafer, J.L. Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Stat. Sci. 2007, 22, 523–539. [Google Scholar]
  34. Chambers, R.; Dorfman, A.H.; Hall, P. Properties of estimators of the finite population distribution function. Biometrika 1992, 79, 577–582. [Google Scholar] [CrossRef]
  35. Robins, J.M.; Rotnitzky, A.; Zhao, L.P. Estimation of regression coefficients when some regressors are not always observed. J. Am. Stat. Assoc. 1994, 89, 846–866. [Google Scholar] [CrossRef]
  36. Woodruff, R.S. Confidence intervals for medians and other position measures. J. Am. Stat. Assoc. 1952, 47, 635–646. [Google Scholar] [CrossRef]
  37. Lohr, S.L. Sampling: Design and Analysis; Chapman and Hall; CRC: Boca Raton, FL, USA, 2021. [Google Scholar]
  38. Sitter, R.R.; Wu, C. A note on Woodruff confidence intervals for quantiles. Stat. Probab. Lett. 2001, 52, 353–358. [Google Scholar] [CrossRef]
Table 1. %RB and RRMSE of the finite distribution function estimators (Study 1).
Table 1. %RB and RRMSE of the finite distribution function estimators (Study 1).
t 1 = ξ 0.25 t 2 = ξ 0.50 t 3 = ξ 0.75
ScenarioEstimator%RBRRMSE%RBRRMSE%RBRRMSE
TT F ^ IPW ( t ) 0.600.100.300.060.230.03
F ^ REG ( t ) 0.220.080.010.04−0.280.02
F ^ DR ( t ) 0.240.100.010.050.060.03
TF F ^ IPW ( t ) 0.600.100.300.060.230.03
F ^ REG ( t ) −30.270.31−24.230.25−17.020.17
F ^ DR ( t ) 0.530.100.250.060.200.03
FT F ^ IPW ( t ) −30.060.31−23.910.24−16.880.17
F ^ REG ( t ) 0.220.080.010.04−0.280.02
F ^ DR ( t ) 0.330.090.310.050.090.03
FF F ^ IPW ( t ) −30.060.31−23.910.24−16.880.17
F ^ REG ( t ) −30.270.31−24.230.25−17.020.17
F ^ DR ( t ) −30.010.31−23.880.24−16.860.17
Table 2. %RB and %CPv of bootstrap variance estimators (Study 1).
Table 2. %RB and %CPv of bootstrap variance estimators (Study 1).
t 1 = ξ 0.25 t 2 = ξ 0.50 t 3 = ξ 0.75
ScenarioEstimator%RB %CPv%RB %CPv%RB %CPv
TT v IPW , BT 6.1195.506.9895.107.1695.40
v REG , BT 3.0795.003.5094.905.5196.40
v DR , BT 2.3995.103.4595.505.0296.00
TF v IPW , BT 6.1195.506.9895.107.1695.40
v REG , BT 6.010.605.360.009.520.00
v DR , BT 5.0495.905.4295.005.0595.60
FT v IPW , BT 3.012.204.010.008.600.00
v REG , BT 3.0795.003.5094.905.5196.40
v DR , BT 1.7895.503.2695.806.3396.00
FF v IPW , BT 3.012.204.010.008.600.00
v REG , BT 6.010.605.360.009.520.00
v DR , BT 2.982.303.780.008.470.00
Table 3. %RB and RRMSE of quantile estimators (Study 1).
Table 3. %RB and RRMSE of quantile estimators (Study 1).
t 1 = ξ 0.25 t 2 = ξ 0.50 t 3 = ξ 0.75
ScenarioEstimator%RBRRMSE%RBRRMSE%RBRRMSE
TT ξ ^ q IPW −1.570.18−0.560.06−0.330.04
ξ ^ q REG −0.310.13−0.040.050.330.03
ξ ^ q DR −0.940.17−0.210.06−0.210.04
TF ξ ^ q IPW −1.570.18−0.560.06−0.330.04
ξ ^ q REG 60.700.6230.240.3122.650.23
ξ ^ q DR −1.400.18−0.450.07−0.350.05
FT ξ ^ q IPW 60.600.6329.490.3022.840.23
ξ ^ q REG −0.310.13−0.040.050.330.03
ξ ^ q DR −0.970.15−0.460.05−0.290.04
FF ξ ^ q IPW 60.600.6329.490.3022.840.23
ξ ^ q REG 60.700.6230.240.3122.650.23
ξ ^ q DR 60.480.6229.480.3022.810.23
Table 4. %CPξ, %L, and %U of Woodruff confidence intervals (Study 1).
Table 4. %CPξ, %L, and %U of Woodruff confidence intervals (Study 1).
t 1 = ξ 0.25 t 2 = ξ 0.50 t 3 = ξ 0.75
ScenarioEstimator %CPξ%L%U %CPξ%L%U %CPξ%L%U
TT ξ ^ q IPW 95.701.832.4795.372.202.4396.371.731.90
ξ ^ q REG 94.433.302.2795.073.231.7094.704.271.03
ξ ^ q DR 94.532.473.0094.932.432.6396.002.101.90
TF ξ ^ q IPW 95.701.832.4795.372.202.4396.371.731.90
ξ ^ q REG 0.5799.430.000.0399.970.000.0399.970.00
ξ ^ q DR 95.102.102.8095.032.472.5095.971.872.17
FT ξ ^ q IPW 2.5397.470.000.0399.970.000.0399.970.00
ξ ^ q REG 94.433.302.2795.073.231.7094.704.271.03
ξ ^ q DR 94.602.403.0095.002.232.7795.502.132.37
FF ξ ^ q IPW 2.5397.470.000.0399.970.000.0399.970.00
ξ ^ q REG 0.5799.430.000.0399.970.000.0399.970.00
ξ ^ q DR 2.7097.300.000.0399.970.000.0399.970.00
Table 5. Variables and definitions from the Korean Survey of Household Finances and Living Conditions (2023).
Table 5. Variables and definitions from the Korean Survey of Household Finances and Living Conditions (2023).
VariableDescription
INCOMECurrent income
EDUEducational attainment
GEOMetropolitan status: In metropolitan area (1), Not in metropolitan area (2)
SNGOne-person household: Yes (1), No (2)
APTResidence in an apartment: Yes (1), No (2)
SIZESize of net Floor Area: Classified into 4 groups by size
HOMEHousing types
DEBTAny debt held by the household: Yes (1), No (2)
EXP1Consumption expenditure
EXP2Non-consumption expenditure
Table 6. %RB and RRMSE of the finite distribution function estimators (Study 2).
Table 6. %RB and RRMSE of the finite distribution function estimators (Study 2).
t 1 = ξ 0.25 t 2 = ξ 0.50 t 3 = ξ 0.75
ScenarioEstimator%RBRRMSE%RBRRMSE%RBRRMSE
A F ^ IPW ( t ) 0.110.070.020.040.050.03
F ^ REG ( t ) −7.800.09−2.600.040.550.02
F ^ DR ( t ) 0.190.060.100.040.150.02
B F ^ IPW ( t ) 15.910.189.190.103.670.04
F ^ REG ( t ) −7.800.09−2.600.040.550.02
F ^ DR ( t ) 6.480.093.140.051.080.02
Table 7. %RB and %CPv of bootstrap variance estimators (Study 2).
Table 7. %RB and %CPv of bootstrap variance estimators (Study 2).
t 1 = ξ 0.25 t 2 = ξ 0.50 t 3 = ξ 0.75
ScenarioEstimator%RB %CPv%RB %CPv%RB %CPv
A v IPW , BT 10.0396.1011.9596.077.6195.67
v REG , BT 16.3462.3725.0088.1017.2994.53
v DR , BT 14.6796.6716.8696.6712.1895.23
B v IPW , BT 7.4047.209.6542.507.4666.67
v REG , BT 16.3462.3725.0088.1017.2994.53
v DR , BT 13.2684.9016.7388.1012.4792.57
Table 8. %RB and RRMSE of quantile estimators (Study 2).
Table 8. %RB and RRMSE of quantile estimators (Study 2).
t 1 = ξ 0.25 t 2 = ξ 0.50 t 3 = ξ 0.75
ScenarioEstimator%RBRRMSE%RBRRMSE%RBRRMSE
A ξ ^ q IPW −0.180.06−0.140.05−0.280.04
ξ ^ q REG 7.570.092.960.04−0.820.03
ξ ^ q DR −0.350.06−0.230.04−0.470.04
B ξ ^ q IPW −14.060.15−10.800.12−6.940.08
ξ ^ q REG 7.570.092.960.04−0.820.03
ξ ^ q DR −6.320.08−3.910.06−2.120.04
Table 9. %CPξ, %L and %U of Woodruff confidence intervals (Study 2).
Table 9. %CPξ, %L and %U of Woodruff confidence intervals (Study 2).
t 1 = ξ 0.25 t 2 = ξ 0.50 t 3 = ξ 0.75
ScenarioEstimator %CPξ%L%U %CPξ%L%U %CPξ%L%U
A ξ ^ q IPW 96.431.671.9096.331.472.2095.872.132.00
ξ ^ q REG 58.4341.570.0083.6716.230.1095.671.532.80
ξ ^ q DR 96.871.601.5396.901.331.7795.701.872.43
B ξ ^ q IPW 44.570.0055.4343.030.0056.9771.570.0328.40
ξ ^ q REG 58.4341.570.0083.6716.230.1095.671.532.80
ξ ^ q DR 83.600.0016.4088.700.0311.2794.370.475.17
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kwon, S.; Jang, D.; Kim, K.-S. Doubly Robust Estimation of the Finite Population Distribution Function Using Nonprobability Samples. Mathematics 2025, 13, 3227. https://doi.org/10.3390/math13193227

AMA Style

Kwon S, Jang D, Kim K-S. Doubly Robust Estimation of the Finite Population Distribution Function Using Nonprobability Samples. Mathematics. 2025; 13(19):3227. https://doi.org/10.3390/math13193227

Chicago/Turabian Style

Kwon, Soonpil, Dongmin Jang, and Kyu-Seong Kim. 2025. "Doubly Robust Estimation of the Finite Population Distribution Function Using Nonprobability Samples" Mathematics 13, no. 19: 3227. https://doi.org/10.3390/math13193227

APA Style

Kwon, S., Jang, D., & Kim, K.-S. (2025). Doubly Robust Estimation of the Finite Population Distribution Function Using Nonprobability Samples. Mathematics, 13(19), 3227. https://doi.org/10.3390/math13193227

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop