Next Article in Journal
Modeling and Analysis of New Hybrid Clustering Technique for Vehicular Ad Hoc Network
Previous Article in Journal
Open-Set Recognition Model Based on Negative-Class Sample Feature Enhancement Learning Algorithm
Previous Article in Special Issue
Single Imputation Methods and Confidence Intervals for the Gini Index
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Methods to Counter Self-Selection Bias in Estimations of the Distribution Function and Quantiles

by
María del Mar Rueda
1,*,
Sergio Martínez-Puertas
2 and
Luis Castro-Martín
3
1
Department of Statistics and O.R. and Institute of Mathematics, University of Granada, 18071 Granada, Spain
2
Department of Mathematics, University of Almería, 04120 Almería, Spain
3
Andalusian School of Public Health, University of Granada, 18011 Granada, Spain
*
Author to whom correspondence should be addressed.
Mathematics 2022, 10(24), 4726; https://doi.org/10.3390/math10244726
Submission received: 7 November 2022 / Revised: 1 December 2022 / Accepted: 8 December 2022 / Published: 12 December 2022

Abstract

:
Many surveys are performed using non-probability methods such as web surveys, social networks surveys, or opt-in panels. The estimates made from these data sources are usually biased and must be adjusted to make them representative of the target population. Techniques to mitigate this selection bias in non-probability samples often involve calibration, propensity score adjustment, or statistical matching. In this article, we consider the problem of estimating the finite population distribution function in the context of non-probability surveys and show how some methodologies formulated for linear parameters can be adapted to this functional parameter, both theoretically and empirically, thus enhancing the accuracy and efficiency of the estimates made.

1. Introduction

Distribution function estimation is an important topic in survey research. This approach offers valuable benefits in the context of probability surveys and has been the focus of much research attention in recent years. It is especially useful when the underlying goal is to determine the proportion of values of a study variable that are less than or equal to a certain value. For example, knowledge of the distribution function makes it possible to obtain the reliability function, which is commonly used in life data analysis and reliability engineering [1]. Furthermore, the distribution function allows us to examine whether two samples originate from the same population [2].
Additionally, the finite population distribution function can be used to calculate certain parameters, such as population quantiles. In several areas of study [3,4,5], quantiles are of special interest. For example, rates of extreme pediatric obesity are defined as the body mass index at or above the 99th percentile [6]. In another area, that of ozone concentrations, the 5th percentile is a measure of the baseline condition, while the 95th reflects peak concentration levels [7]. In economics, some variables, such as income, have skewed distributions and in this case quantiles provide a more suitable measure of location than the mean [8,9]. Also in this field, quantiles allow us to obtain measures such as the poverty line and the poverty gap [10,11], as well as inequality parameters indicators such as the headcount index, which measures the proportion of individuals classified as poor within a given population [12]. Other analyses of inequality, such as those focusing on wages or income distribution, also require measures based on percentile ratios [9]. In these cases, estimating the distribution function is also more useful than calculating totals and means [13]. In view of these considerations, some studies have focused on the auxiliary population information available at the estimation stage to gain more accurate values for the distribution function and quantiles [14,15,16,17]. One means of incorporating auxiliary information to develop new estimators of the distribution function is to employ the calibration method, which was originally designed to estimate the total population [18]. An extensive body of research has been conducted in this area, and various implementations of the calibration approach have been applied in the probability survey context to obtain estimators of the distribution function and the quantiles [19,20,21,22,23,24,25,26,27]. The use of calibration techniques has also been considered for estimating the distribution function when a probability survey is subject to non-response [28].
As part of the global commitment to fight poverty and social exclusion, many government agencies wish to know the proportion of the population living below the poverty line, in order to monitor the effectiveness of their policies [29]. One way to obtain this information is to conduct probabilistic surveys, based on representative samples of the target population. The aim of survey sampling theory is to maximize the reliability of the estimates thus obtained.
For a sample to be considered probabilistic and therefore valid for drawing inferences regarding the population, it must be selected under the assumption that all the individuals in the target population have a known and non-null probability of inclusion.
In recent years, alternative data sources to probabilistic samples have been considered, such as big data and web surveys. These approaches offer certain advantages over traditional probability sampling: estimates in near real time may be obtained, data access is easier, and data collection costs are lower. In these non-traditional methods, the data generating process is different and the subsequent analysis is based on non-probability samples. Despite the above advantages, this method also presents serious issues, especially the fact that the selection procedure for the units included in the sample is unknown and so the estimates obtained may be biased, since the sample itself does not necessarily provide a valid picture of the entire population. In other words, the sample is potentially exposed to self-selection bias [30,31].
Many studies of survey sampling have been undertaken to reduce selection bias in the methods used to estimate population totals and means, and this research has been reviewed in [32,33,34,35,36,37], among others. The methods considered include inverse probability weighting [38,39], inverse sampling [34], mass imputation [40], doubly robust methods [31], kernel smoothing methods [41], statistical matching combined with propensity score adjustment [42], and calibration combined with propensity score adjustment [39,43]. However, despite the extensive literature available on using calibration techniques to estimate the distribution function and the population mean under conditions of self-selection bias, little attention has been paid to the development of efficient methods for estimating the population distribution function under these conditions.
To address this research gap, we propose a general framework for drawing statistical inferences for the distribution function with non-probability survey samples when auxiliary information is available. We discuss different methods of adjusting for self-selection bias, depending on the type of information available, applying calibration, propensity score, and statistical matching techniques.
The rest of the paper is organized as follows: in Section 2, we review the estimation of the distribution function from probability and non-probability samples, in order to establish the basic framework and the notation employed. In Section 3, we then propose several estimators for the distribution function, based on calibration, propensity score adjustment (PSA) and statistical matching (SM), taking into consideration that the non-smooth nature of the finite population distribution function produces certain complexities, which are resolved in different ways. The properties of the proposed estimators are described in Section 4, after which we present the results obtained from the simulation studies performed with these estimators. In the final section, we summarize the main conclusions drawn and suggest possible lines of further research in this area.

2. Basic Setup for Estimating the Distribution Function

Let U denote a finite population of size N, U = 1 , , i , , N . Let s V be a self-selected sample of size n V , self-selected from U. Let y be the variable of interest in the survey estimation. We assume that y k is known for all sample units. Our goal is to estimate the distribution function F y ( t ) for the study variable y, which can be defined as follows:
F y ( t ) = 1 N k U Δ ( t y k )
where Δ ( · ) denotes the Heaviside function, given by:
Δ ( t y k ) = 1 if t y k 0 if t < y k .
In the absence of auxiliary information, the distribution function F y ( t ) can be estimated by the naive estimator, defined by
F ^ Y N a ( t ) = 1 n k s V Δ ( t y k ) .
If the convenience sample s V suffers from selection bias, the above estimator will provide biased results.
Let R be an indicator variable of an element being in s V , such that
R k = 1 k s V 0 k s V .
If we know { R k : k U } , the error of F ^ Y N a ( t ) will be:
F ^ Y N a ( t ) F y ( t ) = 1 n k U R k Δ ( t y k ) 1 N k U Δ ( t y k ) = 1 f C o v ( R , Δ ( t y ) )
with f = n / N and C o v ( R , Δ ( t y ) ) = 1 N k U ( R k R ¯ N ) ( Δ ( t y k ) F y ( t ) ) , being R ¯ N = 1 N k U R k .
By applying the expectation of the mean difference, we obtain the selection bias of the estimator, as follows:
B = E R ( F ^ Y N a ( t ) F y ( t ) ) = 1 f E R ( C o v ( R , Δ ( t y ) ) )
where E R denotes the expectation with respect to the random mechanism for R k .
The mean squared error is obtained by:
E C M = 1 f 2 E R ( C o v ( R , Δ ( t y ) ) 2 ) = 1 f 2 E R ( C o r r ( R , Δ ( t y ) ) 2 ) V a r ( R ) V a r ( Δ ( t y ) ) =
= E R ( C o r r ( R , Δ ( t y ) ) 2 ) × 1 f 1 × V a r ( Δ ( t y ) )
because V a r ( R ) = 1 N k U ( R k R ¯ N ) 2 = f ( 1 f ) .
Therefore, a non-probability sampling design with E R ( C o r r ( R , Δ ( t y ) ) 2 ) 0 means that the analysis results are subject to selection bias. This is the main problem addressed in our study.

3. Proposed Estimators

The key to successful weighting to eliminate bias in self-selection surveys lies in the use of appropriate auxiliary information. To address this question, let us consider J auxiliary variables x 1 , , x J and let x k = ( x 1 k , , x J k ) be the vector of auxiliary variables at unit k.
We distinguish three different cases, called InfoTP, InfoES, and InfoES, depending on the information at hand ([44])
  • InfoTP: Only the population vector totals of the auxiliary variables, U x k = X , are known.
  • InfoES: Information is available at the level of a probability sample conducted on the same target population as the non-probability survey, with good coverage and high response rates. The vector of auxiliary variables x k is known for every unit in this reference sample.
  • InfoEP: Information is available at the level of the population U: the vector of auxiliary variables x k is known for every k U .
Below, we consider various adjustment methods, depending on the type of information available.

3.1. InfoTP

The calibration method, originally developed by Deville and Särndall [18] for the estimation of totals, can be adapted to estimate the distribution function. This approach enables us to incorporate the auxiliary information available through the auxiliary vector x k in several ways [19,20,24,25,26,27].
In the case of InfoTP, the calibration can be performed on the totals: given a pseudo-distance G ( . , . ) , and denoting w v k = N / n V , we seek new calibrated weights w k c 1 that are the solution to the following minimization problem
min w k k s V G ( w k , w v k )
subject to
k s V w k x k = X .
The resulting calibrated estimator of the distribution function is given by:
F ^ Y c 1 ( t ) = 1 N k s V w k c 1 Δ ( t y k ) .
Ref. [18] proposes a family of pseudo-distance G ( . , . ) with which to develop calibration estimators. One of the distances proposed is the chi-square distance given by
Φ s = k s V ( w k w v k ) 2 w v k q k
where q k is positive weights that are usually assumed as uniform 1 / q k = 1 although unequal weights 1 / q k are sometimes preferred.
The resulting calibrated weights w k c 1 with the minimization of (7) subject to the conditions (5) are given by:
w k c 1 = w v k + w v k q k γ · x k
where
γ = X k s V w v k x k T k s V w v k q k x k x k T 1
In the estimation of totals and means, previous research has shown that the exclusive use of calibration fails to eliminate self-selection bias if this approach is not combined with other methods, such as propensity score adjustment (PSA) [39,43]. Thus, in terms of bias reduction, the results of the calibration and PSA combination clearly surpass those obtained with only calibration weighting [43].
In order to incorporate methods such as PSA and to develop new estimators that overcome the problems met with the F ^ Y c 1 ( t ) estimator, we consider other scenarios as follows.

3.2. InfoES

Let s R be a probability sample of size n R selected from U under a probability sampling design ( s R , p R ) in which π k = s R k p R ( s R ) > 0 is the first-order inclusion probability for individual k. The covariates x k are common to both samples, but we only have measurements of the variable of interest y for the individuals in the convenience sample. The original design weight of the individual k in the reference (probability) sample is denoted by w R k = 1 / π k .
First, we consider a calibration method for reweighting based on the proposal given in [25], calibrating from the pseudo-variable:
g k = β ^ T x k for k = 1 , 2 , . . . N
β ^ = k s V x k x k T 1 · k s V x k y k .
The new weights w k c 2 are obtained by minimizing the chi-square distance (7) subject to the following conditions:
1 N k s V w k c 2 Δ ( t j g k ) = 1 N k s R w R k Δ ( t j g k ) j = 1 , 2 , , P
where t j for j = 1 , . . . P are points chosen arbitrarily and where we assume that t 1 < t 2 < . . . < t P and q k are positive constants.
The resulting calibrated estimator of the distribution function is given by:
F ^ Y c 2 ( t ) = 1 N k s V w k c 2 Δ ( t y k ) .
in which the calibrated weights w k c 2 are given by:
w k c 2 = w v k + w v k q k λ N Δ ( t P g k )
with
λ = N 2 · F ^ G R ( t P ) 1 N k s V Δ ( t P g k ) T k s V w v k q k Δ ( t P g k ) Δ ( t P g k ) T 1
and
Δ ( t P g k ) T = Δ ( t 1 g k ) , Δ ( t 2 g k ) , , Δ ( t P g k )
F ^ G R ( t P ) T = 1 N k s R w R k Δ ( t 1 g k ) , 1 N k s R w R k Δ ( t 2 g k ) , , 1 N k s R w R k Δ ( t P g k )
The calibrated weights (13) and the weights w R k for the samples s V and s R , respectively, give the same estimates for the distribution function of the pseudo-variable g, when evaluated over the set of points t j .
In the case of InfoES information, the most popular adjustment method in non-probability settings is propensity score adjustment [38,39,43,45,46,47]. This method, developed by [48], can be used to estimate the distribution function, as described below.
Under PSA, it is assumed that each element of U has a probability (propensity) of being selected for s V , which can be formulated as
π k v = P r ( R k = 1 | x k , y k )
We assume that the response selection mechanism is ignorable and follows a parametric model:
π k v = P r ( R k = 1 | x k ) = m ( x k , λ ) ,
for a known function m ( · ) with second continuous derivatives with respect to an unknown parameter λ .
We estimate the propensity scores π k v by using data from both the self-selection and the probability samples. The maximum likelihood estimator (MLE) of π k v is π ^ k v = m ( λ ^ , x k ) , where λ ^ corresponds to the value of λ that maximizes the pseudo-log-likelihood function:
l ˜ ( λ ) = s V l o g m ( λ , x k ) 1 m ( λ , x k ) + s R 1 π k l o g ( 1 m ( λ , x k ) ) .
The resulting propensities can then be used to calculate new weights, w k P S A = 1 π ^ k v . Thus, we define the inverse propensity weighting estimator of the distribution function as:
F ^ Y I P S ( t ) = 1 N k s V w k P S A Δ ( t y k ) .
Another PSA-based estimator can be obtained using the weights w k P S A 2 = 1 π ^ k v π ^ k v [49]. In this respect, Refs. [39,47] proposed other PSA weights whereby the combined sample ( s V s R ) is grouped into g equally-sized strata of similar propensity scores from which an average propensity is calculated for each group.
The estimator (17) can be obtained as a special case of the general framework on inference for the general parameter proposed in [50]. The latter authors present an estimator that uses the propensity score for each individual in the survey weighted by the estimating equation under logistic regression, thus obtaining the asymptotic variance of the estimator.
A third approach to dealing with InfoES information is that of statistical matching, by which imputed values are created for all elements in the probability sample. This method was introduced by [40] and is based on modeling the relationship between y k and x k , using the self-selected sample s V to predict y k for the probability sample. The question then is how to predict the values y k .
To do so, let us assume a working population model E m ( y / x ) = M ( x , β ) where β is the unknown parameter. We further assume that the population model holds for the sample s V . Using the data from this sample, we can obtain an estimator β ^ v which is consistent for β under the model assumed. From β ^ v , we then propose the matching estimator for the distribution function as:
F ^ Y S M ( t ) = 1 N k s R Δ ( t y ^ k ) / π k
where y ^ k = M ( x , β ^ v ) is the predicted value of y k under the above model. The estimator (18) is consistent if the model for the study variable is correctly specified.
A more complex estimator for the distribution function can be constructed using the idea of double robust estimation [31], which is based on the following considerations. Firstly, the propensity score adjusted estimator (17) requires that the propensity score model be correctly specified. Moreover, the imputation-based estimator (18) requires that the working population model be correctly specified. An estimator is called doubly robust if the estimator is consistent whenever one of these two models is correctly specified [51]. Hence, the double robust estimator of the distribution function is defined as:
F ^ Y D R ( t ) = 1 N ( k s R Δ ( t y ^ k ) / π k + k s V w k P S A ( Δ ( t y k ) Δ ( t y ^ k ) ) .
The estimator (19) is double robust because it is consistent if either the model for the participation probabilities or the model for the study variable is correctly specified.

3.3. InfoEP

In the case of InfoEP, an initial possibility is to consider a similar calibrated estimator, based on the proposal given in [25]. The new weights w k c 3 are obtained by minimizing the chi-square distance 7 subject to the following conditions:
1 N k s V w k c 3 Δ ( t j g k ) = F g ( t j ) j = 1 , 2 , , P
where F g ( t j ) is the finite distribution function of g at the points t j , j = 1 , 2 , , P .
The resulting calibrated estimator of the distribution function is:
F ^ Y c 3 ( t ) = 1 N k s V w k c 3 Δ ( t y k )
where the calibrated weights w k c 3 are given by:
w k c 3 = w v k + w v k q k θ N Δ ( t P g k )
with
θ = N 2 · F g ( t P ) 1 N k s V Δ ( t P g k ) T k s V w v k q k Δ ( t P g k ) Δ ( t P g k ) T 1
F g ( t P ) T = F g ( t 1 ) , F g ( t 2 ) , , F g ( t P ) .
F ^ Y c 3 ( t ) gives perfect estimates for the distribution function of the pseudo-variable g, when evaluated over the set of points t j , j = 1 , 2 , , P .
We define a model-based estimator based on the non-probability sample as
F ^ Y D R 2 ( t ) = 1 N k U s V Δ ( t y ^ k ) + k s V Δ ( t y k )
and a model-assisted estimator by
F ^ Y D R 3 ( t ) = 1 N k U Δ ( t y ^ k ) + k s V w P S A ( Δ ( t y k ) Δ ( t y ^ k )

4. Properties of Proposed Estimators

When estimating the distribution function, the estimator considered F ^ Y ( t ) should satisfy the following distribution function properties:
(i)
F ^ Y ( t ) should be continuous on the right;
(ii)
F ^ Y ( t ) should be monotonically nondecreasing;
(iii)
lim t F ^ Y ( t ) = 0 ;
(iv)
{ lim t + F ^ Y ( t ) = 1 .
If an estimator of the distribution function F ^ Y ( t ) is a genuine distribution function, i.e., F ^ Y ( t ) meets the above conditions, it can be used directly for estimating the quantiles [16]. Specifically, the quantile α can be estimated as:
Q ^ α = i n f { t : F ^ Y ( t ) α } = F ^ Y 1 ( α )
Since the Heaviside function is continuous on the right, it is clear that all the proposed estimators satisfy conditions (i) and (iii).
In general, however, estimator F ^ y c 1 ( t ) does not satisfy conditions (ii) or (iv). In order to meet condition (ii), let us consider the specific pseudo-distances G ( . , . ) that guarantee positive calibrated weights w k c 1 > 0 . In this respect, Ref. [18] proposed some pseudo-distances which always produce positive weights whilst avoiding extremely large ones. Some of these pseudo-distances may be considered in estimator F ^ y c 1 ( t ) in order to satisfy condition (ii). In addition, to meet condition (iv), we can add the constraint:
N = k s V w k
to condition (5).
Similarly, conditions (ii) and (iv are not generally met by the estimators F ^ Y c 2 ( t ) or F ^ Y c 3 ( t ) ). Regarding condition (ii), and following [25], the weights w k c 2 and w k c 3 are always positive if q k = c for all units in the population. Thus, under the usual uniform choice 1 / q k = 1 , both estimators satisfy condition (ii). To meet condition (iv), in the case of estimator F ^ Y c 2 ( t ) , we can add the constraint (25) to the conditions (11), while, for estimator F ^ Y c 3 ( t ) , we can take a value t P that is large enough so F g ( t P ) = 1 .
The estimator F ^ Y I P S ( t ) based on the weights w k P S A verifies condition (ii) if weights w k P S A 0 , whereas if F ^ Y I P S ( t ) is based on w k P S A 2 , then it meets condition (ii) when w k P S A 1 . Consequently, if 0 w k P S A 1 , the estimator F ^ Y I P S ( t ) based on both w k P S A and w k P S A 2 satisfies condition (ii). Thus, through the model selected to estimate propensities m ( λ , x k ) , condition (ii) can be met. For example, an extended option in the estimation of propensities is that of the logistic regression model
m ( λ , x k ) = e x p ( λ T x k ) 1 + e x p ( λ T x i k )
that verifies the condition 0 w k P S A 1 . Hence, if we choose this model, condition (ii) is met by F ^ Y I P S ( t ) regardless of whether we use the weight w k P S A or the weight w k P S A 2 .
To ensure that condition (iv) is met with the estimator F ^ Y I P S ( t ) , it can be divided by the sum of weights, that is, k s V w k P S A or k s V w k P S A 2 .
Estimator F ^ Y S M ( t ) satisfies condition (ii) but not condition (iv). To ensure the latter, again we can divide F ^ Y S M ( t ) by the sum of its weights, that is, s R π k .
Finally, whereas the estimator F ^ Y D R 2 ( t ) satisfies all the conditions, F ^ Y D R ( t ) and F ^ Y D R 3 ( t ) do not meet conditions (ii) or (iv). Condition (iv) can be met by both F ^ Y D R ( t ) and F ^ Y D R 3 ( t ) when they are divided by the sum of their respective weights, but these estimators, in general, are not monotonic non-decreasing functions and therefore are not genuine distribution functions. In both cases, we might consider the general procedure described in [16] to obtain a monotonous non-decreasing version of the estimators F ˜ Y D R ( t ) and F ˜ Y D R 3 ( t ) . However, this procedure always increases the computational cost when estimating quantiles.

5. Simulation Study

In this section, we conduct a Monte Carlo study to compare the efficiency of the estimators presented in Section 3.2. The simulation study was programmed in R and Python. New code was developed to calculate the estimator considered. Python was only used for training and applying the machine learning models in order to take advantage of the package Optuna [52] for hyperparameter optimization. However, R was chosen as the main programming language since the functions wtd.quantile, from the package reldist [53], and qgeneric, from the package flexsurv [54], facilitate the implementation of custom quantiles. To show that the superiority of some estimators depends on the data, we define various setups based on different sampling strategies for the probability and nonprobability samples. In this analysis, only InfoES information is used.

5.1. Data

The dataset used in the simulation was collected between 2011 and 2012, in the Spanish Life Conditions Survey [55]. Using criteria harmonized for all European Union countries, the Living Conditions Survey generates a reference source of statistics on income distribution and social exclusion within Europe. The dataset was filtered to rule out individuals and variables with large quantities of missing data. Following this procedure, the resulting pseudopopulation had a size of N = 28210 .
The following variables were used in the simulation:
- Demographics
  • C O M : 1 if the individual has a computer at home, and 0 otherwise;
  • S E X : 1 if the individual is male, and 0 otherwise;
  • A G E : the individual’s age in years;
  • A R E A M E : 1 if the individual lives in a medium-density population area, and 0 otherwise;
  • A R E A L O W : 1 if the individual lives in a low-density population area, and 0 otherwise.
- Analysis variable
  • I N C : Household expenses in EUR.
Let us consider two setups. In the first, the sampling procedure is the same as that used to select the sample in the Spanish Life Conditions Survey: the probability sample is obtained by stratified cluster sampling, whereby the strata are defined by the NUTS2 regions and the clusters are composed of the households within these regions, extracted with probabilities proportional to the household size. The number of households to be selected, m, is estimated by dividing n R (the sample size of s R 1 ) by the mean household size. For n R = 2000 , m = 902 . According to this procedure, the final sample size of s R 1 is n R 1 = 2003 .
In the second setup, the reference probability sample is drawn by Midzuno sampling with probabilities proportional to the minimum household income necessary for basic subsistence.
To generate the nonprobability sample, s V , the following scenarios were considered:
  • SC1: Simple random sampling from the population with C O M = 1
  • SC2: Unequal probability sampling from the full pseudopopulation, where the probability of selection for the i-th individual, p i , is given as follows:
    p i = 1 1 + e x p ( 2 C O M + 0.2 S E X + 0.01 A G E + 0.2 A R E A M E + 0.4 A R E A L O W )
  • SC3: Unequal probability sampling from the full pseudopopulation, where the probability of selection for the i-th individual, p i , is given as follows:
    p i = ( A G E 1925 ) 3 / ( 1995 1925 ) 3
These participation mechanisms create weights with different models and levels of variability. Figure 1, Figure 2 and Figure 3 show the resulting histogram of propensities.
By this procedure, we obtained nonprobability samples with sizes n V = 2000 , 4000 and 6000.

5.2. Simulation

In each simulation, the following parameters were estimated:
  • The quantiles Q 0.25 , Q 0.5 and Q 0.75 .
  • The distribution function F y ( t ) at points Q 0.25 , Q 0.5 and Q 0.75 .
The following methods for estimating these parameters were compared:
  • Naive estimator, using the sample distribution function of the s V sample to draw inferences.
  • The proposed calibrated estimator F ^ Y c 2 ( t ) where t j for j = 1 , 2 , 3 corresponds to Q 0.25 , Q 0.5 , and Q 0.75 .
  • The proposed PSA estimator F ^ Y I P S ( t ) .
  • The proposed SM estimator F ^ Y S M ( t ) .
  • The proposed DR estimator F ^ Y D R ( t ) .
All five demographic variables were considered potential predictors of propensities, and predicted values y i for both logistic and linear regression models. In addition, a state-of-the-art machine learning method, XGBoost [56], was used as an alternative to these two models in order to evaluate the effect of the method used to estimate propensities and predict values. Refs. [57,58] show that this technique can improve the representativity of self-selection surveys with respect to other prediction methods.
The quantile α is estimated as follows:
Q ^ α = i n f { t : F ^ Y ( t ) α } = F ^ Y 1 ( α )
where F ^ Y is one of the five above estimators of F y .
One thousand simulations were run for each context. The resulting mean bias, standard deviation, and root mean square error were measured in relative numbers to make them comparable across different scenarios. The formulas used for their calculation were:
R B i a s ( % ) = i = 1 1000 θ ^ ( i ) 1000 θ N · 100 θ N
R S t a n d a r d d e v i a t i o n ( % ) = i = 1 1000 ( θ ^ ( i ) θ ¯ ^ ) 2 999 · 100 θ N
R M S E ( % ) = R B i a s 2 + R S D 2
where θ ^ ( i ) is the estimation of a parameter θ N in the i-th simulation and θ ¯ ^ is the mean of the 1000 estimations.

5.3. Results

The relative bias of estimators is shown in Table 1, for all scenarios and sample sizes.
These results show that the performance of the methods is very similar for each of the probability sample selections considered. The following comments refer to the first columns. i.e., those corresponding to the situation in which the probability sample is chosen through a stratified cluster scheme.
The naive estimator for all parameters in Scenario 1, where there is also coverage bias, reflects a very large degree of bias, which is not eliminated by increasing the sample size. The calibration estimator achieves a considerable reduction in the bias when XGBoost is used to predict the values but does not achieve a significant reduction in the bias with linear regression. For some parameters, this bias is even greater than that of the naive estimator.
As expected, in Scenario 1, the PSA-based estimators do not eliminate the self-selection bias, since there is no relationship between the variables of interest and the probability of participation, and the machine learning method used to predict the propensities has little influence. These results are comparable to those reported by [43], who observed that it is important to add covariates related to the study goal in order to make PSA useful.
On the contrary, with the SM method, the ML technique is of determinant importance: the estimators based on linear regression perform very badly, in general, since there is no linear relationship between the values to be predicted and the covariates. However, the XGBoost method works well in the case of nonlinearity and allows us to select the useful covariates in the prediction. A noteworthy finding is the large amount of bias shown by the regression-based estimator for quantiles Q 0.25 and Q 0.75 , while the version based on XGBoost achieves a very significant error reduction. A very similar pattern of behavior was observed in all cases between the SM and the DR estimators.
With Scenario 2, the estimators present a different behavior pattern. The probability of participation depends on all the covariates, and the PSA method reduces the self-selection bias considerably, in all cases. The ML method has less impact, and the degree of bias reduction achieved is similar in the two methods. Comparable results were obtained with SM and DR, the methods based on calibration. In these cases, the bias reduction in relation to the values obtained with the naive estimator is very large and does not depend on the ML method used. No clear pattern emerged as to which of the methods was the best: for some parameters ( Q 0.25 and Q 0.5 ), the calibration method worked better, while for others ( F y ( Q 0.5 ) ), the PSA achieved the greatest reduction in bias, and in yet others ( Q 0.75 ) the best estimates were produced by SM and DR.
In Scenario 3, where the probability of participating depends only on the age covariate, the calibration estimators, DR and SM, also performed well, obtaining a good level of bias reduction. The estimator based on PSA with logistic regression was the best of all in this respect, for all parameters. However, when XGBoost was used, this decrease in bias was not observed in some parameters. This may be due to the fact that this ML method is very sensitive to the choice of hyperparameters and in these simulations the default parameters were chosen and no hyperparameter optimization was performed. Table 2 shows the relative RMSE of these estimators, for each scenario.
In Scenario 1, the estimators that use linear or logistic regression are the least efficient, due to the bias that is present. Calibrated SM and DR estimators based on XGBoost improve efficiency by reducing bias. Moreover, the RMSE reduction is very strong in some parameters ( Q 0.75 and F y ( Q 0.75 ) ). However, the PSA-based estimates do not produce a significant reduction in RMSE because the propensities cannot be modeled from the covariates.
In Scenarios 2 and 3, all the proposed methods effectively reduce the error in the estimates, with the exception of PSA with XGBoost in some cases, as discussed above.
To determine whether this problem encountered with the XGBoost method in some situations can be resolved with an appropriate choice of hyperparameters, we repeated the simulation using a hyperparameter optimization process based on the Tree-structured Parzen Estimator (TPE) algorithm [59]. In this procedure, the error is estimated by cross-validation on the logistic loss obtained by each possible model over the training data. Accordingly, this process could be replicated in a real-world scenario.
Table 3 and Table 4 show the bias and RMSE values for the estimators with this new simulation for Scenario 2 and Setup 2 (the worst scenario for PSA with the default XGBoost method). Similar results were obtained for all other situations, but for reasons of space they are not shown in this paper.
These results clearly show that, by optimizing the hyperparameters, we have considerably reduced the bias and error of the estimators.

6. Discussion

In recent years, the use of survey-based online research has expanded considerably. Web surveys are an attractive option in many fields of sociological investigation due to their low fieldwork costs and rapid data collection. However, this survey mode is also subject to many limitations in terms of accurately representing the target population, and the estimates thus obtained are highly likely to present coverage and/or self-selection bias. Various correction techniques, such as calibration, propensity score adjustment, and statistical matching, have been proposed as a means of reducing or eliminating these forms of bias.
Our paper focuses on the question of estimating the distribution function. This issue is important: the distribution function is a basic statistic underlying many others; for purposes such as assessing and comparing finite populations, it can be more revealing than the use of simple means and totals. Indeed, many previous studies have been undertaken to consider how calibration techniques may be applied to the estimation of the distribution function in the context of a probability survey [19,20,21,22,23,24,25,26,27] and even to overcome the problem of non-response [28]. However, to our knowledge, very few, if any, studies have addressed this issue from the standpoint of a non-probability survey. Accordingly, we analyze the efficiency obtained by certain bias-correction techniques such as calibration, propensity score adjustment, and statistical matching in various situations within a non-probability survey context. In this analysis, we consider the performance of several estimators in terms of reducing self-selection bias, using a representative survey sample as a proxy for the target population. Among the results obtained by the estimators proposed for the distribution function, F ^ y c 1 ( t ) needs specific pseudo-distances G ( . , . ) in order to satisfy condition (ii). The estimator F ^ Y D R 2 ( t ) is always a genuine distribution function and under favorable conditions, the estimators F ^ Y c 2 ( t ) and F ^ Y c 3 ( t ) also obtain a genuine distribution function. Moreover, with minor modifications, the estimators F ^ Y I P S ( t ) (under a logistic regression model) and F ^ Y S M ( t ) also satisfy the distribution function conditions. On the other hand, the estimators F ^ Y D R ( t ) and F ^ Y D R 3 ( t ) are not generally monotonically nondecreasing functions, and therefore when estimating quantiles, an additional process, which increases the computational cost, must be applied. All the estimators included in our proposal can be used under linear and nonlinear models. Self-evidently, F ^ Y I P S ( t ) , F ^ Y S M ( t ) , F ^ Y D R ( t ) , F ^ Y D R 2 ( t ) , and F ^ Y D R 3 ( t ) are applicable to linear or nonlinear models. While the calibrated estimators F ^ Y c 2 ( t ) and F ^ Y c 3 ( t ) assume a linear model, due to the pseudo-variable g k , the combination with XGBoost enables them to be used with other models too. Furthermore, F ^ Y c 2 ( t ) and F ^ Y c 3 ( t ) can cover the nonlinear case through the procedure described in [60]. The behavior of all these estimators is demonstrated through simulation studies.
Although further investigation is needed, our results show that self-selection bias can be greatly reduced by any of the four methods considered, particularly when appropriate covariates and a valid machine learning technique are used, both in estimating propensities and in predicting values. However, our investigation did not enable us to determine which method is best in all situations. Specifically, for each parameter and bias-reduction method, different behavior patterns were obtained. Nevertheless, in general, the calibration method based on XGBoost is fairly efficient in any situation.
Although the methods proposed are shown to be effective in reducing the MSE of quantile and distribution function estimates in various situations, certain limitations exist and must be acknowledged. For the PS-based method, for example, the amount of bias reduction achieved depends on how well the propensity model predictors predict the outcome. If the propensity model is poorly fitted, the PS estimates may even be more biased than naive estimates. This is also the case with estimators based on SM, which need a good model in order to accurately predict the y-values. In addition, for the distribution function, the issue is even more complex; although there is a good linear relationship between y and the covariates x, this relationship is not necessarily transferred to the jump functions Δ ( t y ) . In practice, it is often difficult to decide whether the auxiliary variables contain all the components needed to characterize the selection mechanism and the superpopulation model. Therefore, when selecting the covariates and the function of the model, it is essential to use flexible ML techniques.
Finally, the present study does not address the question of the estimation of variance. Plug-in estimators can be used to construct variance estimators from the expression of the asymptotic variance, but the issue is not simple, as the variance depends on the probability of the sample s R being selected and on the selection mechanism described by the propensity model. In estimating the variance for nonlinear parameters, jackknife and bootstrap techniques [61] might be useful and should be considered in future research in this area.

Author Contributions

M.d.M.R., S.M.-P. and L.C.-M. contributed equally to the conceptualization of this study, its methodology, software, and original draft preparation. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Ministerio de Ciencia, Innovación y Universidades (Grant No. PID2019-106861RB-I00), IMAG-Maria de Maeztu CEX2020-001105-M/AEI/10.13039/501100011033 and FEDER/Junta de Andalucía-Consejería de Transformación Económica, Industria, Conocimiento y Universidades (FQM170-UGR20).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Acal, C.; Ruiz-Castro, J.E.; Aguilera, A.M.; Jiménez-Molinos, F.; Roldán, J.B. Phase-type distributions for studying variability in resistive memories. J. Comput. Appl. Math. 2019, 345, 23–32. [Google Scholar] [CrossRef]
  2. Alba-Fernández, M.V.; Batsidis, A.; Jiménez-Gamero, M.D.; Jodrá, P. A class of tests for the two-sample problem for count data. J. Comput. Appl. Math. 2017, 318, 220–229. [Google Scholar] [CrossRef]
  3. Decker, R.A.; Haltiwanger, J.; Jarmin, R.S.; Miranda, J. Declining business dynamism: What we know and the way forward. Am. Econ. Rev. 2016, 106, 203–207. [Google Scholar] [CrossRef] [Green Version]
  4. Gallagher, C.M.; Meliker, J.R. Blood and urine cadmium, blood pressure, and hypertension: A systematic review and metaanalysis. Environ. Health Perspect. 2010, 118, 1676–1684. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  5. Medialdea, L.; Bogin, B.; Thiam, M.; Vargas, A.; Marrodán, M.D.; Dossou, N.I. Severe acute malnutrition morphological patterns in children under five. Sci. Rep. 2021, 11, 4237. [Google Scholar] [CrossRef] [PubMed]
  6. Vander Wal, J.S.; Mitchell, E.R. Psychological complications of pediatric obesity. Pediatr. Clin. 2011, 58, 1393–1401. [Google Scholar] [CrossRef]
  7. Wilson, R.C.; Fleming, Z.L.; Monks, P.S.; Clain, G.; Henne, S.; Konovalov, I.B.; Menut, L. Have primary emission reduction measures reduced ozone across Europe? An analysis of European rural background ozone trends 1996, Äì2005. Atmos. Chem. Phys. 2012, 12, 437–454. [Google Scholar] [CrossRef] [Green Version]
  8. Decker, R.; Haltiwanger, J.; Jarmin, R.; Miranda, J. The role of entrepreneurship in US job creation and economic dynamism. J. Econ. Perspect. 2014, 28, 3–24. [Google Scholar] [CrossRef] [Green Version]
  9. Dickens, R.; Manning, A. Has the national minimum wage reduced UK wage inequality? J. R. Stat. Soc. Ser. A Stat. Soc. 2004, 167, 613–626. [Google Scholar] [CrossRef] [Green Version]
  10. De Haan, J.; Pleninger, R.; Sturm, J.E. Does financial development reduce the poverty gap? Soc. Indic. Res. 2022, 161, 1–27. [Google Scholar] [CrossRef]
  11. Jolliffe, D.; Prydz, E.B. Estimating international poverty lines from comparable national thresholds. J. Econ. Inequal. 2016, 14, 185–198. [Google Scholar] [CrossRef] [Green Version]
  12. Martínez, S.; Illescas, M.; Martínez, H.; Arcos, A. Calibration estimator for Head Count Index. Int. J. Comput. Math. 2020, 97, 51–62. [Google Scholar] [CrossRef]
  13. Sedransk, N.; Sedransk, J. Distinguishing among distributions using data from complex sample designs. J. Am. Stat. Assoc. 1979, 74, 754–760. [Google Scholar] [CrossRef]
  14. Chambers, R.L.; Dunstan, R. Estimating distribution functions from survey data. Biometrika 1986, 73, 597–604. [Google Scholar] [CrossRef]
  15. Chen, J.; Wu, C. Estimation of distribution function and quantiles using the model-calibrated pseudo empirical likelihood method. Stat. Sin. 2002, 12, 1223–1239. [Google Scholar]
  16. Rao, J.N.K.; Kovar, J.G.; Mantel, H.J. On estimating distribution functions and quantiles from survey data using auxiliary information. Biometrika 1990, 77, 365–375. [Google Scholar] [CrossRef]
  17. Silva, P.L.D.; Skinner, C.J. Estimating distribution functions with auxiliary information using poststratification. J. Off. Stat. 1995, 11, 277–294. [Google Scholar]
  18. Deville, J.C.; Särndal, C.E. Calibration estimators in survey sampling. J. Am. Stat. Assoc. 1992, 87, 376–382. [Google Scholar] [CrossRef]
  19. Arcos, A.; Martínez, S.; Rueda, M.; Martínez, H. Distribution function estimates from dual frame context. J. Comput. Appl. Math. 2017, 318, 242–252. [Google Scholar] [CrossRef]
  20. Harms, T.; Duchesne, P. On calibration estimation for quantiles. Surv. Methodol. 2006, 32, 37–52. [Google Scholar]
  21. Martínez, S.; Rueda, M.; Arcos, A.; Martínez, H. Optimum calibration points estimating distribution functions. J. Comput. Appl. Math. 2010, 233, 2265–2277. [Google Scholar] [CrossRef] [Green Version]
  22. Martínez, S.; Rueda, M.; Martínez, H.; Arcos, A. Optimal dimension and optimal auxiliary vector to construct calibration estimators of the distribution function. J. Comput. Appl. Math. 2017, 318, 444–459. [Google Scholar] [CrossRef]
  23. Martínez, S.; Rueda, M.; Illescas, M. The optimization problem of quantile and poverty measures estimation based on calibration. J. Comput. Appl. Math. 2022, 45, 113054. [Google Scholar] [CrossRef]
  24. Mayor-Gallego, J.A.; Moreno-Rebollo, J.L.; Jiménez-Gamero, M.D. Estimation of the finite population distribution function using a global penalized calibration method. AStA Adv. Stat. Anal. 2019, 103, 1–35. [Google Scholar] [CrossRef]
  25. Rueda, M.; Martínez, S.; Martínez, H.; Arcos, A. Estimation of the distribution function with calibration methods. J. Stat. Plan. Inference 2007, 137, 435–448. [Google Scholar] [CrossRef]
  26. Singh, H.P.; Singh, S.; Kozak, M. A family of estimators of finite-population distribution function using auxiliary information. Acta Appl. Math. 2008, 104, 115–130. [Google Scholar] [CrossRef]
  27. Wu, C. Optimal calibration estimators in survey sampling. Biometrika 2003, 90, 937–951. [Google Scholar] [CrossRef]
  28. Rueda, M.; Martínez, S.; Illescas, M. Treating nonresponse in the estimation of the distribution function. Math. Comput. Simul. 2021, 186, 136–144. [Google Scholar] [CrossRef]
  29. Bradshaw, J.; Mayhew, E. Understanding extreme poverty in the European Union. Eur. J. Homelessness 2010, 4, 171–186. [Google Scholar]
  30. Bethlehem, J. Selection Bias in Web Surveys. Int. Stat. Rev. 2010, 78, 161–188. [Google Scholar] [CrossRef]
  31. Chen, Y.; Li, P.; Wu, C. Doubly Robust Inference with Nonprobability Survey Samples. J. Am. Stat. Assoc. 2019, 115, 2011–2021. [Google Scholar] [CrossRef] [Green Version]
  32. Beaumont, J.F. Are probability surveys bound to disappear for the production of official statistics? Surv. Methodol. 2020, 46, 1–29. [Google Scholar]
  33. Buelens, B.; Burger, J.; van den Brakel, J.A. Comparing Inference Methods for Non-probability Samples. Int. Stat. Rev. 2018, 86, 322–343. [Google Scholar] [CrossRef]
  34. Kim, J.K.; Wang, Z. Sampling techniques for big data analysis. Int. Stat. Rev. 2019, 87, S177–S191. [Google Scholar] [CrossRef] [Green Version]
  35. Rao, J.N.K. On Making Valid Inferences by Integrating Data from Surveys and Other Sources. Sankhya B 2020, 83, 242–272. [Google Scholar] [CrossRef]
  36. Valliant, R. Comparing alternatives for estimation from nonprobability samples. J. Surv. Stat. Methodol. 2020, 8, 231–263. [Google Scholar] [CrossRef]
  37. Yang, S.; Kim, J.K. Statistical data integration in survey sampling: A review. Jpn. J. Stat. Data Sci. 2020, 3, 625–650. [Google Scholar] [CrossRef]
  38. Lee, S. Propensity Score Adjustment as a Weighting Scheme for Volunteer Panel Web Surveys. J. Off. Stat. 2006, 22, 329–349. [Google Scholar]
  39. Lee, S.; Valliant, R. Estimation for Volunteer Panel Web Surveys Using Propensity Score Adjustment and Calibration Adjustment. Sociol. Method Res. 2009, 37, 319–343. [Google Scholar] [CrossRef]
  40. Rivers, D. Sampling for Web Surveys. In Proceedings of the Joint Statistical Meetings, Salt Lake City, UT, USA, 29 July–2 August 2007. [Google Scholar]
  41. Wang, L.; Graubard, B.I.; Katki, H.A.; Li, Y. Improving external validity of epidemiologic cohort analyses: A kernel weighting approach. J. R. Stat. Soc. Ser. A Stat. Soc. 2020, 183, 1293–1311. [Google Scholar] [CrossRef]
  42. Castro-Martín, L.; Rueda, M.D.M.; Ferri-García, R. Combining statistical matching and propensity score adjustment for inference from non-probability surveys. J. Comput. Appl. Math. 2021, 404, 113414. [Google Scholar] [CrossRef]
  43. Ferri-García, R.; Rueda, M.M. Efficiency of Propensity Score Adjustment and calibration on the estimation from non-probabilistic online surveys. SORT Stat. Oper. Res. Trans. 2018, 42, 1–10. [Google Scholar]
  44. Rueda, M.; Ferri-García, R.; Castro, L. The R package NonProbEst for estimation in non-probability survey. R J. 2020, 12, 406–418. [Google Scholar] [CrossRef]
  45. Elliott, M.R.; Valliant, R. Inference for Nonprobability Samples. Stat. Sci. 2017, 32, 249–264. [Google Scholar] [CrossRef]
  46. Ferri-García, R.; Rueda, M.D.M. Propensity score adjustment using machine learning classification algorithms to control selection bias in online surveys. PLoS ONE 2020, 15, e0231500. [Google Scholar] [CrossRef] [Green Version]
  47. Valliant, R.; Dever, J.A. Estimating Propensity Adjustments for Volunteer Web Surveys. Sociol. Method Res. 2011, 40, 105–137. [Google Scholar] [CrossRef]
  48. Rosenbaum, P.R.; Rubin, D.B. The Central Role of the Propensity Score in Observational Studies for Causal Effects. Biometrika 1983, 70, 41–55. [Google Scholar] [CrossRef]
  49. Schonlau, M.; Couper, M. Options for Conducting Web Surveys. Stat. Sci. 2017, 32, 279–292. [Google Scholar] [CrossRef]
  50. Castro-Martín, L.; Rueda, M.D.M.; Ferri-García, R. Estimating General Parameters from Non-Probability Surveys Using Propensity Score Adjustment. Mathematics 2020, 8, 2096. [Google Scholar] [CrossRef]
  51. Wu, C.; Thompson, M.E. Sampling Theory and Practice; Springer Nature: Cham, Switzerland, 2020. [Google Scholar]
  52. Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 2623–2631. [Google Scholar]
  53. Handcoc, M.S. Relative Distribution Methods; Version 1.7-1. 2022. Available online: https://CRAN.R-project.org/package=reldist (accessed on 20 October 2022).
  54. Jackson, C.H. flexsurv: A platform for parametric survival modeling in R. J. Stat. Softw. 2016, 70, i08. [Google Scholar] [CrossRef] [Green Version]
  55. National Institute of Statistics. Life Conditions Survey—Microdata; National Institute of Statistics: Washington, DC, USA, 2012. [Google Scholar]
  56. Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
  57. Castro-Martín, L.; Rueda, M.D.M.; Ferri-García, R. Inference from Non-Probability Surveys with Statistical Matching and Propensity Score Adjustment Using Modern Prediction Techniques. Mathematics 2020, 8, 879. [Google Scholar] [CrossRef]
  58. Castro-Martín, L.; Rueda, M.D.M.; Ferri-García, R.; Hernando-Tamayo, C. On the Use of Gradient Boosting Methods to Improve the Estimation with Data Obtained with Self-Selection Procedures. Mathematics 2021, 9, 2991. [Google Scholar] [CrossRef]
  59. Bergstra, J.; Bardenet, R.; Bengio, Y.; Kégl, B. Algorithms for hyper-parameter optimization. In Proceedings of the Advances in Neural Information Processing Systems, Granada, Spain, 12–15 December 2011; Volume 24. [Google Scholar]
  60. Rueda, M.; Sánchez-Borrego, I.; Arcos, A.; Martínez, S. Model-calibration estimation of the distribution function using nonparametric regression. Metrika 2010, 71, 33–44. [Google Scholar] [CrossRef]
  61. Wolter, K.M. Introduction to Variance Estimation, 2nd ed.; Springer Inc.: New York, NY, USA, 2007. [Google Scholar]
Figure 1. Histogram of population propensities in SC1.
Figure 1. Histogram of population propensities in SC1.
Mathematics 10 04726 g001
Figure 2. Histogram of population propensities in SC2.
Figure 2. Histogram of population propensities in SC2.
Mathematics 10 04726 g002
Figure 3. Histogram of population propensities in SC3.
Figure 3. Histogram of population propensities in SC3.
Mathematics 10 04726 g003
Table 1. Bias (%) for each reference probability sample, parameter, non-probability sampling and size, method and machine learning model (linear/logistic regression or XGBoost).
Table 1. Bias (%) for each reference probability sample, parameter, non-probability sampling and size, method and machine learning model (linear/logistic regression or XGBoost).
StratifiedProportional
Q 0 . 25 Q 0 . 5 Q 0 . 75 F y ( Q 0 . 25 ) F y ( Q 0 . 5 ) F y ( Q 0 . 75 ) Q 0 . 25 Q 0 . 5 Q 0 . 75 F y ( Q 0 . 25 ) F y ( Q 0 . 5 ) F y ( Q 0 . 75 )
SC1 2000Naive 23.317.913.4−32.6−20.9−10.023.317.913.4−32.6−20.9−10.0
CalReg6.712.917.7−0.6−17.2−20.7−10.32.012.040.71.4−5.9
XGB0.23.69.87.14.2−0.31.16.39.85.51.0−0.4
PSAReg16.814.611.1−23.3−16.5−8.221.518.013.9−30.0−20.4−10.1
XGB19.613.110.5−28.1−15.6−8.828.420.912.5−38.2−24.4−10.5
SMReg−4 × 10 8 25.83 × 10 8 −0.6−17.3−20.7−2 × 10 9 −0.84 × 10 7 40.71.4−5.9
XGB−4.1−4.20.67.34.1−0.4−2.3−1.20.65.70.9−0.4
DRReg−4 × 10 8 25.83 × 10 8 −0.6−17.2−20.7−2 × 10 9 −0.94 × 10 7 40.71.4−5.9
XGB−4.1−4.30.57.34.2−0.3−2.4−1.20.65.61.0−0.4
SC1 4000Naive 23.318.113.6−32.7−21.0−10.123.318.113.6−32.7−21.0−10.1
CalReg−6.32.913.730.6−1.6−10.2−11.21.411.742.92.6−5.2
XGB0.23.79.96.94.2−0.41.26.39.95.21.0−0.5
PSAReg16.814.611.3−23.4−16.6−8.321.518.013.9−30.0−20.4−10.1
XGB19.914.013.0−28.0−16.4−10.428.122.113.7−37.1−25.0−11.4
SMReg−2 × 10 9 4.32 × 10 8 30.6−1.6−10.2−2 × 10 9 −2.36 × 10 7 42.92.5−5.2
XGB−4.2−4.20.67.24.1−0.4−2.2−1.20.65.50.9−0.5
DRReg−2 × 10 9 4.32 × 10 8 30.6−1.6−10.2−2 × 10 9 −2.46 × 10 7 42.92.6−5.2
XGB−4.1−4.20.67.14.2−0.3−2.3−1.20.65.41.0−0.5
SC1 6000Naive 23.318.013.5−32.7−20.9−10.023.318.013.5−32.7−20.9−10.0
CalReg0.78.115.813.9−9.9−15.8−10.91.611.742.02.1−5.5
XGB0.23.69.76.84.1−0.41.26.29.75.11.0−0.5
PSAReg16.814.611.4−23.5−16.5−8.321.417.913.9−30.0−20.3−10.1
XGB20.114.614.8−27.6−17.2−11.427.622.514.3−36.0−25.1−11.9
SMReg−7 × 10 8 15.81 × 10 8 13.9−10.0−15.8−3 × 10 9 −1.79 × 10 7 42.02.1−5.5
XGB−4.3−4.10.67.04.2−0.4−2.1−1.20.65.41.0−0.5
DRReg−7 × 10 8 15.81 × 10 8 13.9−9.9−15.8−3 × 10 9 −1.89 × 10 7 42.02.1−5.5
XGB−4.2−4.20.67.04.2−0.4−2.3−1.30.65.31.0−0.5
SC2 2000Naive 12.110.27.8−18.2−11.6−5.612.110.27.8−18.2−11.6−5.6
CalReg−1.40.15.96.74.0−0.4−0.73.15.95.01.1−0.6
XGB−1.50.05.87.14.2−0.3−0.83.25.95.51.0−0.4
PSAReg−5.2−4.3−2.99.54.92.0−0.20.1−0.20.20.10.2
XGB5.42.42.5−8.5−2.4−2.115.210.85.9−22.5−12.3−4.7
SMReg−4.3−4.00.66.74.0−0.4−2.1−1.10.65.01.0−0.6
XGB−4.1−4.20.57.24.2−0.3−2.3−1.20.65.60.9−0.4
DRReg−4.3−4.00.66.74.0−0.4−2.2−1.20.65.01.1−0.6
XGB−4.1−4.30.57.14.2−0.3−2.4−1.20.65.51.0−0.4
SC2 4000Naive 11.810.17.7−17.8−11.5−5.611.810.17.7−17.8−11.5−5.6
CalReg−1.4−0.15.86.64.1−0.4−0.73.05.85.01.1−0.5
XGB−1.6−0.25.77.04.2−0.4−0.93.15.85.41.0−0.4
PSAReg−5.4−4.3−2.99.84.92.0−0.5−0.1−0.40.80.20.3
XGB3.52.03.2−5.6−1.8−2.713.910.86.4−20.3−12.1−5.2
SMReg−4.3−4.00.66.64.0−0.4−2.1−1.10.64.91.1−0.5
XGB−4.1−4.10.67.34.1−0.4−2.2−1.20.65.60.9−0.5
DRReg−4.3−4.00.56.64.1−0.4−2.1−1.20.65.01.1−0.5
XGB−4.1−4.20.67.04.2−0.3−2.4−1.20.65.41.0−0.5
SC2 6000Naive 11.49.97.4−17.3−11.1−5.411.49.97.4−17.3−11.1−5.4
CalReg−1.5−0.25.66.64.1−0.4−0.72.95.74.91.1−0.5
XGB−1.6−0.35.67.04.2−0.4−0.93.05.65.31.0−0.5
PSAReg−5.3−4.4−2.89.84.91.9−0.6−0.2−0.50.90.30.4
XGB1.81.43.6−3.0−1.2−3.112.310.36.5−18.1−11.3−5.1
SMReg−4.3−4.00.66.54.1−0.4−2.1−1.10.64.91.1−0.5
XGB−4.2−4.20.67.34.1−0.4−2.2−1.10.65.61.0−0.5
DRReg−4.3−4.00.56.64.1−0.4−2.1−1.20.54.91.1−0.5
XGB−4.1−4.20.67.04.2−0.3−2.3−1.20.65.31.1−0.5
SC3 2000Naive 9.88.96.9−14.2−10.2−4.99.88.96.9−14.2−10.2−4.9
CalReg−2.1−0.55.37.44.3−0.3−1.32.65.45.51.1−0.6
XGB−1.9−0.55.37.04.2−0.3−1.22.85.35.51.0−0.4
PSAReg1.70.7−0.6−3.0−0.60.53.42.30.7−6.1−2.5−0.4
XGB10.55.75.3−14.5−5.6−4.013.19.46.6−17.8−10.4−5.0
SMReg−4.5−4.40.67.44.3−0.3−2.5−1.20.65.51.1−0.6
XGB−4.1−4.10.67.14.2−0.3−2.4−1.10.65.61.0−0.4
DRReg−4.5−4.50.57.44.3−0.3−2.6−1.20.65.51.1−0.6
XGB−4.1−4.20.57.14.2−0.3−2.5−1.20.65.51.1−0.4
SC3 4000Naive 10.09.06.8−14.3−10.2−4.910.09.06.8−14.3−10.2−4.9
CalReg−1.9−0.65.26.84.2−0.4−1.12.65.35.01.1−0.5
XGB−1.8−0.65.26.74.2−0.3−1.22.75.25.31.1−0.5
PSAReg1.61.0−0.4−2.7−1.00.33.32.70.7−5.8−2.9−0.5
XGB10.25.95.8−14.1−5.6−4.612.79.87.2−16.7−11.0−5.5
SMReg−4.3−4.20.56.84.1−0.4−2.1−1.10.65.01.1−0.5
XGB−4.2−4.10.56.94.2−0.4−2.4−1.20.65.41.0−0.5
DRReg−4.4−4.20.56.84.2−0.4−2.2−1.20.65.01.1−0.5
XGB−4.1−4.20.66.94.2−0.3−2.5−1.20.65.31.1−0.5
SC3 6000Naive 10.09.06.7−14.4−10.2−4.910.09.06.7−14.4−10.2−4.9
CalReg−1.8−0.65.26.64.1−0.4−1.12.65.24.91.1−0.5
XGB−1.8−0.65.26.74.2−0.4−1.22.65.25.21.1−0.5
PSAReg1.60.9−0.3−2.9-0.80.23.22.50.8−5.8−2.7−0.5
XGB9.86.46.5−13.5−6.0−5.312.610.37.7−16.2−11.4−6.0
SMReg−4.3−4.00.56.64.1−0.4−2.1−1.20.64.91.1−0.5
XGB−4.2−4.20.66.94.2−0.4−2.3−1.10.65.41.0−0.5
DRReg−4.3−4.00.56.64.1−0.4−2.1−1.20.54.91.1−0.5
XGB−4.2−4.20.66.94.2−0.3−2.4−1.10.65.21.1−0.5
Table 2. RMSE (%) for each reference probability sample, parameter, non-probability sampling and size, method, and machine learning model (linear/logistic regression or XGBoost).
Table 2. RMSE (%) for each reference probability sample, parameter, non-probability sampling and size, method, and machine learning model (linear/logistic regression or XGBoost).
StratifiedProportional
Q 0 . 25 Q 0 . 5 Q 0 . 75 F y ( Q 0 . 25 ) F y ( Q 0 . 5 ) F y ( Q 0 . 75 ) Q 0 . 25 Q 0 . 5 Q 0 . 75 F y ( Q 0 . 25 ) F y ( Q 0 . 5 ) F y ( Q 0 . 75 )
SC1 2000Naive 23.418.013.532.821.010.023.418.013.532.821.010.0
CalReg24.021.819.154.432.227.521.713.413.359.321.615.6
XGB1.13.89.97.34.20.41.56.59.95.61.00.5
PSAReg16.914.711.223.516.68.321.618.013.930.120.510.1
XGB20.313.711.429.016.29.328.721.212.838.724.710.8
SMReg9 × 10 8 45.57 × 10 8 54.432.227.53 × 10 9 28.42 × 10 8 59.321.615.6
XGB4.24.30.77.54.20.52.41.30.85.81.00.6
DRReg9 × 10 8 45.47 × 10 8 54.432.227.53 × 10 9 28.52 × 10 8 59.321.615.6
XGB4.24.30.77.44.20.42.51.30.75.71.00.6
SC1 4000Naive 23.418.213.632.821.110.123.418.213.632.821.110.1
CalReg22.316.715.359.425.519.821.412.713.059.620.914.8
XGB0.83.79.97.04.20.41.46.410.05.31.00.5
PSAReg16.914.711.423.516.68.321.518.013.930.120.410.1
XGB20.214.313.528.516.810.628.322.213.837.425.111.5
SMReg3 × 10 9 35.35 × 10 8 59.425.519.83 × 10 9 27.43 × 10 8 59.620.914.8
XGB4.34.20.77.34.20.52.31.20.75.71.00.6
DRReg3 × 10 9 35.25 × 10 8 59.425.519.83 × 10 9 27.43 × 10 8 59.620.914.8
XGB4.24.20.77.24.20.42.41.30.65.41.00.5
SC1 6000Naive 23.318.113.532.820.910.023.318.113.532.820.910.0
CalReg23.119.517.456.829.324.221.513.013.059.521.215.1
XGB0.63.79.76.84.20.41.36.29.85.21.00.5
PSAReg16.914.611.523.516.58.321.418.013.930.120.410.1
XGB20.414.815.127.917.411.627.822.614.436.225.211.9
SMReg1 × 10 9 41.03 × 10 8 56.829.324.24 × 10 9 27.83 × 10 8 59.521.215.1
XGB4.44.20.77.24.20.52.21.20.75.51.00.6
DRReg1 × 10 9 41.03 × 10 8 56.829.324.24 × 10 9 27.93 × 10 8 59.521.215.1
XGB4.24.20.67.14.20.42.41.30.65.31.10.5
SC2 2000Naive 12.410.48.018.511.85.812.410.48.018.511.85.8
CalReg1.51.06.06.74.00.40.93.26.15.01.10.6
XGB1.71.16.07.24.20.41.13.46.05.61.00.5
PSAReg5.64.63.210.25.22.21.81.61.33.21.70.8
XGB7.34.84.211.55.23.515.811.36.623.612.95.3
SMReg4.34.00.76.74.00.42.11.10.65.01.10.6
XGB4.24.20.77.44.20.52.41.30.75.81.00.5
DRReg4.34.00.66.74.00.42.21.20.65.01.10.6
XGB4.14.30.77.24.20.42.51.30.75.61.00.5
SC2 4000Naive 11.910.27.818.011.65.611.910.27.818.011.65.6
CalReg1.50.75.86.64.10.40.83.15.95.01.10.5
XGB1.60.85.87.14.20.41.03.25.85.41.10.5
PSAReg5.54.53.010.15.02.01.21.10.92.31.20.6
XGB5.03.94.18.03.93.414.311.06.821.012.45.4
SMReg4.34.00.66.64.00.42.11.10.64.91.10.5
XGB4.24.20.77.44.10.52.31.30.75.71.00.5
DRReg4.34.00.56.64.10.42.11.20.65.01.10.5
XGB4.14.20.67.14.20.42.41.30.65.51.10.5
SC2 6000Naive 11.59.97.517.411.25.411.59.97.517.411.25.4
CalReg1.50.65.76.64.10.40.82.95.74.91.10.5
XGB1.70.65.67.14.20.41.03.05.75.31.10.5
PSAReg5.44.42.910.05.02.01.10.90.82.11.00.6
XGB3.33.14.15.42.93.512.710.56.718.611.65.3
SMReg4.34.00.66.64.10.42.11.10.64.91.10.5
XGB4.34.20.67.44.10.42.21.10.65.71.00.5
DRReg4.34.00.56.64.10.42.11.20.54.91.10.5
XGB4.24.20.67.04.20.42.31.20.65.31.10.5
SC3 2000Naive 10.19.17.114.510.45.110.19.17.114.510.45.1
CalReg2.21.15.47.44.30.41.52.85.55.61.10.6
XGB2.11.15.47.14.30.41.42.95.55.61.10.5
PSAReg6.14.33.310.35.02.36.14.43.210.55.02.2
XGB11.77.16.516.37.24.914.210.17.419.611.35.6
SMReg4.54.50.67.44.30.42.51.20.75.61.10.6
XGB4.24.20.87.34.20.52.41.20.85.71.00.6
DRReg4.54.50.67.44.30.42.61.20.65.61.10.6
XGB4.24.20.77.24.20.42.51.20.75.61.10.5
SC3 4000Naive 10.19.06.914.410.34.910.19.06.914.410.34.9
CalReg1.90.85.36.84.20.41.22.75.35.01.10.5
XGB1.90.95.36.84.20.41.32.75.35.31.10.5
PSAReg3.93.02.26.83.31.54.63.62.17.94.01.5
XGB10.76.76.415.06.55.113.310.27.517.611.45.8
SMReg4.34.20.66.84.10.42.21.10.65.01.10.5
XGB4.24.20.77.14.20.42.41.20.75.51.00.6
DRReg4.44.20.56.84.20.42.31.20.65.01.10.5
XGB4.24.20.67.04.20.42.51.20.65.41.10.5
SC3 6000Naive 10.09.06.814.510.24.910.09.06.814.510.24.9
CalReg1.80.75.26.64.10.41.12.65.34.91.10.5
XGB1.90.85.26.84.20.41.32.75.25.21.10.5
PSAReg3.12.41.75.52.61.24.03.11.77.03.51.2
XGB10.26.96.914.26.65.512.910.57.916.711.66.2
SMReg4.34.10.56.64.10.42.11.20.64.91.10.5
XGB4.24.20.77.04.20.42.41.10.75.41.10.5
DRReg4.34.10.56.64.10.42.11.20.54.91.10.5
XGB4.24.20.67.04.20.42.51.20.65.31.10.5
Table 3. Bias (%) including hyperparameter optimization when overfitting.
Table 3. Bias (%) including hyperparameter optimization when overfitting.
Proportional
Q 0 . 25 Q 0 . 5 Q 0 . 75 F y ( Q 0 . 25 ) F y ( Q 0 . 5 ) F y ( Q 0 . 75 )
SC2 2000Naive 12.110.27.8−18.2−11.6−5.6
CalReg−0.73.15.95.01.1−0.6
XGB−0.83.25.95.51.0−0.4
PSAReg−0.20.1−0.20.20.10.2
XGB15.210.85.9−22.5−12.3−4.7
XGB (opt)2.42.72.1−4.1−2.8−1.4
SMReg−2.1−1.10.65.01.0−0.6
XGB−2.3−1.20.65.60.9−0.4
DRReg−2.2−1.20.65.01.1−0.6
XGB−2.4−1.20.65.51.0−0.4
Table 4. RMSE (%) including hyperparameter optimization when overfitting.
Table 4. RMSE (%) including hyperparameter optimization when overfitting.
Proportional
Q 0 . 25 Q 0 . 5 Q 0 . 75 F y ( Q 0 . 25 ) F y ( Q 0 . 5 ) F y ( Q 0 . 75 )
SC2 2000Naive 12.410.48.018.511.85.8
CalReg0.93.26.15.01.10.6
XGB1.13.46.05.61.00.5
PSAReg1.81.61.33.21.70.8
XGB15.811.36.623.612.95.3
XGB (opt)3.03.22.65.23.41.8
SMReg2.11.10.65.01.10.6
XGB2.41.30.75.81.00.5
DRReg2.21.20.65.01.10.6
XGB2.51.30.75.61.00.5
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Rueda, M.d.M.; Martínez-Puertas, S.; Castro-Martín, L. Methods to Counter Self-Selection Bias in Estimations of the Distribution Function and Quantiles. Mathematics 2022, 10, 4726. https://doi.org/10.3390/math10244726

AMA Style

Rueda MdM, Martínez-Puertas S, Castro-Martín L. Methods to Counter Self-Selection Bias in Estimations of the Distribution Function and Quantiles. Mathematics. 2022; 10(24):4726. https://doi.org/10.3390/math10244726

Chicago/Turabian Style

Rueda, María del Mar, Sergio Martínez-Puertas, and Luis Castro-Martín. 2022. "Methods to Counter Self-Selection Bias in Estimations of the Distribution Function and Quantiles" Mathematics 10, no. 24: 4726. https://doi.org/10.3390/math10244726

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop