Estimating General Parameters from Non-Probability Surveys Using Propensity Score Adjustment

: This study introduces a general framework on inference for a general parameter using nonprobability survey data when a probability sample with auxiliary variables, common to both samples, is available. The proposed framework covers parameters from inequality measures and distribution function estimates but the scope of the paper is broader. We develop a rigorous framework for general parameter estimation by solving survey weighted estimating equations which involve propensity score estimation for units in the non-probability sample. This development includes the expression of the variance estimator, as well as some alternatives which are discussed under the proposed framework. We carried a simulation study using data from a real-world survey, on which the application of the estimation methods showed the effectiveness of the proposed design-based inference on several general parameters. The results show a large decrease in bias and MSE for all response patterns for both PSA methods, which shows the robustness of the adjustment method. The reduction in bias and MSE is different across them. Using PSA with the reference sample drawn under a stratiﬁed design, s r 1 , provided less RMSE when the convenience sample was drawn using NP1. On the other hand, PSA using the reference sample drawn with probabilities proportional to the income, s r 2 provided much less biased estimates overall when the selection mechanism depended on NP2, NP3 or NP4.


Introduction
Nonprobability samples are increasingly common in empirical sciences. The rise of online and smartphone surveys, along with the decrease of response rates in traditional survey modes, have contributed to the popularization of volunteer surveys where sampling is non-probabilistic. Moreover, the development of Big Data involves the analysis of large scale datasets whose obtention is conditioned by data availability and not by a probabilistic selection, and therefore they can be considered large nonprobability samples of a population [1].
The lack of a probability sampling scheme can be responsible for selection bias. Following the description from [1,2], we can distinguish the target population, U T , the subpopulation that a given selection method can potentially cover, U pc , and the fraction of the subpopulation that is finally covered, U f c , and whose individuals might participate in the survey. Selection bias occurs when the characteristics of the individuals in U f c differ significantly from those in U T in a way that could affect final estimates. Typically, differences between individuals in U T and individuals in U pc are caused by a lack of coverage induced by the survey administration mode (for example, an online questionnaire cannot be administered to the population without internet access), while differences between U pc and U f c are caused by the variability in the propensities to participate between social-demographic groups (for example, an online questionnaire accesible in a thematic website might only be fulfilled by visitors of the website who have a specific interests that could influence the results).
Following the rise of nonprobability samples, a class of methods for reducing selection bias have been proposed in the last decades. These methods were developed from different perspectives according to the availability of auxiliary information. We can mention calibration, Propensity Score Adjustment (PSA), Statistical Matching and superpopulation modelling as the most relevant techniques to mitigate selection bias produced by coverage and self-selection errors.
Calibration weighting was originally developed by [3] as a method to correct representation issues in samples with coverage or non-response errors. It only requires a vector of auxiliary variables available for each individual of the sample and the population totals of those variables. Calibration is able to remove selection bias in nonprobability samples if the selection mechanism is ignorable [4], and despite being originally developed for parametric estimation, further work [5][6][7] has extended calibration to distribution function, quantile and poverty measures estimation.
Propensity Score Adjustment (PSA) and Statistical Matching require, apart from the nonprobability sample, a probability sample to do the adjustments. PSA was originally developed for balancing groups in non-randomized clinical trials [8] and it was adapted for non-response adjustments shortly after [9,10]. The application of PSA for removing bias in nonprobability surveys was theoretically developed in [11,12]. Statistical Matching was firstly proposed in [13] and extended in [14] for non-response adjustments. The difference between both methods is the sample used in the estimators: PSA estimates the propensity of each individual of the nonprobability sample to participate in the survey and then this propensity is used to construct the weights of the estimators, while Statistical Matching adjusts a prediction model using data from the nonprobability sample, applies it in the probability sample to predict their values for the target variable y and uses them in the parametric estimators. To the best of our knowledge, PSA and Statistical Matching has not been developed for nonparametric estimation.
Superpopulation modelling requires data from the complete census of the target population for the covariates used in the adjustment, which is assumed to be a realization (sample) of a superpopulation where the (unknown) target values follow a model. It is based on the works by [15,16], where the main idea is to fit a regression model on the target variable with data from the nonprobability sample, and use the model to predict the values of the target variable for each individual in the population. The prediction can be used for estimation using a model-based approach or some alternative versions such as model-assisted and model-calibrated. LASSO models [17] and Machine Learning predictors [18,19] have been studied as alternatives to ordinary least squares regression in superpopulation modelling.
The interest of society on poverty and inequality has increased in the last decades given the successive economic cycles and crisis. In such a context, official poverty rates and the percentage of people in poverty (or under a poverty threshold) are some important measures of a country's wealth. The common characteristic of many poverty measures is their complexity. The literature on survey sampling is usually focused on the goal of estimating linear parameters. However, it is usual that the variable of interest in poverty studies is a measure of wages or income, where the distribution function becomes a relevant tool because it is required to calculate the proportion of people with low income, the poverty gap and other measures. Estimators for the cumulative distribution function, quantiles [20,21] and poverty measures [22] can be found in literature regarding probability samples, but there is hardly any work on the estimation of these parameters when the samples are obtained from volunteers.
In this paper, we aim to develop a framework for statistical inference on a general parameter with non probability survey samples when a reference probability sample is available. After introducing the problem of the mean estimation for volunteer samples in Section 2, in Section 3, we consider the problem of the estimation for a general parameter through general estimating equations. Section 4 presents a new estimator for a general parameter through the use of PSA to estimate the propensity score of each individual in the survey weighted estimating equation and major theoretical results are presented. Results from simulation studies are reported in Sections 5 and 6 presents the concluding remarks.

Approaches to Estimation of a Mean for Volunteer Online Samples
Let U T be the target population with N elements and s v a nonprobability sample drawn from a subset of U T , U v , with a size of n v ≤ N. Let y be the target variable of the survey, whose mean in the population U T is denoted as Y. The sample estimation of Y,Ŷ, is done using the Horvitz-Thompson estimator:Ŷ where w is a vector of weights that accounts for the lack of representativity of s v caused by selection bias. If no auxiliary information is given, the weight would be the same for every unit, w i = N/n v , which requires to assume that the sample was drawn under a simple random sampling scheme. This is a naïve assumption given that s v is not probabilistic, this is, the probability of being in the sample is unknown and/or null for any of the units in U T . Let x be a matrix of covariates measured in s v along with y. If the population totals of the covariates, X, are available, it is possible to estimate the mean using a vector of weights obtained with calibration, w CAL . The calibration weights aim to minimize the distance between the original and the new weights while respecting the calibration equations Some choices for the distance G(., .) were listed in [3], along with the resulting estimators. Calibration weighting for selection bias treatment was studied in [4], where post-stratification, which is a special case of calibration [23], was used to mitigate the bias caused by different selection mechanisms, showing its efficacy when the selection of the units of s v is Missing At Random (MAR).
If a reference sample, s r , drawn from the population U T is available and a number of covariates x have been measured both in s v and s r , two procedures can be done to reduce selection bias present in s v . Let I v be an indicator variable of an element being in s v , this is Propensity Score Adjustment (PSA) assumes that each element of U T has a probability (propensity) of being selected for s v which can be formulated as where π v i is the propensity of the i-th individual to participate in s v . The random mechanism behind this probability is the selection mechanism that governs the nonprobability sample. If the selection is Missing Completely At Random (MCAR), then π v i = Pr(I vi = 1) and the selection bias is null, while if the selection is MAR then π v i = Pr(I vi = 1|x i ) and the selection mechanism is considered ignorable. This does not mean that the selection bias should be ignored but rather it can be treated with the right techniques.
In PSA, we consider the situation where true propensities are not known and therefore have to be estimated; we do it by combining s v and s r into a sample. The probability that I v = 1 is then estimated using a prediction model, traditionally a logistic regression one: Alternative models, such as non-linear regression and Machine Learning classification algorithms, have been studied in literature as a substitute of logistic regression (see [24] for a review). The resulting propensities can be used to adjust new weights, w PSA , with different alternatives: • A simple inverse probability weighting is proposed by [25] w PSA1 which is a similar approach to the formula used in [26] w PSA2 • Alternatively, individuals of the combined sample (s v ∪ s r ) can be grouped in g equally-sized strata of similar propensity scores from which an average propensity is calculated for each group. Let π g be the mean propensities of the g-th strata. [2] use the means as in (7) to calculate the new weights: where g i refers to the strata to which the i-th individual of s v belongs.

•
A similar approach can be found in [12], but instead of using the means, a factor is calculated for each strata: where s r g and s v g are respectively the individuals from the probability and nonprobability sample that belong to the g-th strata, andw is the vector of design weigths of the reference sample. The final weights are obtained by multiplying the original weights and the correction factor: PSA has been proven to successfully remove selection bias when prognostic covariates are chosen [11] and further adjustments, such as calibration, are applied in the estimations [2,12,27]. A recent paper [28] shows a real application of PSA in web panel surveys where the reductions in bias, although present, were not large enough to consider the estimates as unbiased.
As an alternative to PSA, Statistical Matching is another method to mitigate selection bias when a reference sample is available. For the matter, a prediction model for y using x as the dependent variables is built using data from s v . The model is subsequently applied on the reference sample to obtain the estimates from the predicted values of y in s r ,ŷ: The choice of prediction models has been studied in literature; the usual method is linear regression but other approaches such as donor imputation [13] or Machine Learning algorithms [19,29] have been listed as alternatives. Under certain conditions, Statistical Matching can reduce bias and mean square error to a greater extent than PSA [29].
When a complete census of the entire target population is available, with information on the covariates present in s v , superpopulation modelling can be applied to remove selection bias [19]. In this paper we consider the case when auxiliary information is available only from a reference probability survey.

Estimation of a General Parameter by Using PSA
Let y be the variable of interest in a survey and y i be the value of the i-th unit in that variable, i = 1, . . . , N. Suppose we want to estimate a finite population parameter θ N of dimension p ≥ 1 defined as the solution of the census estimating equations: where u i (y i , θ N ) is be a function of θ N . Some unidimensional parameters of interest can be: We denote byθ the solution of the equation: It is clear the E r (Û(θ N )) = U(θ N ) where r stands for the model of the selection mechanism for the sample s v , this is, the true model that fits propensity scores. If π v i are known we can get the consistent estimator of θ N by solving the equation above. For the study of the properties of this estimator we consider a quasi-probability approach or pseudo-design-based inference ( [19]) and we treat the volunteer sample as a realization of a Poisson sampling with probabilities π v i . For any sample design that verifies certain regularity conditions, the solution to U(θ) = 0 provides a consistent estimator for the parameter θ N (see [30]). Poisson sampling verifies these conditions, so that the consistency of the estimator is obtained immediately from the result of [30]. The normality of the estimator is demonstrated by [31], who also obtains the asymptotic variance of the estimator. From said expression and taking into account that in Poisson sampling the extractions are independent and therefore the probability of second order is given by π v ij = π v i π v j we can obtain the variance ofθ: being J(θ) = 1 N ∑ U ∂u i /∂θ and var(Û(θ)

Estimation of a General Parameter with Estimated Propensities
The propensity scores π v i are not known are impossible to estimate using the nonprobability sample s v alone, so additional information must be included. Let s r be a reference probability sample, of size n r , selected from U T under a sampling design (s d , p d ) where the first order inclusion probabilities, π p i = ∑ s r i p d (s r ), i = 1, . . . , n r , are known and non-null. The covariates of the propensity model x have been measured both in s v and s r , while the variable of interest y is only available for those individuals in s v .
Suppose that the propensity scores can be modelled parametrically as for some known function m(·) with second continuous derivatives with respect to an unknown parameter λ o . We estimate the propensity scores by using data of both the volunteer and the probability sample. The maximum likelihood estimator (MLE) of π v i is m(λ, x i ) whereλ corresponds to the value of lambda that maximizes the log-likelihood function: As it is usual in survey sampling, we consider the pseudo-likelihood given that some units of the population have not been sampled: We propose thus a two phase procedure in this manner: Step 1: Calculateλ pl by solving the score equations: Step 2: Calculateθ v as the solution of the estimating function: We consider the following asymptotic framework for theoretical development, which is equivalent to the framework in [32]. Let U Tν be a sequence of finite populations of size N ν . Each U Tν has an associated non-probability sample s vν of size n vν and an associated probability sample s pν of size n pν . We consider that the population size N ν → ∞, the nonprobability sample size n vν → ∞ and the probability sample size n pν → ∞ as ν → ∞. For notational simplicity the index ν is suppressed for the rest of the paper. The properties of the estimatorθ v are developed under both the model for the propensity scores and the survey design for the probability sample.
We make the following assumptions: • A.1. The estimating function u i (y i , θ, λ) is twice differentiable with respect to θ and λ. • A.2. The propensities and the sampling design ensure thatÛ V (θ) − U(θ) = O p (n −1/2 ) for any θ ∈ Θ. • A.3. The propensities and the sampling design ensure thatÛ V (θ) is asymptotically Normal with mean U(θ) and entries of the variance at the order O(n −1 ) for any fixed θ ∈ Θ.

Theorem 1. Under the conditions A.1, A.2 and A.3,θ v is a consistent and asymptotically normal estimator for θ.
Proof. Under assumed conditions, U V (θ) = U(θ) + O p (n −1/2 ), thus by using the mean value theorem,θ v has the same asymptotic behaviour thatθ which is consistent for θ and asymptotically normal distributed (see Section 3).
Variance estimation forθ v can be handled by combining the two estimating equations,l andÛ v , into a single system as it is done in [33].
Sinceθ v andλ are consistent estimator of respective parameters, we can writeψ = ψ + O p (1) and the Taylor series expansion gives: Thus the asymptotic variance ofψ is given by: Taking into account the two random mechanisms, and the probabilities of the conditional expectation, we have V( U(θ, λ)) = V p E r ( U(θ, λ)) + E p V r ( U(θ, λ)) where r stands for the model of the selection mechanism for the sample s v and p refers to the probability sampling design for s p .
The asymptotic variance ofψ depends on the probability of selecting the sample s p under the given sampling design and the selection mechanism described by the propensity model. Plug-in estimators can be used to construct variance estimators for all the required components but it is not a simple issue.
In practice, and as described in [7], the use of jackknife [34] and bootstrap techniques [35] in the variance estimation for nonlinear parameters should be more advantageous because of their wide applicability for different cases and conditions. Direct applications of bootstrap methods for estimating the variance-covariance matrix ofψ involve solving the equation U(θ, λ) = 0 repeatedly for each bootstrap sample. Multiplier Bootstrap with Estimating Functions was proposed by [36].

Data
Data for the simulation study come from a wave of the Spanish Living Conditions Survey collected between 2011 and 2012 [37], which contains an annual thematic module that, in 2012, was dedicated to household conditions. The survey sampling follows a two-phase cluster sampling, where the primary units are the households and the secondary units are their members. In 2012, the final sample included 33,573 individuals. For this study, the dataset was filtered to rule out individuals and variables with high quantities of missing data. After this procedure, the dataset employed as pseudopopulation of the study had a size of N = 28,210 individuals and p = 60 available variables.
From this pseudopopulation, two probability samples of size n r were drawn according to the following sampling strategies: • The first sample, s r1 , was drawn with a stratified cluster sampling, where the strata were defined by the Autonomous Communities (NUTS2 regions) and the clusters were the households, which were drawn with probabilities proportional to the household size. The number of households to be selected, m, was estimated dividing n r by the medium household size in order to reach the aforementioned size of n r = 2000, resulting in m = 902 households. The final sample size of s r1 was n r1 = 2003.

•
The second sample, s r2 , was drawn with an unequal probability sampling, where probabilities were proportional to the minimum income of the individual's household to make ends meet (variable HS130 in [37]).
The extraction of the nonprobability sample, s v , was done with unequal probability sampling from the full pseudopopulation, where the probability of selection for the i-th individual, p i , was given by the formula: where • x 1 i = 1 when the i-th sampled individual has a computer at home, and The reasoning behind this sampling procedure is to take into account more similar mechanisms to self-selection procedures that take place in real nonprobability surveys.
We have considered three different sample sizes, n v = 2000, 4000, 6000. 1000 simulation runs were performed for each procedure and sample size, drawing a sample in each run.

Simulation
In each simulation, the parameters to be estimated were the following:

•
The Gini coefficient [38], which measures the income inequality, estimated as The proportion of individuals with a disposable income below the at-risk-at-poverty threshold. This measure can be referred to as poverty incidence, poverty proportion, poverty risk or HCI ( [39] and is estimated asĤ The interquartile range, estimated as The interdecile range, estimated as Every parameter was estimated with and without applying PSA so we could evaluate its performance. In order to estimate the propensities, a logistic regression model was chosen: 1000 simulations were executed for each context. The resulting mean bias, standard deviation and Root Mean Square Error were measured in relative numbers to make them comparable across different scenarios. The formulas used for their calculation can be found below: withθ (i) the estimation in the i-th simulation andθ the mean of the 1000 estimations.

Results
The relative mean bias of the estimations can be observed in Tables 1-3. We can observe that PSA reduces the bias in all situations, specially in the estimation of HCI. PSA using the reference sample drawn with probabilities proportional to the income, s r2 , provided much less biased estimates overall. The relative standard deviation of the estimations can be observed in Tables 4-6. The standard deviation remained stable across estimates of Gini coefficient, IQR and IDR, even with small gains for the latter when using the reference sample with probabilities proportional to the minimum income to make ends meet, s r2 , but increased after applying PSA in the estimation of HCI. The relative Root Mean Square Error of the estimations can be observed in Tables 7-9. As a result of the stability of standard deviation and the reduction in bias, the RMSE of the estimates of the four parameters has a similar pattern than the observed for bias. Although RMSE is reduced after applying PSA in all cases, PSA was more efficient when the reference sample was drawn with probabilities proportional to the minimum income to make ends meet, s r2 . PSA performance could be deeply affected by the selection mechanisms, which could lead to model misspecifications in propensity estimations. To test limitation and robustness of the proposed approach we have repeated the simulation with different patterns of non-response. The selection procedures can be described as follows: NP1 Simple Random Sampling Without Replacement (SRSWOR) from the population fraction of individuals with a computer at home, U v . NP2 The probability of selection for the i-th individual, is given by NP3 The probability of selection for the i-th individual, is given by NP4 The probability of selection for the i-th individual, is given by The procedure 1 is a typical case of coverage error (which is a type of selection bias itself [1]). The third scheme represents a cubic relationship between age and the probability of selection, with young people being the individuals with the highest probabilities and decreasing as age increases. The last scheme has two components: one dichotomous and the other cosine-shaped. Tables 10 and 11 show the results of bias and relative ecm for the HCI parameter, where the selection bias of the unweighted estimator is large. The results show a large decrease in bias and MSE for all response patterns for both PSA methods, which shows the robustness of the adjustment method. The reduction in bias and MSE is different across them. Using PSA with the reference sample drawn under a stratified design, s r1 , provided less RMSE when the convenience sample was drawn using NP1. On the other hand, PSA using the reference sample drawn with probabilities proportional to the income, s r2 provided much less biased estimates overall when the selection mechanism depended on NP2, NP3 or NP4.

Conclusions
Technological development has made large amounts of inexpensive data (commonly known as Big Data) available for researchers to be used for inference. New survey administration methods have also favoured the rise of data from nonprobability samples. Inferences from Big Data and nonprobability surveys have important sources of error ( [4,24,28], . . . ). Given the characteristics of these data collection procedures, selection bias is particularly relevant.
Despite the growing interest raised by nonprobability data (both coming from Big Data or nonprobability surveys), there is still a lack of rigorous theory to make statistical inferences for general parameters through estimating equations. The current paper aims to fill this gap by establishing a theoretical framework for estimation of general parameters with nonprobability samples.
Results observed in our simulation study provide strong evidence on the efficiency of methods based in estimating equations with estimated propensities. However, it must be noted that the efficiency depends on the selection mechanisms of nonprobability samples and the availability of covariates for propensity estimation. In our simulations, results showed that Propensity Score Adjustment is more efficient when the propensity of being in the nonprobability sample is less related to the variable of interest. This behavior has been observed in literature regarding PSA for parametric estimation [11,24].
We used parametric methods to obtain the estimated propensities but we could use machine learning techniques as regression trees, spline regression, random forests etc. Recently [24,29] presented simulation studies where decision trees, k-nearest neighbors, Naive Bayes, Random Forest, Gradient Boosting Machine and Model Averaged Neural Networks are used for propensity score estimation. These studies compare the empirical efficiency of the use of linear models and Machine Learning prediction algorithms in estimation of linear parameters, but the theory is more complex and has not yet been developed. Other way to reduce the bias of the PSA estimates is to combine the PSA technique with other techniques as Statistical Matching or calibration. [27] apply a combination of propensity score adjustment and calibration on auxiliary variables in a real volunteer survey aimed to a population for which a complete census was available. [32] propose a doubly robust estimator for population mean estimation by incorporating the model-based estimator framework to PSA methods, improving their efficiency and making it robust to model misspecifications. Further research should focus on extensions of those methods for general parameter estimation.
Author Contributions: The authors contributed equally to this work in conceptualization, methodology, software and original draft preparation. All authors have read and agreed to the published version of the manuscript.

Conflicts of Interest:
The authors declare no conflict of interest.