Inference from Non-Probability Surveys with Statistical Matching and Propensity Score Adjustment Using Modern Prediction Techniques

: Online surveys are increasingly common in social and health studies, as they provide fast and inexpensive results in comparison to traditional ones. However, these surveys often work with biased samples, as the data collection is often non-probabilistic because of the lack of internet coverage in certain population groups and the self-selection procedure that many online surveys rely on. Some procedures have been proposed to mitigate the bias, such as propensity score adjustment (PSA) and statistical matching. In PSA, propensity to participate in a nonprobability survey is estimated using a probability reference survey, and then used to obtain weighted estimates. In statistical matching, the nonprobability sample is used to train models to predict the values of the target variable, and the predictions of the models for the probability sample can be used to estimate population values. In this study, both methods are compared using three datasets to simulate pseudopopulations from which nonprobability and probability samples are drawn and used to estimate population parameters. In addition, the study compares the use of linear models and Machine Learning prediction algorithms in propensity estimation in PSA and predictive modeling in Statistical Matching. The results show that statistical matching outperforms PSA in terms of bias reduction and Root Mean Square Error (RMSE), and that simpler prediction models, such as linear and k-Nearest Neighbors, provide better outcomes than bagging algorithms.


Introduction
Surveys are a fundamental tool for data collection in areas like social studies and health sciences. Probability sampling methods have been widely adopted by researchers in those areas, as well as by official statistics. The main reason is that it provides valid statistical inferences about large finite populations by using relatively small samples, based on a solid mathematical theory, with the right combination of a random sample design and an approximately design-unbiased estimator.
Over the last decade, new alternatives to survey sample data have became popular as data sources. Examples are big data and web surveys that have the potential of providing estimates in nearly real time, an easier data access, and lower data collection costs than traditional probability sampling [1]. Very often, the data-generating process of such sources is nonprobabilistic, given that the probability of being part of the sample is not known and/or is null for some groups of the target population, and, as a result, these methods produce nonprobability samples. There are serious issues on the use of nonprobability survey samples; the most relevant is that the data-generating process is unknown and may have serious coverage, nonresponse, and selection biases, which may not be ignorable and could deeply affect estimates [2]. These biases tend to be more disruptive as the population size gets larger, regardless of the sample size [3].
In order to correct this selection bias produced by non-random selection mechanisms, some inference procedures are proposed in the literature. A first class of methods includes statistical models aiming to predict the non-sampled units of the population [4][5][6]. Specifying an appropriate super-population model capable of learning the variation of the target variables is important for this model-based method. For this approach, auxiliary features X must be available for each unit of the observed and the unobserved parts of the population. This situation is complicated in practice.
Some studies combine a nonprobability sample with a reference probability sample for constructing models for units in the latter or to adjust selection probabilities. The most important methods for this case are statistical matching and propensity score adjustment (PSA). There are many works studying the properties and performance of PSA [7][8][9][10][11][12], but there is not much of a bibliography that develops statistical matching in this context.
In this article, we apply machine learning prediction techniques to build statistical matching estimators and compare their performance with PSA estimators. Since there is not a sampling design that allows us to determine the main statistical properties (sampling distribution, expected value, variance, etc.) of random quantities calculated from the non-probability sample, we cannot include theoretical properties of the estimators obtained, but their behavior is studied through simulation studies that also include several propensity score techniques. Although PSA performance was compared with linear calibration in [10] and the combination of PSA with machine learning was already studied in [12], to the best of our knowledge, this is the first time that these methodologies are compared in practice and the first time that machine learning techniques are used for estimation with statistical matching from nonprobability samples.
The description of the conducted study is organized as follows: In Section 2, we introduce the notation and explain the estimation problem that can be solved with the aforementioned methods. In Sections 3 and 4, we describe the mathematical foundations of PSA and statistical matching, respectively, and their properties according to previous research. In Section 5, we briefly explain the ideas behind each of the algorithms tested in the study. In Section 6, we describe the data and the simulation study used to compare the performance of PSA and statistical matching, as well as the metrics used to measure it. Finally, in Sections 7 and 8, we show the results of the study and discuss some of their implications in the comparison between methods.

Background
Suppose that the finite population U consists of i = 1, ..., N subjects. Let y be a survey variable and y i be the y-value attached to the i-th unit,i = 1, ..., N.
Let s v be a volunteer nonprobability sample of size n v , obtained from U v ⊂ U observing the study variable y.
Without any auxiliary information, the population meanȲ is usually estimated with the unweighted sample meanŶ that produces biased estimates of the population mean. The size and direction of the bias depend on the proportion of the population with no chance of inclusion in the sample (coverage) and differences in the inclusion probabilities among the different members of the sample with a non-zero probability of taking part in the survey (selection) [2,13]. The selection bias cannot be estimated in practice for most survey variables of interest. We consider the situation where there is a probability sample available and compare two inference methods to treat selection biases in a general framework. Let s r be the reference probability sample selected under the sampling design (s d , p d ) with π i the first-order inclusion probability for the i-th individual. Let us assume that in s r , we observe some other study variables that are common to both samples, denoted by x. The available data are denoted by {(i, y i , x i ), i ∈ s v }, and {(i, x i ), i ∈ s r }. We are interested in estimating a linear parameter θ N = ∑ U a i y i , where a i are known constants. Examples include the population total T y = ∑ U y i , the population meanȲ, and the population proportion p A = ∑ U y i /N, where y i = 1 if the unit i belongs to the interest group A, and 0 otherwise.

Propensity Score Adjustment
The most popular adjustment method in nonprobability settings is propensity score adjustment (PSA) or propensity weighting. This method, firstly developed by [14], was originally intended to correct the confounding bias in the experimental design context, and it is the most widely used method in practice [2,[7][8][9][10]12,[15][16][17]. In this approach, the propensity for an individual to participate in the volunteer survey is estimated by binning the data from both samples, s r and s v , together and training a machine learning model (usually logistic regression) on the variable δ, with δ k = 1 if k ∈ s v and δ k = 0 if k ∈ s r . We assume that the selection mechanism of s v is ignorable; this is: We also assume that the mechanism follows a parametric model: for some vector γ. We obtain the pseudo maximum likelihood of parameter γ and use the inverse of the estimated response propensity as weight for constructing the estimator [11]: wherep k (x k ) denotes the estimated response propensity for the individual k ∈ s v . Alternative estimators can be constructed by slightly modifying the formula in (4) [18]: Other alternatives involve the stratification of propensities in a fixed number of groups, with the idea of grouping individuals with similar volunteering propensities. For instance, in [7,8], adjustment factors f c are obtained for the cth strata of individuals: where s c r and s c v are individuals from the s r and s v sample respectively who belong to the cth group, while d r k = 1/π k and d v j = 1/p j are the design weights for the kth individual of the reference sample and the jth individual of the volunteer sample, respectively. The final weights are: The weights are then used in the Horvitz-Thompson estimator: The approach used in [9] does not require the calculation of f c . It only uses the average propensity within each group where n c r and n c v are the number of individuals from the reference and the volunteer sample, respectively, that belong to the cth group. The mean propensity for each member of the volunteer sample is used in the Horvitz-Thompson estimator: PSA in nonprobability online surveys has been proven to be efficient if the selection mechanism is ignorable and the right covariates are used for modeling [7]. If some of these conditions do not apply, the use of PSA can induce biased estimates that would need further adjustments [9]. The combination of PSA and calibration has shown successful results in terms of bias removal [8,10].
Several machine learning models have been suggested as alternatives to logistic regression for the estimation of propensity scores in the experimental design context, with promising results. Ref. [19,20] examined the performance of various classification and regression trees (CART) for PSA in sample balancing. Other applications of machine learning algorithms in PSA involve their use in nonresponse adjustments; more precisely, they have been studied using Random Forests as propensity predictors [21]. Regarding the nonprobability sampling context covered in this study, [12] presented a simulation study using decision trees, k-Nearest Neighbors, Naive Bayes, Random Forests, and a Gradient Boosting Machine that support the view given in [6] about machine learning methods being used for removing selection bias in nonprobability samples. All of those algorithms, along with Discriminant Analysis and Model Averaged Neural Networks, will be used for propensity estimation in this study. Further details can be consulted in Section 5.

Statistical Matching
Statistical matching (also known as data fusion, data merging, or synthetic matching) is a model-based approach introduced by [22] and further developed by [23] for nonresponse in probability samples. The idea in this context is to model the relationship between y k and x k using the volunteer sample s V in order to predict y k for the reference sample.
Suppose that the finite population {(i, y i , x i ), i ∈ U} can be viewed as a random sample from the superpopulation model: where m(x i ) = E m (y i |x i ) and the random vector e = (e 1 , ..., e N ) is assumed to have zero mean. Under the design-based approach, the usual estimator of a population's linear parameter is the Horvitz-Thompson estimator given by:θ where d k = 1/π k is the sampling weight of the unit k that is design-unbiased, consistent for θ, and asymptotically normally distributed under mild conditions [24]. This estimator cannot be calculated because y k is not observed for the units k ∈ s r ; thus, we substitute y i by the predicted values from the above model. Thus, the matching estimator is given by: whereŷ k is the predicted value of y k .
The key is how to predict the values of y k . Usually, the linear regression model is considered for estimation; E m (y i |x i ) = x T i β is easy to implement in most of the existent statistical packages, but several drawbacks have to be considered. Parametric models require assumptions regarding variable selection, the functional form and distributions of variables, and specification of interactions. If any of these assumptions are incorrect, the bias reduction could be incomplete or nonexistent. Contrary to statistical modeling approaches that assume a data model with parameters estimated from the data, more advanced machine learning algorithms aim to extract the relationship between an outcome and predictor without an a priori data model. These methods have not been widely applied in the statistical matching literature. Now, we propose the use of machine learning methods as an alternative to linear regression modeling. The ML prediction methods considered in this article are described in the following section.

Generalized Linear Models (GLM)
The most basic regression model consists of calculating coefficients, β, of linear regression based on input data. The coefficients that satisfy the optimality criteria based on minimizing the Ordinary Least Squares are estimated with the formula β = (X X) −1 X Y. This method is only stable as long as X X is relatively close to a unit matrix [25]. Quite often, covariates suffer from multicollinearity. For those cases, ridge regression proposes an identity term to control instability: β = (X X + kI) −1 X Y, where k ≥ 0 can be chosen arbitrarily or via parameter tuning. β can also be considered as the posterior mean of a prior normal distribution with zero mean and a variance of Iσ 2 /k [26]. From that point of view, Bayesian estimates can be obtained via Gibbs sampling.
Instead, the Least Absolute Shrinkage and Selection Operator (LASSO) regression [27] proposes using a penalty parameter, α, according to the following optimization problem: t is a hyperparameter that forces the shrinkage of the coefficients. In this case, coefficients are allowed to be equal to zero. Therefore, the main difference is that LASSO allows the optimization procedure to select variables, while ridge regression may produce very small coefficients for some cases without reaching zero. Alternatively, LASSO coefficients can be estimated considering the posterior mode of prior Laplace distributions. Bayesian estimates can then be calculated as described in [28]. Ridge and LASSO are both considered standard penalized regression models [29].
For PSA, these methods can be used for estimating the propensities. First, the target variable for the model is defined as y i = 1 if k ∈ s v and y i = 0 if k ∈ s r . The pseudo maximum likelihood can then be optimized via logistic regression or any of its variants described above.
For statistical matching, the target variable for the model is the survey variable itself. Therefore, the model is trained with the volunteer sample and then used to obtain the estimated responses for the reference sample.

Discriminant Analysis
When the predicted variable is discrete, Discriminant Analysis can be used for classification of individuals. Let y be the dependent variable with K classes, π k the probability of an individual of belonging to the kth class, X the matrix of covariates n × p, and f k (x) the joint distribution of x conditioned to y taking the kth class. As described in [30], Linear Discriminant Analysis (LDA) assigns an individual the class that maximizes the probability: Assuming that X|y = k follows a multivariate Gaussian distribution N p (µ k , Σ), LDA works by assigning an instance the class for which the coefficient δ k (x i ) defined as is largest. Note that the decision depends on a linear combination of the multivariate Gaussian distribution parameters; hence, the classifier gets the name of Linear Discriminant Analysis. When used for PSA, K = 2 and, as a result, the outcome of LDA is the posterior probability obtained in (15) for the class δ = 1. LDA can provide good results; however, its simplicity can be a handicap in some cases where relationships between covariates and target are complex, and if the covariates are correlated, its performance gets worse [31]. For these reasons, alternatives considering smoothing, such as Penalized Discriminant Analysis (FDA) or Shrinkage Discriminant Analysis (SDA), can be used. The former expands the covariate matrix and applies penalization coefficients in the calculation of thresholds [32], while the latter performs a shrinkage of covariates similar to that performed in the ridge or LASSO models.
LDA is only suitable for classification and, therefore, it cannot be used for statistical matching when the survey variable is continuous. However, its probabilistic nature makes it appropriate for estimating propensities in PSA, as described above.

Decision Trees, Bagged Trees, and Random Forests
Decision trees sequentially split the input data via conditional clauses until they reach a terminal node, which assigns a specific class or value. This process results in the following estimation for the expectance E m (y i |x i ): where y(s J i ) denotes the mean of y among the members of the sampled population, s, meeting the criteria of the i-th terminal node.
Bagged trees combine this approach with bagging [33]. Bagging averages the predictions of multiple weak classifiers (in this case, m unpruned trees). In order for them to complement each other, they are trained with a bootstrapped subsample of the complete dataset. Therefore: where y(s J j i ) denotes the mean of y among the members of the sampled population, s, meeting the criteria of the i-th terminal node of the j-th tree. This technique is known to improve the accuracy of the predictions [34]. Alternatively, Random Forests can also be used for both regression and classification using weak classifiers [35]. In this algorithm, the input variables for each weak classifier are a random subset of all of the covariates, instead of taking the whole x i vector as in bagged trees.
This approach is easy to apply for statistical matching. As usual, a model is trained using the volunteer sample in order to predict a response based on the covariates. Said model is then applied to the reference sample covariates. However, tree-based models are not good for estimating probabilities [36]. They can still be used for PSA, taking the proportion of weak classifiers that agree as the estimated propensity.

Gradient Boosting Machine
Gradient Boosting Machine (GBM) also works as an ensemble of weak classifiers. Boosting is an iterative process that trains subsequent models, giving more importance to the data for which previous models failed. This idea can be interpreted as an optimization problem [37], and, therefore, it is suitable for the gradient descent algorithm [38]. Then, the estimates for y are: where J(x i ) stands for a matrix of terminal nodes of m decision trees and v is a vector representing the weight of each tree. GBM has improved previous state-of-the-art models for some cases [39]. GBM can be used for PSA and statistical matching in the same way as the previous ensemble models considered.

k-Nearest Neighbors
k-Nearest Neighbors is "one of the most fundamental and simple classification methods" [40]. It does not need training. The algorithm simply averages the value of the target variable for the k individuals closest to the estimated individual (its k nearest neighbors), given a certain distance dependent on the covariates. This is: where x (1) ,...,x (n−1) are, respectively, the individuals closest to and furthest from x i . Choosing the right k is important for the proper performance of the algorithm. For classification, k-Nearest Neighbors would usually simply predict the most repeated label among its k-nearest neighbors. However, it can instead take into account the proportion in order to estimate probabilities. This idea can be applied for PSA, taking y i = 1 if k ∈ s v and y i = 0 if k ∈ s r , as always. For statistical matching, k-Nearest Neighbors can also be used normally to predict the responses.

Naive Bayes
The Naive Bayes algorithm is a classifier (that is, it can only be used to predict discrete variables) based on the Bayes theorem. In this study, Naive Bayes has been used only for propensity estimation in PSA. In this case, the Bayes theorem can be used with the probabilities that the participants have of being part of the volunteer sample and the occurrence of a given vector for X, that is, the values of the covariates for a given individual i.
The Naive Bayes classifier is simple in its reasoning, but can provide precise results in PSA under certain conditions [12]. On the other hand, predictions from Naive Bayes can turn unstable when covariates with high cardinality (e.g., numerical variables) are present, as discrete domains are required for computation of probabilities [41].
As was the case with discriminant analysis, Naive Bayes works naturally with probabilities; therefore, it is suitable for estimating propensities in PSA, but not for statistical matching if the survey variable is continuous.

Neural Networks with Bayesian Regularization
Neural networks calculate the expectance of y i as: where g and f k stand for the activation functions, v k are the weights of the k-th neuron, and b is the activation threshold [42]. The inputs follow an iterative process through one or more hidden layers until reaching the last layer, which produces the final output. The weights are initialized randomly and then optimized via gradient descent with the backpropagation algorithm [43].
Overfitting is an important problem for neural networks so prior distributions can be imposed to v k weights as a regularization method. They are then optimized to maximize the posterior density or the likelihood, as described in [42]. Another option is bagging of neural networks, as explained in [44]. The same neural network model is fitted using different seeds, and the results are averaged to obtain the predictions. This approach is known as Model Averaged Neural Networks.
Neural networks have already been considered for superpopulation modeling [45]. They are the state of the art for many domains [46]; Bayesian neural networks in particular are "fairly robust with respect to the problems of overfitting and hyperparameter choice" [47].
Since they work as universal approximators [48], neural networks can be used for PSA and statistical matching in the same way as generalized linear models.

Data
All of the experiments were performed using three different populations. In addition, two different sampling strategies were selected for each one in order to recreate the behavior of the estimates under the lack of representativeness of the potentially sampled subpopulation and under selection mechanisms tied to individual features (e.g., voluntariness).
The first population, which will be referred to as P1, corresponds to the microdata of the Spanish Life Conditions Survey (2012 edition) [49]. It collects data about economic and life conditions variables for 28,610 adults living in Spain. We took the mean health, as reported by the individuals themselves on a scale from 1 to 5, as the objective variable to estimate. The algorithms were trained using the 56 most related variables, excluding "health issues in the last six months", "chronic conditions", "household income difficulties", and "civil status" (as they are too correlated with the target variable). The first sampling strategy for this population, which will be referred to as P1S1, was a simple random sampling excluding the individuals without internet access. In the second sampling strategy, P2S2, we also included a propensity to participate in the sample using the formula Pr(yr) = yr 2 −1900 2 1996 2 −1900 2 , where yr is the year in which the individual was born. This way, linear models should have more problems learning the relations.
BigLucy [50], P2, was chosen as the second population. It consists of various financial variables of 85,396 industrial companies of a city for a particular fiscal year. The target variable chosen was the annual income in the previous fiscal year. The algorithms were trained using the size of the company (small, medium, or big), the number of employees, the company's income tax, and whether it is ISO certified. The first sampling strategy for this population, P2S1, was simple random sampling among the companies with SPAM options that are not small companies. This approach tested whether the models were able to correctly estimate the annual income for companies that were not in the training data. The second sampling strategy, P2S2, was simple random sampling among the companies with SPAM options, including a propensity to participate calculated as Pr(taxes) = min(taxes 2 /30, 1), where taxes is the company's income tax. This scenario is similar but it implies a quadratic dependence.
The Bank Marketing Data Set [51], P3, is the third population. It includes information about 41,188 phone calls related to direct marketing campaigns of a Portuguese banking institution. Our goal is to predict the mean contact duration. A total of 18 variables were used for training. We excluded two of the dataset variables, the month, and whether the client has subscribed for a term deposit in order to make the inference more difficult. For the first sampling strategy, P3S1, we applied simple random sampling among the clients contacted more than three times. For the second sampling strategy, P3S2, we applied simple random sampling among the clients contacted more than twice. Surprisingly, filtering less led to worse estimations for some cases.

Simulation
Each population and sampling strategy was simulated using various sample sizes: 1000, 2000, and 5000. The same size is taken for the convenience sample and for the reference sample. For each sample size, 500 simulations were executed. In each simulation, PSA (using weights defined in Formula (4) in Section 3), PSA with stratification (using weights defined in Formula (10)  For statistical matching, the following regression algorithms were used: linear regression (glm), Ridge regression with and without Bayesian priors (bridge and ridge respectively), LASSO regression via penalized maximum likelihood (glmnet), LARS-EN algorithm (lasso) and using Bayesian priors on the estimates (blasso), k-Nearest Neighbors (knn), Bagged Trees (treebag), Gradient Boosting Machine (gbm) and Bayesian-regularized Neural Networks (brnn).
These represent standard variants from different model types: Linear regression, penalized regression, Bayesian models, prototype models, trees, gradient boosting, neural networks, and discriminant analysis. All of the methods were trained using default hyperparameters, except for k-Nearest Neighbors, Naive Bayes, and C4.5, because their performance improved greatly after hypeparameter tuning. Said tuning was performed with bootstrap. The framework used for training, optimization, and prediction was caret [52], an R [53] package. Different metrics are considered for evaluating each scenario: Relative mean bias, relative standard deviation, and relative Root Mean Square Error (RMSE).
where p y is the value of the target variable,p y is the mean of the 500 estimations of p y , andp yi is the estimation of p y in the i-th simulation. In order to rank each estimator, the mean efficiency, the median efficiency, and the number of times it has been among the best are measured. An estimator is considered to be among the best when its RMSE differs from the minimum RMSE by less than 1%. The efficiency is defined as follows: where the baseline is the RMSE of using the sample average as the estimation.
To complete the comparison analysis, the results of relative bias, standard deviation, and RMSE were analyzed using linear mixed-effects regression. This approach provides estimates of the effect size of each adjustment method and algorithm. Datasets were considered random effects, while adjustment (matching, PSA, or PSA with propensity stratification) and algorithm (glm, gbm, glmnet, knn, and treebag, as they were the only algorithms used in all adjustments) were the fixed effects variables. All three response variables take non-negative values, and the interpretation is the same: The lower their value is, the better (less biased and/or less variable) the estimations are. Following this rule, negative Beta coefficients indicate that a given factor is contributing to better estimations, and vice versa for positive coefficients.

Results
Tables A1-A3 in the Appendix A show, respectively, the resulting biases, deviations, and RMSEs. In general, it can be observed that statistical matching outperforms PSA, which outperforms PSA with stratification. Nevertheless, all three methods consistently reduce the sample bias.
In terms of machine learning algorithms, basic linear models seem the most robust. Others, like Naive Bayes, can achieve outstanding results, but only for some cases. It is also interesting noting that C4.5 trees can even lead to worse estimations than simply using the sample average.
The final ranking confirming this impression can be seen in Table 1. Statistical matching with linear or ridge regression has the best mean efficiency and has been among the best more than the rest of approaches. This is not a surprise, since their simplicity avoids overfitting, presumably one of the main problems of matching. Ridge regression should be preferred if the data suffer from multicollinearity. Otherwise, linear regression alone is very effective (and faster). The results of linear mixed-effects modeling can be consulted in Tables 2-4. It is noticeable how linear models, LASSO with LARS-EN algorithm, and k-Nearest Neighbors outperform Bagged Trees in all metrics (modulus of relative bias, relative standard deviation, and RMSE), while there is no evidence that GBM is different from any of them. Regarding adjustment methods, PSA (both with and without propensity stratification) showed significantly more bias and RMSE than statistical matching. In the case of standard deviation, there is also evidence that PSA without propensity stratification provides higher values (deviation) than matching. Altogether, these results would indicate that matching has a larger effect on bias reduction than PSA.

Discussion
Nonprobability samples are increasingly common due to the growing internet penetration and the subsequent rise of online questionnaires. These questionnaries are a faster, less expensive, and more comfortable method of information collection in comparison to traditional ones. However, samples obtained with this technique deal with several sources of bias: Despite the increasing internet penetration, large population groups (less educated or elderly people) are still not properly represented. In addition, questionnaires are often administered with non-probabilistic sampling methods (e.g., snowballing), which imply that the selection is controlled by the interviewees themselves, causing a selection bias.
In this study, we focus on two of the proposed methods to reduce biases produced by nonprobability sampling: PSA and matching. We also compare the outcomes when the predictive modeling, required in both methods, is done through linear regression and through machine learning algorithms. PSA and matching require a probability sample on which the target variable has not been measured. The unit sampling performed in the simulations captures different self-selection scenarios in nonprobability sampling, while probability samples are drawn by simple random sampling with no sources of bias. This canonical representation is not usual, as reference samples are mostly drawn with complex sampling methods and the amount of bias is non-null. Further research could take into account these imperfect situations.
Results show that statistical matching provides better results than PSA on bias reduction and RMSE, regardless of the dataset and selection mechanism. In addition, linear models and k-nearest neighbors provided, on average, better results in terms of bias reduction than more complex models, such as GBM and Bagged Trees. These results are relevant since, even though there are comparative studies between adjustment techniques in nonprobability surveys [11,54], to the best of our knowledge, no comparison has been done before between these two methods.
Before closing, several limitations of our analysis should be mentioned. Given that the datasets used for simulation are real-life examples, we cannot ensure whether a selection bias mechanism is Missing At Random (MAR) or Missing Not At Random (MNAR), as the causality relationships are not known. It is known that the selection mechanism makes a difference in terms of how challenging bias reduction can be, but, in this study, it was not possible to assess.
In the near future, it is planned to explore how to combine PSA and statistical matching techniques. Shrinkage is a natural way to improve the available estimates in terms of the mean squared error that has been used by many authors in other contexts (e.g., [55][56][57]). The idea is to shrink the estimatorθ SM towards the estimatorθ PSA and obtainθ srk = Kθ SM + (1 − K)θ PSA , where K is a constant satisfying 0 < K < 1.
Another way can be considered by taking into account that most machine learning models allow weighting of the data used for training. Therefore, the weights obtained via PSA for the volunteer sample can be used when training the model used for statistical matching, since it is trained with said sample.