An Approach to Integrating a Non-Probability Sample in the Population Census

: Population censuses are increasingly using administrative information and sampling as alternatives to collecting detailed data from individuals. Non-probability samples can also be an additional, relatively inexpensive data source, although they require special treatment. In this paper, we consider methods for integrating a non-representative volunteer sample into a population census survey, where the complementary probability sample is drawn from the rest of the population. We investigate two approaches to correcting non-probability sample selection bias: adjustment using propensity scores, which models participation in the voluntary sample, and doubly robust estimation, which has the property of persisting possible misspeciﬁcation of the latter model. We combine the estimators of population parameters that correct the selection bias with the estimators based on a representative union of both samples. Our analysis shows that the availability of detailed auxiliary information simpliﬁes the applied estimation procedures, which are efﬁcient in the Lithuanian census survey. Our ﬁndings also reveal the biased nature of the non-probability sample. For instance, when estimating the proportions of professed religions, smaller religious communities exhibit a higher participation rate than other groups. The combination of estimators corrects such selection bias. Our methodology for combining the voluntary and probability samples can be applied to other sample surveys.


Introduction
Population censuses are traditionally understood as large-scale surveys conducted once every ten years, where all individuals provide their data through census questionnaires.However, such data collections are expensive and require a lot of other resources.Therefore, new ways of conducting population censuses are being discussed more frequently, particularly as administrative data become more accessible [1][2][3][4].Lithuania is an example, as Statistics Lithuania (the State Data Agency) has conducted the Population and Housing Census 2021 primarily based on administrative data from state registers and information systems.
Unfortunately, certain census variables, such as professed religion or mother tongue, cannot be derived from administrative sources.For these variables, a sample survey could be a compromise that supports the idea of optimizing the population census.Probability sampling methods, along with sample design-based inference, are an accepted approach to surveying finite populations in many areas of statistics [5], particularly in official statistics.Probability samples are also being used in censuses, as seen in [6].On the other hand, the use of alternative data sources, such as big data or non-probability samples, has been studied extensively in recent years since these data are cheaper and much easier to obtain [7][8][9][10].However, unlike with probability samples, the inclusion mechanisms into non-probability samples are typically unknown; therefore, inclusion probabilities need to be estimated to correct the sample selection bias.Even a very large non-probability sample may lead to worse estimation results than a small probability sample if the sample selection bias is not taken into account [11].To our knowledge, non-probability samples have not yet been directly used in censuses, including contemporary corrections of their biases.We present the results of our research on the integration of such a non-probability sample in a population census, which might be useful for other researchers and practitioners working with censuses and sample surveys.
We considered the sampling framework created in the survey of the Lithuanian census: firstly, the data were collected through the voluntary (non-probability) sample and then the probability sample was drawn from the rest of the census population.Our scenario is similar to the one considered in [12,13] but differs from another one often provided in the literature, where the study variables were assumed to be unobserved in a probability sample [14,15].Moreover, we had access to complete auxiliary information from administrative sources and previous censuses, which may not have always been the case in other surveys.This enabled us to simplify the estimation procedures used in [14] and apply non-probability sample integration similar to [12,13].
Our goal is to efficiently combine both the non-probability and probability samples to estimate the parameters of interest.In Section 2.3.1, we consider a natural post-stratified generalized regression (calibrated) estimator of population means, as described in [12].This estimator is based on the union of the probability and voluntary samples, with inclusion probabilities initially set to one for units in the latter sample.In Section 2.3.2,we explore an alternative to the post-stratified estimator, as presented in [14], which is the inverse probability weighting estimator based on estimated inclusion probabilities (propensity scores) for the non-probability sample.We adapt the methodology of [14] to our framework to derive the variance estimation formula for this estimator.Finally, we combine both estimators in Section 2.3.5 by taking into account their estimated variances.In Section 2.3.4,we investigate the doubly robust estimator for the non-probability sample, which provides protection against possible misspecification of the propensity score model [14].This estimator incorporates model-based prediction estimators for the parameters of interest and exploits complete auxiliary information.We combine the doubly robust estimator with an analogous generalized difference estimator from [16], which we describe in Section 2.3.3.Our aim is to determine which combination works best, at least in our application to the population census.
The application of the considered estimators to the Lithuanian census survey is elaborated on in Section 3. A discussion of the results of the study and some future insights are reviewed in Section 4. The most relevant conclusions are outlined in Section 5.By summarizing our findings supported by the analysis of the real census data, it is possible to benefit from the voluntary sample, especially if the estimators based on it are combined with those using the probability sample, and good auxiliary information is available.

Sampling from the Finite Population
We consider any continuous or binary study variable y with the fixed values y 1 , . . ., y N in a finite population U = {1, . . ., N} of size N.We estimate the population mean or proportion To estimate population parameters (1), assume that, at first, a non-probability sample s A of size n A is obtained from U , and a sample s B of size n B is drawn according to any probability sampling design without replacement from the rest of the survey population U \{s A } afterward.We can then interpret that the union of both samples s = s A ∪ s B of size n = n A + n B is drawn according to the probability sampling design p(•) with inclusion into the sample probabilities π k = P p {k ∈ s} > 0, k ∈ U , where we set Hereinafter, we use the notation P p , E p , and V p to denote the probability, expectation, and variance, calculated according to the randomness induced by p(•), respectively.We write d k = 1/π k to denote the sampling weights.

Auxiliary Data and Outcome Regression Model
We associate with the unit k ∈ U the values x k of the auxiliary variables x, and assume that these values are known for all population units.Hence, the complete auxiliary information is available.
Suppose that the relationship between the variables y and x can be described by a semiparametric outcome regression model ξ: where β and σ 2 are unknown parameters, v k = v(x k ) is a known function of x k , and m(x k , β) has a known form as well, for example, m(x k , β) = x k β.Here, E ξ and V ξ denote the expectation and variance with respect to the model ξ.We assume (without any loss of generality) that 1 is the first component of the vector x k for all k ∈ U .

Estimation of Population Parameters 2.3.1. Post-Stratified Generalized Regression Estimator
Let us consider the combined sample s with the accompanying probability sampling design p(•).Taking m(x k , β) = x k β in (2), we have the linear regression model, which is used to build the generalized regression estimator [17] μGR = 1 of (1), where The quantity B estimates the population characteristic which is, in turn, the generalized least squares estimator of β of the linear regression model, which is also called the assisting model.Estimator (3) is equivalent to the calibrated estimator [18] μGR = 1 often used in practice, where the calibration weights w k , k ∈ s, are chosen to minimize the distance function subject to the calibration equations Estimator ( 3) is approximately design-unbiased, i.e., E p ( μGR ) ≈ µ.The generalized regression estimator (3) is also referred to as the post-stratified estimator in [12], with two post-strata, i.e., s A and U \{s A }.The authors of [12] argue that such estimation is efficient if the non-probability sample s A is very large.
According to [17], a design-consistent estimator of the variance V p ( μGR ) is ψGR = 1 where π kl = P p {k, l ∈ s} > 0 are the second-order inclusions into the sample probabilities.

Inverse Probability Weighting Estimator Based on the Propensity Score Model
Let us consider only the non-probability sample s A .The non-probability sample itself does not represent the target population, and naive estimators based on it are typically biased [11].The main obstacle is the unknown selection mechanism for a unit to be included in the sample.
Let R k = I(k ∈ s A ) be the indicator variable for a unit k ∈ U selected to the sample s A .The probabilities are called the propensity scores, where the subscript q refers to the propensity score model.Probabilities ( 6) are analogous to the inclusion into the sample probabilities π k (for probability samples) since they describe the inclusion into the sample s A .The propensity scores π A k , k ∈ s A , need to be estimated before using them to weigh the units of the non-probability sample.
The following assumptions are considered to simplify the propensity score model [14]: A1 The indicator R k and the study variable y k are independent given the covariates x k .A2 All units have a nonzero propensity score: π A k > 0 for all k ∈ U .A3 The indicators R k and R l are independent, given x k and x l for k = l.

Due to assumption A1, we have π
for all k ∈ U , and the selection mechanism is called ignorable.It is similar to the notion of missing at random (MAR) used in missing data analyses [19].
We model the propensity scores π where θ is the model parameter with the unknown true value θ 0 .The propensity score estimates πA k under the logistic regression model ( 7) are obtained from the maximum likelihood estimator πA k = π(x k , θ), where θ maximizes the log-likelihood function The maximum likelihood estimator θ is found by solving the score equations A conventional way to do it is to apply the Newton-Raphson iterative procedure.The estimated propensity scores πA k = π(x k , θ), k ∈ s A , are then used to compute the inverse probability weighting (IPW) estimator [14] μIPW = 1 of the population mean µ.This is the adaptation of the Hájek estimator used for the probability samples.Estimator (8) can correct the sample selection bias efficiently if the propensity score model is well-specified.We construct the estimator of variance V q ( μIPW ) using asymptotic properties of estimator (8).Let U ν be a sequence of finite populations of size N ν , indexed by ν.For each U ν , there is an associated non-probability sample s A,ν of size n A,ν .The population size N ν → ∞ and the sample size n A,ν → ∞ as ν → ∞.Further, the index ν is suppressed to simplify the notation.Consider the following regularity conditions [14]: C1 The population size N and the sample size n A satisfy lim N→∞ n A /N = f A ∈ (0, 1).C2 There exist c 1 and c 2 such that 0 < c 1 < Nπ A k /n A ≤ c 2 for all units k ∈ U .C3 The finite population and the propensity scores satisfy N −1 ∑ k∈U y 2 k = O(1), as well as Proposition 1.Under assumptions A1-A3 and regularity conditions C1-C3, and assuming the logistic regression model (7) for the propensity scores, estimator (8) is asymptotically unbiased, i.e., μIPW ), and an asymptotic variance of (8) can be derived as Proof.The proposition is actually the corollary of Theorem 1 in [14] for complete auxiliary data.An inspection of the proof of the latter theorem leads to simpler assumptions required for the proposition statement.
Using the asymptotic variance from Proposition 1, the variance of estimator ( 8) can be estimated by where given the non-probability sample s A .

Generalized Difference Estimator
Consider the combined sample s together with the sampling design p(•).If the outcome regression model ( 2) is not assumed to be linear, one can apply the generalized difference estimator [16] μGD = 1 to estimate the population mean µ.The estimator exploits the available complete auxiliary data.Here, β can be the quasi-maximum likelihood estimator of the regression parameter β based on the dataset {(y k , x k ), k ∈ s} by [20].
A design-consistent estimator of the variance V p ( μGD ) is provided in [16] as

Doubly Robust Estimator
Let us use only the non-probability sample s A .A drawback of the IPW estimator (8) may be its sensitivity to a misspecified model for the propensity scores, especially if some units have very small values in πA k [10].The efficiency and robustness of the IPW estimator can be improved by incorporating outcome regression model (2), also called the prediction model.The doubly robust (DR) estimator for the mean µ is [14] μDR = 1 Due to the so-called model transportability implied by the assumption A1 [21], model (2), fitted using standard methods based on the dataset {(y k , x k ), k ∈ s A }, can be applied to compute the predicted values m(x k , β) for all k ∈ U used in (12).A doubly robust estimator ( 12) is an analog of the generalized difference estimator (10).
We turn to the estimation of variance V q ( μDR ).Let β 0 be the unknown true value of parameter β in the prediction model.Consider additional regularity conditions imposed on the mean function m(x, β) as given in [14]: C4 For each x, ∂m(x, β)/∂β is continuous in β and |∂m(x, β)/∂β| ≤ h(x, β) for β in the neighborhood of β 0 , and for β in the neighborhood of β 0 , and Proposition 2. Estimator (12) is doubly robust in the sense that it is a consistent estimator of the mean µ if either the propensity score model or the prediction model is correctly specified.Under assumptions A1-A3 and regularity conditions C1-C5, and assuming correctly specified logistic regression model (7) for the propensity scores, an asymptotic variance of (12) can be derived as Proof.The proposition follows from Theorem 2 of [14] for the complete auxiliary data.
Remark 1. Regularity conditions C4-C5 are redundant for the linear outcome regression model.
Using asymptotic variance (13), a simple plug-in variance estimator for doubly robust estimator (12) is where given the non-probability sample s A .

Composite Estimators
We linearly combine design-based post-stratified estimators (3) and ( 10) based on the combined sample s with, correspondingly, model-based IPW and DR estimators supported only by the non-probability sample s A .We consider two composite estimators and of population mean (1).Similar combinations are investigated in [13].Here, the quantities λ1 and λ2 estimate optimal coefficients of the combinations, ignoring covariance terms since the estimators we combine, in principle, do not have common sources of randomness.Compositions (15) and ( 16) give more weight to the estimators with smaller variances.The respective variance estimators are and The interpretation of variance estimators ( 17) and ( 18) is that the variances of the designbased estimators may be reduced by the factors λ1 and λ2 , respectively.

Motivation
In 2020, the COVID-19 pandemic highlighted the need to promptly produce statistical information from national statistical institutions.This led Statistics Lithuania (the State Data Agency) to take on a new role as the governing organization for state data, forming a unified database of the main state registers and information systems with a vast amount of data that are ready to be used for statistical purposes.Therefore, Statistics Lithuania was able to carry out the following 2021 census based on administrative data from these registers and information systems: residents, real estate, address registers, and the State Social Insurance Fund Board (Sodra) database, among others.
However, as some variables of interest could not be obtained from any administrative source, a statistical survey for such data collection had to be launched.Hence, a statistical survey on population by ethnicity, native language, and religion was conducted in 2021.It aimed to evaluate population proportions for the following variables: religion professed (16 categories), mother tongue (more than 12 categories), knowledge of other languages (16 languages), and ethnicity.For the latter variable, mass imputation was used since relevant information was known from the Ethnicity Register for approximately 87% of the census population.The research was conducted to achieve the objective of efficiently estimating these proportions by exploiting complete data from the previous censuses and other auxiliary information.

Sample Selection
The survey sample s ⊂ U was drawn and consisted of three parts, At first, a voluntary online survey was carried out from 15 January to 28 February, 2021, which allowed for the collection of statistical data from approximately 2% of the census population (about 54,000 respondents), resulting in the non-probability sample s A .(ii) After the end of the online survey, a sampling frame for probability sampling was constructed.It excluded certain addresses, e.g., if at least one individual from the address participated in the online survey, if it was an institution, if more than 15 individuals were permanent residents, among other rules.These units, which were not included in the sampling frame, comprised the part s O of the sample s. (iii) Lastly, the probability sample s B was drawn from the sampling frame U \{s A ∪ s O }, which was divided into H = 113 strata according to the municipality intersected with the area of residence, i.e., urban or rural.The number of addresses sampled from a particular stratum was proportional to the size of the stratum, resulting in around 40,000 addresses sampled from the Population Register in total; approximately 6% of the census population was interviewed through the telephone survey (about 171,000 respondents).
The working sampling design p(•) is characterized by the inclusion probabilities: where N h denotes the size of the hth stratum, n h is the number of addresses selected, and m k is the number of individuals in the corresponding address.The sample part s O is treated as a separate post-stratum.

Imputation of Missing Values
The response rate in the probability sample s B reached approximately 88%.Missing values in the whole sample s were filled in using three imputation methods: historical, deductive, and k-nearest neighbor.
Missing values were first filled in using historical information from the 2011 and 2001 censuses consecutively, as variables of interest were fully known for the populations of those censuses.The remaining missing values accounted for 2.3% of the sample.
Additional sociodemographic characteristics of previous and current censuses, such as age, gender, marital status, household structure, country of birth, citizenship, education, and employment status, were used for further deductive imputation.For instance, if the same religion was observed for each household member except one, the corresponding religion was imputed where missing.After the deductive imputation, only 0.3% of the sample remained with missing values.
Eventually, the remaining missing values in the sample were filled in by applying the k-nearest neighbor method [22].

Application to Religion Proportions
We focus on the non-probability sample integration for the estimation of religion proportions as the results are similar for every proportion of interest.
When we obtained the non-probability sample from the online survey, the question arose if the collected data could be used for estimation.We first checked the representativeness of the voluntary sample using sociodemographic characteristics known for the entire 2021 census population.A comparison of some proportions of sociodemographic characteristics between the voluntary sample and the whole population showed the biased nature of the non-probability sample.The results provided in Table 1 suggest that people with higher education, as well as those who are employed and married, tend to participate in such online surveys.Another interesting observation made was the willingness of some ethnic communities to participate in the online survey and represent their community.For instance, Polish people in Lithuania accounted for 35% of the voluntary sample but only about 7% of the whole population.Additionally, we compared the religion proportions of 2011 religion in the 2021 census population for the online survey respondents and the entire population; see Table 2.It was observed that the representatives of smaller religious communities were more likely to participate in the survey.For instance, the proportion of the Karaites religious community in the voluntary sample was 1307% larger than the corresponding proportion in the population.The sociodemographic variables of Table 1 and the religion variables of Table 2 contain information that can explain the chance of being selected in the voluntary sample.Hence, these variables were used as covariates in the propensity score model.
To estimate the religion proportions in the 2021 census population, we first considered the post-stratified generalized regression and generalized difference estimators given by ( 3) and (10), respectively.The calibrated weights in (4) were calculated by taking such auxiliary information as binary variables on age groups, gender, and religions professed in 2011 intersected with counties in the calibration equations, while the same auxiliary variables as in the propensity score model were used for estimator (10).Comparing the results of the post-stratified calibrated estimator with the proportions of the previous censuses in Table 3 (and based on external evaluations), the estimator μGR tends to underestimate smaller religious communities due to the lack of data.On the other hand, the generalized difference estimator μGD seems to produce slightly higher estimates for the majority of these smaller religions.As we observed relatively more representatives of minor religions in the voluntary sample (see Table 2), we expected smaller variances of the estimators based only on this non-probability sample with a condition that a selection bias is properly corrected.The IPW estimator is designed to correct such bias by incorporating the propensity scores evaluated using the auxiliary variables of Tables 1 and 2.
We integrated the non-probability sample through the combination μC1 of the poststratified generalized regression (calibrated) and IPW estimators.According to Table 4, it seems that the first composite estimator corrected the underestimation.Alternatively, we considered the combination μC2 of the generalized difference estimator with its analog DR estimator based on the auxiliary variables of Tables 1 and 2. The second composite estimator seems to have produced even higher estimates for smaller religious communities; however, it also came with larger variances (see Table 5).
Nevertheless, as the calibrated and generalized difference estimators seemed to evaluate larger proportions of interest accurately, they were used in compositions (15) and (16) with weights equal to 1.That is, for religions None, Not indicated, and Roman Catholics, the proportions were evaluated using only the design-based calibrated or generalized difference estimators.Table 5 provides comparisons of the relative percent difference between (i) the smoothed version of variance estimator (5) and variance estimator ( 17), (ii) the smoothed version of variance estimator (11) and variance estimator (18), and (iii) variance estimators (17) and (18).We used smoothed variances ψGRs and ψGDs instead of ψGR and ψGD in compositions (15) and ( 16), respectively, due to the estimation of very small proportions.For the smoothing of variance ( 5), similarly as in [23], we assumed that V p ( μGR ) ≈ K N γ , with N as the sum of values of 2011 variable of interest in the 2021 census population.Parameters K > 0 and γ ∈ R were evaluated through a log-log regression, using the data pairs ( ψGR , N) calculated from all categories of the variable of interest.The smoothing of variance (11) was performed analogously.The compositions given by ( 15) and ( 16) assign more weight to estimators with a smaller variance.The first two numerical columns in Table 5, which correspond to cases (i)-(ii), provide an indication of how much composite estimators can improve the estimation accuracy of the calibrated and generalized difference estimators, respectively.The last column of the table includes the relative percent differences between the variance estimates of composite estimators μC1 and μC2 .The first composite estimator gives more satisfactory results for the proportions of smaller religious communities.However, the second composite estimator seems to correct the estimation accuracy of proportions of larger religious groups.

Discussion
The generalized regression and generalized difference estimators considered here are traditional model-assisted estimators based on the union of the non-probability and probability samples, together with the probability sampling design p(•).Since the selection bias is corrected through the known inclusion probabilities, the application of these estimators may lead to valid design-based inferences.However, the latter estimators incorporate unweighted units of the non-probability sample, which does not add more efficiency unless the size of the non-probability sample is of the same magnitude as the population size [12].In the application to the Lithuanian census, the voluntary sample covers about 2% of the population, and such a contribution is too small.
Meanwhile, IPW and DR estimators exploit the non-probability sample in a more advanced way, i.e., through the propensity score and prediction models, and we may benefit from combining them with traditional design-based estimators.There are various ways to combine different data sources and estimators [24]; however, we choose to optimize the linear combinations of estimators by taking into account their estimated variances.This approach appears to be efficient in the application when estimating smaller population proportions that require larger sample sizes.Indeed, the first composite estimator improves the estimation accuracy for proportions of smaller religious communities as the estimated variances of the composition are smaller than the estimated variances of the generalized regression estimator by up to 19%.Alternatively, we consider the combination of the generalized difference estimator with the DR estimator as it allows us to better leverage the available complete auxiliary data through the prediction model.Surprisingly, while this composite estimator improves the estimation accuracy for proportions of larger religious communities, it does not give satisfactory results for smaller religions.
In the application, the detailed data available from administrative registers and previous complete censuses allowed us to efficiently apply the model-based IPW and DR estimators integrating the non-probability sample.In regard to future censuses, it is worth considering the possibility of collecting much larger non-probability samples by promoting voluntary participation more.
Our study is based on a strong MAR assumption for the variable of interest in the propensity score model.Although the analysis of the Lithuanian census data allowed us to identify variables that clearly explain voluntary survey participation (and, thus, the propensity score model may be specified quite well), this assumption might be relaxed in future research.In this way, future investigations could explore data integration under the assumption of the non-ignorable selection mechanism.

Conclusions
The non-probability and probability samples can be combined into a single sample and used in design-based post-stratified estimators of parameters.This approach is a safe but inefficient way to exploit the non-probability sample.
The post-stratified estimators are, therefore, linearly combined with the model-based estimators based only on the non-probability sample.By applying the estimated optimal combinations to the Lithuanian census survey, there is a significant improvement in estimating the population proportions.
The success of integrating a voluntary sample into a census survey depends on the availability of proper auxiliary information, such as complete data from previous censuses.In the future, such information could be obtained by collecting much larger non-probability samples.In addition, it would then be possible to forego the selection of a probability sample.

Table 1 .
Comparison of proportions of some sociodemographic characteristics in the voluntary sample and the whole population.

Table 2 .
Comparison of religion proportions in the voluntary sample and the whole population.

Table 4 .
Religion proportion estimates in 2021 census population.

Table 5 .
Comparison of the relative difference (in %): )