Next Article in Journal
Output Feedback Robust Tracking Control for a Variable-Speed Pump-Controlled Hydraulic System Subject to Mismatched Uncertainties
Next Article in Special Issue
Quantile Regression Based on the Weighted Approach with Dependent Truncated Data
Previous Article in Journal
Prediction and Analysis of Heart Diseases Using Heterogeneous Computing Platform
Previous Article in Special Issue
Estimation of Fixed Effects Partially Linear Varying Coefficient Panel Data Regression Model with Nonseparable Space-Time Filters
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Approach to Integrating a Non-Probability Sample in the Population Census

by
Ieva Burakauskaitė
and
Andrius Čiginas
*
Institute of Data Science and Digital Technologies, Vilnius University, Akademijos Str. 4, LT-08412 Vilnius, Lithuania
*
Author to whom correspondence should be addressed.
Mathematics 2023, 11(8), 1782; https://doi.org/10.3390/math11081782
Submission received: 6 March 2023 / Revised: 30 March 2023 / Accepted: 4 April 2023 / Published: 8 April 2023
(This article belongs to the Special Issue Nonparametric Statistical Methods and Their Applications)

Abstract

:
Population censuses are increasingly using administrative information and sampling as alternatives to collecting detailed data from individuals. Non-probability samples can also be an additional, relatively inexpensive data source, although they require special treatment. In this paper, we consider methods for integrating a non-representative volunteer sample into a population census survey, where the complementary probability sample is drawn from the rest of the population. We investigate two approaches to correcting non-probability sample selection bias: adjustment using propensity scores, which models participation in the voluntary sample, and doubly robust estimation, which has the property of persisting possible misspecification of the latter model. We combine the estimators of population parameters that correct the selection bias with the estimators based on a representative union of both samples. Our analysis shows that the availability of detailed auxiliary information simplifies the applied estimation procedures, which are efficient in the Lithuanian census survey. Our findings also reveal the biased nature of the non-probability sample. For instance, when estimating the proportions of professed religions, smaller religious communities exhibit a higher participation rate than other groups. The combination of estimators corrects such selection bias. Our methodology for combining the voluntary and probability samples can be applied to other sample surveys.

1. Introduction

Population censuses are traditionally understood as large-scale surveys conducted once every ten years, where all individuals provide their data through census questionnaires. However, such data collections are expensive and require a lot of other resources. Therefore, new ways of conducting population censuses are being discussed more frequently, particularly as administrative data become more accessible [1,2,3,4]. Lithuania is an example, as Statistics Lithuania (the State Data Agency) has conducted the Population and Housing Census 2021 primarily based on administrative data from state registers and information systems.
Unfortunately, certain census variables, such as professed religion or mother tongue, cannot be derived from administrative sources. For these variables, a sample survey could be a compromise that supports the idea of optimizing the population census. Probability sampling methods, along with sample design-based inference, are an accepted approach to surveying finite populations in many areas of statistics [5], particularly in official statistics. Probability samples are also being used in censuses, as seen in [6]. On the other hand, the use of alternative data sources, such as big data or non-probability samples, has been studied extensively in recent years since these data are cheaper and much easier to obtain [7,8,9,10]. However, unlike with probability samples, the inclusion mechanisms into non-probability samples are typically unknown; therefore, inclusion probabilities need to be estimated to correct the sample selection bias. Even a very large non-probability sample may lead to worse estimation results than a small probability sample if the sample selection bias is not taken into account [11]. To our knowledge, non-probability samples have not yet been directly used in censuses, including contemporary corrections of their biases. We present the results of our research on the integration of such a non-probability sample in a population census, which might be useful for other researchers and practitioners working with censuses and sample surveys.
We considered the sampling framework created in the survey of the Lithuanian census: firstly, the data were collected through the voluntary (non-probability) sample and then the probability sample was drawn from the rest of the census population. Our scenario is similar to the one considered in [12,13] but differs from another one often provided in the literature, where the study variables were assumed to be unobserved in a probability sample [14,15]. Moreover, we had access to complete auxiliary information from administrative sources and previous censuses, which may not have always been the case in other surveys. This enabled us to simplify the estimation procedures used in [14] and apply non-probability sample integration similar to [12,13].
Our goal is to efficiently combine both the non-probability and probability samples to estimate the parameters of interest. In Section 2.3.1, we consider a natural post-stratified generalized regression (calibrated) estimator of population means, as described in [12]. This estimator is based on the union of the probability and voluntary samples, with inclusion probabilities initially set to one for units in the latter sample. In Section 2.3.2, we explore an alternative to the post-stratified estimator, as presented in [14], which is the inverse probability weighting estimator based on estimated inclusion probabilities (propensity scores) for the non-probability sample. We adapt the methodology of [14] to our framework to derive the variance estimation formula for this estimator. Finally, we combine both estimators in Section 2.3.5 by taking into account their estimated variances. In Section 2.3.4, we investigate the doubly robust estimator for the non-probability sample, which provides protection against possible misspecification of the propensity score model [14]. This estimator incorporates model-based prediction estimators for the parameters of interest and exploits complete auxiliary information. We combine the doubly robust estimator with an analogous generalized difference estimator from [16], which we describe in Section 2.3.3. Our aim is to determine which combination works best, at least in our application to the population census.
The application of the considered estimators to the Lithuanian census survey is elaborated on in Section 3. A discussion of the results of the study and some future insights are reviewed in Section 4. The most relevant conclusions are outlined in Section 5. By summarizing our findings supported by the analysis of the real census data, it is possible to benefit from the voluntary sample, especially if the estimators based on it are combined with those using the probability sample, and good auxiliary information is available.

2. Methods

2.1. Sampling from the Finite Population

We consider any continuous or binary study variable y with the fixed values y 1 , , y N in a finite population U = { 1 , , N } of size N. We estimate the population mean or proportion
μ = 1 N k U y k .
To estimate population parameters (1), assume that, at first, a non-probability sample s A of size n A is obtained from U , and a sample s B of size n B is drawn according to any probability sampling design without replacement from the rest of the survey population U { s A } afterward. We can then interpret that the union of both samples s = s A s B of size n = n A + n B is drawn according to the probability sampling design p ( · ) with inclusion into the sample probabilities π k = P p { k s } > 0 , k U , where we set π k = 1 if k s A . Hereinafter, we use the notation P p , E p , and V ( + ) to denote the probability, expectation, and variance, calculated according to the randomness induced by p ( · ) , respectively. We write d k = 1 / π k to denote the sampling weights.

2.2. Auxiliary Data and Outcome Regression Model

We associate with the unit k U the values x k of the auxiliary variables x , and assume that these values are known for all population units. Hence, the complete auxiliary information is available.
Suppose that the relationship between the variables y and x can be described by a semiparametric outcome regression model ξ :
E ξ ( y k | x k ) = m ( x k , β ) and V ξ ( y k | x k ) = v k 2 σ 2 , k U ,
where β and σ 2 are unknown parameters, v k = v ( x k ) is a known function of x k , and m ( x k , β ) has a known form as well, for example, m ( x k , β ) = x k β . Here, E ξ and V ξ denote the expectation and variance with respect to the model ξ . We assume (without any loss of generality) that 1 is the first component of the vector x k for all k U .

2.3. Estimation of Population Parameters

2.3.1. Post-Stratified Generalized Regression Estimator

Let us consider the combined sample s with the accompanying probability sampling design p ( · ) . Taking m ( x k , β ) = x k β in (2), we have the linear regression model, which is used to build the generalized regression estimator [17]
μ ^ GR = 1 N k s d k y k + 1 N k U x k 1 N k s d k x k B ^
of (1), where
B ^ = k s d k x k x k c k 1 k s d k x k y k c k
with positive constants c k , for instance, c k = v k 2 . The quantity B ^ estimates the population characteristic
B = k U x k x k c k 1 k U x k y k c k ,
which is, in turn, the generalized least squares estimator of β of the linear regression model, which is also called the assisting model. Estimator (3) is equivalent to the calibrated estimator [18]
μ ^ GR = 1 N k s w k y k ,
often used in practice, where the calibration weights w k , k s , are chosen to minimize the distance function
k s c k ( w k d k ) 2 d k
subject to the calibration equations
k s w k x k = k U x k .
Estimator (3) is approximately design-unbiased, i.e., E p ( μ ^ GR ) μ . The generalized regression estimator (3) is also referred to as the post-stratified estimator in [12], with two post-strata, i.e., s A and U { s A } . The authors of [12] argue that such estimation is efficient if the non-probability sample s A is very large.
According to [17], a design-consistent estimator of the variance V p ( μ ^ GR ) is
ψ ^ GR = 1 N 2 k s l s 1 π k π l π k l ( y k x k B ^ ) ( y l x l B ^ ) π k π l ,
where π k l = P p { k , l s } > 0 are the second-order inclusions into the sample probabilities.

2.3.2. Inverse Probability Weighting Estimator Based on the Propensity Score Model

Let us consider only the non-probability sample s A . The non-probability sample itself does not represent the target population, and naive estimators based on it are typically biased [11]. The main obstacle is the unknown selection mechanism for a unit to be included in the sample.
Let R k = I ( k s A ) be the indicator variable for a unit k U selected to the sample s A . The probabilities
π k A = E q ( R k | x k , y k ) = P q ( R k = 1 | x k , y k ) , k U ,
are called the propensity scores, where the subscript q refers to the propensity score model. Probabilities (6) are analogous to the inclusion into the sample probabilities π k (for probability samples) since they describe the inclusion into the sample s A . The propensity scores π k A , k s A , need to be estimated before using them to weigh the units of the non-probability sample.
The following assumptions are considered to simplify the propensity score model [14]:
A1
The indicator R k and the study variable y k are independent given the covariates x k .
A2
All units have a nonzero propensity score: π k A > 0 for all k U .
A3
The indicators R k and R l are independent, given x k and x l for k l .
Due to assumption a, we have π k A = P q ( R k = 1 | x k , y k ) = P q ( R k = 1 | x k ) for all k U , and the selection mechanism is called ignorable. It is similar to the notion of missing at random (MAR) used in missing data analyses [19].
We model the propensity scores π k A = P q ( R k = 1 | x k ) parametrically using the inverse logit function
π k A = π ( x k , θ ) = exp ( x k θ ) 1 + exp ( x k θ ) ,
where θ is the model parameter with the unknown true value θ 0 . The propensity score estimates π ^ k A under the logistic regression model (7) are obtained from the maximum likelihood estimator π ^ k A = π ( x k , θ ^ ) , where θ ^ maximizes the log-likelihood function
l ( θ ) = k s A log π ( x k , θ ) 1 π ( x k , θ ) + k U log { 1 π ( x k , θ ) } = k s A x k θ k U log { 1 + exp ( x k θ ) } .
The maximum likelihood estimator θ ^ is found by solving the score equations
U ( θ ) = θ l ( θ ) = k U R k π ( x k , θ ) x k = 0 .
A conventional way to do it is to apply the Newton–Raphson iterative procedure.
The estimated propensity scores π ^ k A = π ( x k , θ ^ ) , k s A , are then used to compute the inverse probability weighting (IPW) estimator [14]
μ ^ IPW = 1 N ^ A k s A y k π ^ k A , where N ^ A = k s A 1 π ^ k A ,
of the population mean μ . This is the adaptation of the Hájek estimator used for the probability samples. Estimator (8) can correct the sample selection bias efficiently if the propensity score model is well-specified.
We construct the estimator of variance V q ( μ ^ IPW ) using asymptotic properties of estimator (8). Let U ν be a sequence of finite populations of size N ν , indexed by ν . For each U ν , there is an associated non-probability sample s A , ν of size n A , ν . The population size N ν and the sample size n A , ν as ν . Further, the index ν is suppressed to simplify the notation. Consider the following regularity conditions [14]:
C1
The population size N and the sample size n A satisfy lim N n A / N = f A ( 0 , 1 ) .
C2
There exist c 1 and c 2 such that 0 < c 1 < N π k A / n A c 2 for all units k U .
C3
The finite population and the propensity scores satisfy N 1 k U y k 2 = O ( 1 ) , as well as N 1 k U x k 3 = O ( 1 ) , and N 1 k U π k A ( 1 π k A ) x k x k is a positive definite matrix.
Proposition 1.
Under assumptions A1–A3 and regularity conditions C1–C3, and assuming the logistic regression model (7) for the propensity scores, estimator (8) is asymptotically unbiased, i.e., μ ^ IPW μ = O p ( n A 1 / 2 ) , and an asymptotic variance of (8) can be derived as
V q ( μ ^ IPW ) = V IPW + o ( n A 1 ) ,
where
V IPW = 1 N 2 k U ( 1 π k A ) π k A y k μ π k A b x k 2
with π k A = π ( x k , θ 0 ) = exp ( x k θ 0 ) / 1 + exp ( x k θ 0 ) and
b = k U ( 1 π k A ) ( y k μ ) x k k U π k A ( 1 π k A ) x k x k 1 .
Proof. 
The proposition is actually the corollary of Theorem 1 in [14] for complete auxiliary data. An inspection of the proof of the latter theorem leads to simpler assumptions required for the proposition statement. □
Using the asymptotic variance from Proposition 1, the variance of estimator (8) can be estimated by
V ^ IPW = 1 ( N ^ A ) 2 k s A ( 1 π ^ k A ) y k μ ^ IPW π ^ k A b ^ x k 2 ,
where
b ^ = k s A 1 π ^ k A 1 ( y k μ ^ IPW ) x k k U π ^ k A ( 1 π ^ k A ) x k x k 1 ,
given the non-probability sample s A .

2.3.3. Generalized Difference Estimator

Consider the combined sample s together with the sampling design p ( · ) . If the outcome regression model (2) is not assumed to be linear, one can apply the generalized difference estimator [16]
μ ^ GD = 1 N k s d k ( y k m ( x k , β ^ ) ) + k U m ( x k , β ^ )
to estimate the population mean μ . The estimator exploits the available complete auxiliary data. Here, β ^ can be the quasi-maximum likelihood estimator of the regression parameter β based on the dataset { ( y k , x k ) , k s } by [20].
A design-consistent estimator of the variance V ( + ) ( μ ^ GD ) is provided in [16] as
ψ ^ GD = 1 N 2 k s l s 1 π k π l π k l ( y k m ( x k , β ^ ) ) ( y l m ( x l , β ^ ) ) π k π l .

2.3.4. Doubly Robust Estimator

Let us use only the non-probability sample s A . A drawback of the IPW estimator (8) may be its sensitivity to a misspecified model for the propensity scores, especially if some units have very small values in π ^ k A [10]. The efficiency and robustness of the IPW estimator can be improved by incorporating outcome regression model (2), also called the prediction model. The doubly robust (DR) estimator for the mean μ is [14]
μ ^ DR = 1 N ^ A k s A y k m ( x k , β ^ ) π ^ k A + 1 N k U m ( x k , β ^ ) .
Due to the so-called model transportability implied by the assumption A1 [21], model (2), fitted using standard methods based on the dataset { ( y k , x k ) , k s A } , can be applied to compute the predicted values m ( x k , β ^ ) for all k U used in (12). A doubly robust estimator (12) is an analog of the generalized difference estimator (10).
We turn to the estimation of variance V q ( μ ^ DR ) . Let β 0 be the unknown true value of parameter β in the prediction model. Consider additional regularity conditions imposed on the mean function m ( x , β ) as given in [14]:
C4
For each x , m ( x , β ) / β is continuous in β and | m ( x , β ) / β | h ( x , β ) for β in the neighborhood of β 0 , and N 1 k U h ( x k , β 0 ) = O ( 1 ) .
C5
For each x , 2 m ( x , β ) / β β is continuous in β and max i , j | 2 m ( x , β ) / β i β j | k ( x , β ) for β in the neighborhood of β 0 , and N 1 k U k ( x k , β 0 ) = O ( 1 ) .
Proposition 2.
Estimator (12) is doubly robust in the sense that it is a consistent estimator of the mean μ if either the propensity score model or the prediction model is correctly specified. Under assumptions A1–A3 and regularity conditions C1–C5, and assuming correctly specified logistic regression model (7) for the propensity scores, an asymptotic variance of (12) can be derived as
V q ( μ ^ DR ) = V DR + o ( n A 1 ) ,
where
V DR = 1 N 2 k U ( 1 π k A ) π k A y k m ( x k , β 0 ) h N π k A b DR x k 2
with π k A = π ( x k , θ 0 ) = exp ( x k θ 0 ) / 1 + exp ( x k θ 0 ) and
b DR = k U ( 1 π k A ) ( y k m ( x k , β 0 ) h N ) x k k U π k A ( 1 π k A ) x k x k 1 , h N = 1 N k U ( y k m ( x k , β 0 ) ) .
Proof. 
The proposition follows from Theorem 2 of [14] for the complete auxiliary data. □
Remark 1.
Regularity conditions C4–C5 are redundant for the linear outcome regression model.
Using asymptotic variance (13), a simple plug-in variance estimator for doubly robust estimator (12) is
V ^ DR = 1 ( N ^ A ) 2 k s A ( 1 π ^ k A ) y k m ( x k , β ^ ) h ^ N π ^ k A b ^ DR x k 2 ,
where
b ^ DR = k s A 1 π ^ k A 1 ( y k m ( x k , β ^ ) h ^ N ) x k k U π ^ k A ( 1 π ^ k A ) x k x k 1 , h ^ N = 1 N ^ A k s A y k m ( x k , β ^ ) π ^ k A ,
given the non-probability sample s A .

2.3.5. Composite Estimators

We linearly combine design-based post-stratified estimators (3) and (10) based on the combined sample s with, correspondingly, model-based IPW and DR estimators supported only by the non-probability sample s A . We consider two composite estimators
μ ^ C 1 = λ ^ 1 μ ^ GR + ( 1 λ ^ 1 ) μ ^ IPW with λ ^ 1 = V ^ IPW ψ ^ GR + V ^ IPW
and
μ ^ C 2 = λ ^ 2 μ ^ GD + ( 1 λ ^ 2 ) μ ^ DR with λ ^ 2 = V ^ DR ψ ^ GD + V ^ DR
of population mean (1). Similar combinations are investigated in [13]. Here, the quantities λ ^ 1 and λ ^ 2 estimate optimal coefficients of the combinations, ignoring covariance terms since the estimators we combine, in principle, do not have common sources of randomness. Compositions (15) and (16) give more weight to the estimators with smaller variances. The respective variance estimators are
V ^ C 1 = λ ^ 1 ψ ^ GR
and
V ^ C 2 = λ ^ 2 ψ ^ GD .
The interpretation of variance estimators (17) and (18) is that the variances of the design-based estimators may be reduced by the factors λ ^ 1 and λ ^ 2 , respectively.

3. Application to the Survey of the Lithuanian Census

3.1. Motivation

In 2020, the COVID-19 pandemic highlighted the need to promptly produce statistical information from national statistical institutions. This led Statistics Lithuania (the State Data Agency) to take on a new role as the governing organization for state data, forming a unified database of the main state registers and information systems with a vast amount of data that are ready to be used for statistical purposes. Therefore, Statistics Lithuania was able to carry out the following 2021 census based on administrative data from these registers and information systems: residents, real estate, address registers, and the State Social Insurance Fund Board (Sodra) database, among others.
However, as some variables of interest could not be obtained from any administrative source, a statistical survey for such data collection had to be launched. Hence, a statistical survey on population by ethnicity, native language, and religion was conducted in 2021. It aimed to evaluate population proportions for the following variables: religion professed (16 categories), mother tongue (more than 12 categories), knowledge of other languages (16 languages), and ethnicity. For the latter variable, mass imputation was used since relevant information was known from the Ethnicity Register for approximately 87% of the census population. The research was conducted to achieve the objective of efficiently estimating these proportions by exploiting complete data from the previous censuses and other auxiliary information.

3.2. Sample Selection

The survey sample s U was drawn and consisted of three parts, s = s A s O s B :
(i)
At first, a voluntary online survey was carried out from 15 January to 28 February, 2021, which allowed for the collection of statistical data from approximately 2% of the census population (about 54,000 respondents), resulting in the non-probability sample s A .
(ii)
After the end of the online survey, a sampling frame for probability sampling was constructed. It excluded certain addresses, e.g., if at least one individual from the address participated in the online survey, if it was an institution, if more than 15 individuals were permanent residents, among other rules. These units, which were not included in the sampling frame, comprised the part s O of the sample s.
(iii)
Lastly, the probability sample s B was drawn from the sampling frame U { s A s O } , which was divided into H = 113 strata according to the municipality intersected with the area of residence, i.e., urban or rural. The number of addresses sampled from a particular stratum was proportional to the size of the stratum, resulting in around 40,000 addresses sampled from the Population Register in total; approximately 6% of the census population was interviewed through the telephone survey (about 171,000 respondents).
The working sampling design p ( · ) is characterized by the inclusion probabilities: π k = 1 if k s A s O , and
π k m k n h N h if k s B ,
where N h denotes the size of the hth stratum, n h is the number of addresses selected, and m k is the number of individuals in the corresponding address. The sample part s O is treated as a separate post-stratum.

3.3. Imputation of Missing Values

The response rate in the probability sample s B reached approximately 88%. Missing values in the whole sample s were filled in using three imputation methods: historical, deductive, and k-nearest neighbor.
Missing values were first filled in using historical information from the 2011 and 2001 censuses consecutively, as variables of interest were fully known for the populations of those censuses. The remaining missing values accounted for 2.3% of the sample.
Additional sociodemographic characteristics of previous and current censuses, such as age, gender, marital status, household structure, country of birth, citizenship, education, and employment status, were used for further deductive imputation. For instance, if the same religion was observed for each household member except one, the corresponding religion was imputed where missing. After the deductive imputation, only 0.3% of the sample remained with missing values.
Eventually, the remaining missing values in the sample were filled in by applying the k-nearest neighbor method [22].

3.4. Application to Religion Proportions

We focus on the non-probability sample integration for the estimation of religion proportions as the results are similar for every proportion of interest.
When we obtained the non-probability sample from the online survey, the question arose if the collected data could be used for estimation. We first checked the representativeness of the voluntary sample using sociodemographic characteristics known for the entire 2021 census population. A comparison of some proportions of sociodemographic characteristics between the voluntary sample and the whole population showed the biased nature of the non-probability sample. The results provided in Table 1 suggest that people with higher education, as well as those who are employed and married, tend to participate in such online surveys. Another interesting observation made was the willingness of some ethnic communities to participate in the online survey and represent their community. For instance, Polish people in Lithuania accounted for 35% of the voluntary sample but only about 7% of the whole population.
Additionally, we compared the religion proportions of 2011 religion in the 2021 census population for the online survey respondents and the entire population; see Table 2. It was observed that the representatives of smaller religious communities were more likely to participate in the survey. For instance, the proportion of the Karaites religious community in the voluntary sample was 1307% larger than the corresponding proportion in the population.
The sociodemographic variables of Table 1 and the religion variables of Table 2 contain information that can explain the chance of being selected in the voluntary sample. Hence, these variables were used as covariates in the propensity score model.
To estimate the religion proportions in the 2021 census population, we first considered the post-stratified generalized regression and generalized difference estimators given by (3) and (10), respectively. The calibrated weights in (4) were calculated by taking such auxiliary information as binary variables on age groups, gender, and religions professed in 2011 intersected with counties in the calibration equations, while the same auxiliary variables as in the propensity score model were used for estimator (10). Comparing the results of the post-stratified calibrated estimator with the proportions of the previous censuses in Table 3 (and based on external evaluations), the estimator μ ^ GR tends to underestimate smaller religious communities due to the lack of data. On the other hand, the generalized difference estimator μ ^ GD seems to produce slightly higher estimates for the majority of these smaller religions.
As we observed relatively more representatives of minor religions in the voluntary sample (see Table 2), we expected smaller variances of the estimators based only on this non-probability sample with a condition that a selection bias is properly corrected. The IPW estimator is designed to correct such bias by incorporating the propensity scores evaluated using the auxiliary variables of Table 1 and Table 2.
We integrated the non-probability sample through the combination μ ^ C 1 of the post-stratified generalized regression (calibrated) and IPW estimators. According to Table 4, it seems that the first composite estimator corrected the underestimation. Alternatively, we considered the combination μ ^ C 2 of the generalized difference estimator with its analog DR estimator based on the auxiliary variables of Table 1 and Table 2. The second composite estimator seems to have produced even higher estimates for smaller religious communities; however, it also came with larger variances (see Table 5).
Nevertheless, as the calibrated and generalized difference estimators seemed to evaluate larger proportions of interest accurately, they were used in compositions (15) and (16) with weights equal to 1. That is, for religions None, Not indicated, and Roman Catholics, the proportions were evaluated using only the design-based calibrated or generalized difference estimators.
Table 5 provides comparisons of the relative percent difference between (i) the smoothed version of variance estimator (5) and variance estimator (17), (ii) the smoothed version of variance estimator (11) and variance estimator (18), and (iii) variance estimators (17) and (18). We used smoothed variances ψ ^ G R s and ψ ^ G D s instead of ψ ^ G R and ψ ^ G D in compositions (15) and (16), respectively, due to the estimation of very small proportions. For the smoothing of variance (5), similarly as in [23], we assumed that V ( + ) ( μ ^ GR ) K N ˜ γ , with N ˜ as the sum of values of 2011 variable of interest in the 2021 census population. Parameters K > 0 and γ R were evaluated through a log–log regression, using the data pairs ( ψ ^ GR , N ˜ ) calculated from all categories of the variable of interest. The smoothing of variance (11) was performed analogously.
The compositions given by (15) and (16) assign more weight to estimators with a smaller variance. The first two numerical columns in Table 5, which correspond to cases (i)–(ii), provide an indication of how much composite estimators can improve the estimation accuracy of the calibrated and generalized difference estimators, respectively. The last column of the table includes the relative percent differences between the variance estimates of composite estimators μ ^ C 1 and μ ^ C 2 . The first composite estimator gives more satisfactory results for the proportions of smaller religious communities. However, the second composite estimator seems to correct the estimation accuracy of proportions of larger religious groups.

4. Discussion

The generalized regression and generalized difference estimators considered here are traditional model-assisted estimators based on the union of the non-probability and probability samples, together with the probability sampling design p ( · ) . Since the selection bias is corrected through the known inclusion probabilities, the application of these estimators may lead to valid design-based inferences. However, the latter estimators incorporate unweighted units of the non-probability sample, which does not add more efficiency unless the size of the non-probability sample is of the same magnitude as the population size [12]. In the application to the Lithuanian census, the voluntary sample covers about 2% of the population, and such a contribution is too small.
Meanwhile, IPW and DR estimators exploit the non-probability sample in a more advanced way, i.e., through the propensity score and prediction models, and we may benefit from combining them with traditional design-based estimators. There are various ways to combine different data sources and estimators [24]; however, we choose to optimize the linear combinations of estimators by taking into account their estimated variances. This approach appears to be efficient in the application when estimating smaller population proportions that require larger sample sizes. Indeed, the first composite estimator improves the estimation accuracy for proportions of smaller religious communities as the estimated variances of the composition are smaller than the estimated variances of the generalized regression estimator by up to 19%. Alternatively, we consider the combination of the generalized difference estimator with the DR estimator as it allows us to better leverage the available complete auxiliary data through the prediction model. Surprisingly, while this composite estimator improves the estimation accuracy for proportions of larger religious communities, it does not give satisfactory results for smaller religions.
In the application, the detailed data available from administrative registers and previous complete censuses allowed us to efficiently apply the model-based IPW and DR estimators integrating the non-probability sample. In regard to future censuses, it is worth considering the possibility of collecting much larger non-probability samples by promoting voluntary participation more.
Our study is based on a strong MAR assumption for the variable of interest in the propensity score model. Although the analysis of the Lithuanian census data allowed us to identify variables that clearly explain voluntary survey participation (and, thus, the propensity score model may be specified quite well), this assumption might be relaxed in future research. In this way, future investigations could explore data integration under the assumption of the non-ignorable selection mechanism.

5. Conclusions

The non-probability and probability samples can be combined into a single sample and used in design-based post-stratified estimators of parameters. This approach is a safe but inefficient way to exploit the non-probability sample.
The post-stratified estimators are, therefore, linearly combined with the model-based estimators based only on the non-probability sample. By applying the estimated optimal combinations to the Lithuanian census survey, there is a significant improvement in estimating the population proportions.
The success of integrating a voluntary sample into a census survey depends on the availability of proper auxiliary information, such as complete data from previous censuses. In the future, such information could be obtained by collecting much larger non-probability samples. In addition, it would then be possible to forego the selection of a probability sample.
Applications such as ours may be more efficient if the MAR assumption for the propensity score model is abandoned. This would lead to more complex estimation procedures that can be explored in the future.

Author Contributions

Conceptualization, A.Č.; methodology, A.Č.; software, A.Č. and I.B.; validation, A.Č. and I.B.; formal analysis, A.Č. and I.B.; investigation, A.Č. and I.B.; resources, A.Č. and I.B.; data curation, A.Č. and I.B.; writing—original draft preparation, I.B.; writing—review and editing, A.Č. and I.B.; visualization, A.Č. and I.B.; supervision, A.Č.; project administration, A.Č.; funding acquisition, A.Č. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data are not publicly available due to privacy concerns, as the data contain sensitive information on an individual level.

Acknowledgments

This research was conducted as part of the Lithuanian Population and Housing Census 2021 and was supported by Statistics Lithuania (the State Data Agency), where both authors of the paper are employed.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
MARmissing at random
IPWinverse probability weighting
DRdoubly robust

References

  1. Axelson, M.; Holmberg, A.; Jansson, I.; Westling, S. A register-based census: The Swedish experience. In Administrative Records for Survey Methodology; Chun, A.Y., Larsen, M.D., Durrant, G., Reiter, J.P., Eds.; Wiley: Hoboken, NJ, USA, 2021; pp. 179–204. [Google Scholar]
  2. Bernardini, A.; Brown, J.; Chipperfield, J.; Bycroft, C.; Chieppa, A.; Cibella, N.; Dunnet, G.; Hawkes, M.; Hleihel, A.; Law, E.; et al. Evolution of the person census and the estimation of population counts in New Zealand, United Kingdom, Italy and Israel. Stat. J. IAOS 2022, 38, 1221–1237. [Google Scholar] [CrossRef]
  3. Bycroft, C. Census transformation in New Zealand: Using administrative data without a population register. Stat. J. IAOS 2015, 31, 401–411. [Google Scholar] [CrossRef]
  4. Mule, V.T., Jr.; Keller, A. Administrative records applications for the 2020 census. In Administrative Records for Survey Methodology; Chun, A.Y., Larsen, M.D., Durrant, G., Reiter, J.P., Eds.; Wiley: Hoboken, NJ, USA, 2021; pp. 205–229. [Google Scholar]
  5. Tille, Y. Sampling and Estimation from Finite Populations; Wiley Series in Survey Methodology; Wiley: Hoboken, NJ, USA, 2020. [Google Scholar]
  6. Argüeso, A.; Vega, J.L. A population census based on registers and a “10% survey” methodological challenges and conclusions. Stat. J. IAOS 2014, 30, 35–39. [Google Scholar]
  7. Beaumont, J.F. Are probability surveys bound to disappear for the production of official statistics? Surv. Methodol. 2020, 46, 71–96. [Google Scholar]
  8. Kim, J.-K. A gentle introduction to data integration in survey sampling. Surv. Stat. 2022, 85, 19–29. [Google Scholar]
  9. Rao, J.N.K. On making valid inferences by integrating data from surveys and other sources. Sankhya B 2021, 83, 242–272. [Google Scholar] [CrossRef]
  10. Wu, C. Statistical inference with non-probability survey samples. Surv. Methodol. 2022, 48, 283–311. [Google Scholar]
  11. Meng, X.-L. Statistical paradises and paradoxes in big data (I): Law of large populations, big data paradox, and the 2016 US presidential election. Ann. Appl. Stat. 2018, 12, 685–726. [Google Scholar] [CrossRef] [Green Version]
  12. Kim, J.-K.; Tam, S.-M. Data integration by combining big data and survey sample data for finite population inference. Int. Stat. Rev. 2021, 89, 382–401. [Google Scholar] [CrossRef]
  13. Tam, S.-M.; Kim, J.-K. Big data ethics and selection-bias: An official statistician’s perspective. Stat. J. IAOS 2018, 34, 577–588. [Google Scholar] [CrossRef] [Green Version]
  14. Chen, Y.; Li, P.; Wu, C. Doubly robust inference with nonprobability survey samples. J. Am. Stat. Assoc. 2020, 115, 2011–2021. [Google Scholar] [CrossRef] [Green Version]
  15. Castro-Martín, L.; Rueda, M.d.M.; Ferri-García, R. Estimating general parameters from non-probability surveys using propensity score adjustment. Mathematics 2020, 8, 2096–2109. [Google Scholar] [CrossRef]
  16. Wu, C.; Sitter, R.R. A model-calibration approach to using complete auxiliary information from survey data. J. Am. Stat. Assoc. 2001, 96, 185–193. [Google Scholar] [CrossRef] [Green Version]
  17. Särndal, C.-E.; Swensson, B.; Wretman, J. Model Assisted Survey Sampling; Springer Series in Statistics; Springer: New York, NY, USA, 1992. [Google Scholar]
  18. Deville, J.C.; Särndal, C.-E. Calibration estimators in survey sampling. J. Am. Stat. Assoc. 1992, 87, 376–382. [Google Scholar] [CrossRef]
  19. Rubin, D.B. Inference and missing data. Biometrika 1976, 63, 581–592. [Google Scholar] [CrossRef]
  20. McCullagh, P.; Nelder, J.A. Generalized Linear Models; Chapman and Hall: New York, NY, USA, 1989. [Google Scholar]
  21. Kim, J.K.; Park, S.; Chen, Y.; Wu, C. Combining non-probability and probability survey samples through mass imputation. J. R. Stat. Soc. Ser. A 2021, 184, 941–963. [Google Scholar] [CrossRef]
  22. Kowarik, A.; Templ, M. Imputation with the R Package VIM. J. Stat. Softw. 2016, 74, 1–16. [Google Scholar] [CrossRef] [Green Version]
  23. Dick, P. Modelling net undercoverage in the 1991 Canadian census. Surv. Methodol. 1995, 21, 45–54. [Google Scholar]
  24. Yang, S.; Kim, J.-K. Statistical data integration in survey sampling: A review. Jpn. J. Stat. Data Sci. 2020, 3, 625–650. [Google Scholar] [CrossRef]
Table 1. Comparison of proportions of some sociodemographic characteristics in the voluntary sample and the whole population.
Table 1. Comparison of proportions of some sociodemographic characteristics in the voluntary sample and the whole population.
Voluntary SamplePopulationDifference in %
EthnicityPole0.350.07441
Educationhigher0.480.20134
CountyVilnius0.640.29121
Employmentemployed0.630.4541
Age group≥30, <500.370.2737
Marital statusmarried0.520.4225
Gendermale0.410.46−11
EthnicityLithuanian0.560.85−34
Education(lower) secondary0.240.37−35
Educationprimary0.090.20−55
Table 2. Comparison of religion proportions in the voluntary sample and the whole population.
Table 2. Comparison of religion proportions in the voluntary sample and the whole population.
Voluntary SamplePopulationDifference in %
Karaites0.001300.000091307
New Apostolic Church0.001610.000141049
Evangelical Reformed Believers0.008330.00207302
Other0.015960.00514211
Pentecostalists0.001980.00067194
Greek Catholics (Uniats)0.000480.00021131
Evangelical Lutherans0.013110.00585124
Judaists0.000740.00035112
Baptists and Free Churches0.000830.0004874
Sunni Muslims0.001300.0008552
Not indicated0.076210.10090−24
Seventh Day Adventist Church0.000260.00032−20
None0.075800.0642418
Old Believers0.006150.00683−10
Orthodox0.040470.037877
Roman Catholics0.755480.77398−2
Table 3. Religion proportions in 2001, 2011, and 2021 census populations.
Table 3. Religion proportions in 2001, 2011, and 2021 census populations.
μ ( 2001 ) μ ( 2011 ) μ ^ GR μ ^ GD
Roman Catholics0.783910.772330.736640.74101
Not indicated0.056710.101120.157010.15025
None0.096960.061460.054080.05477
Orthodox0.041500.041130.034330.03482
Old Believers0.008060.007670.004340.00419
Evangelical Lutherans0.005650.006040.003890.00398
Other0.002820.004930.005660.00625
Evangelical Reformed Believers0.002080.002210.001220.00126
Pentecostalists0.000370.000610.001170.00158
Sunni Muslims0.000750.000890.000580.00064
Baptists and Free Churches0.000340.000440.000170.00016
Judaists0.000390.000400.000250.00029
Greek Catholics (Uniats)0.000100.000230.000300.00038
Seventh Day Adventist Church0.000160.000300.000140.00013
New Apostolic Church0.000120.000140.000150.00020
Karaites0.000080.000100.000080.00010
Table 4. Religion proportion estimates in 2021 census population.
Table 4. Religion proportion estimates in 2021 census population.
μ ^ GR μ ^ C 1 μ ^ GD μ ^ C 2
Roman Catholics0.736640.733490.741010.73811
Not indicated0.157010.154520.150250.14832
None0.054080.053190.054770.05401
Orthodox0.034330.038040.034820.03592
Old Believers0.004340.005030.004190.00486
Evangelical Lutherans0.003890.004600.003980.00475
Other0.005660.006360.006250.00709
Evangelical Reformed Believers0.001220.001510.001260.00175
Pentecostalists0.001170.001260.001580.00191
Sunni Muslims0.000580.000690.000640.00094
Baptists and Free Churches0.000170.000240.000160.00037
Judaists0.000250.000310.000290.00051
Greek Catholics (Uniats)0.000300.000340.000380.00060
Seventh Day Adventist Church0.000140.000170.000130.00031
New Apostolic Church0.000150.000170.000200.00035
Karaites0.000080.000090.000100.00022
Table 5. Comparison of the relative difference (in %): (i) ( ψ ^ GRs V ^ C 1 ) / V ^ C 1 , (ii) ( ψ ^ GDs V ^ C 2 ) / V ^ C 2 , (iii) ( V ^ C 1 V ^ C 2 ) / V ^ C 2 .
Table 5. Comparison of the relative difference (in %): (i) ( ψ ^ GRs V ^ C 1 ) / V ^ C 1 , (ii) ( ψ ^ GDs V ^ C 2 ) / V ^ C 2 , (iii) ( V ^ C 1 V ^ C 2 ) / V ^ C 2 .
(i)(ii)(iii)
New Apostolic Church16−88
Karaites443−88
Greek Catholics (Uniats)216−84
Seventh Day Adventist Church422−79
Judaists635−76
Pentecostalists13−73
Baptists and Free Churches942−71
Sunni Muslims620−65
Evangelical Reformed Believers612−46
Other22−14
Evangelical Lutherans77−7
Old Believers19224
Orthodox189146
None00261
Not indicated00366
Roman Catholics001389
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Burakauskaitė, I.; Čiginas, A. An Approach to Integrating a Non-Probability Sample in the Population Census. Mathematics 2023, 11, 1782. https://doi.org/10.3390/math11081782

AMA Style

Burakauskaitė I, Čiginas A. An Approach to Integrating a Non-Probability Sample in the Population Census. Mathematics. 2023; 11(8):1782. https://doi.org/10.3390/math11081782

Chicago/Turabian Style

Burakauskaitė, Ieva, and Andrius Čiginas. 2023. "An Approach to Integrating a Non-Probability Sample in the Population Census" Mathematics 11, no. 8: 1782. https://doi.org/10.3390/math11081782

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop