3.1. Motivation
In 2020, the COVID-19 pandemic highlighted the need to promptly produce statistical information from national statistical institutions. This led Statistics Lithuania (the State Data Agency) to take on a new role as the governing organization for state data, forming a unified database of the main state registers and information systems with a vast amount of data that are ready to be used for statistical purposes. Therefore, Statistics Lithuania was able to carry out the following 2021 census based on administrative data from these registers and information systems: residents, real estate, address registers, and the State Social Insurance Fund Board (Sodra) database, among others.
However, as some variables of interest could not be obtained from any administrative source, a statistical survey for such data collection had to be launched. Hence, a statistical survey on population by ethnicity, native language, and religion was conducted in 2021. It aimed to evaluate population proportions for the following variables: religion professed (16 categories), mother tongue (more than 12 categories), knowledge of other languages (16 languages), and ethnicity. For the latter variable, mass imputation was used since relevant information was known from the Ethnicity Register for approximately 87% of the census population. The research was conducted to achieve the objective of efficiently estimating these proportions by exploiting complete data from the previous censuses and other auxiliary information.
3.3. Imputation of Missing Values
The response rate in the probability sample reached approximately 88%. Missing values in the whole sample s were filled in using three imputation methods: historical, deductive, and k-nearest neighbor.
Missing values were first filled in using historical information from the 2011 and 2001 censuses consecutively, as variables of interest were fully known for the populations of those censuses. The remaining missing values accounted for 2.3% of the sample.
Additional sociodemographic characteristics of previous and current censuses, such as age, gender, marital status, household structure, country of birth, citizenship, education, and employment status, were used for further deductive imputation. For instance, if the same religion was observed for each household member except one, the corresponding religion was imputed where missing. After the deductive imputation, only 0.3% of the sample remained with missing values.
Eventually, the remaining missing values in the sample were filled in by applying the
k-nearest neighbor method [
22].
3.4. Application to Religion Proportions
We focus on the non-probability sample integration for the estimation of religion proportions as the results are similar for every proportion of interest.
When we obtained the non-probability sample from the online survey, the question arose if the collected data could be used for estimation. We first checked the representativeness of the voluntary sample using sociodemographic characteristics known for the entire 2021 census population. A comparison of some proportions of sociodemographic characteristics between the voluntary sample and the whole population showed the biased nature of the non-probability sample. The results provided in
Table 1 suggest that people with higher education, as well as those who are employed and married, tend to participate in such online surveys. Another interesting observation made was the willingness of some ethnic communities to participate in the online survey and represent their community. For instance, Polish people in Lithuania accounted for 35% of the voluntary sample but only about 7% of the whole population.
Additionally, we compared the religion proportions of 2011 religion in the 2021 census population for the online survey respondents and the entire population; see
Table 2. It was observed that the representatives of smaller religious communities were more likely to participate in the survey. For instance, the proportion of the Karaites religious community in the voluntary sample was 1307% larger than the corresponding proportion in the population.
The sociodemographic variables of
Table 1 and the religion variables of
Table 2 contain information that can explain the chance of being selected in the voluntary sample. Hence, these variables were used as covariates in the propensity score model.
To estimate the religion proportions in the 2021 census population, we first considered the post-stratified generalized regression and generalized difference estimators given by (
3) and (
10), respectively. The calibrated weights in (
4) were calculated by taking such auxiliary information as binary variables on age groups, gender, and religions professed in 2011 intersected with counties in the calibration equations, while the same auxiliary variables as in the propensity score model were used for estimator (
10). Comparing the results of the post-stratified calibrated estimator with the proportions of the previous censuses in
Table 3 (and based on external evaluations), the estimator
tends to underestimate smaller religious communities due to the lack of data. On the other hand, the generalized difference estimator
seems to produce slightly higher estimates for the majority of these smaller religions.
As we observed relatively more representatives of minor religions in the voluntary sample (see
Table 2), we expected smaller variances of the estimators based only on this non-probability sample with a condition that a selection bias is properly corrected. The IPW estimator is designed to correct such bias by incorporating the propensity scores evaluated using the auxiliary variables of
Table 1 and
Table 2.
We integrated the non-probability sample through the combination
of the post-stratified generalized regression (calibrated) and IPW estimators. According to
Table 4, it seems that the first composite estimator corrected the underestimation. Alternatively, we considered the combination
of the generalized difference estimator with its analog DR estimator based on the auxiliary variables of
Table 1 and
Table 2. The second composite estimator seems to have produced even higher estimates for smaller religious communities; however, it also came with larger variances (see
Table 5).
Nevertheless, as the calibrated and generalized difference estimators seemed to evaluate larger proportions of interest accurately, they were used in compositions (
15) and (
16) with weights equal to 1. That is, for religions None, Not indicated, and Roman Catholics, the proportions were evaluated using only the design-based calibrated or generalized difference estimators.
Table 5 provides comparisons of the relative percent difference between (i) the smoothed version of variance estimator (
5) and variance estimator (
17), (ii) the smoothed version of variance estimator (
11) and variance estimator (
18), and (iii) variance estimators (
17) and (
18). We used smoothed variances
and
instead of
and
in compositions (
15) and (
16), respectively, due to the estimation of very small proportions. For the smoothing of variance (
5), similarly as in [
23], we assumed that
, with
as the sum of values of 2011 variable of interest in the 2021 census population. Parameters
and
were evaluated through a log–log regression, using the data pairs
calculated from all categories of the variable of interest. The smoothing of variance (
11) was performed analogously.
The compositions given by (
15) and (
16) assign more weight to estimators with a smaller variance. The first two numerical columns in
Table 5, which correspond to cases (i)–(ii), provide an indication of how much composite estimators can improve the estimation accuracy of the calibrated and generalized difference estimators, respectively. The last column of the table includes the relative percent differences between the variance estimates of composite estimators
and
. The first composite estimator gives more satisfactory results for the proportions of smaller religious communities. However, the second composite estimator seems to correct the estimation accuracy of proportions of larger religious groups.