1. Introduction
In a socially and personally very sensitive survey, if you directly ask a question to the respondents, they tend to refuse to answer or give a false answer. To solve this problem, ref. [
1] proposed a randomized response model (RRM) that could obtain sensitive information while protecting the identity or confidentiality of the respondent through an indirect response using a randomization device. Since then, many researchers have suggested various randomized response models to improve the quality of estimation.
Subsequently, refs. [
2,
3,
4] organized, summarized and systematized the randomized response models, ref. [
5] applied two-stage cluster sampling to a randomized response model, and ref. [
6] researched improving the practicality of randomized response model by suggesting a randomized response model using PPS sampling. Meanwhile, the authors of [
7] suggested a unrelated question randomized response method to estimate the mean number of participants with a rare sensitive attribute using Poisson distribution. Examples of rare sensitive attributes include the proportion of people with AIDS who have persistent relationships with strangers, the proportion of people who witnessed murders, and the number of girls raped by their own fathers, etc. and examples of rare unrelated attributes include the proportion of people born correctly at 12 o’clock, the proportion of babies born blind, and the proportion of triplets delivered by women [
8,
9] suggested a stratified two-stage randomized response models for estimating a rare sensitive attribute under Poisson distribution.
Furthermore, ref. [
10] proposed a partial randomized response model using Poisson distribution, providing an alternative approach to estimating rare sensitive attributes through simple random estimation and stratified estimation. Their model demonstrated higher efficiency compared to Suman and Singh’s model. However, this research also faces limitations when applied to actual surveys if the population is clustered. Therefore, when the population is clustered, it is expected that applying Narjis and Shabbir’s model, which is more efficient than Suman and Singh’s model, could offer a practical solution for estimating rare sensitive attributes in real surveys.
In this study, we proposed a method for estimating rare sensitive attributes when the survey question is highly sensitive, and the population is composed of clusters with varying sizes. We applied the probability proportional to the size sampling method, which assigns sampling probabilities in proportion to the size of the clusters, to the partial randomized response model of [
10]. In
Section 2, we first introduced the partial randomized response model and proposed estimation methods using Probability Proportional to Size (PPS) with replacement, PPS without replacement, and two-stage equal probability sampling. In
Section 3, we compared the efficiency of the estimation methods, and finally, in
Section 4, we presented conclusions and implications of the study.
2. PPS Estimation for a Rare Sensitive Attribute by Partial Randomized Response Model
In
Section 2, when the survey questions are very sensitive and the population is composed of
N clusters that each contains
sub-units, a two-stage selection method is used, in which
n clusters are selected with PPS or with equal probability from the population, and then
survey units are selected through simple random sampling in each selected cluster, which is applied to the partial randomized response model using the Poisson distribution proposed by [
10] to deal with the method of estimating a rare sensitive attribute.
In
Section 2.1, we reviewed Narjis and Shabbir’s Partial randomized response model and then we considered the sampling method for the clusters via PPS sampling with replacements in
Section 2.2. Clusters by PPS sampling without replacement are considered in
Section 2.3, and clusters by equal probability sampling are examined in
Section 2.4.
2.1. Narjis, Shabbir’s Partial Randomized Response Model
In the partial randomized response model, a sample of size n is selected via simple random sampling with replacement from the population. An individual is selected from the sample using two randomization devices and is requested to report his/her response as per following outcomes of the devices.
The first-stage randomization device consists of the following statements:
- (1)
I have the sensitive attribute A with probability T.
- (2)
Go to the randomization device with probability T.
The second-stage randomization device consists of the following statements:
- (1)
I have the sensitive attribute A.
- (2)
Forced to say No.
- (3)
Draw one more card.
With probabilities , and respectively, .
If the statement (3) appears on the card of the respondent, then it is necessary to carry out the process without replacing the card. In the second draw, if statement (3) reappears, then the respondent is suggested to report his/her actual status. The respondent should answer the question with s “Yes” (or “No”), if his/her actual status matches (un-matches) with the statement on the card.
The probability of getting a “Yes” from the respondent is given by:
where
k is the total number of cards in the randomization device
.
As before, assuming that
and
, then
(finite). Equation (
1) can be rewritten as
Let be a random sample of n observations from the Poisson distribution with parameter .
The maximum-likelihood estimator of
is given by:
The variance of the estimator
is given by:
2.2. Estimation by PPS When PSUs Are Selected with Replacement
Suppose
n primary sampling units (PSUs) of size
have been selected from the population of
N clusters with selection probability
with replacement and the secondary sampling units (SSUs) of
size are selected from each chosen primary unit using SRSWR. We apply the two-stage sampling procedure to Narjis and Shabbir’s partial randomized response model to estimate a rare sensitive attribute. Each person selected via the two-stage sampling procedure is requested to answer “Yes” or “No” using Narjis and Shabbir’s randomization device such as
Table 1 and
Table 2 for each First and Second randomization device in
ith cluster.
If Question 3 in randomization device appears on the card of the respondent, then it is necessary to select a card repeatedly in without replacing the card. In the second draw, if Question 3 reappears, then the respondent is suggested to report his/her “Yes” or “No”, according to his/her true response to the sensitive question.
From First and Second randomization devices, is the selection probability of a rare sensitive question in randomization device for the ith cluster, is the population proportion of a rare sensitive attribute for the ith cluster, and is the selection probability of a rare sensitive question in randomization device for the ith cluster. And is the selection probability of the forced answer “No” in randomization device , is the selection probability of the statement “Draw one more cards” in randomization device for the ith cluster, and is the number of cards in the card deck of randomization device for the ith cluster.
The probability of answering “Yes” from the respondent in cluster
i is given by
To clarify the response process, we presented a flow chart for the probability of answering “Yes” for
ith cluster in
Figure 1.
Since the attribute in cluster i is very rare in the population, if we assume and , then (finite).
Let
be a random sample of
observations from the Poisson distribution with parameter
in cluster
i, then the estimator
of
, the parameter of a rare sensitive attribute of cluster
i, is given by
When respondents are selected via simple random sampling with replacement from the
ith cluster, which was selected with replacement using sampling probability
for the estimator
of
, the parameter of a rare sensitive attribute is given by:
where
.
Theorem 1. The estimator is an unbiased estimator of the parameter λ.
Proof. Since
for each cluster and
We have
where
we can obtain
□
Theorem 2. The variance of is given by Proof. By [
11], we have
where
and
Because
, we have
Thus, we determine the variance of as shown in (8). □
Also, the estimator of
is given by
On the other hand, when the sampling probabilities of
n PSUs are proportional to each cluster size
, then
, which is called PPS sampling. When a sample of
n PSUs are selected via PPS sampling with replacement and
SSUs are selected using simple random sampling with replacement from each PSU, the estimator
of
is as follows
And the variance of
and its estimator are, respectively,
and
2.3. Estimation by PPS When PSUs Are Selected without Replacement
Suppose n PSUs of size have been selected from the population of N clusters with selection probability without replacement and the SSUs of size are selected from each chosen primary unit via SRSWR. We apply the two-stage sampling procedure to Narjis and Shabbir’s RRT to estimate a rare sensitive attribute.
The estimator
of
, the parameter of a rare sensitive attribute obtained using the above sampling procedure is given by
where
is the inclusion probability of survey unit
i.
And the variance of
is given by:
where
is the joint inclusion probability of survey units
i and
j.
Also, the estimator of
is given by
2.4. Estimation via Two-Stage Equal Probability Sampling
Suppose
n PSUs of size
have been selected from the population of
N clusters by SRSWR and the SSUs of size
are selected again from each chosen PSU via SRSWR. We consider the two-stage equal probability sampling procedure for Narjis and Shabbir’s RRT for estimating a rare sensitive attribute. The estimator
of
, the parameter of a rare sensitive attribute, obtained using the above procedure is given by
where
.
and
where
.
3. Efficiency Comparisons for the PPS vs. Equal Probability Sampling
Narjis and Shabbir’s RRT model was developed under the assumption of simple random sampling and stratified random sampling, and the efficiency thereof was compared with that of the estimators [
9]. Therefore, it is reasonable to compare the existing estimator with the estimator proposed in this paper using Narjis and Shabbir’s model. However, in the case of cluster sampling, the increase in variance compared to that obtained using simple random sampling or stratified sampling has already been dealt with in the typical sampling textbooks, so in this paper, as described above, when the population consists of
N clusters, we consider the case the PPS with replacement estimator and two-stage equal probability estimator.
Now, the difference between the variance (17) of two-stage equal probability sampling and the variance (11) of PPS with replacement sampling is given as follows under
In (19), if then . In other words, if the cluster sizes are equal, the selection probability of PPS sampling with replacement becomes and is equal to that of two-stage equal probability sampling with replacement. Hence, they have the same efficiency.
If each cluster size is unequal, the values of first term of the right-hand side in (19) are much increased, and the values of the second term of the right-hand side in (19) have relatively small ones. Hence, the estimation using PPS sampling with replacement is more efficient than that of two-stage equal probability sampling with replacement.
We tabulate to summarize the relationship for each estimator in a cluster sampling design as follows.
Now, we compare the efficiency by calculating relative efficiencies (RE) between different sampling methods, such as simple random sampling with replacement (:ppzwr), PPS sampling with replacement (:ppswr) and two-stage equal probability sampling with replacement (:wr) according to varying parameter combinations by numerical example.
The values of greater than one means that unequal probability sampling with replacement (:ppzwr) is more efficient than two-stage equal probability sampling with replacement (:wr), greater than one means that PPS sampling with replacement (:ppswr) is more efficient than unequal probability sampling with replacement(:ppzwr), and greater than one means that PPS sampling with replacement (:ppswr) is more efficient than two-stage equal probability sampling with replacement(:wr).
In calculating REs, we set parameters for ith cluster as follows.
10,000; ; ; ; ,
; ; ; ; ,
;
, , , ;
;
, , .
We also assume the selection probabilities for ith cluster as follows.
varying from 0.2 to 0.8 by 0.2.
In order to compare the efficiency of the proposed estimators from numerical examples, we summarized the relative efficiencies according to various parameter values with their mean values.
From
Table 3, it can be seen that for all the parametric combinations, the mean values of
are greater than one, which indicates that the unequal probability sampling with replacement estimator
is more efficient than the two-stage estimator,
, as the sensitive attribute value
decreases, and in contrast, if sensitive attribute
increases, then the efficiency of
decreases. In addition, the variation in
with respect to
indicates that the
increases as the values of selection probability
increase.
As shown in
Table 4, the probability proportional to size estimator,
, is more efficient than the unequal probability sampling with replacement estimator,
. As the sensitive attribute value
increases, and in contrast, as
decreases, the probability proportional estimator decreases in efficiency.
As shown in
Table 5, the probability proportional to size estimator,
, is more efficient than the two-stage sampling with replacement estimator,
. As the sensitive attribute value
decreases, and in contrast, as
decreases, the probability proportional estimator decreases in efficiency.
In summary, an examination of the efficiency of a partial randomized response model for rare sensitive attributes based on a cluster sampling design with numerical examples shows the following trends:
- (1)
Between
and
, efficiency decreases as a rare sensitive attribute
increases (refer to
Table 4).
- (2)
Between
and
, efficiency increases as
increases, and efficiency is relatively low at specific values of
(refer to
Table 5).
- (3)
Between
and
, efficiency increases as
decreases, similar to the relation between
and
, where efficiency sharply increases at specific values of
(refer to
Table 6).
- (4)
The number of cards does not significantly impact efficiency.