A Probability Proportional to Size Estimation of a Rare Sensitive Attribute Using a Partial Randomized Response Model with Poisson Distribution

: In this paper, we suggest using a partial randomized response model using Poisson distribution to efficiently estimate a rare sensitive attribute by applying the probability proportional to size (PPS) sampling method when the population is composed of several different and sensitive clusters. We have obtained estimators for a rare and sensitive attribute and their variances and variance estimates by applying PPS sampling and two-stage equal probability sampling. We compare the efficiency between the estimators of the rare sensitive attribute, one obtained via PPS sampling with replacement and the other obtained using the two-stage equal probability sampling with replacement. As a result, it is confirmed that the estimate obtained via the PPS sampling with replacement is more efficient than the estimate provided by the two-stage equal probability sampling with replacement when the cluster sizes are different.


Introduction
In a socially and personally very sensitive survey, if you directly ask a question to the respondents, they tend to refuse to answer or give a false answer.To solve this problem, ref. [1] proposed a randomized response model (RRM) that could obtain sensitive information while protecting the identity or confidentiality of the respondent through an indirect response using a randomization device.Since then, many researchers have suggested various randomized response models to improve the quality of estimation.
Subsequently, refs.[2][3][4] organized, summarized and systematized the randomized response models, ref. [5] applied two-stage cluster sampling to a randomized response model, and ref. [6] researched improving the practicality of randomized response model by suggesting a randomized response model using PPS sampling.Meanwhile, the authors of [7] suggested a unrelated question randomized response method to estimate the mean number of participants with a rare sensitive attribute using Poisson distribution.Examples of rare sensitive attributes include the proportion of people with AIDS who have persistent relationships with strangers, the proportion of people who witnessed murders, and the number of girls raped by their own fathers, etc. and examples of rare unrelated attributes include the proportion of people born correctly at 12 o'clock, the proportion of babies born blind, and the proportion of triplets delivered by women [8,9] suggested a stratified two-stage randomized response models for estimating a rare sensitive attribute under Poisson distribution.Furthermore, ref. [10] proposed a partial randomized response model using Poisson distribution, providing an alternative approach to estimating rare sensitive attributes through simple random estimation and stratified estimation.Their model demonstrated higher efficiency compared to Suman and Singh's model.However, this research also faces limitations when applied to actual surveys if the population is clustered.Therefore, when the population is clustered, it is expected that applying Narjis and Shabbir's model, which is more efficient than Suman and Singh's model, could offer a practical solution for estimating rare sensitive attributes in real surveys.
In this study, we proposed a method for estimating rare sensitive attributes when the survey question is highly sensitive, and the population is composed of clusters with varying sizes.We applied the probability proportional to the size sampling method, which assigns sampling probabilities in proportion to the size of the clusters, to the partial randomized response model of [10].In Section 2, we first introduced the partial randomized response model and proposed estimation methods using Probability Proportional to Size (PPS) with replacement, PPS without replacement, and two-stage equal probability sampling.In Section 3, we compared the efficiency of the estimation methods, and finally, in Section 4, we presented conclusions and implications of the study.

PPS Estimation for a Rare Sensitive Attribute by Partial Randomized Response Model
In Section 2, when the survey questions are very sensitive and the population is composed of N clusters that each contains M i (i = 1, 2, • • • , N) sub-units, a two-stage selection method is used, in which n clusters are selected with PPS or with equal probability from the population, and then m i (i = 1, 2, • • • , n) survey units are selected through simple random sampling in each selected cluster, which is applied to the partial randomized response model using the Poisson distribution proposed by [10] to deal with the method of estimating a rare sensitive attribute.
In Section 2.1, we reviewed Narjis and Shabbir's Partial randomized response model and then we considered the sampling method for the clusters via PPS sampling with replacements in Section 2.2.Clusters by PPS sampling without replacement are considered in Section 2.3, and clusters by equal probability sampling are examined in Section 2.4.

Narjis, Shabbir's Partial Randomized Response Model
In the partial randomized response model, a sample of size n is selected via simple random sampling with replacement from the population.An individual is selected from the sample using two randomization devices (R 1 , R 2 ) and is requested to report his/her response as per following outcomes of the devices.
The first-stage randomization device R 1 consists of the following statements: (1) I have the sensitive attribute A with probability T.
(2) Go to the randomization device R 2 with probability T.
The second-stage randomization device R 2 consists of the following statements: (1) I have the sensitive attribute A.
With probabilities P 1 , P 2 and P 3 respectively, ∑ 3 i=1 P i = 1.If the statement (3) appears on the card of the respondent, then it is necessary to carry out the process without replacing the card.In the second draw, if statement (3) reappears, then the respondent is suggested to report his/her actual status.The respondent should answer the question with s "Yes" (or "No"), if his/her actual status matches (un-matches) with the statement on the card.
The probability of getting a "Yes" from the respondent is given by: where k is the total number of cards in the randomization device R 2 .
The maximum-likelihood estimator of λ 0 is given by: λp = The variance of the estimator λp is given by:

Estimation by PPS When PSUs Are Selected with Replacement
Suppose n primary sampling units (PSUs) of size M i (i = 1, 2, • • • , n) have been selected from the population of N clusters with selection probability φ i with replacement and the secondary sampling units (SSUs) of m i (i = 1, 2, • • • , n) size are selected from each chosen primary unit using SRSWR.We apply the two-stage sampling procedure to Narjis and Shabbir's partial randomized response model to estimate a rare sensitive attribute.Each person selected via the two-stage sampling procedure is requested to answer "Yes" or "No" using Narjis and Shabbir's randomization device such as Tables 1 and 2 for each First and Second randomization device in ith cluster.
If Question 3 in randomization device R 2i appears on the card of the respondent, then it is necessary to select a card repeatedly in R 2i without replacing the card.In the second draw, if Question 3 reappears, then the respondent is suggested to report his/her "Yes" or "No", according to his/her true response to the sensitive question.From First and Second randomization devices, T i is the selection probability of a rare sensitive question in randomization device R 1i for the ith cluster, π i is the population proportion of a rare sensitive attribute for the i th cluster, and P i1 is the selection probability of a rare sensitive question in randomization device R 2i for the ith cluster.And P i2 is the selection probability of the forced answer "No" in randomization device R 2i , P i3 is the selection probability of the statement "Draw one more cards" in randomization device R 2i for the i th cluster, and k i is the number of cards in the card deck of randomization device R 2i for the ith cluster.
The probability of answering "Yes" from the respondent in cluster i is given by To clarify the response process, we presented a flow chart for the probability of answering "Yes" for i th cluster in Figure 1.Since the attribute A i in cluster i is very rare in the population, if we assume m i → ∞ and l i0 → 0, then m i l i0 = λ i0 (finite).
Let y i1 , y i2 , • • • , y im i be a random sample of m i observations from the Poisson distribution with parameter λ i0 in cluster i, then the estimator λi of λ i , the parameter of a rare sensitive attribute of cluster i, is given by λi = When respondents are selected via simple random sampling with replacement from the ith cluster, which was selected with replacement using sampling probability φ i for the estimator λppzwr of λ, the parameter of a rare sensitive attribute is given by: where Theorem 1.The estimator λppzwr is an unbiased estimator of the parameter λ.
Proof.Since y ij ∼ iid Po(λ i0 ) for each cluster and We have , where we can obtain Theorem 2. The variance of λppzwr is given by Proof.By [11], we have Because y ij ∼ iid Po(λ i0 ), we have . Thus, we determine the variance of λppzwr as shown in (8).
Also, the estimator of V( λppzwr ) is given by On the other hand, when the sampling probabilities of n PSUs are proportional to each cluster size M i , then φ i = M i /M 0 , which is called PPS sampling.When a sample of n PSUs are selected via PPS sampling with replacement and m i SSUs are selected using simple random sampling with replacement from each PSU, the estimator λppzwr of λ is as follows And the variance of λppswr and its estimator are, respectively, and

Estimation by PPS When PSUs Are Selected without Replacement
Suppose n PSUs of size M i (i = 1, 2, • • • , n) have been selected from the population of N clusters with selection probability ϕ i without replacement and the SSUs of size m i are selected from each chosen primary unit via SRSWR.We apply the two-stage sampling procedure to Narjis and Shabbir's RRT to estimate a rare sensitive attribute.
The estimator λppswor of λ, the parameter of a rare sensitive attribute obtained using the above sampling procedure is given by λppswor where ϕ i is the inclusion probability of survey unit i.
And the variance of λppswor is given by: where ϕ ij is the joint inclusion probability of survey units i and j.Also, the estimator of V( λppswor ) is given by

Estimation via Two-Stage Equal Probability Sampling
Suppose n PSUs of size M i (i = 1, 2, • • • , n) have been selected from the population of N clusters by SRSWR and the SSUs of size m i are selected again from each chosen PSU via SRSWR.We consider the two-stage equal probability sampling procedure for Narjis and Shabbir's RRT for estimating a rare sensitive attribute.The estimator λwr of λ, the parameter of a rare sensitive attribute, obtained using the above procedure is given by λwr where M = M 0 /N.

Efficiency Comparisons for the PPS vs. Equal Probability Sampling
Narjis and Shabbir's RRT model was developed under the assumption of simple random sampling and stratified random sampling, and the efficiency thereof was compared with that of the estimators [9].Therefore, it is reasonable to compare the existing estimator with the estimator proposed in this paper using Narjis and Shabbir's model.However, in the case of cluster sampling, the increase in variance compared to that obtained using simple random sampling or stratified sampling has already been dealt with in the typical sampling textbooks, so in this paper, as described above, when the population consists of N clusters, we consider the case the PPS with replacement estimator and two-stage equal probability estimator.Now, the difference between the variance (17) of two-stage equal probability sampling and the variance (11) of PPS with replacement sampling is given as follows under In other words, if the cluster sizes are equal, the selection probability of PPS sampling with replacement becomes 1/N and is equal to that of two-stage equal probability sampling with replacement.Hence, they have the same efficiency.
If each cluster size M i is unequal, the values ∑ N i=1 (M i − M) 2 λ 2 i of first term of the righthand side in (19) are much increased, and the values of the second term of the right-hand side in (19) have relatively small ones.Hence, the estimation using PPS sampling with replacement is more efficient than that of two-stage equal probability sampling with replacement.
We tabulate to summarize the relationship for each estimator in a cluster sampling design as follows.Now, we compare the efficiency by calculating relative efficiencies (RE) between different sampling methods, such as simple random sampling with replacement (:ppzwr), PPS sampling with replacement (:ppswr) and two-stage equal probability sampling with replacement (:wr) according to varying parameter combinations by numerical example.
The values of RE 1 greater than one means that unequal probability sampling with replacement (:ppzwr) is more efficient than two-stage equal probability sampling with replacement (:wr), RE 2 greater than one means that PPS sampling with replacement (:ppswr) is more efficient than unequal probability sampling with replacement(:ppzwr), and RE 3 greater than one means that PPS sampling with replacement (:ppswr) is more efficient than two-stage equal probability sampling with replacement(:wr).
In order to compare the efficiency of the proposed estimators from numerical examples, we summarized the relative efficiencies according to various parameter values with their mean values.
From Table 3, it can be seen that for all the parametric combinations, the mean values of RE 1 are greater than one, which indicates that the unequal probability sampling with replacement estimator λppzwr is more efficient than the two-stage estimator, λwr , as the sensitive attribute value λ decreases, and in contrast, if sensitive attribute λ increases, then the efficiency of λppzwr decreases.In addition, the variation in RE 1 with respect to k i indicates that the RE 1 increases as the values of selection probability T i increase.As shown in Table 4, the probability proportional to size estimator, λppswr , is more efficient than the unequal probability sampling with replacement estimator, λppzwr .As the sensitive attribute value λ increases, and in contrast, as λ decreases, the probability proportional estimator decreases in efficiency.
As shown in Table 5, the probability proportional to size estimator, λppswr , is more efficient than the two-stage sampling with replacement estimator, λwr .As the sensitive attribute value λ decreases, and in contrast, as λ decreases, the probability proportional estimator decreases in efficiency.
In summary, an examination of the efficiency of a partial randomized response model for rare sensitive attributes based on a cluster sampling design with numerical examples shows the following trends: (1) Between ppzwr and wr, efficiency decreases as a rare sensitive attribute λ increases (refer to Table 4).(2) Between ppswr and ppzwr, efficiency increases as λ increases, and efficiency is relatively low at specific values of λ (refer to Table 5).(3) Between ppswr and wr, efficiency increases as λ decreases, similar to the relation between ppswr and ppzwr, where efficiency sharply increases at specific values of λ (refer to Table 6).(4) The number of cards k i does not significantly impact efficiency.

Conclusions
In this paper, when the population is composed of several different and sensitive clusters, we suggest a randomized method for efficiently estimating a rare sensitive attribute by applying the PPS sampling method to the partial randomized response model of [10].And by applying PPS sampling and two-stage equal probability sampling, estimators for a rare and sensitive attribute and its variance and variance estimates are obtained.We compare the efficiency between the estimators of the rare sensitive attribute, one obtained using the PPS with replacement sampling method and the other obtained using the twostage equal probability sampling with replacement method when the cluster sizes are different.As a result, it was confirmed that the estimation obtained using the PPS sampling with replacement is more efficient than the estimation obtained based on the two-stage equal probability sampling with replacement when the cluster sizes are different from each other.

Figure 1 .
Figure 1.Response flow using partial randomization device for the ith cluster.

Table 3 .
The relationship between different estimators for cluster sampling.

Table 4 .
The mean values of RE 1 for λ ppzwr vs. λ wr .

Table 6 .
The mean values of RE 3 for λ ppswr vs. λ wr .