An Estimation of Sensitive Attribute Applying Geometric Distribution under Probability Proportional to Size Sampling

: In this paper, we extended Yennum et al.’s model, in which geometric distribution is used as a randomization device for a population that consists of di ﬀ erent-sized clusters, and clusters are obtained by probability proportional to size (PPS) sampling. Estimators of a sensitive parameter, their variances, and their variance estimators are derived under PPS sampling and equal probability two-stage sampling, respectively. We also applied these sampling schemes to Yennum et al.’s generalized model. Numerical studies were carried out to compare the e ﬃ ciencies of the proposed sampling methods for each case of Yennum et al.’s model and Yennum et al.’s generalized model.


Introduction
The randomized response model (RRM) was suggested by [1] to estimate the true population proportion of sensitive characteristics, such as illegal gambling, drug-abuse, tax evasion, the extent of illegal income, and the experience of abortion, among others [2][3][4].
Since Warner's work, many scholars have developed the RRM in various ways. In [5,6], they arranged, summarized, and systemized various RRMs and emphasized their importance. In [7], sampling survey of sensitive attributes applied two-stage cluster sampling to RRM for a population consisting of equal-sized clusters, and [8] considered the cluster RRM for a population consisting of different-sized clusters, where the clusters are selected by probability proportional to size (PPS) sampling.
Recently, Yennum et al. [9] suggested a new randomization device to gather sensitive data in two-stages under the assumption of geometric distribution and made a generalization of their model encompassing generalized geometric distribution using [10] model. Based on Yennum et al.'s work, it is assumed that the respondents are selected by simple random sampling with replacements, but a real survey selects respondents from various sampling schemes. Now, we can consider a large sample of clusters. For example, to estimate the true population proportion of drug-abuse among high school students, it is possible to use a randomization device like Yennum et al.'s model via proportional sampling by considering the primary sampling unit as the school and the secondary sampling unit as the students.
From this point of view, we extend Yennum et al.'s model, in which geometric distribution is used as a randomization device based on a population that consists of different-sized clusters, and the clusters are selected by PPS sampling. Estimators of a sensitive parameter, their variances, and their variance estimators are derived by PPS sampling and equal probability two-stage sampling, respectively.
We also apply these methods to the case of Yennum

An Estimation of Sensitive Attributes with Probability Proportional to Size Sampling under Yennum et al.'s Model
In Section 2, we consider a new sampling scheme to estimate sensitive attributes using Yennum et al.'s model, in which geometric distribution is used as a randomization device when n clusters are selected with proportional to size (PPS) sampling or equal probability sampling from a population that consists of N clusters of size, M i (i = 1, 2, · · · , N) and m i (i = 1, 2, · · · , n) units are selected by simple random sampling from each sampled cluster.
In Section 2.1, we consider the sampling method for the clusters via PPS sampling with replacements. Clusters by PPS sampling without replacement are considered in Section 2.2, and clusters by equal probability sampling are examined in Section 2.3.

PPS Sampling with Replacement
Let the population be composed of N clusters. In the first stage, the size of the n sample of the first sampling units (FSU) is selected with replacement by the selection probability p i for the ith cluster. In the second stage, m i second sampling units (SSU) are drawn by simple random sampling with replacement (SRSWR) from each FSU and are guided to carry out Yennum et al.'s randomization device.
First of all, the randomization device consists of two elements. The first randomization device for the ith cluster consists of two kinds of urns with white and black balls, where the selection probability of a white ball is W i , and the selection probability of a black ball is 1 − W i .
On the other hand, the second randomization device is composed of two kinds of urns with balls. The first device with balls contains a slip of paper including two statements, such as "I have a sensitive attribute" with selection probability P i , and the other balls includes a statement such as "I do not have a sensitive attribute" with selection probability 1 − P i . The second device with balls contains a slip of paper with the statement "I do not have a sensitive attribute" with selection probability T i and balls with the statement "I have a sensitive attribute" with selection probability 1 − T i .
In the first stage, for the ith cluster, each interviewee draws a ball from the first randomization device, such as the urn with the white and black balls. When he or she selects a white ball, he or she is guided to pick balls from the first urn of the second randomization device, one after another, with replacement, until the first ball containing a statement matching his or her own status appears.
We assume that X i1 is the total number of balls drawn before he or she obtains the first ball including his or her own status in the ith cluster, and X i2 is the total number of balls drawn before he or she obtains the first ball with his or her own status of not having a sensitive attribute in the ith cluster. Similarly, when he or she draws a black ball, he or she is guided to pick balls from the second urn of the second randomization device, one after another, with replacement, until the first ball containing a statement matching his or her own status appears.
For the ith cluster, using the randomization device in Figure 1, the total number of balls taken by interviewees X i1 , X i2 , Y i1 , Y i2 are distributed via generalized geometric distribution. Let π i and 1 − π i be the true population proportion of persons who have a sensitive attribute A i and A c i for the ith cluster. Assume that each interviewee in the ith cluster is drawn by SRSWR.
For the ith cluster, the total number for each ball selected by interviewees through the proposed two-stage device distributes one of the following random variables: where Ge(·) represents the geometric distribution with a success probability. Let π i and 1 − π i be the true population proportions of persons who have a sensitive attribute (A i and A c i , respectively) for the ith cluster. Assume that each interviewee in the ith cluster is drawn by SRSWR. For the th cluster, the total number for each ball selected by interviewees through the proposed two-stage device distributes one of the following random variables: represents the geometric distribution with a success probability. Let and 1 − be the true population proportions of persons who have a sensitive attribute ( and , respectively) for the th cluster. Assume that each interviewee in the th cluster is drawn by SRSWR.
Let be the th observed answer in the th cluster; can be expressed as Then, we can find the expected value of as follows: The expected value (2) can be expressed as follows: Now the estimator ˆi π for the true population proportion i π in the i th cluster is given by: When the interviewees are drawn by SRSWR from the i th cluster selected with a replacement by the sampling probability i p , the estimator ˆp pswr π of the true population proportion π for a sensitive character is given by: Let Z ij be the jth observed answer in the ith cluster; Z ij can be expressed as Then, we can find the expected value of Z ij as follows: The expected value (2) can be expressed as follows: Now the estimatorπ i for the true population proportion π i in the ith cluster is given by: When the interviewees are drawn by SRSWR from the ith cluster selected with a replacement by the sampling probability p i , the estimatorπ ppswr of the true population proportion π for a sensitive character is given by:

Theorem 1:
The estimatorπ ppswr of the true population proportion of a sensitive attribute π under PPS with a replacement sampling scheme is an unbiased estimator. Proof: and since: we can obtain:

Theorem 2:
The variance ofπ ppswr is obtained from a two-stage procedure, such that a sample of size n FSU is selected by replacement with sampling probability p i for the unit i from the population of N clusters with size M i elements in the ith cluster, and the SSUs with size m i are drawn by SRSWR from each FSU, as given by: where: where G represents the geometric distribution with a success probability. Since the expected values of Z ij and Z 2 ij are then: Based on (7) and (8), the variance of Z ij is: and, since Z ij is independent, the variance ofπ i can be expressed by: Since V(π ppswr ) = V 1 E 2 (π ppswr ) + E 1 V 2 (π ppswr ), then the first and second terms are given, respectively, as: Then, we can obtain the variance (10). Moreover, an unbiased estimator of V(π ppswr ) is given bŷ If the FSUs are selected proportional to size with M i , then p i = M i /M 0 . For this reason, we call this method "probability proportional to size" (PPS) sampling. When a sample of the FSU is selected by PPS sampling with replacement via sampling probability, p i = M i /M 0 for the ith cluster, and m i SSU are selected by SRSWR from each FSU. The estimatorπ ppswr of π is given by: and the variance ofπ ppswr and its estimator are as follows:

The PPS without Replacement
In this subsection, we consider PPS sampling without replacement to estimate the true population proportion of a sensitive character by applying Yennum et al.'s model, in which n FSUs are drawn by PPS sampling without replacement from the population of N clusters with M i elementary units for the ith cluster, and m i SSUs are drawn by SRSWR from each FSU.
From this two-stage sampling, the estimatorπ ppswor of π is: where θ i is the first inclusion probability for the ith cluster. The variance ofπ ppswor is given by: where θ ij is the second inclusion probability of the ith and jth clusters. Furthermore, the variance estimator ofπ ppswor is as follows:

Two-Stage Equal Probability Sampling
In this subsection, we consider a two-stage equal probability sampling design to estimate the true population proportion of a sensitive characteristic by applying Yennum et al.'s model, in which n FSUs are drawn by simple random sampling without replacement (SRSWOR) from a population of N clusters with M i elementary units for the ith cluster, and m i SSUs are drawn by SRSWR from each FSU.
From this two-stage sampling, the estimatorπ wr of π is given by: whereπ i is an estimator of the true population proportion for a sensitive characteristic for the ith cluster, which is the same as (4). The variance ofπ wr and its estimator are given as: where M = M 0 /N.

An Estimation of Sensitive Attributes with Probability Proportional to Size Sampling Under Yennum et al.'s Generalized Model
We consider Yennum et al.'s generalized model, in which generalized geometric distribution is used as a randomization device when n clusters are sampled by PPS sampling or equal probability sampling from the population, which consists of N clusters with size M i (i = 1, 2, · · · , N), and m i (i = 1, 2, · · · , n) units are drawn by simple random sampling from each sampled cluster.
We develop the sampling schemes for PPS sampling with replacement in Section 3.1 and those for PPS sampling without replacement in Section 3.2. Finally, equal probability sampling is presented in Section 3.3.

PPS Sampling with Replacement
Let the population be composed of N clusters. In the first stage, a sample of n FSUs is drawn by replacement with the sampling probability p i for the ith cluster. In the second stage, m i SSUs are selected by SRSWR from each FSU and guided to apply Yennum et al.'s generalized randomization device.
If the interviewees in the ith cluster choose a white ball during the first stage, and if they have a sensitive attribute A (or A c ), then they are guided to pick replacement balls from the first urn of the second stage device until they take k i2 (or k i1 ) successive balls with their actual status for the first time and are then asked to determine the total number of balls as X i1 (or X i2 ).
If the interviewee in the ith cluster draws a black ball in the first stage, and if they have a sensitive attribute A c (or A), then they are guided to take replacement balls from the second urn of the second stage device until they take k i2 (or k i1 ) successive balls with their actual status for the first time and are then asked to determine the total number of balls as Y i1 (or Y i2 ).
For the ith cluster, using the randomization device in Figure 1, the total number of balls taken by interviewees X i1 , X i2 , Y i1 , and Y i2 are distributed via generalized geometric distribution. Let π i and 1 − π i be the true population proportion of persons who have a sensitive attribute A and A c for the ith cluster. Assume that each interviewee in the ith cluster is drawn by SRSWR.
For the jth surveyed answer in the ith cluster, Z ij can be expressed as: The expected value of Z ij is given by: Then, the formula (22) can be expressed as: The estimatorπ iG of the population proportion π i for the ith cluster is given by: where: and: When the interviewees are sampled by SRSWR for the ith cluster selected with a replacement by sampling probability p i , the estimatorπ Gppswr of the true population proportion π of a sensitive attribute is:π

Theorem 3:
The estimatorπ Gppswr of the true population proportion π of a sensitive character is an unbiased estimator. Proof: and, since: we can obtain:

Theorem 4:
The variance ofπ Gppswr is obtained by a two-stage sampling scheme, such that a sample of n FSU is selected with replacement by sampling probability p i for the ith cluster from the population of N clusters consisting of M i elements for the ith cluster, and m i SSUs are drawn by SRSWR from each FSU, as given by: where: Proof: The total number of balls taken by interviewees for the ith cluster, X i1 , X i2 , Y i1 and Y i2 , are random variables with variances: From (21), to drive the variance ofπ Gppswr we can obtain the expected values of Z ij and Z 2 ij as follows: and: We can then obtain the variance (28). Also, an unbiased estimator of V(π Gppswr ) is given by:

PPS Sampling Without Replacement
In this subsection, we consider PPS sampling without replacement to estimate the true population proportion of a sensitive characteristic by applying Yennum et al.'s generalized model, in which n FSUs are drawn by PPS sampling without replacement from a population of N clusters with M i elementary units for the ith cluster, and m i SSUs are drawn by SRSWR from each FSU.
From this procedure, the estimatorπ Gppswor of π is given by: where θ i is the first inclusion probability for the ith cluster. The variance ofπ Gppswor is given by: where θ ij is the second inclusion probability for ith and jth clusters. Also, the variance estimator ofπ Gppswor is:

Two-Stage Equal Probability Sampling
In this subsection, we consider a two-stage equal probability sampling scheme to estimate the true population proportion of a sensitive attribute by applying Yennum et al.'s generalized model, in which n FSUs are drawn by SRSWOR from a population of N clusters consisting of M i elementary units for the ith cluster, and m i SSUs are drawn by SRSWR from each FSU.
From this procedure, the estimatorπ Gwr of the true population proportion π for a sensitive attribute is given by:π where the estimatorπ iG is the estimator of a sensitive characteristic of the ith cluster, which is the same as (24). The variance and variance estimator ofπ Gwr are: and:V (π Gwr ) = N 2 respectively, where M = M 0 /N.

PPSWR Sampling versus Equal Probability Two-Stage Sampling in Yennum et al.'s Model
If we assume N − 1 N, then the difference between the variance of equal probability two-stage sampling, (19), and the variance of PPS with replacement sampling, (6), is given by: In (43), we can see that V(π wr ) = V(π ppswr ) under the condition M i = M = M 0 /N; i.e., if the cluster sizes are equal, the selection probabilities of the PPS with replacement sampling are all N −1 and equal to those of equal probability two-stage replacement sampling.
If the size of a cluster, M i is significantly different, then N i=1 (M i − M) 2 π 2 i , the first term on the right side of (43), has large values, and the second term, , has relatively small values. Hence, the estimation by PPS with replacement sampling is more efficient than that by equal probability two-stage replacement sampling.
We used the relative efficiency (RE) to compare the efficiency of the two sampling methods-PPS with replacement sampling and equal probability two-stage replacement sampling: Values of RE 1 over 100% indicate that the estimator obtained by the PPS with the replacement sampling method was more efficient than the estimator obtained by the equal probability two-stage replacement sampling.
From Table 1, when the selection probability W for the first-stage randomization device increased from 0.1 to 0.9 by 0.2 and the second stage randomization devices T increased from 0.6 to 0.8 by 0.1 and P from 0.65 to 0.90 by 0.05, REs increase under the fixed proportion of a sensitive attribute (particularly when the selection probability of the second randomization device T increased), and the RE increased according to the conditions of P and π i . On the other hand, RE increased when the first-stage selection probability W was less than 0.5, and the values of T, P, and π i (from 0.1 to 0.4) decreased, but the RE decreased when the value of W was greater than 0.5 under a fixed value for T, P, and π i . Furthermore, the greater the true population proportion of a sensitive attribute π i , the higher the overall efficiency of Yennum et al.'s model, as shown by the values of the bottom cells in Table 1. This result agrees with the typical sampling survey methodology as the true population proportion of a sensitive attribute π i increases.

PPSWR Sampling versus Equal Probability Two-Stage Sampling in Yennum et al.'s Generalized Model
If we assume N − 1 N, then the difference between the variance of equal probability two-stage sampling scheme (41) and the variance of the PPS with replacement sampling scheme (28) is given by: In (44), we can see that V(π Gwr ) = V(π Gppswr ) under the condition M i = M = M 0 /N, i.e., if the cluster sizes are equal, the selection probabilities of the PPS with replacement sampling are all N −1 and equal to those of the equal probability two-stage replacement sampling.
If cluster sizes, M i , were significantly different, then N i=1 (M i − M) 2 π 2 i , the first term of the right-hand side in (44), had large values, and the second term, , had relatively small values. Hence, the estimation by PPS with replacement sampling is more efficient than that by equal probability two-stage replacement sampling.
We used the relative efficiency (RE) to compare the efficiency of the two sampling designs (PPS with replacement sampling and equal probability two-stage replacement sampling): Values of RE 2 over 100% indicate that the estimator obtained by PPS with the replacement sampling method was more efficient than the estimator obtained by equal probability two-stage replacement sampling. Table 2 shows the results of the REs obtained by increasing the true population proportion π i from 0.1 to 0.4 by 0.1. The selection probabilities of the randomized response model (W, T and P) are shown in Section 4.1. In calculating the REs, we set the parameters as follows: M 0 = 10, 000, M 1 = 1, 000, M 2 = 2, 000, M 3 = 3, 000, M 4 = 4, 000 m 0 = 1, 000, m 1 = 100, m 2 = 200, m 3 = 300, m 4 = 400, p 1 = 0.235, p 2 = 0.441, p 3 = 0.609, p 4 = 0.715, k 1 = 2, k 2 = 1.
From the results of Table 2, the efficiencies vary according to changes in the probabilities of selection during the first stage W and the second stage T and P in the randomization device, but when the first-stage selection probability W is fixed, and the second-stage selection probabilities T and P increase, then the relative efficiency of the PPS sampling is better than that of the equal probability two-stage sampling in Yennum et al.'s model.

Conclusions
We extended Yennum et al.'s model, in which geometric distribution is used as a randomization device for a population consisting of different-sized clusters, and clusters are selected by PPS sampling. Estimators for the true population proportion of a sensitive attribute, their variances, and their variance estimators are derived under PPS sampling and equal probability two-stage sampling.
We also applied these sampling Although the experiments were assumed to use a replacement, we expected similar results for a case without replacement, as per typical sampling theory.
From the numerical study, we found that the efficiency of the two-stage sampling for probability proportional to size depends on the given parameter values, but the efficiency of Yennum et al.'s generalized model is preferred for most combinations of parameters over around 80%.