Abstract
In this paper, we extended Yennum et al.’s model, in which geometric distribution is used as a randomization device for a population that consists of different-sized clusters, and clusters are obtained by probability proportional to size (PPS) sampling. Estimators of a sensitive parameter, their variances, and their variance estimators are derived under PPS sampling and equal probability two-stage sampling, respectively. We also applied these sampling schemes to Yennum et al.’s generalized model. Numerical studies were carried out to compare the efficiencies of the proposed sampling methods for each case of Yennum et al.’s model and Yennum et al.’s generalized model.
1. Introduction
The randomized response model (RRM) was suggested by [1] to estimate the true population proportion of sensitive characteristics, such as illegal gambling, drug-abuse, tax evasion, the extent of illegal income, and the experience of abortion, among others [2,3,4].
Since Warner’s work, many scholars have developed the RRM in various ways. In [5,6], they arranged, summarized, and systemized various RRMs and emphasized their importance. In [7], sampling survey of sensitive attributes applied two-stage cluster sampling to RRM for a population consisting of equal-sized clusters, and [8] considered the cluster RRM for a population consisting of different-sized clusters, where the clusters are selected by probability proportional to size (PPS) sampling.
Recently, Yennum et al. [9] suggested a new randomization device to gather sensitive data in two-stages under the assumption of geometric distribution and made a generalization of their model encompassing generalized geometric distribution using [10] model.
Based on Yennum et al.’s work, it is assumed that the respondents are selected by simple random sampling with replacements, but a real survey selects respondents from various sampling schemes.
Now, we can consider a large sample of clusters. For example, to estimate the true population proportion of drug-abuse among high school students, it is possible to use a randomization device like Yennum et al.’s model via proportional sampling by considering the primary sampling unit as the school and the secondary sampling unit as the students.
From this point of view, we extend Yennum et al.’s model, in which geometric distribution is used as a randomization device based on a population that consists of different-sized clusters, and the clusters are selected by PPS sampling. Estimators of a sensitive parameter, their variances, and their variance estimators are derived by PPS sampling and equal probability two-stage sampling, respectively.
We also apply these methods to the case of Yennum et al.’s generalized model. Numerical studies are carried out to compare the efficiencies of the suggested methods in each case of Yennum et al.’s model and Yennum et al.’s generalized model.
2. An Estimation of Sensitive Attributes with Probability Proportional to Size Sampling under Yennum et al.’s Model
In Section 2, we consider a new sampling scheme to estimate sensitive attributes using Yennum et al.’s model, in which geometric distribution is used as a randomization device when clusters are selected with proportional to size (PPS) sampling or equal probability sampling from a population that consists of clusters of size, and units are selected by simple random sampling from each sampled cluster.
In Section 2.1, we consider the sampling method for the clusters via PPS sampling with replacements. Clusters by PPS sampling without replacement are considered in Section 2.2, and clusters by equal probability sampling are examined in Section 2.3.
2.1. PPS Sampling with Replacement
Let the population be composed of clusters. In the first stage, the size of the sample of the first sampling units (FSU) is selected with replacement by the selection probability for the th cluster. In the second stage, second sampling units (SSU) are drawn by simple random sampling with replacement (SRSWR) from each FSU and are guided to carry out Yennum et al.’s randomization device.
First of all, the randomization device consists of two elements. The first randomization device for the th cluster consists of two kinds of urns with white and black balls, where the selection probability of a white ball is , and the selection probability of a black ball is .
On the other hand, the second randomization device is composed of two kinds of urns with balls. The first device with balls contains a slip of paper including two statements, such as “I have a sensitive attribute” with selection probability , and the other balls includes a statement such as “I do not have a sensitive attribute” with selection probability . The second device with balls contains a slip of paper with the statement “I do not have a sensitive attribute” with selection probability and balls with the statement “I have a sensitive attribute” with selection probability .
In the first stage, for the th cluster, each interviewee draws a ball from the first randomization device, such as the urn with the white and black balls. When he or she selects a white ball, he or she is guided to pick balls from the first urn of the second randomization device, one after another, with replacement, until the first ball containing a statement matching his or her own status appears.
We assume that is the total number of balls drawn before he or she obtains the first ball including his or her own status in the th cluster, and is the total number of balls drawn before he or she obtains the first ball with his or her own status of not having a sensitive attribute in the th cluster. Similarly, when he or she draws a black ball, he or she is guided to pick balls from the second urn of the second randomization device, one after another, with replacement, until the first ball containing a statement matching his or her own status appears.
For the th cluster, using the randomization device in Figure 1, the total number of balls taken by interviewees are distributed via generalized geometric distribution. Let and be the true population proportion of persons who have a sensitive attribute and for the th cluster. Assume that each interviewee in the th cluster is drawn by SRSWR.
Figure 1.
Randomization device for the th cluster.
For the th cluster, the total number for each ball selected by interviewees through the proposed two-stage device distributes one of the following random variables: , , and , where represents the geometric distribution with a success probability. Let and be the true population proportions of persons who have a sensitive attribute ( and , respectively) for the th cluster. Assume that each interviewee in the th cluster is drawn by SRSWR.
Let be the th observed answer in the th cluster; can be expressed as
Then, we can find the expected value of as follows:
The expected value (2) can be expressed as follows:
where .
Now the estimator for the true population proportion in the th cluster is given by:
When the interviewees are drawn by SRSWR from the th cluster selected with a replacement by the sampling probability , the estimator of the true population proportion for a sensitive character is given by:
where .
Theorem 1:
The estimator of the true population proportion of a sensitive attribute under PPS with a replacement sampling scheme is an unbiased estimator.
Proof:
and since:
we can obtain:
□
Theorem 2:
The variance of is obtained from a two-stage procedure, such that a sample of size FSU is selected by replacement with sampling probability for the unit from the population of clusters with size elements in the th cluster, and the SSUs with size are drawn by SRSWR from each FSU, as given by:
where:
Proof:
Given , , , , where represents the geometric distribution with a success probability. Since the expected values of and are
then:
Based on (7) and (8), the variance of is:
and, since is independent, the variance of can be expressed by:
Since , then the first and second terms are given, respectively, as:
and
Then, we can obtain the variance (10).
Moreover, an unbiased estimator of is given by
□
If the FSUs are selected proportional to size with , then . For this reason, we call this method “probability proportional to size” (PPS) sampling. When a sample of the FSU is selected by PPS sampling with replacement via sampling probability, for the th cluster, and SSU are selected by SRSWR from each FSU. The estimator of is given by:
and the variance of and its estimator are as follows:
2.2. The PPS without Replacement
In this subsection, we consider PPS sampling without replacement to estimate the true population proportion of a sensitive character by applying Yennum et al.’s model, in which FSUs are drawn by PPS sampling without replacement from the population of clusters with elementary units for the th cluster, and SSUs are drawn by SRSWR from each FSU.
From this two-stage sampling, the estimator of is:
where is the first inclusion probability for the th cluster.
The variance of is given by:
where is the second inclusion probability of the th and th clusters.
Furthermore, the variance estimator of is as follows:
2.3. Two-Stage Equal Probability Sampling
In this subsection, we consider a two-stage equal probability sampling design to estimate the true population proportion of a sensitive characteristic by applying Yennum et al.’s model, in which FSUs are drawn by simple random sampling without replacement (SRSWOR) from a population of clusters with elementary units for the th cluster, and SSUs are drawn by SRSWR from each FSU.
From this two-stage sampling, the estimator of is given by:
where is an estimator of the true population proportion for a sensitive characteristic for the th cluster, which is the same as (4).
The variance of and its estimator are given as:
where .
3. An Estimation of Sensitive Attributes with Probability Proportional to Size Sampling Under Yennum et al.’s Generalized Model
We consider Yennum et al.’s generalized model, in which generalized geometric distribution is used as a randomization device when clusters are sampled by PPS sampling or equal probability sampling from the population, which consists of clusters with size , and units are drawn by simple random sampling from each sampled cluster.
We develop the sampling schemes for PPS sampling with replacement in Section 3.1 and those for PPS sampling without replacement in Section 3.2. Finally, equal probability sampling is presented in Section 3.3.
3.1. PPS Sampling with Replacement
Let the population be composed of clusters. In the first stage, a sample of FSUs is drawn by replacement with the sampling probability for the th cluster. In the second stage, SSUs are selected by SRSWR from each FSU and guided to apply Yennum et al.’s generalized randomization device.
If the interviewees in the th cluster choose a white ball during the first stage, and if they have a sensitive attribute (or ), then they are guided to pick replacement balls from the first urn of the second stage device until they take (or ) successive balls with their actual status for the first time and are then asked to determine the total number of balls as (or ).
If the interviewee in the th cluster draws a black ball in the first stage, and if they have a sensitive attribute (or ), then they are guided to take replacement balls from the second urn of the second stage device until they take (or ) successive balls with their actual status for the first time and are then asked to determine the total number of balls as (or ).
For the th cluster, using the randomization device in Figure 1, the total number of balls taken by interviewees , , , and are distributed via generalized geometric distribution. Let and be the true population proportion of persons who have a sensitive attribute and for the th cluster. Assume that each interviewee in the th cluster is drawn by SRSWR.
For the th surveyed answer in the th cluster, can be expressed as:
The expected value of is given by:
Then, the formula (22) can be expressed as:
The estimator of the population proportion for the th cluster is given by:
where:
and:
When the interviewees are sampled by SRSWR for the th cluster selected with a replacement by sampling probability , the estimator of the true population proportion of a sensitive attribute is:
where .
Theorem 3:
The estimator of the true population proportion of a sensitive character is an unbiased estimator.
Proof:
and, since:
we can obtain:
□
Theorem 4:
The variance of is obtained by a two-stage sampling scheme, such that a sample of FSU is selected with replacement by sampling probability for the th cluster from the population of clusters consisting of elements for the th cluster, and SSUs are drawn by SRSWR from each FSU, as given by:
where:
Proof:
The total number of balls taken by interviewees for the th cluster, and , are random variables with variances:
From (21), to drive the variance of we can obtain the expected values of and as follows:
Since ,
and:
We can then obtain the variance (28). Also, an unbiased estimator of is given by:
□
3.2. PPS Sampling Without Replacement
In this subsection, we consider PPS sampling without replacement to estimate the true population proportion of a sensitive characteristic by applying Yennum et al.’s generalized model, in which FSUs are drawn by PPS sampling without replacement from a population of clusters with elementary units for the th cluster, and SSUs are drawn by SRSWR from each FSU.
From this procedure, the estimator of is given by:
where is the first inclusion probability for the th cluster.
The variance of is given by:
where is the second inclusion probability for th and th clusters.
Also, the variance estimator of is:
3.3. Two-Stage Equal Probability Sampling
In this subsection, we consider a two-stage equal probability sampling scheme to estimate the true population proportion of a sensitive attribute by applying Yennum et al.’s generalized model, in which FSUs are drawn by SRSWOR from a population of clusters consisting of elementary units for the th cluster, and SSUs are drawn by SRSWR from each FSU.
From this procedure, the estimator of the true population proportion for a sensitive attribute is given by:
where the estimator is the estimator of a sensitive characteristic of the th cluster, which is the same as (24).
The variance and variance estimator of are:
and:
respectively, where .
4. Efficiency Comparisons
4.1. PPSWR Sampling versus Equal Probability Two-Stage Sampling in Yennum et al.’s Model
If we assume , then the difference between the variance of equal probability two-stage sampling, (19), and the variance of PPS with replacement sampling, (6), is given by:
In (43), we can see that under the condition ; i.e., if the cluster sizes are equal, the selection probabilities of the PPS with replacement sampling are all and equal to those of equal probability two-stage replacement sampling.
If the size of a cluster, is significantly different, then , the first term on the right side of (43), has large values, and the second term, , has relatively small values. Hence, the estimation by PPS with replacement sampling is more efficient than that by equal probability two-stage replacement sampling.
We used the relative efficiency (RE) to compare the efficiency of the two sampling methods—PPS with replacement sampling and equal probability two-stage replacement sampling:
Values of over 100% indicate that the estimator obtained by the PPS with the replacement sampling method was more efficient than the estimator obtained by the equal probability two-stage replacement sampling.
In calculating REs, we set the parameters as follows:
From Table 1, when the selection probability for the first-stage randomization device increased from 0.1 to 0.9 by 0.2 and the second stage randomization devices increased from 0.6 to 0.8 by 0.1 and from 0.65 to 0.90 by 0.05, REs increase under the fixed proportion of a sensitive attribute (particularly when the selection probability of the second randomization device increased), and the RE increased according to the conditions of and .
Table 1.
The relative efficiencies (REs) of a sensitive estimator between the probability proportional to size (PPS) sampling with replacement and the equal probability two-stage sampling with replacement in Yennum et al.’s model to change and .
On the other hand, RE increased when the first-stage selection probability was less than 0.5, and the values of , , and (from 0.1 to 0.4) decreased, but the RE decreased when the value of was greater than 0.5 under a fixed value for , , and .
Furthermore, the greater the true population proportion of a sensitive attribute , the higher the overall efficiency of Yennum et al.’s model, as shown by the values of the bottom cells in Table 1. This result agrees with the typical sampling survey methodology as the true population proportion of a sensitive attribute increases.
4.2. PPSWR Sampling versus Equal Probability Two-Stage Sampling in Yennum et al.’s Generalized Model
If we assume , then the difference between the variance of equal probability two-stage sampling scheme (41) and the variance of the PPS with replacement sampling scheme (28) is given by:
In (44), we can see that under the condition , i.e., if the cluster sizes are equal, the selection probabilities of the PPS with replacement sampling are all and equal to those of the equal probability two-stage replacement sampling.
If cluster sizes, , were significantly different, then , the first term of the right-hand side in (44), had large values, and the second term, , had relatively small values. Hence, the estimation by PPS with replacement sampling is more efficient than that by equal probability two-stage replacement sampling.
We used the relative efficiency (RE) to compare the efficiency of the two sampling designs (PPS with replacement sampling and equal probability two-stage replacement sampling):
Values of over 100% indicate that the estimator obtained by PPS with the replacement sampling method was more efficient than the estimator obtained by equal probability two-stage replacement sampling.
Table 2 shows the results of the REs obtained by increasing the true population proportion from 0.1 to 0.4 by 0.1. The selection probabilities of the randomized response model (, and ) are shown in Section 4.1.
Table 2.
The REs for a sensitive estimator between the PPS with replacement sampling and equal probability two-stage sampling with replacement in Yennum et al.’s generalized model for changing and .
In calculating the REs, we set the parameters as follows:
From the results of Table 2, the efficiencies vary according to changes in the probabilities of selection during the first stage and the second stage and in the randomization device, but when the first-stage selection probability is fixed, and the second-stage selection probabilities and increase, then the relative efficiency of the PPS sampling is better than that of the equal probability two-stage sampling in Yennum et al.’s model.
5. Conclusions
We extended Yennum et al.’s model, in which geometric distribution is used as a randomization device for a population consisting of different-sized clusters, and clusters are selected by PPS sampling. Estimators for the true population proportion of a sensitive attribute, their variances, and their variance estimators are derived under PPS sampling and equal probability two-stage sampling.
We also applied these sampling designs to the case of Yennum et al.’s generalized model. Numerical studies were carried out to compare the efficiencies of the proposed methods in each case of Yennum et al.’s model and Yennum et al.’s generalized model in cases with a replacement.
Although the experiments were assumed to use a replacement, we expected similar results for a case without replacement, as per typical sampling theory.
From the numerical study, we found that the efficiency of the two-stage sampling for probability proportional to size depends on the given parameter values, but the efficiency of Yennum et al.’s generalized model is preferred for most combinations of parameters over around 80%.
Author Contributions
Conceptualization, G.-S.L.; methodology, C.-K.S.; writing—original draft preparation, K.-H.H.; writing—review and editing, C.-K.S.; project administration and funding acquisition, G.-S.L.
Funding
This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2018R1D1A3B07044007).
Acknowledgments
We would like to thank the anonymous reviewers for their very careful reading and valuable comments/suggestions.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Warner, S.L. Randomized response: A survey technique for eliminating evasive answer bias. J. Am. Stat. Assoc. 1965, 60, 63–69. [Google Scholar] [CrossRef] [PubMed]
- Cochran, W.G. Sampling Techniques, 3rd ed.; John Wiley and Sons: New York, NY, USA, 1977. [Google Scholar]
- Fox, J.A.; Tracy, P.E. Randomized Response: A Method for Sensitive Survey; Sage Publications: Newbury Park, CA, USA, 1986. [Google Scholar]
- Kuk, A.Y.C. Asking sensitive questions indirectly. Biometrika 1990, 77, 436–438. [Google Scholar] [CrossRef]
- Chaudhuri, A.; Mukerjee, R. Randomized Response: Theory and Techniques; Marcel Dekker Inc.: New York, NY, USA, 1988. [Google Scholar]
- Ryu, J.B.; Hong, K.H.; Lee, G.S. Randomized Response Model; Freedom Academy: Seoul, Korean, 1993. [Google Scholar]
- Lee, G.S.; Hong, K.H. Randomized response model by two-stage cluster sampling. Korean Commun. Stat. 1998, 5, 99–105. [Google Scholar]
- Lee, G.S. A Study on the Randomized Response Technique by PPS Sampling. Korean J. Appl. Stat. 2006, 19, 69–80. [Google Scholar]
- Yennum, N.Y.; Sedory, S.A.; Singh, S. Improved strategy to collect sensitive data by using geometric distribution as a randomization device. Commun. Stat. Theory Methods 2019, 48, 5777–5795. [Google Scholar] [CrossRef]
- Hussain, Z.; Shabbir, J.; Pervez, Z.; Shah, S.F.; Khan, M. Generalized geometric distribution of order k: A flexible choice to randomize the response. Commun. Stat. Simul. Comput. 2017, 46, 4708–4721. [Google Scholar] [CrossRef]
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).