Hierarchical Bayesian Modeling and Randomized Response Method for Inferring the Sensitive-Nature Proportion

: In this study, a new three-statement randomized response estimation method is proposed to improve the drawback that the maximum likelihood estimation method could generate a negative value to estimate the sensitive-nature proportion (SNP) when its true value is small. The Bayes estimator of the SNP is obtained via using a hierarchical Bayesian modeling procedure. Moreover, a hybrid algorithm using Gibbs sampling in Metropolis–Hastings algorithms is used to obtain the Bayes estimator of the SNP. The highest posterior density interval of the SNP is obtained based on the empirical distribution of Markov chains. We use the term 3RR-HB to denote the proposed method here. Monte Carlo simulations show that the quality of 3RR-HB procedure is good and that it can improve the drawback of the maximum likelihood estimation method. The proposed 3RR-HB procedure is simple for use. An example regarding the homosexual proportion of college freshmen is used for illustration.


Introduction
Warner [1] is the pioneer to propose the randomized response (RR) method for evaluating the sensitive-nature proportion (SNP). An individual in the RR method of [1] is required to answer "yes" or "no" to either the statement "I am a member of group A" or "I am not a member of group A" where group A is a sensitive-nature statement. Since the interviewees do not need to release the selected statement to the interviewer, the interviewees can be supposed to confide in the interviewer the true answer.
Let θ denote the population SNP and n be the sample size of interviewees. In the RR survey, Y interviewees answer "yes". It is trivial to show that Y follows a binomial distribution with the sample size n and proportion of ω = P(yes) = θ p s + (1 − θ)(1 − p s ), where p s denotes the proportion of selecting the sensitive-nature statement. Denote Y ∼ Bin(n, ω) . Warner [1] obtained the maximum likelihood estimator of θ, denoted byθ, and its variance is as follows: and We use the term MLE to denote maximum likelihood estimator/estimate here. Equations (1) and (2) can be used to construct the confidence interval of θ. The MLE proposed by [1] is valid only if p s = 0.5 and 0 ≤ θ ≤ 1. When the value of θ is small, the MLE of θ could be negative. This fact makes the method of [1] invalid to infer θ, as its true value is small.
New RR methods have been proposed after [1]. Greenberg et al. [2] proposed an unrelated question RR methods. They proposed a theoretical framework to infer the model parameters for the design of two statements. A new RR procedure was proposed by Mangat and Singh [3]. Their method used two randomization devices to design the RR strategy. Mangat and Singh [3] demonstrated that their new strategy was more efficient than the usual strategy of [1].
Kuk [4] proposed an alternative method to perform an RR survey. The design method of [4] does not require direct answers from the interviewees, and such a design can enhance the confidence of interviewees to tell the true answer. The method of [4] can be applied to both qualitative and quantitative questions. Kuk [4] suggested to collect data for a mixture distribution, and the problem can reduce to the estimation of a mixture proportion.
Chaudhuri [5] emphasized the protection of the interviewee's privacy and also studied the impact of simple random sampling design on the final conclusions. Chaudhuri [5] illustrated two existing RR devices for indicating how an estimator along with an estimated measure of its error could be developed when the RR sample may be drawn adopting a complex survey design involving unequal selection probabilities with or without replacement.
Christofides [6] proposed a generalized RR (GRR) technique to eliminate a major bias in surveys of the population SNP resulting from an interviewee's refusal when using the RR method of [1]. Chang et al. [7] considered a simple generalization for some existing investigations and suggested suitable selection strategies for design parameters. They also discussed the superiority of their proposed strategies over the RR strategy of [1].
Hsieh et al. [8] proposed a modified GRR (MGRR) approach for a multi-level attribute using a single sensitive item. The MGRR approach has some merits over the other counterparts. Hsieh et al. [8] suggested using the Markov chain Monte Carlo (MCMC) method to obtain the Bayes estimator of the SNP instead of the maximum likelihood estimation method. We use the term BE to denote Bayes estimator/Bayes estimate here. Examples about using Bayesian methods for real applications can be found in the book of Gelman et al. [9].
Bar-Lev et al. [10] presented a Bayesian approach to four RR models. They used truncated beta distributions in a common conjugate prior structure to obtain the BE of the SNP. Barabesi and Marcheselli [11] proposed a Bayesian estimation procedure to obtain the BE of the SNP based on Frankin's RR procedure. They conducted a simulation study to evaluate the quality of their proposed method.
Barabesi and Marcheselli [12] proposed a Bayesian method to the joint estimation of the SNPs and sensitivity level of a stigmatizing attribute via applying a two-stage RR design. The MGRR method is designed for a multi-level attribute using a single sensitivenature statement. Hsieh et al. [8] suggested using the MCMC approach to obtain the BE of the SNP. The MGRR method is effective to obtain a reliable BE of the SNP. However, the MGRR method could be too complicated to implement for users.
Bayesian estimation methods are useful for modeling multi-faceted or nonlinear practical phenomena other than the maximum likelihood estimation method. Among all popular Bayesian estimation methods, the hierarchical Bayesian (HB) modeling method can be run with multiple hierarchical levels for estimating the parameters of posterior distribution. If grouped observations are used in a survey, hierarchical modeling is a relevant design to obtain the reliable BEs of model parameters.
The example in Section 4 of this study is based on college students from different groups to study the homosexual proportion in a region during different years. Hence, the HB modeling method is helpful to obtain the reliable BEs of model parameters. The HB modeling method has been commonly applied in many different areas when the information on several different levels of observational units is available; see [12] for comprehensive discussions. It is helpful to apply hierarchical analysis forms to understand multi-parameter problems and design computational strategies.
However, heavy computation loading is a problem to obtain BEs of the model parameters using the HB modeling method. Taking advantage of the recent advances of computer power, it becomes easier to reduce the impact of computation loading when using HB modeling methods for data analysis. Some applications using the HB modeling method other than the RR design can be found in [13][14][15][16][17][18][19][20][21][22]. The HB modeling method is not yet applied for RR design. The implementation of the proposed 3RR-HB modeling method is discussed in Section 3.

Motivation and Organization
The traditional RR methods of [1,2] are simple to use. However, the MLE of the SNP could be negative when the true value of the SNP is small. In order to make interviewees more confident to tell the true answer in the survey, it is helpful to use more than one non-sensitive-nature statement in the RR method. Hence, we extend the traditional RR design of [2] to a three-statement RR method, which contains one sensitive-nature and two non-sensitive-nature statements.
To escape the trap of obtaining a negative MLE of the SNP, the HB modeling method is adopted to improve the drawback of the maximum likelihood estimation method when the true value of the SNP is small. In order to overcome the complexity of numerical computation to obtain the BE of the SNP, the hybrid algorithm of using Gibbs sampling in the Metropolis-Hastings algorithm is proposed to implement the MCMC method to obtain the BEs of model parameters. The main contribution of this study is to propose a 3RR-HB procedure to obtain the BE of the SNP. Moreover, the highest posterior density interval (HPDI) of the SNP is constructed. The proposed 3RR-HB procedure can improve the drawback of using a negative MLE to estimate the population SNP when its true value is small.
The rest of this paper is organized as follows: In Section 3, we present the data structure and introduce the proposed 3RR-HB procedure. In Section 4, an example regarding the homosexual proportion of college freshmen is used to demonstrate the applications of the proposed 3RR-HB procedure. A Monte Carlo simulation study is also conducted in Section 4 to study the weakness of maximum likelihood estimation method and evaluate the quality of the proposed 3RR-HB procedure. Some concluding remarks are given in Section 5.

Materials and Methods
Conducting a RR survey with two non-sensitive-nature and one sensitive-nature statements as follows: (i) I am in Group A; (ii) I am in Group B; and (iii) I am in Group C. Group A is a sensitive-nature statement; Group B and Group C are non-sensitivenature statements. Interviewees have probabilities p s , p 1 and p 2 to randomly answers the Statements (i), (ii) and (iii). In this study, the values of p s , p 1 and p 2 are pre-assigned.
The interviewees do not need to release which statement they have replied to the interviewer. Let θ, δ 1 and δ 2 denote the probabilities of individual answering "yes" under the Statement (i), (ii) and (iii), respectively and ω denote the probability of answering "yes" in the sample. It is trivial to shown that ω = P(yes) = p s θ + p 1 δ 1 + p 2 δ 2 . In this study, the values of δ 1 and δ 2 are known in the RR design.
Let the sample size be n, in which Y of them answer "yes". It can be shown that Y ∼ Bin(n, ω). The log-likelihood function based on the data (n, y) can be presented as The MLE of θ to maximize (ω | n, y) can be presented bŷ whereȲ = y n . We note thatθ is valid only if the working condition of is true. Unfortunately, Equation (5) is often violated when the value of θ is small. The fact makes the maximum likelihood estimation method unreliable for the cases of small θ. We will show that the failure rate of P(θ < 0) is high via using the Monte Carlo simulation method in Section 4. The Bayesian inference method is used to obtain the BE of θ. Let ω be random and follow the prior distribution of Beta, ω ∼ Beta(α, β): where α > 0 and β > 0 are hyper-parameters. The posterior distribution can be presented by In some occasions, we may collect RR samples from k(> 1) different regions or time periods. One example is to evaluate the homosexual proportions of the freshmen who enrolled in a university over different years. Since the enrolled freshmen come from different cities year by year, it is reasonable to assume that the proportion of homosexual freshmen varies year by year. Therefore, the SNPs are the proportions of homosexual freshmen in k years, denoted by θ 1 , θ 2 , · · · , θ k .
These values of θ 1 , θ 2 , · · · , θ k are different. In this study, we are interested in studying the trend of the population proportion of the homosexual freshmen in a university over years. A 3RR-HB method is developed to obtain the BEs of SNPs along with i = 1, 2, · · · , k. If all SNPs are same; that is, θ i = θ for i = 1, 2, · · · , k.
Taking summation to the both sides of the equation Replacing (8) by the p s,iθi based on the ith subsample, the BE of θ can be presented bỹ Assume that RR sub-samples were collected from k different regions or time periods and the values of θ 1 , θ 2 , · · · , θ k are different. Let (n i , Y i ) denote the ith RR sample and Y i ∼ Bin(n i , ω i ), where ω i = p s,i θ i + p 1,i δ 1,i + p 2,i δ 2,i for i = 1, 2, · · · , k. Let N = (n 1 , n 2 , · · · , n k ) and Y = (Y 1 , Y 2 , · · · , Y k ), and the data structure can be simplified as (N, Y). To avoid subjectively setting up the values of hyper-parameters, the HB modeling method is used to develop the proposed Bayesian inference procedure. Let ω i ∼ Beta(α, β) for i = 1, 2, · · · , k. The density function of ω i can be denoted by Using the square loss function for the Bayesian inference, the BE of ω, denoted bỹ ω, is the posterior mean based on the π(ω | n, y, α, β) in Equation (7), and we can present Therefore, the BE of θ can be obtained bỹ It is trivial thatθ > 0 ifω > p 1 δ 1 + p 2 δ 2 . It could be subjective to select the values of α and β for Bayesian inference. Hence, the HB modeling method is used in the proposed Bayesian estimation procedure to obtain the BEs of model parameters.
To implement the HB modeling method, we need to assume the second layer of prior distribution. Let α and β follow a hyper-prior distribution with the structure of a product of two Gamma distributions: where and For simplification, let Θ = (α, β, ω 1 , · · · , ω k ). The full posterior distribution can be presented by Moreover, the conditional posterior of ω i , given α and β can be presented by ω i ∼ Beta(y i + α, n i − y i + β), i = 1, 2, · · · , k. The value of θ i during the MCMC computation can be updated byθ Based on the condition of Given the values of α and (N, Y), we can shown that where c m (α, N, Y) = min y i + α p 1,i δ 1,i + p 2,i δ 2,i − (n i + α), i = 1, 2, ..., k .
The set Ω β|α,N,Y can be used to guaranteeθ ( * ) i > 0 for i = 1, 2, · · · , k. Hence, Ω β|α,N,Y can be a reference set to select the hyper-parameter β when a value of α is generated and the data of (N, Y) are collected.
The proposed hybrid algorithm is constructed as follows: Let Θ −1 = (β, ω 1 , · · · , ω k ) and Θ −2 = (α, ω 1 , · · · , ω k ) denote the vector of parameters by removing α and β from Θ, respectively. After algebraic computation, the marginal density distributions of α and β can be obtained, respectively, by and In order to overcome the difficulty to update α and β via using Equations (18) and (19) in the Gibbs sampling procedure, the Metropolis and Hastings algorithm is used to update α and β. Hence, the proposed hybrid algorithm for implementing the HB modeling method can be followed based on the following steps: Step 1: otherwise, α (j+1) = α (j) .
Step 4: Repeat Step 1 to Step 3 B times, where B is a big positive integer. Perform the burn-in step by removing the leading B 1 Markov chains. The BE of parameter is obtained via using the remainder (B − B 1 ) Markov chains. Since the square loss function is considered in this study, the BEsα,β andθ i , i = 1, 2, · · · , k.
Some proposals with the property of q(τ (j) |τ ( * ) ) = q(τ ( * ) |τ (j) ), where τ is the target parameter for update, can be selected to reduce the computation loading of MCMC-for example, the normal or uniform distribution. When such proposals are used to implement the MCMC approach, Equations (20) and (21) can reduce to respectively. In this study, the normal distribution is considered as the proposal for MCMC approach. Generate α ∼ N(α (j) , 1) and β ∼ N(β (j) , 1). If α < 0, we do not update α; if β / ∈ Ω β|α,N,Y , we do not update β. The obtained Markov chains based on the proposed 3RR-HB procedure can also be used to construct the empirical distribution of BE. Then, the HPDI of the model parameter can be obtained via using the empirical distribution of BE. The applications of the proposed 3RR-HB procedure and its quality will be studied in Section 4 via using a real example and Monte Carlo simulations.

Homosexual College Freshmen Survey Example
A RR survey was first conducted in October of 2019 and repeated in October of 2020 to evaluate the homosexual proportion of freshmen in a university, located in north Taiwan. College students from different majors were interviewed and asked to answer "yes" or "no" to one of the following three statements: (i) Is the last code of your ID card even? (ii) Is the last code of your student ID card even? (iii) Are you homosexual? Clearly, only Statement (iii) is a sensitive-nature statement.
Each student in the RR survey randomly draw a ball from a urn, which contains five white, five green and five red balls. If the drawn ball is white, answer Statement (i); if the drawn ball is green, answer Statement (ii); otherwise, answer Statement (iii). Students do not need to release the ball's color to the interviewer. The three-statement RR design can make students fully confide the true answer in the interviewer.
Using the normal proposals for the proposed MCMC procedure to generate α ∼ N(α ( * ) , 1) and β ∼ N(β ( * ) , 1). If α < 0, we do not update α; if β / ∈ Ω β|α,N,Y , we do not update β. All the estimation results are reported in Tables 2 and 3.  Tables 2 and 3, we can find the strength of the proposed 3RR-HB procedure. Table 2 shows that the obtained BEs based on the proposed 3RR-HB procedure are reliable. All bias and standard errors (SEs) of BEs based on the proposed 3RR-HB procedure with 10,000 iterations, where B = 50, 000 and B 1 = 10, 000, are small. Moreover, all acceptance rates in Table 3 are larger than 50%. Since ω ( * ) i ∼ Beta(y i + α (j+1) , n i − y i + β (j+1) ), all the generated values are accepted. Therefore, the acceptance rates of ω 1 and ω 2 are 100%. Moreover, the obtained BE of the homosexual proportion of college freshmen in the region is about 9% in 2019 and 8% in 2020.
The empirical distribution can be constructed based on the obtained Markov chains of BEs. Using the values of ξ 1 , η 1 , ξ 2 and η 2 of C-III in Table 1, the 90% HPDI of θ 1 and θ 2 are (0, 0.174) and (0, 0.168), respectively. Overall, we can conclude that the point homosexual proportion of college freshmen in the region in 2019 and 2020 was about 9% and 8% and then up to 17.4% and 16.8%, respectively, under the considering of sampling error with 90% confidence. The length of HPDI is long as the sample size is small. If the confident coefficient 95% is used instead of 90%, the length of the HPDI of θ i will be longer than the 90% HPDI for i = 1, 2.
Let θ 1 = θ 2 = 0.05 and 0.1. The proposed 3RR-HB procedure with B = 50,000, B 1 = 10,000 is implemented. Moreover, the BEs of θ i , i = 1, 2 are obtained based on the values of ξ 1 , η 1 , ξ 2 , η 2 of C-III in Table 1. Repeat each simulation procedure 10,000 times, and then the bias and mean squared error (MSE) of each BE are evaluated based on the 10,000 runs. All simulation results are reported in Figure 1 and Tables 4-6.   Figure 1 displays the scatter plot of the first 1, 000 MLEs ofθ 1 andθ 2 for n = 200 and θ 1 = θ 2 = 0.05. We can see that many pairs ofθ 1 andθ 2 are in Zones B, C, or D, and those pairs are invalid MLEs due toθ 1 < 0 orθ 2 < 0. Only the pairs ofθ 1 andθ 2 in Zone A are valid due to the required conditions θ 1 > 0 and θ 2 > 0.
We found that 7020 MLEs of θ 1 with the proportion of 0.702 and 6,930 MLEs of θ 2 with the proportion of 0.603 in Table 4 were positive when n = 200 and θ 1 = θ 2 = 0.05. In a scan of Table 4, we find that the maximum likelihood estimation method has a high risk to generate invalid values ofθ 1 andθ 2 if the true values of θ 1 and θ 2 are closed to zero. The sample proportions of P(θ 1 > 0) and P(θ 2 > 0) are increased as n is increased. However, as the RR method goes through an interview process, it could be difficult to collect an large sample to obtain reliable MLEs of θ 1 and θ 2 .
As many negative MLEs were found, the bias and MSE of MLE are not reliable. Tables 5 and 6 only report the bias and MSE of BE. In view of Tables 5 and 6, we found that the bias and MSE of BE is declined as n increased. The bias and MSEs ofθ 1 andθ 2 were small. Hence, the simulation results indicate that the proposed 3RR-HB procedure can be a reliable method to evaluate SNP.

Conclusions
In this paper, we proposed a 3RR-HB procedure to infer the SNP by considering a hierarchical structure for the prior distribution in Bayesian modeling. Moreover, the Beta-Binomial distribution was applied to characterize the RR samples. In order to overcome the computation complexity, the hybrid algorithm of using Gibbs sampling in Metropolis-Hastings algorithm was adopted to update the model parameters during MCMC computation. The proposed 3RR-HB procedure method is simple and minimally subjective for use.
A data set regarding the homosexual proportion of college freshmen was used to illustrate the applications of the proposed 3RR-HB procedure. We also conducted Monte Carlo simulations to study the performance of the proposed 3RR-HB procedure. The simulation results showed that the proposed 3RR-HB procedure was reliable to obtain the BEs of model parameters. Moreover, the 3RR-HB procedure can help users to escape the drawback of using invalid MLE to estimate the SNP.
The design of equal probabilities for the three statements was used to obtain RR samples. Such a design will reduce the chance of interviewees to select the sensitive-nature statement. However, such a design can enhance the willing of interviewees to confide in the interviewer the true answer. The equal-probability design is a trade-off. Practitioners can use unequal-probability design to obtain RR samples to implement the proposed 3RR-HB procedure based on their considerations.
We only used one sensitive-nature statement to obtain RR samples. It will be interesting to expand the proposed method for the RR method containing two or more sensitive-nature statements. How to establish the HB modeling inference procedure for the RR method with two or more sensitive-nature statements is an open question that can be studied in the future.