Scrambling Reports: New Estimators for Estimating the Population Mean of Sensitive Variables

: Warner proposed a methodology called randomized response techniques, which, through the random scrambling of sensitive variables, allows the non-response rate to be reduced and the response bias to be diminished. In this document, we present a randomized response technique using simple random sampling. The scrambling of the sensitive variable is performed through the selection of a report R i , i = 1, 2, 3. In order to evaluate the accuracy and efficiency of the proposed estimators, a simulation was carried out with two databases, where the sensitive variables are the destruction of poppy crops in Guerrero, Mexico, and the age at first sexual intercourse. The results show that more accurate estimates are obtained with the proposed model


Introduction
When carrying out survey sampling, the goal of the sampler is to collect, based on a sample, the greatest amount of information in order to estimate a certain characteristic of the population under study.To accomplish the objective of having accurate and truthful measurements, the sampler must have a sufficient amount of financial and methodological resources.If the sampler cannot solve any of the aforementioned issues, in practice, problems will arise in the collection of the information of interest, and these problems are a component of so-called "sampling errors".These errors are mainly due to a lack of response (non-response) or response bias.In addition, these sampling errors increase when the information to be obtained is about a sensitive characteristic.That is, respondents are more likely to avoid answering or give untruthful responses to questions on topics such as drugs, sexual violence, alcoholism, crime, etc.
We can find in the literature different techniques or methodologies to obtain answers to direct questions of a sensitive nature, such as the bogus pipeline developed by Jones In the first work on randomized responses by Warner [3], he considered a dichotomous population U of size N; that is, the elements of the population are classified according to their possible responses in the groups (), consisting of people who have the sensitive characteristic Y, and ( ̅ ) , consisting of people who do not have the sensitive characteristic Y.Using simple random sampling (SRS), a sample s of size n is selected in order to estimate the proportion of people with the sensitive qualitative characteristic:   .Using the following model, he scrambled the sensitive response of the respondent, assisted by a randomization device that selects the sensitive question with the probability P. Hence,   =    + (1 − ) ( .Extensions to deal with quantitative sensitive variables were developed by Greenberg et al. [6], Eriksson et al. [7], Huang [8], Bouza [9], Arnab [10], Singh and Gorey [11], Hussain and Shahid [12], Narjis et al. [13], Bouza et al. [14], Hussain et al. [15], and Azeem and Ali [16], among other works.Another utility of RR techniques is their applicability to sensitive issues, such as health areas (see Murtaza et al. [17]), social issues (see Chong et al. [18]), and drug use (see Perri et al. [19] and Kirtasze et al. [20]), among other sensitive issues.We present a variation of Saleem et al.'s [21] paper, in which the authors proposed a scrambling procedure for quantitative sensitive variables.
In this study, we used two databases to evaluate the estimators.One of them was obtained from a census on the cultivation of illegal drugs in the State of Guerrero, Mexico (see México Unido Contra La Delincuencia [22]).The sensitive variable is the area devoted to such crops.This research is very important because, despite efforts to curb the production of illicit drugs, their cultivation increases.A goal of the involved authorities is to examine behavior when using scrambling techniques to provide farmers with the confidence that their reports are not going to stigmatize them.Eradication efforts of such crops have an impact on ecosystems, as policies disproportionately affect not only smallholders, pushing them to marginality, but also programs such as the aerial spraying of herbicides, which affect biodiversity by fragmentizing and degrading forest habitat and wildlife.Severe damage to the environment, which may be a consequence of eradication policies, imposes the need to periodically review their effects from societal perspectives.Sample surveys should be developed periodically.The other database was provided by research on the age of first sexual intercourse (see Secretaria De Salud [23]).Early sexual activity in adolescence has multiple short-and long-term negative impacts on further emotional development and the quality of health, both mental and physical.Different studies maintain that having sex before the age of 13 increases the likelihood of sexually transmitted infections and other unhealthy behaviors, such as alcohol abuse.It is also associated with delinquency, violence, intergenerational health due to unintended pregnancies, etc. See Epstein et al. [24] for a discussion on these facts.Previous reliability studies on first intercourse have given some idea of the rates of falsified answers.See Brener et al. [25] as an example.Obtaining truthful answers while protecting privacy is possible with the use of RR techniques.They also provide higher rates of response from surveyed persons.
The content of this document is organized as follows.In the first part, we propose a variation of the model proposed by Saleem et al. [21] under SRS.The goal of this variation is to improve Saleem et al.'s [21] model in terms of precision, resulting in an R3 report with an unbiased estimator of the mean under specified conditions.In the second section, we evaluate the quality of the estimators in terms of accuracy and efficiency.We developed numerical studies on the behavior of the RR techniques presented using the two databases.Both studies provide recommendations on the use of the estimators derived for the considered scrambling procedures.Numerical and graphical studies were performed using simulations.

Materials and Methods
Proposed RR Scrambling Procedure Using SRSWR Randomized response techniques increase the participation of respondents to direct questions regarding a sensitive characteristic by providing them with confidence when reporting the value of their sensitive characteristic Y. Otherwise, the sampler is generally faced with a high proportion of non-responses and/or false responses.In practice, RR techniques, which are better at scrambling the sensitive value Y, will be perceived with more confidence by the respondents, who are more likely to supply its true value.We propose a variation of the work of Saleem et al. [21].The RR proposed is a compulsory randomized response technique, in which the respondent's response is randomly scrambled by one of the following three reports: They individually scramble the true value of Y. Take g ∈ [0,1] and  ∈ {−1,1} , which are independent constants known and/or generated by the sampler.S is an auxiliary or scrambling variable, with the mean () =   = 0 and variance   2 fixed by the sampler.The report is Our proposal substitutes the last alternative report with  (3) =   /  and S with the mean   > 0 and variance   2 .It is also a compulsory randomized response technique.Now, the respondent's response is randomly scrambled by  1 ,  2 , or  (3) .Therefore, the RR model is given by: SRSWR (simple random sampling with replacement) is used to select a sample s of size n from a population U in the reports.It is of interest to know the population characteristics of the sensitive value Y. Looking at the characteristics for  1 and  2 proposed by Saleem et al. [21]: for  1 ;  ̅ ( 2 ) =  ̅ 2 +   , and its . His proposal of an estimator of   * for the Z* model is . We propose the following estimator of the variance: Our proposal uses  (3) =   /  instead of  3 =     .It seems that respondents will perceive that  (3) provides more confidence in scrambling Yi.The next lemma gives the statistical properties of an estimation of the population mean based on reports  (3) , i = 1, …, n.

Lemma 1. The estimator of the mean of Y using the scrambling procedure 𝑅
Proof.Expectation.Note that it is a ratio estimator.Note that the expectation of  ( 3) This expression is derived by using a Taylor Series approximation E ( Hence, the estimator is an approximately unbiased estimator of Variance of the estimator.The variance of  (3) under the model is , where  [ Using, in both expectations, a Taylor Series approximation, as developed by Singh [26], Then, the lemma is proved.□ Since the estimator is not unbiased, the bias is: The sampler is able to diminish this bias using a variable S such that =   .All these conditions are possible as long as the distributions of   and   2 are fixed by the researcher, as we pointed out above.Note that  → ∞, and hence, R(3) is consistent.
Our proposal uses the estimator An estimation theory for this RR scrambling procedure is given in Lemma 2.2.
Lemma 2. The use of the  report has the following characteristics: , which is an estimator of the population mean of Y.
which is the variance of the estimator.
(iv)  ̂[̂] ≈ A natural estimator for the variance is Say, That is, We denote The lemma is proved.□ Note that the bias is: With the same conditions fixed for the R(3) report, we have  + (1 − )  ≅ 1.Then, the expression of the bias will be zero, and the choice of the researcher to use the proposed report R(3), that is, to have g = 0, will make the estimate unbiased.

Results
In this section, we evaluate the accuracy and efficiency of the estimators.Because the expectation of the  3 report by Saleem et al. [21] is zero, it is not possible to make a comparison with the  (3) report, so only the Z* and Z models were compared using simple random sampling with replacement (SRSWR).We present two ways to analyze the behavior of the estimators: the first is numerically and the second is graphically.To carry out the analysis, two different databases were used.For each one, two simulations of 1000 iterations were carried out, and the averages were computed.We have fixed  = 0.5, because we want to have the same probability of choosing R1 or R2 since addition and subtraction are inverse processes of each other.Furthermore, in each database, we ran the simulation twice, fixing g = 0.7 for the first run and g = 0.3 for the second run.The values of the auxiliary variable S were fixed in such a way that the reports, R's, produce results similar to the data in the databases.
This evaluation was performed with the following measurements.The ratio of the relative errors is the measure to evaluate the comparative accuracy of ̂ between the estimators of models Z* and Z, which is  [

Simulation with Data of Illicit Crops in Guerrero, Mexico
In the first database, we considered a sensitive variable to be the amount, in hectares, of destruction of poppy crops by the federal government in Mexico; we only used data from the State of Guerrero [22].We considered that variable to be sensitive due to the media and social repercussions for the State of Guerrero, since it is a state where the majority of inhabitants make a living from tourism.The parameters of the sensitive variable are  = 1157 ,  = 35.0968, and  2 = 4947.115 .The data used for the simulation cover the period 2015-2021.We used () = 2.5 as the error; therefore, n = 470 for SRSWR.Table 1 shows the numerical results of the estimations and measures for the models Z* and Z. Table 2 shows the results of the accuracy and efficiency of Z* against Z.The numerical results in Table 2 show that, for the accuracy of the estimation of the sensitive value Y with respect to the parameter   , it is better to use our proposed model Z than the Z* model because its estimate is closer to the true parameter   = 35.096and thus is more accurate.This is confirmed by the relative errors in the parameter, which are smaller values for both cases where  = 0.7 and  = 0.3.Regarding efficiency, it is better to use the Z* model than the proposed Z model, since it provides smaller values of the variance estimator.
In Table 1, we can confirm what was described above; in addition, we can specify that scrambling the sensitive value Y with R(3) (g = 0.3) provides more accuracy than R1/R2 (g = 0.7, for Z* and Z) and R3 (g = 0.3) in Z*.In addition, the ACP results for Z* show the inaccuracy of its estimator.On the other hand, the Z* model provides smaller values of ACV and AL.

Simulation with Data about First Sexual Intercourse
In the second database, we used data from the National Health and Nutrition Survey 2021 [23] collected by the Ministry of Health of Mexico.From these data, as the sensitive variable Y, we selected the question, "At what age did you have your first sexual intercourse?"The responses have numeric values between 1 and 49, with  = 7240,   = 18.1221, and  2 = 12.79736.It should be noted that this question from the survey was only posed to women and men between 20 and 49 years old.We set the sampling error () = 0.1 for SRSWR, and the resulting n is 1087.As in the previous simulation, we show the results of accuracy and efficiency in Tables 3 and 4. Regarding accuracy and efficiency when using Z* or Z, the numerical results in Table 4 coincide with the conclusions of the previous simulation; that is, the estimation is more accurate when using our proposed model than when using Z*.Again, like the previous simulation, Table 3 shows that the R(3) report (g = 0.3) is more accurate than the others, and the percentage of replicates for which the CI covers   is zero when using the Z* model.In addition, it is better to use Z* than Z to reduce the variance.).For the first database, the sample size increases to n = 25, 50, …, 1000, and for the second database, the sample size increases to n = 50, 100, …, 2000.In the next figures, we can observe the accuracy and efficiency using both designs when we fixed  = 0.5, where g = 0.7 and g = 0.3.

Graphical Simulation
In Figure 1, in terms of the accuracy of the estimator ̂, it can be seen that it is better to use the Z model than the Z* model; as in the numerical results, it is more accurate to use the R(3) report (g = 0.3).Using the Z* model over the Z model with any report produces the minimum variance in the results.The graphs in Figure 2 agree with all the results already shown, where it is better to use the Z model for greater accuracy and the Z* model for the minimum variance.

Discussion
In this document, we propose a new randomized response technique, which allows us to obtain information on a variable of interest Y considered sensitive.In the study of the behavior of the proposed estimators, as already mentioned in this document, we treated the following as sensitive variables: the amount, in hectares, of destruction of poppy crops by the federal government of Mexico in the State of Guerrero and "At what age did you have your first sexual intercourse?" As a consequence of this study, for the first sensitive variable, it is preferable for researchers to use the proposed Z model to more accurately estimate the amount of poppy destruction.This is important in the national context since, due to the public policies of the current federal government [27] in implementing drug prevention programs or licit crop programs in order to reduce poppy crops, it is important to estimate what is closest to reality since, based on these estimates, the budgets for said programs are assigned.Otherwise, there would be an underestimation, causing an inadequate budget for the implementation of the programs, or an overestimation, which would cause other programs in other areas to have a lower budget.Neither sampling error is acceptable in a country such as Mexico.
In the analysis of the sensitive question "At what age did you have your first sexual intercourse?", the same considerations can be made since the Z model provides the best estimate of the true value.On the other hand, if a researcher in the area of health [28], according to our sensitive variable, is also interested, in addition to knowing the estimated value of a sensitive characteristic, in knowing between which values the true value of this characteristic lies, that is, in building confidence intervals, it is better to use Z* due to its minimum variance, since it will provide smaller confidence intervals and, hypothetically, estimates with greater precision.This last statement is valid for unbiased estimators.
As a limitation of this work, the estimators in our proposal are as biased as in the work of Saleem et al. [21].In our case, this is due to the use of ratio estimators, which, by their nature, are biased.In addition, the applicability of the ratio report R( 3) is made more difficult in practical use compared to an addition, subtraction, or multiplication report.Finally, only a simple random sampling design was used.
For the aforementioned issues, it is recommended that, in future works, the estimators of the Z model under simple random sampling (SRS) be extended to stratified simple random sampling (SSRS).This variation is for the purpose of determining under which conditions it is better to use SRS or SSRS with the Z and Z* models, defining the gain in accuracy and optimal allocation, and so on.In addition, it would be desirable to propose other estimators that are not of the ratio type to make comparisons, in terms of accuracy and efficiency, with the estimators proposed in this document.
1 −   ).Warner's proposal for estimating the population proportion π A of a sensitive characteristic A is  ̂ =   −(1−) 2−1 On the other hand, we have several measures to evaluate the efficiency of the estimator of the variance of the estimated mean in each model; these are: (i) The average coefficient of variation,  = 100 * ( √ ̂ 2 ̂ ⁄ ) ; (ii) the actual coverage percentage, ACP = percentage of replicates for which the CI covers   , where the confidence interval of 95% for   is (̂ − 1.96√ ̂ 2 , ̂ + 1.96√ ̂ 2 , ); (iii) the average length of the confidence intervals, AL; and (iv) the average of the ratio of variances, ) 2 +  2 was calculated with a fixed sampling error ().
Another way to analyze the behavior of Z* and Z with both databases is by visualizing the values of the following statistics:  (

Figure 1 .
Figure 1.Accuracy and efficiency under Z* and Z in database of illicit crops.

Figure 2 .
Figure 2. Accuracy and efficiency under Z* and Z in database of first sexual intercourse.

Table 1 .
Estimates and measures to evaluate the estimators of the models.

Table 2 .
Accuracy and efficiency of the estimators.

Table 3 .
Estimates and measures to evaluate the estimators of the models.

Table 4 .
Accuracy and efficiency of the estimators.