Kernel Density Estimation for Joint Scrambling in Sensitive Surveys

Arulandu, Alvan Caleb; Gupta, Sat

doi:10.3390/math13132134

Open AccessFeature PaperArticle

Kernel Density Estimation for Joint Scrambling in Sensitive Surveys

by

Alvan Caleb Arulandu

^1,*

and

Sat Gupta

²

¹

Department of Mathematics, Harvard University, 33 Lowell Mail Center, 10 Holyoke Place, Cambridge, MA 02138, USA

²

Department of Mathematics and Statistics, University of North Carolina at Greensboro, 116 Petty Building, Greensboro, NC 27412, USA

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(13), 2134; https://doi.org/10.3390/math13132134

Submission received: 6 June 2025 / Revised: 25 June 2025 / Accepted: 25 June 2025 / Published: 30 June 2025

(This article belongs to the Special Issue Innovations in Survey Statistics and Survey Sampling)

Download

Browse Figures

Versions Notes

Abstract

Randomized response models aim to protect respondent privacy when sampling sensitive variables but consequently compromise estimator efficiency. We propose a new sampling method, titled joint scrambling, which preserves all true responses while protecting privacy by asking each respondent to jointly speak both their true response and multiple random responses in an arbitrary order. We give a kernel density estimator for the density function with asymptotically equivalent mean squared error for the optimal bandwidth yet greater generality than existing techniques for randomized response models. We also give consistent, unbiased estimators for a general class of estimands including the mean. For the cumulative distribution function, this estimator is more computationally efficient with asymptotically lower mean squared error than existing approaches. All results are verified via simulation and evaluated with respect to natural generalizations of existing privacy notions.

Keywords:

kernel density estimation; privacy protection; randomized response; scrambling; sensitive survey sampling

MSC:

62D05; 62G07

1. Introduction

The validity of sampling methodologies for random variables and corresponding estimators assumes respondent honesty. With sensitive variables, respondents naturally hesitate to divulge private and potentially compromising information. Thus, sensitive variable sampling methods exhibit a trade-off between privacy and efficiency in which schemes are engineered to protect respondent privacy without significantly compromising the performance of respective estimators. Most celebrated is the randomized response model, initially posited by Warner [1] for a binary random variable, Y. Warner’s randomization is achieved by asking the direct or indirect question with probability p or

1 - p

, respectively, recording the corresponding response

Z = Y

or

Z = 1 - Y

. Greenberg et al. [2] replaces this indirect question with an unrelated question

X \sim Bern (π_{X})

, generally with known parameter

π_{X}

. In this manner, if a respondent answers “yes” to the question of interest, for

0 < p < 1

the interviewer cannot tell with certainty if the respondent is admitting to the sensitive question or the unrelated/indirect question, and thus, respondent privacy is protected. Of course, as

| p - (1 - p) |

increases, privacy declines, as the interviewer can use the fact that one question is more likely than another to their predictive advantage. As

| p - 1 / 2 |

decreases, by the privacy and efficiency trade-off, this additional protection comes at the cost of larger estimator variance.

Gupta et al. [3] aimed to compromise this trade-off via optionality, in which respondents use randomized response if and only if the question is sensitive for them, responding truthfully otherwise. Accordingly, the true response of a portion of respondents, that is, those who find the question non-sensitive, is preserved, decreasing the variance in comparison with the absence of optionality. Our proposed sampling method, joint scrambling, is motivated by a similar desire: to preserve all true responses of respondents without compromising privacy.

The aforementioned methods yield analogs for quantitative sensitive variables. Greenberg et al. [4] parallels the binary case, replacing the known parameter

π_{X}

with the generally known distribution of X. For Warner, the concept of an indirect question becomes obselete, replacing Z with a linear scrambling

Z = Y + S

where S is some independent noise drawn from a known distirbution [5]. In this vein, authors have considered alternative scrambling schemes, including multiplicative scrambling

Z = M Y

[6] and a general scrambling of the form

Z = M Y + S

where the multiplicative scrambling noise M is independent from the linear scrambling noise S [7]. Besides increasing variance, all scrambling mechanisms where

M, S

are continuous random variables almost surely never provide a true respondent response, as this occurs solely when

M = 1

and

S = 0

, which occurs with probability 0.

Summarizing the privacy protection of randomized response, Chapter 5 of Chaudhuri and Mukerjee [8] states that respondents, “must be convinced that their privacy is well guarded before they will be persuaded to make available damaging and incriminating documents”. However, beyond the mathematics of confidentiality protection, in practice, privacy protection is effective insofar as it is simple and explainable to the respondent.

We propose joint scrambling as an explainable sampling methodology for sensitive random variables that is well suited for kernel density estimation. Doing so, we also prove the asymptotic dominance of distribution function estimators to existing methods and give mean estimators asymptotically on par with the efficiency of [5] while still reaping privacy benefits.

2. Joint Scrambling

The motivation of joint scrambling is to maintain respondent privacy while preserving the true answers of all respondents. We do so by requesting each respondent to respond with not just their true answer but a list of answers including their true answer and r other responses that are randomly sampled from some known distribution.

Formally, we begin by fixing the sampling parameter

r \geq 0

. Our sampling method proceeds as follows. The k-th respondent is asked to privately record

{\vec{Y}}_{k} = (Y_{k}^{1}, \dots, Y_{k}^{r + 1})

where

\begin{matrix} Y_{k}^{1} & = Y_{k} \\ Y_{k}^{i} & \overset{i . i . d .}{\sim} S (2 \leq i \leq r + 1) \end{matrix}

such that

Y_{k}

is their true response and

F_{S}

is the known distribution of some randomization device. The respondent then speaks their private recordings in increasing order, which the interviewer records as

{\vec{Z}}_{k} = (Z_{k}^{1}, \dots, Z_{k}^{r + 1})

where

Z_{k}^{i} = Y_{k}^{(i)} (1 \leq i \in r + 1)

is the i-th order statistic of the k-th respondent’s recordings.

In this fashion, the respondent always reports their true response as

Y_{k} \in {Z_{k}^{1}, \dots, Z_{k}^{r + 1}}

, yet the interviewer is inhibited from deducing which of

{Z_{k}^{1}, \dots, Z_{k}^{r + 1}}

is the true response. Joint scrambling refers to the technique of masking a signal by shuffling scrambling variables and reporting a joint distribution as opposed to perturbing the signal itself.

3. Distribution Estimators for Continuous Variables

We first aim to construct computationally efficient estimators with asymptotically equivalent or superior mean squared error. We begin with estimating quantitative distributions, specifically the probability density function

f_{Y}

and the cumulative distribution function

F_{Y}

. For convenience, we use the following notation:

M_{a}^{b} (g) = \int_{- \infty}^{\infty} t^{a} g^{b} (t) d t (g : R \to R)

Existing work on such estimators for randomized response models derive from the following kernel density estimator of the reported responses,

Z_{1}, \dots, Z_{n}

, with bandwidth h and known kernel

k : R \to R

.

{\hat{f}}_{Z} (z) = \frac{1}{n h} \sum_{i = 1}^{n} k (\frac{z - Z_{i}}{h}) = \frac{1}{n} \sum_{i = 1}^{n} k_{h} (z - Z_{i})

(1)

where

k_{h} (t) = k (t / h) / h

. By Wand and Jones [9], this estimator admits the following optimal bandwidth with respect to the asymptotic mean integrated squared error.

h_{Z}^{*} = {\{\frac{M_{0}^{2} (k)}{M_{2}^{1} {(k)}^{2} M_{0}^{2} (f_{Y}^{″})}\}}^{1 / 5}

(2)

3.1. Density Estimation

Current work only regards randomized response models using purely additive [10] or multiplicative scrambling [11] with a specific distribution, typically uniform. We begin by giving the following assumptions, as standard in the literature [9,10,11].

Assumption 1.

Suppose

S \sim Unif (0, T)

.

Assumption 2.

Let

f_{Y} : R \to R

be a density such that the second derivative

f_{Y}^{''}

is continuous, square-integrable, and monotone over

(- \infty, - M) \cup (M, \infty)

for some

M > 0

. This substantiates the use of Taylor series expansion in this section’s proofs—also present in Appendix A. We assume that kernel

k : R \to R

is a bounded probability density function, symmetric about the origin, with a finite fourth moment and bounded, integrable first derivative. Since

k (x) = k (- x)

for

x \in R

, we also have

M_{a}^{1} (k) = \{\begin{matrix} 1 & a = 1 \\ 0 & a \equiv 1 (mod 2) \\ < \infty & a \leq 4 \end{matrix}

We also naturally restrict h, specifically the sequence

h_{n}

, such that

{lim}_{n \to \infty} h = 0

.

Assumption 3.

We restrict

h = ω (n^{- 1 / 3})

, meaning

{lim}_{n \to \infty} n^{3} h = \infty

.

Under Assumptions 1, 2, and 3, for the multiplicative scrambling

Z = S \cdot Y

, [11] proposed the following density estimator with asymptotic mean integrated squared error:

\begin{matrix} {\hat{f}}_{Y_{M}} (y) & = - y T^{2} {\hat{f}}_{Z}^{'} (y T) \\ AMSE {{\hat{f}}_{Y_{M}} (y)} & = y^{2} T^{4} f_{Z} (y T) M_{0}^{2} (k^{'}) {(n h^{3})}^{- 1} + \frac{y^{2} T^{4} {f_{Z}^{'''} (y T) M_{2}^{1} (k)}^{2} h^{4}}{4} \end{matrix}

(3)

\begin{matrix} AMISE ({\hat{f}}_{Y_{M}}) & = T^{4} M_{0}^{2} (k^{'}) M_{2}^{1} (f_{Z}^{*}) {(n h^{3})}^{- 1} + \frac{y^{2} T^{4} M_{2}^{1} {(k)}^{2} M_{0}^{2} (f_{Z}^{\circ}) h^{4}}{4} \\ MISE {{\hat{f}}_{Y_{M}} (y)} & = AMISE {{\hat{f}}_{Y_{M}} (y)} + o {{(n h^{3})}^{- 1} + h^{4}} \end{matrix}

(4)

where

f_{Z}^{*} = f_{Z} (y T)

and

f_{Z}^{\circ} (t) = y f_{Z}^{'''} (y T)

. Minimizing (4) with respect to h, [11] derives an asymptotic optimal bandwidth.

\begin{matrix} h_{M}^{*} & = {\{\frac{3 M_{0}^{2} (k^{'}) M_{2}^{1} (f_{Z}^{*})}{M_{2}^{1} {(k)}^{2} M_{0}^{2} (f_{Z}^{\circ})}\}}^{1 / 7} n^{- 1 / 7} \end{matrix}

For clarity, Assumption 3 restricts the convergence rate of h with respect to n to permit computation of an optimal kernel bandwidth via minimization of asymptotic mean squared errors and mean integrated square errors. An exact mean integrated squared error may be derived in the manner of Section 2.6 in Wand and Jones [9], but it follows from Wand and Jones [9] and Mostafa and Ahmad [12] that this truncation of mean squared error to asymptotic mean squared error via Taylor expansion incurs an error of

o {{(n h^{3})}^{- 1} + h^{4}}

. Thus, asymptotically,

h_{M}^{*} = Θ (n^{- 1 / 7}), MISE {{\hat{f}}_{Y_{M}} (y)} = Θ (n^{- 4 / 7}) + o (n^{- 4 / 7}) = Θ (n^{- 4 / 7})

(5)

Assumption 4.

We restrict

h = ω (n^{- 1})

, meaning

{lim}_{n \to \infty} n h = \infty

.

Assumption 5.

Suppose

Y \geq 0

such that, under Assumption 1, this also implies

Z \geq 0

.

Under Assumptions 1, 2, 4, and 5, Shou and Gupta [10] derived similar results for the additive scrambling

Z = Y + S

.

\begin{matrix} {\hat{f}}_{Y_{A}} (y) = & \frac{1}{T} \int_{y}^{y + T} {\hat{f}}_{Z} (z) d z \\ AMSE ({\hat{f}}_{Y_{A}}) = & \frac{h^{4} M_{2}^{1} {(k)}^{2}}{4 T^{2}} \int_{y}^{y + T} f_{Z}^{''} {(z)}^{2} d z + \frac{M_{0}^{2} (k)}{n h T^{2}} \int_{y}^{y + T} q (z) d z \\ + \frac{1}{T^{2}} {\{\int_{y}^{y + T} f_{Z} (z) d z\}}^{2} + f_{Y}^{2} (y) - \frac{2}{T} \int_{y}^{y + T} f_{Z} (z) f_{Y} (y) d z \\ AMISE {{\hat{f}}_{Y_{A}} (y)} = & \frac{h^{4} M_{2}^{1} {(k)}^{2}}{4 T^{2}} \int_{- \infty}^{\infty} \int_{y}^{y + T} f_{Z}^{''} {(z)}^{2} d z d y \\ + \frac{M_{0}^{2} (k)}{n h T^{2}} \int_{- \infty}^{\infty} \int_{y}^{y + T} q (z) d z d y \\ + \frac{1}{T^{2}} \int_{- \infty}^{\infty} {\{\int_{y}^{y + T} f_{Z} (z) d z\}}^{2} d y + M_{0}^{2} (f_{Y}) \\ - \frac{2}{T} \int_{- \infty}^{\infty} \int_{y}^{y + T} f_{Z} (z) f_{Y} (y) d z d y \\ h_{A}^{*} = & {\{\frac{M_{0}^{2} (k) \int_{- \infty}^{\infty} \int_{y}^{y + T} f_{Z} (z) d z d y}{M_{2}^{1} {(k)}^{2} \int_{- \infty}^{\infty} \int_{y}^{y + T} f_{Z}^{''} {(z)}^{2} d z d y}\}}^{1 / 5} n^{- 1 / 5} \end{matrix}

(6)

Similar to the multiplicative case, it follows from Wand and Jones [9] and Mostafa and Ahmad [12] that truncating mean squared error to asymptotic mean squared error via Taylor approximation incurs an error of

o {{(n h)}^{- 1} + h^{4}}

. Thus, asymptotically,

h_{A}^{*} = Θ (n^{- 1 / 5}), MISE {{\hat{f}}_{Y_{A}} (y)} = Θ (n^{- 4 / 5}) + o (n^{- 4 / 5}) = Θ (n^{- 4 / 5})

(7)

In the randomized response domain, additive scrambling is typically more attractive than the multiplicative alternative, but we also see that the mean integrated squared error of the estimator from the former (7) is asymptotically dominant over the latter (5) as

Θ (n^{- 4 / 5}) = o {Θ (n^{- 4 / 7})}

.

Á la joint scrambling, our estimator modifies the kernel density estimate in (1) to estimate

f_{Y}

.

Theorem 1.

Under solely Assumptions 2 and 4, we give the following consistent density estimator.

\begin{matrix} {\hat{f}}_{Y_{JS}} (y) & = \frac{1}{n} \sum_{i = 1}^{n} \sum_{j = 1}^{r + 1} k_{h} (y - Z_{i}^{j}) - r (k_{h} * f_{S}) (y) \end{matrix}

(8)

where * is the convolution operator given by

(k_{h} * f_{S}) (y) = \int_{- \infty}^{\infty} k_{h} (y - t) f_{S} (t) t

. We also derive the asymptotic mean square error, asymptotic mean integrated squared error, asymptotic optimal bandwidth, and corresponding mean integrated squared error for (8).

\begin{matrix} AMSE {{\hat{f}}_{Y_{JS}} (y)} & = {(n h)}^{- 1} M_{0}^{2} (k) {f_{Y} (y) + r f_{S} (y)} + \frac{1}{4} h^{4} f_{Y}^{''} {(y)}^{2} M_{2}^{1} {(k)}^{2} \\ AMISE {{\hat{f}}_{Y_{JS}} (y)} & = {(n h)}^{- 1} (1 + r) M_{0}^{2} (k) + \frac{1}{4} h^{4} M_{2}^{1} {(k)}^{2} M_{0}^{2} (f_{Y}^{''}) \\ h_{JS}^{*} & = {\{\frac{(r + 1) M_{0}^{2} (k)}{M_{2}^{1} {(k)}^{2} M_{0}^{2} (f_{Y}^{''})}\}}^{1 / 5} n^{- 1 / 5} \\ = {(r + 1)}^{1 / 5} h_{Z}^{*} \end{matrix}

(9)

\begin{matrix} MISE {{\hat{f}}_{Y_{JS}} (y)} & = AMISE {{\hat{f}}_{Y_{JS}} (y)} + o {{(n h)}^{- 1} + h^{4}} \end{matrix}

(10)

Thus, asymptotically,

h_{JS}^{*} = Θ (r^{1 / 5} n^{- 1 / 5}), MISE {{\hat{f}}_{Y_{JS}} (y)} = O (r^{4 / 5} n^{- 4 / 5}) = Θ (n^{- 4 / 5})

(11)

Remark 1.

Since, jointly, the

Z_{i}^{j}

are not i.i.d. as in typical kernel density estimation, adapting the estimator of (1) to the joint scrambling setting as shown in the first term of (8) yields a bias that does not vanish as

n \to \infty

. The second term of (8) is an additive correction which mitigates this bias and permits convergence as described by (11). Further, Theorem 1 does not restrict the distribution of S or Y. While Ahmad [11] does generalize Assumption 1 to distributions of the form

F_{S} (s) \propto s^{β}

for

0 \leq s \leq T

, the removal of Assumptions 1 and 5 provide the statistician with greater scrambling generality than the randomized response variants (6) and (3).

We see that the joint scrambling estimator in (8) is asymptotically equivalent in mean integrated squared error to the additive scrambling estimator in (6) and asymptotically dominates the multiplicative scrambling estimator in (3).

Simulation Study

We use simulated data to evaluate the claimed scaling (11) of the joint scrambling density estimator’s mean integrated square error.

The tight matching in the left plot of Figure 1 corroborates the claim that

MISE ({\hat{f}}_{Y_{JS}}) = Θ (n^{- 4 / 5})

for constant r. While the right plot of Figure 1 does not match as well, we observe that the simulated scaling with respect to r for constant n is much slower than the proven bound

Θ (r^{4 / 5})

(11). This means that while the bound applies to all cases, in certain settings, the error terms in the derivation of (11) de-constructively interfere, leading to better practical scaling.

We also argue that the joint scrambling estimator is better suited for practical application given its computational advantages. Computing the kernel density estimate for a given z requires

O (n)

computation to evaluate the summation in (1). The additive scrambling estimator in (6) integrates

f_{Z_{KDE}} (z)

for

z \in [y, y + T]

. Since the integral in (6) does not have a closed form, it must be numerically approximated. Advancements in numerical integration techniques such as the Runge–Kutta family have replaced the naive Euler’s method with higher-order approximation, typically of order four or five [13]. But increasing approximation precision requires repeated evaluation of the expensive integrand, inversely affecting time complexity. Of course, the multiplicative scrambling estimator in (3) avoids this integration at the cost of lower statistical efficiency. Since

k_{h}

is a known function, the derivative

k_{h}^{'}

can typically be computed exactly, or via numerically differentiation, meaning

f_{Z_{KDE}}^{'} (y)

takes roughly

O (n)

time. While the convolution correction term in (8) is an integral, since both

k_{h}

and

f_{S}

are known functions, this term can be computed exactly or numerically integrated, which contributes negligibly to asymptotic time complexity as evaluation of the integrand is

O (1)

. Then, computing the modified kernel density estimate summation in (8) is simply

O (n)

. Using Gauss–Konrod quadrature [14] for numerical integration and finite differencing for numerical differentiation, we demonstrate these computational considerations via simulation in Table 1.

Comparing estimator runtime performance, T, across each method in Table 1, we see that Shou and Gupta [10] has the largest runtime, followed by joint scrambling and Ahmad [11]. This is natural as we expect numerical integration to be a computational bottleneck, particularly when the integrand has non-constant time complexity as in the additive scrambling case. On the other hand, numerical differentiation and summation for a known kernel are less intensive.

For the integrated variance,

IV

, we see that Ahmad [11] yields the worst result, followed by joint scrambling and then Shou and Gupta [10]. However, the integrated absolute bias,

IAB

, is lowest for joint scrambling followed by varying results for Shou and Gupta [10] and Ahmad [11]. In this manner, we see that the joint scrambling estimator compensates for higher integrated variance with low integrated absolute bias. This results in the lowest mean integrated square error,

MISE

, for joint scrambling followed by Shou and Gupta [10] and Ahmad [11].

Despite maintaining the same asymptotic mean integrated square error as existing techniques, we conclude that the joint scrambling density estimator provides greater scrambling generality and computational efficiency in addition to superior mean integrated square error in practice.

3.2. Cumulative Distribution Estimation

Prior work on cumulative distribution estimators,

{\hat{F}}_{Y}

, for randomized response models is limited to a single estimator in the additive case when

Y \geq 0

[10]. Generalizing this estimator to all Y, under Assumptions 1, 2, and 4,

{\hat{F}}_{Y_{A}} (y) = \int_{- \infty}^{y + T} {\hat{f}}_{Z} (w) d w - \frac{1}{T} \int_{y}^{y + T} (z - y) {\hat{f}}_{Z} (z) d z

(12)

While Shou and Gupta [10] omitted further analysis of (12), in Appendix A.2, we derive the asymptotic behavior of the mean integrated squared error using the optimal bandwidth for the traditional kernel density estimator in (1) via Wand and Jones [9].

MISE ({\hat{F}}_{Y_{A}}) = Ω (n^{- 4 / 5})

(13)

Since

F_{S}

is known, we modify the empirical cumulative distribution function to estimate

F_{Y}

á la joint scrambling.

Theorem 2.

We give the following consistent, unbiased estimator for the cumulative distribution function

F_{Y}

.

{\hat{F}}_{Y_{JS}} (y) = \frac{1}{n} \sum_{i = 1}^{n} \sum_{j = 1}^{r + 1} I (Z_{i}^{j} \leq y) - r F_{S} (y)

(14)

We also derive the exact mean squared error for (14).

MSE {{\hat{F}}_{Y_{J S}} (y)} = [F_{Y} (y) {1 - F_{Y} (y)} + r F_{S} (y) {1 - F_{y} (s)}] n^{- 1}

(15)

Corollary 1.

Maximizing (15),

\begin{matrix} max_{y \in im (Y)} MSE {{\hat{F}}_{Y_{JS}} (y)} & = \frac{(r + 1)}{4 n} = Θ (r n^{- 1}) = Θ (n^{- 1}) \\ MISE ({\hat{F}}_{Y_{J S}}) & = Ω (n^{- 1}) \end{matrix}

It follows from Corollary 1 that the mean squared error in (15) asymptotically dominates that of Shou and Gupta [10]. This also exhibits the aforementioned privacy–efficiency trade-off as increasing r increases privacy but also increases mean squared error linearly. Of course, since r is a fixed design parameter, it does not effect the asymptotic properties of the estimator’s mean squared error as

Θ (r n^{- 1}) = Θ (n^{- 1})

.

Simulation Study

We give Figure 2 as a verification that the sample mean squared error from simulation agrees with the theoretical result in (15).

Beyond statistical efficiency, the proposed estimator in (14) is less computationally expensive than existing work. The former simply requires an

O (n)

summation of the reported responses and yields an exact estimate. However, the numerical integration of the kernel density estimate in (14) poses computational challenges in a similar manner to the density estimator of Shou and Gupta [10]. We demonstrate this via simulation in Table 2.

We observe from Table 2 that the comparison of joint scrambling’s cumulative distribution estimator to Shou and Gupta [10] is analogous to that of the density estimator. The joint scrambling estimator has significantly lower runtime, T, and mean integrated square error,

MISE

, where the latter is achieved by compensating a higher integrated variance,

IV

, for a low integrated absolute bias,

IAB

. This advantage in mean integrated square error concurs with the asymptotic comparison.

We conclude that the joint scrambling cumulative distribution function estimator is asymptotically more efficient than (12) both statistically and computationally.

4. Mean Estimation

Randomized response models generally regard a mean estimand,

μ = E (Y)

. Analogous to the construction of (14) from the empirical cumulative distribution function, we construct a mean estimator empirically, using the sample mean and

μ_{S} = E (S)

.

Theorem 3.

We give the following consistent, unbiased estimator of

μ_{Y}

.

{\hat{μ}}_{Y_{JS}} = \frac{1}{n} \sum_{i = 1}^{n} \sum_{j = 1}^{r + 1} Z_{i}^{j} - r E (S)

(16)

We also give the variance for (16).

var ({\hat{μ}}_{S_{JS}}) = {var (Y) + r var (S)} n^{- 1} = Θ (r n^{- 1}) = Θ (n^{- 1})

Comparison with Additive Scrambling

Given reported responses

Z_{1}, \dots, Z_{n}

, Warner [5] gives the following mean estimator and variance for the additive scrambling model

Z = Y + S

:

\begin{matrix} {\hat{μ}}_{Y} & = \frac{1}{n} \sum_{i = 1}^{n} Z_{i} - μ_{S} \\ var ({\hat{μ}}_{Y}) & = {var (Y) + var (S)} n^{- 1} \end{matrix}

Typically, the scrambling is centered such that

E (S) = 0

, making

{\hat{μ}}_{Y}

equivalent to the sample mean. For Warner’s additive model, it is known that the privacy/efficiency trade-off is solely parameterized by the distribution

F_{S}

. For joint scrambling and the corresponding mean estimator (16), the privacy/efficiency trade-off is parameterized by both r and

F_{S}

where estimator variance is linear in r and

var (S)

. Interestingly, for

r = 1

, the variance of the joint scrambling and Warner’s additive mean estimators are equivalent.

var ({\hat{μ}}_{Y_{JS}}) = var ({\hat{μ}}_{Y}) = {var (Y) + var (S)} n^{- 1}

5. A General Class of Estimators

We unify our cumulative distribution function (14) and mean (16) estimator under a general class of estimators.

Theorem 4.

Let g be some function on random variables and consider estimand

θ = E [g (Y)]

. Then, the following is a consistent, unbiased estimator of

g (Y)

with the corresponding variance:

\begin{matrix} {\hat{θ}}_{J S} & = \frac{1}{n} \sum_{i = 1}^{n} \sum_{j = 1}^{r + 1} g (Z_{j}^{i}) - r E [g (S)] \\ var ({\hat{θ}}_{J S}) & = [var {g (Y)} + r var {g (S)}] n^{- 1} \end{matrix}

Remark 2.

Applying Theorem 4 to

g (Y) = I (Y \leq y)

gives

{\hat{F}}_{Y_{JS}} (y)

in (14). Taking

g (Y) = Y

gives (16).

While the density estimator is not in this general class but rather uses the kernel density estimator of (1), we see that the convolution correction term of (8) is the density analogue of the

r E {g (S)}

correction term in Theorem 4. Both terms seek to eliminate estimator bias originating from S. Theorem 4 allows our results to extend to mixed distributions and a large class of estimands,

E {g (Y)}

.

6. Privacy

Of course, the extent of privacy protection from joint scrambling depends on r and the scrambling distribution

F_{S}

. While many privacy notions exist including differential privacy [15], which has seen substantial recent interest, we opt for comparison via standard privacy benchmarks in survey sampling and defer further discussion to Section 8. Particularly, we extend the traditional definition of privacy

\nabla = E {{(Z - Y)}^{2}}

from Yan et al. [16] to the joint response case by replacing Z with

\tilde{Y} = E (Y ∣ \vec{Z})

, the expected true response given the respondent’s reported joint response.

In the one-dimensional case of

r = 0

,

\tilde{Y} = E (Y ∣ \vec{Z}) = E (Y ∣ Z) = Z

agrees with the original construction. For joint scrambling, the joint response is

\vec{Z} = (Z^{1}, \dots, Z^{r + 1})

, random variables corresponding to the i-th number spoken to the interviewer. Ideally, with perfect scrambling

S \sim Y

, the interviewer can do no better than choosing a random element of

{Z_{k}^{1}, \dots, Z_{k}^{r + 1}}

, which correctly predicts

Y_{k} = Y_{k}^{1}

with probability

1 / (r + 1)

.

Proposition 1.

Ideal scrambling occurs when

S \sim Y

and yields the following properties:

\tilde{Y} = E (Y ∣ \vec{Z}) = \frac{1}{r + 1} \sum_{j = 1}^{r + 1} Z^{j}, \nabla \equiv E {{(\tilde{Y} - Y)}^{2}} = (1 - \frac{1}{r + 1}) var (Y)

We see that

ϕ \equiv \nabla / var (Y) = 1 - 1 / (r + 1)

is independent of the variance in the underlying sensitive variable, Y, and is simply the failure probability of the prediction. For this reason, we call

ϕ

the normalized privacy. Unfortunately, a general calculation of

ϕ

for

S ≁ Y

is difficult.

While joint scrambling protects privacy better than randomized response with respect to

\tilde{Y} = E (Y ∣ \vec{Z})

, the expected response given the reported responses, it accomplishes this by leaking less sensitive information regarding Y. With one reported response in randomized response models, beyond

E (Y ∣ Z)

, no further information can be conveyed. However, for joint response, reporting

Z^{1}, \dots, Z^{r + 1}

conveys information about Y, specifically that

Z^{1} \leq Y \leq Z^{r + 1}

and, more specifically,

Y \in {Z^{1}, \dots, Z^{r + 1}}

. When

r = 0

, joint scrambling becomes direct survey sampling with no privacy protection. For

r > 0

, provided Y and S have the same support, the interviewer can never almost surely know Y. Whether the information that

Y \in {Z^{1}, \dots, Z^{r + 1}}

is sensitive or not depends on realized responses

{z^{1}, \dots, z^{r + 1}}

and the nature of the question. Suppose researchers poll students under the age of 21 regarding the number of alcoholic drinks they consume weekly. If

min ({Z^{1}, \dots, Z^{r + 1}}) > 0

, even if Y is unknown to the interviewer, the positivity of the individual’s reported responses incriminates the underage student and inhibits privacy. If

min ({Z^{1}, \dots, Z^{r + 1}}) = 0

, the extent to which privacy is inhibited depends on the distributions

F_{S}

and

F_{Y}

. However, for alternative sensitive questions, such as annual income,

I (Y > 0)

may not necessarily be sensitive.

Ultimately, the researcher must select privacy metrics, such as those proposed in this work, that encapsulate the privacy needs of the particular statistical question. The usage of joint scrambling versus any sensitive variable sampling method hinges on this discretion to gauge the privacy protection of the method, according to this metric, against the statistical and computational efficiency of the respective estimators.

7. Real Data Example

Given our own proposition of joint scrambling as a sensitive sampling technique, real survey data collected via joint scrambling is not publicly available. Consequently, we consider a hypothetical example involving real world data.

The National Survey on Drug Use and Health (NSDUH) from the Substance Abuse and Mental Health Services Administration (SAMHSA) is an annual survey considered to be the leading source of population-based statistical data on alcohol, drug use, and related health information. Data is collected via face-to-face interviews in people’s homes, though web-based interviews are also offered as of 2020 due to COVID-related changes to survey methodology.

While joint scrambling with a discrete scrambling S can be applied for discrete quantitative random variables, we desire a continuous random variable such that density estimation is well-posed. In particular, we choose the Body Mass Index (BMI), a health metric computed from one’s height and weight to estimate body fat. The 2023 NSDUH survey denotes the corresponding column “BMI2” with range [9.3, 68.6] and computes this via the standard formula from a respondent’s height and weight.

Suppose that SAMHSA were to return the NSDUH to fully in-person interviews. Since personal health data particularly related to one’s body fat may be sensitive when asked face-to-face, suppose SAMHSA further decides to use joint scrambling.

Using the 2023 NSDUH survey data, we simulate the effect of such a decision on the joint scrambling density estimator for various choices of bandwidth h and scrambling count r. For a choice of

(r, h)

, for each response in the real survey, we concatenate r independent draws from scrambling distribution

S \sim Unif ([0.3, 68.6])

to the true response, order the concatenated responses for each respondent, and consider this to be the hypothetical survey data when collected with joint scrambling. This scrambling distribution is natural as it is simply uniform over the BMI range.

From the left plot of Figure 3, we see that as the number of scramblings, r, increases, our density estimates worsen from the most solid to the most dotted orange line. One nuance revealed by this is that the joint scrambling density estimator may be negative if the kernel summation term is smaller than the convolution correction term. We also see that the best considered density estimate, that is, for

r = 2

with the optimal bandwidth

h = h_{JS}^{*}

, closely matches the empirical distribution.

From the right plot of Figure 3, we observe how this best density estimate changes as h deviates from the optimal bandwidth in either direction. Comparing the dotted lines to the solid in the right plot, we observe that the estimate does not change much in this case. However, other settings may be more sensitive to h.

From (9), we see that the optimal bandwidth choice depends on

f_{Y}^{''}

, which is unknown. This is a nuance of kernel density estimation more broadly, and recent work in standard kernel density estimation settings has yielded a variety of bandwidth selection methods [17] and adaptive bandwidth approaches [18]. However, for the sake of comparison, we compute

h_{JS}^{*}

by first taking the true responses and performing standard kernel density estimation with a standard Gaussian kernel to approximate

f_{Y}

. Since this kernel is smooth, we can also approximate

f_{Y}^{''}

by instead summing the second derivative of the kernel. This allows us to derive

h_{JS}^{*}

via (9).

We believe that the intuition from the literature on bandwidth selection can be applied to joint scrambling, and the specifics of bandwidth selection under scrambling are of interest in future work. Fortunately, the bandwidth choice does not affect the actual data collection, meaning that multiple bandwidths can be experimented with in practice.

8. Discussion

The proposed joint scrambling scheme can be interpreted as a particular form of local differential privacy [19]. Since density estimation in differentially private mechanisms is an area of active research [20], the kernel density estimator in (8) is significant. Moreover, the simplicity and estimand diversity of joint scrambling is particularly compelling in the survey sampling setting.

With recent work on randomized response, model complexity has increased to account for optionality [3], mixed randomization, untruthfulness [21], measurement error [22], combinations of the above, and more. Yet, the effectiveness of privacy protection is mediated by respondent understanding of their privacy guarantees, which cruxes on model simplicity and explainability. The proposed method in Section 6 proceeds with limited respondent overhead, and we argue that, in tandem with the mathematical advantages, joint scrambling is a strong privacy alternative to randomized response in practice because it is interpretable.

Given the asymptotically superior mean squared errors of our cumulative distribution estimator, we are keen on expanding beyond the estimator class

E {g (Y)}

of Theorem 4 to quantile estimands and others to fully develop the theory of joint scrambling. We are also interested in bandwidth selection techniques for our density estimator and hope this work particularly encourages research in density estimation for sensitive surveys.

Author Contributions

Conceptualization, A.C.A.; methodology, A.C.A.; software, A.C.A.; validation, A.C.A. and S.G.; formal analysis, A.C.A.; investigation, A.C.A.; resources, S.G.; writing—original draft preparation, A.C.A.; writing—review and editing, A.C.A. and S.G.; visualization, A.C.A.; supervision, S.G.; project administration, S.G.; funding acquisition, S.G. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by NSF grant DMS-2244160.

Data Availability Statement

Simulation results are available open-access at the following GitHub repository: https://github.com/arulandu/joint-scrambling/releases/tag/v1.0.0 (accessed on 24 June 2025). The 2023 NSDUH survey data is also publicly available at the following link: https://www.samhsa.gov/data/data-we-collect/nsduh-national-survey-drug-use-and-health/datafiles (accessed on 24 June 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Proofs for Section 3

Appendix A.1. Proof of Theorem 1

Proof.

By Taylor Expansion, for any random variable A,

\begin{matrix} f_{A} (y - h t) & = f_{A} (y) - h t f_{A}^{'} (t) + \frac{1}{2} h^{2} t^{2} f_{A}^{''} (y) + o (h^{2}) \end{matrix}

By this, u-substitution of

t^{'} = (y - t) / h

, and the observation that

t k (t)

is odd,

\begin{matrix} E {k_{h} (y - A)} & = \int_{- \infty}^{\infty} k_{h} (y - t) f_{A} (t) d t = \int_{- \infty}^{\infty} k ((y - t) / h) f_{A} (t) \frac{d t}{h} = \int_{- \infty}^{\infty} k (t) f (y - h t) d t \\ = f_{A} (y) + \frac{1}{2} h^{2} f_{A}^{''} (y) M_{2}^{1} (k) + o (h^{2}) \\ E {k_{h} {(y - A)}^{2}} & = \int_{- \infty}^{\infty} k_{h} {(y - t)}^{2} f_{A} (t) d t = \frac{1}{h} \int_{- \infty}^{\infty} k {(t)}^{2} f_{A} (y - h t) d t \\ = \frac{1}{h} f_{A} (y) M_{0}^{2} (k) + \frac{1}{2} h f_{A}^{''} (y) M_{2}^{2} (k) + o (h) \end{matrix}

By Assumption 2, since

{lim}_{n \to \infty} h = 0

, for any

ϵ \in [0, 1)

,

\begin{matrix} \frac{1}{2} h f_{A}^{''} (y) M_{2}^{2} (k) + o (h) = o (h^{ϵ}), \frac{1}{2} h^{2} f_{A}^{''} (y) M_{2}^{1} (k) + o (h^{2}) = o (h^{1 + ϵ}) \end{matrix}

Then,

\begin{matrix} n^{- 1} var {k_{h} (y - A)} & = n^{- 1} E {k_{h} {(y - A)}^{2}} - n^{- 1} E {k_{h} (y - A)}^{2} \\ = {(n h)}^{- 1} f_{A} (y) M_{0}^{2} (k) + n^{- 1} [o (h^{ϵ}) - {f_{A} (y) + o (h^{1 + ϵ})}^{2}] \\ = {(n h)}^{- 1} f_{A} (y) M_{0}^{2} (k) + o ({(n h)}^{- 1}) \end{matrix}

Applying this result to S and Y, by independence,

\begin{matrix} var {{\hat{f}}_{Y_{JS}} (y)} & = \frac{1}{n^{2}} \sum_{i = 1}^{n} \sum_{j = 1}^{r + 1} var {k_{h} (y - Z_{i}^{j})} = n^{- 1} [var {k_{h} (y - Y)} + r var {k_{h} (y - S)}] \\ = {(n h)}^{- 1} M_{0}^{2} (k) {f_{Y} (y) + r f_{S} (y)} + o {{(n h)}^{- 1}} \\ E {{\hat{f}}_{Y_{JS}} (y)} & = \frac{1}{n} \sum_{i = 1}^{n} \sum_{j = 1}^{r + 1} E {k_{h} (y - Z_{i}^{j})} - r \int_{- \infty}^{\infty} k_{h} (y - t) f_{S} (t) d t \\ = E {k_{h} (y - Y)} + r (E {k_{h} (y - S)} - \int_{- \infty}^{\infty} k_{h} (y - t) f_{S} (t) d t) \\ = E {k_{h} (y - Y)} \\ bias {{\hat{f}}_{Y_{JS}} (y)} & = E {{\hat{f}}_{Y_{JS}} (y)} - f_{Y} (y) = \frac{1}{2} h^{2} f_{Y}^{''} (y) M_{2}^{1} (k) + o (h^{2}) \end{matrix}

By the bias-variance decomposition of MSE,

\begin{matrix} MSE {{\hat{f}}_{Y_{JS}} (y)} & = var {{\hat{f}}_{Y_{JS}} (y)} + bias {({\hat{f}}_{Y_{JS}} (y))}^{2} \\ = {(n h)}^{- 1} M_{0}^{2} (k) {f_{Y} (y) + r f_{S} (y)} + \frac{1}{4} h^{4} f_{Y}^{''} {(y)}^{2} M_{2}^{1} {(k)}^{2} + o {{(n h)}^{- 1} + h^{4}} \\ MISE ({\hat{f}}_{Y_{JS}}) & = E [\int_{- \infty}^{\infty} {{\hat{f}}_{Y_{JS}} (y) - f_{Y} (y)}^{2} d y] = \int_{- \infty}^{\infty} MSE {{\hat{f}}_{Y_{JS}} (y)} \\ = {(n h)}^{- 1} (1 + r) M_{0}^{2} (k) + \frac{1}{4} h^{4} M_{2}^{1} {(k)}^{2} M_{0}^{2} (f_{Y}^{''}) + o {{(n h)}^{- 1} + h^{4}} \end{matrix}

It could be argued that the

o ({(n h)}^{- 1})

can be replaced with

o (n^{- 1})

. Regardless,

h^{*} \sim {\{\frac{(r + 1) M_{0}^{2} (k)}{n M_{2}^{1} {(k)}^{2} M_{0}^{2} (f_{Y}^{''})}\}}^{1 / 5} = Θ {{(r / n)}^{1 / 5}} \Rightarrow MISE ({\hat{f}}_{Y_{JS}}) = Θ {{(r / n)}^{4 / 5}}

This proves the claim. □

Appendix A.2. Proof of Equation (13)

Proof.

By Assumption 1, since

S \sim Unif (0, T)

,

\begin{matrix} F_{Y} (y) & = pr (Z - S \leq y) = \int_{- \infty}^{\infty} f_{Z} (z) {1 - F_{S} (z - y)} d z \\ = \int_{- \infty}^{y} f_{Z} (z) d z + \int_{y}^{y + T} f_{z} (z) (1 - \frac{z - y}{T}) d z \\ = \int_{- \infty}^{y + T} f_{Z} (z) d z - \frac{1}{T} \int_{y}^{y + T} (z - y) f_{Z} (z) d z \end{matrix}

Then, using the Taylor expansion in the proof of Theorem 1 and integration by parts,

\begin{matrix} bias {{\hat{F}}_{Y} (y)} = & \int_{0}^{y + T} E {{\hat{f}}_{Z} (z)} d z - \frac{1}{T} \int_{y}^{y + T} (z - y) E {{\hat{f}}_{Z} (z)} d z - F_{Y} (y) \\ = & \int_{0}^{y + T} \{\frac{1}{2} h^{2} f_{Z}^{''} (y) M_{2}^{1} (k) + o (h^{2})\} d z \\ - \frac{1}{T} \int_{y}^{y + T} (z - y) \{\frac{1}{2} h^{2} f_{Z}^{''} (y) M_{2}^{1} (k) + o (h^{2})\} d z \\ = & \frac{h^{2} M_{2}^{1} (k)}{2} {f_{Z}^{'} (y + T) - f_{Z}^{'} (0)} - \frac{h^{2} M_{2} (k)}{2 T} \{\int_{y}^{y + T} (z - y) f_{Z}^{''} (z)\} + o (h^{2}) \\ = & \frac{h^{2} M_{2}^{1} (k)}{2} (f_{Z}^{'} (y + T) - f_{Z}^{'} (0) - \frac{1}{T} [{\{f_{Z}^{'} (z) (z - y)\}}_{y}^{y + T} - \int_{y}^{y + T} f_{Z}^{'} (z) d z]) \\ + o (h^{2}) \\ = & \frac{h^{2} M_{2}^{1} (k)}{2} [\frac{1}{T} \{f_{Z} (y) - f_{Z} (y + T)\} - f_{Z}^{'} (0)] + o (h^{2}) \end{matrix}

Let

α (y) = \{f_{Z} (y) - f_{Z} (y + T)\} / T - f_{Z}^{'} (0)

. Then,

\begin{matrix} bias {{\hat{F}}_{Y} (y)}^{2} & = \frac{M_{2}^{1} {(k)}^{2} α {(y)}^{2} h^{4}}{4} + M_{2}^{1} (k) α (y) h^{2} o (h^{2}) + o (h^{4}) \\ = Θ (h^{4}) + o (h^{4}) + o (h^{4}) \\ = Θ (h^{4}) \end{matrix}

By (2),

h_{Z}^{*} = Θ (n^{- 1 / 5})

. Then,

MSE {{\hat{F}}_{Y} (y)} = var {{\hat{F}}_{Y} (y)} + bias {{\hat{F}}_{Y} (y)}^{2} = var {{\hat{F}}_{Y} (y)} + Θ {n^{- 4 / 5}} = Ω (n^{- 4 / 5})

It follows that

MISE {{\hat{F}}_{Y_{A}} (y)} = Ω (n^{- 4 / 5})

as well. This proves the claim. □

Appendix A.3. Proof of Theorem 2

Proof.

This holds by Theorem 4 with

g (Y) = I (Y \leq y)

. □

Appendix A.4. Proof of Corollary 1

Proof.

The quadratic

h (t) = t (1 - t)

takes global maximum

1 / 4

at

t = 1 / 2

. Then,

MSE {{\hat{F}}_{Y_{J S}} (y)} = n^{- 1} [h {F_{Y} (y)} + r h {F_{S} (y)}] \leq \frac{r + 1}{4 n}

The claim follows. □

Appendix B. Proofs for Section 4

Proof of Theorem 3

Proof.

This holds by Theorem 4 with

g (Y) = Y

. □

Appendix C. Proofs for Section 5

Proof of Theorem 4

Proof.

By the nature of joint scrambling,

\begin{matrix} E ({\hat{θ}}_{JS}) & = \frac{1}{n} \sum_{i = 1}^{n} E \{\sum_{j = 1}^{r + 1} g (Z_{i}^{j})\} - r E {g (S)} = \frac{1}{n} \sum_{i = 1}^{n} [E (Y) + r E {g (S)}] - r E {g (S)} \\ = E (Y) \end{matrix}

It follows that

{\hat{θ}}_{JS}

is unbiased. Similarly,

\begin{matrix} var ({\hat{θ}}_{JS}) = \frac{n}{n^{2}} var \{\sum_{j = 1}^{r + 1} g (Z_{j}^{i})\} = [var {g (Y)} + r var {g (S)}] n^{- 1} \end{matrix}

Since the variance goes to 0 as

n \to \infty

, this unbiased estimator is consistent. □

Appendix D. Proofs for Section 6

Proof of Proposition 1

Proof.

If

S \sim Y

, given

\vec{Z}

, there is a

1 / (r + 1)

chance that

Y = Z^{i}

for each

i = 1, \dots, r + 1

. By the nature of expected value, this proves the left-hand claim. Without loss of generality, let

Z^{1} = Y

. Substituting,

\begin{matrix} E {(\tilde{Y} & {- Y)}^{2}} \\ = \frac{1}{{(r + 1)}^{2}} E \{{(\sum_{i = 2}^{r + 1} Z^{i} - r Z^{1})}^{2}\} \\ = \frac{1}{{(r + 1)}^{2}} [\sum_{i = 2}^{r + 1} E {{(Z^{i})}^{2}} + 2 \sum_{i > j \geq 2}^{r + 1} E (Z^{i}) E (Z^{j}) - 2 r E (Z^{1}) \sum_{i = 2}^{r + 1} E (Z^{i}) + r^{2} E {(Z^{1})}^{2}] \\ = \frac{1}{{(r + 1)}^{2}} [r E (Y^{2}) + 2 (\binom{r}{2}) E {(Y)}^{2} - 2 r E (Y) {r E (Y)} + r^{2} E {(Y)}^{2}] \\ = \frac{r^{2} + r}{{(r + 1)}^{2}} {E (Y^{2}) - E {(Y)}^{2}} \\ = (1 - \frac{1}{r + 1}) var (Y) \end{matrix}

This proves the right-hand claim. □

References

Warner, S.L. Randomized response: A survey technique for eliminating evasive answer bias. J. Am. Stat. Assoc. 1965, 60, 63–69. [Google Scholar] [CrossRef] [PubMed]
Greenberg, B.G.; Abul-Ela, A.L.A.; Simmons, W.R.; Horvitz, D.G. The unrelated question randomized response model: Theoretical framework. J. Am. Stat. Assoc. 1969, 64, 520–539. [Google Scholar] [CrossRef]
Gupta, S.; Gupta, B.; Singh, S. Estimation of sensitivity level of personal interview survey questions. J. Stat. Plan. Inference 2002, 100, 239–247. [Google Scholar] [CrossRef]
Greenberg, B.G.; Kuebler Jr, R.R.; Abernathy, J.R.; Horvitz, D.G. Application of the randomized response technique in obtaining quantitative data. J. Am. Stat. Assoc. 1971, 66, 243–250. [Google Scholar] [CrossRef]
Warner, S.L. The linear randomized response model. J. Am. Stat. Assoc. 1971, 66, 884–888. [Google Scholar] [CrossRef]
Eichhorn, B.H.; Hayre, L.S. Scrambled randomized response methods for obtaining sensitive quantitative data. J. Stat. Plan. Inference 1983, 7, 307–316. [Google Scholar] [CrossRef]
Diana, G.; Perri, P.F. A class of estimators for quantitative sensitive data. Stat. Pap. 2011, 52, 633–650. [Google Scholar] [CrossRef]
Chaudhuri, A.; Mukerjee, R. Randomized Response: Theory and Techniques; Routledge: London, UK, 2020. [Google Scholar]
Wand, M.P.; Jones, M.C. Kernel Smoothing; CRC Press: Boca Raton, FL, USA, 1994. [Google Scholar]
Shou, W.; Gupta, S. Kernel density estimation using additive randomized response technique (RRT) models. Commun. Stat. Simul. Comput. 2023, 1–10. [Google Scholar] [CrossRef]
Ahmad, I.A. Kernel estimation in a continuous randomized response model. In Handbook of Applied Econometrics and Statistical Inference; CRC Press: Boca Raton, FL, USA, 2002; pp. 119–136. [Google Scholar]
Mostafa, S.A.; Ahmad, I.A. Kernel density estimation from complex surveys in the presence of complete auxiliary information. Metrika 2019, 82, 295–338. [Google Scholar] [CrossRef]
Butcher, J.C. A history of Runge-Kutta methods. Appl. Numer. Math. 1996, 20, 247–260. [Google Scholar] [CrossRef]
Notaris, S.E. Gauss-Kronrod quadrature formulae-a survey of fifty years of research. Electron. Trans. Numer. Anal 2016, 45, 371–404. [Google Scholar]
Dwork, C. Differential privacy. In Automata, Languages and Programming, Proceedings of the 33rd International Colloquium, ICALP 2006, Venice, Italy, 10–14 July 2006; Springer: Berlin/Heidelberg, Germany, 2006; pp. 1–12. [Google Scholar]
Yan, Z.; Wang, J.; Lai, J. An efficiency and protection degree-based comparison among the quantitative randomized response strategies. Commun. Stat. Theory Methods 2008, 38, 400–408. [Google Scholar] [CrossRef]
Jones, M.C.; Marron, J.S.; Sheather, S.J. A brief survey of bandwidth selection for density estimation. J. Am. Stat. Assoc. 1996, 91, 401–407. [Google Scholar] [CrossRef]
Carlos, H.A.; Shi, X.; Sargent, J.; Tanski, S.; Berke, E.M. Density estimation and adaptive bandwidths: A primer for public health practitioners. Int. J. Health Geogr. 2010, 9, 1–8. [Google Scholar] [CrossRef] [PubMed]
Xiong, X.; Liu, S.; Li, D.; Cai, Z.; Niu, X. A comprehensive survey on local differential privacy. Secur. Commun. Netw. 2020, 2020, 8829523. [Google Scholar] [CrossRef]
Kroll, M. On density estimation at a fixed point under local differential privacy. Electron. J. Stat. 2021, 15, 1783–1813. [Google Scholar] [CrossRef]
Gupta, S.; Parker, M.; Khalil, S. A Ratio Estimator for the Mean Using a Mixture Optional Enhance Trust (MOET) Randomized Response Model. Mathematics 2024, 12, 3617. [Google Scholar] [CrossRef]
Zahid, E.; Shabbir, J. Estimation of population mean in the presence of measurement error and non response under stratified random sampling. PLoS ONE 2018, 13, e0191572. [Google Scholar] [CrossRef] [PubMed]

Figure 1. A graph showing the mean integrated squared error of (8) averaged over 100 trials (dot-solid) as well as a regression (dashes) of the asymptotic function

β x^{\pm 4 / 5}

fitted such that

β_{-} = 0.255

(left) and

β_{+} = 0.0002

(right). Fixed

Y \sim N (0, 1)

and

S \sim Unif (0, 4)

, varying n with

r = 2

(left) and varying r with

n = 10^{2}

. Integrated squared error was computed via Simpson’s rule on a mesh of 100 evenly spaced

y \in [- 3, 3]

.

Figure 1. A graph showing the mean integrated squared error of (8) averaged over 100 trials (dot-solid) as well as a regression (dashes) of the asymptotic function

β x^{\pm 4 / 5}

fitted such that

β_{-} = 0.255

(left) and

β_{+} = 0.0002

(right). Fixed

Y \sim N (0, 1)

and

S \sim Unif (0, 4)

, varying n with

r = 2

(left) and varying r with

n = 10^{2}

. Integrated squared error was computed via Simpson’s rule on a mesh of 100 evenly spaced

y \in [- 3, 3]

.

Figure 2. A plot of the sample mean squared error (dot) of the estimator in (14) versus the theoretical mean squared error (solid) in (15). Fixed

r = 2, n = 10^{3}, Y \sim N (0, 1)

, and

S \sim Unif (- 1, 1)

, with

10^{4}

sample trials.

Figure 2. A plot of the sample mean squared error (dot) of the estimator in (14) versus the theoretical mean squared error (solid) in (15). Fixed

r = 2, n = 10^{3}, Y \sim N (0, 1)

, and

S \sim Unif (- 1, 1)

, with

10^{4}

sample trials.

Figure 3. A graph showing the empirical distribution of BMI in the 2023 NSDUH survey data (histogram) against the density estimated by the joint scrambling estimator using the hypothetical survey data for various r with fixed optimal bandwidth

h^{*}

(left) and various bandwidths h and fixed

r = 2

(right). Both the solid orange line on the left and the solid purple line on the right plot represent the density estimate for

r = 2

with the optimal bandwidth

h = h_{JS}^{*}

.

Figure 3. A graph showing the empirical distribution of BMI in the 2023 NSDUH survey data (histogram) against the density estimated by the joint scrambling estimator using the hypothetical survey data for various r with fixed optimal bandwidth

h^{*}

(left) and various bandwidths h and fixed

r = 2

(right). Both the solid orange line on the left and the solid purple line on the right plot represent the density estimate for

r = 2

with the optimal bandwidth

h = h_{JS}^{*}

.

Table 1. Efficiency and runtime comparison of density estimators. n, number of respondents;

h^{*}

, optimal bandwidth; T, runtime in seconds; IAB, integrated absolute bias; IV, integrated variance; MISE; mean integrated squared error. We fix

Y \sim N (0, 1)

,

S \sim Unif (0, T)

with

T = 2

, and

r = 2

. We use the kernel

k (x) = f_{K} (x)

where

K \sim N (0, 1 / 4)

. All values in columns IAB, IV, and MISE are approximated via Simpson’s rule on a mesh of 100 evenly spaced points

\in [- 3, 3]

and have been multiplied by

10^{2}

. All values are averaged over

10^{2}

trials.

Table 1. Efficiency and runtime comparison of density estimators. n, number of respondents;

h^{*}

, optimal bandwidth; T, runtime in seconds; IAB, integrated absolute bias; IV, integrated variance; MISE; mean integrated squared error. We fix

Y \sim N (0, 1)

,

S \sim Unif (0, T)

with

T = 2

, and

r = 2

. We use the kernel

k (x) = f_{K} (x)

where

K \sim N (0, 1 / 4)

. All values in columns IAB, IV, and MISE are approximated via Simpson’s rule on a mesh of 100 evenly spaced points

\in [- 3, 3]

and have been multiplied by

10^{2}

. All values are averaged over

10^{2}

trials.

Method	n	$h^{*}$	T	IAB	IV	MISE
Joint Scrambling	$10^{2}$	$1.051$	$0.106$	$12.23$	$0.599$	$0.930$
Shou and Gupta [10]	$10^{2}$	$0.987$	$0.238$	$28.91$	$0.152$	$2.040$
Ahmad [11]	$10^{2}$	$0.366$	$0.024$	$25.03$	$9.553$	$14.29$
Joint Scrambling	$10^{3}$	$0.663$	$0.439$	$5.476$	$0.165$	$0.232$
Shou and Gupta [10]	$10^{3}$	$0.623$	$3.049$	$26.04$	$0.017$	$1.549$
Ahmad [11]	$10^{3}$	$0.263$	$0.221$	$22.82$	$1.336$	$6.190$
Joint Scrambling	$10^{4}$	$0.418$	$3.468$	$1.754$	$0.026$	$0.034$
Shou and Gupta [10]	$10^{4}$	$0.393$	$64.82$	$24.84$	$0.002$	$1.400$
Ahmad [11]	$10^{4}$	$0.190$	$2.226$	$22.82$	$0.176$	$5.259$

Table 2. A comparison of cumulative distribution function estimators. n, number of respondents;

h^{*}

, optimal bandwidth; T, runtime in seconds; IAB, integrated absolute bias; IV, integrated variance; MISE; mean integrated squared error. We fix

Y \sim N (0, 1)

and

S \sim Unif (0, T)

with

T = 2

and

r = 2

. We use the kernel

k (x) = f_{K} (x)

where

K \sim N (0, 1 / 4)

. All values in columns IAB, IV, and MISE are approximated via Simpson’s rule on a mesh of 100 evenly spaced points

\in [- 3, 3]

and have been multiplied by

10^{2}

. All values are averaged over

10^{2}

trials.

Table 2. A comparison of cumulative distribution function estimators. n, number of respondents;

h^{*}

, optimal bandwidth; T, runtime in seconds; IAB, integrated absolute bias; IV, integrated variance; MISE; mean integrated squared error. We fix

Y \sim N (0, 1)

and

S \sim Unif (0, T)

with

T = 2

and

r = 2

. We use the kernel

k (x) = f_{K} (x)

where

K \sim N (0, 1 / 4)

. All values in columns IAB, IV, and MISE are approximated via Simpson’s rule on a mesh of 100 evenly spaced points

\in [- 3, 3]

and have been multiplied by

10^{2}

. All values are averaged over

10^{2}

trials.

Method	n	$h^{*}$	T	IAB	IV	MISE
Joint Scrambling	$10^{2}$	$0.418$	$0.003$	$0.017$	$1.133$	$1.150$
Shou and Gupta [10]	$10^{2}$	$0.987$	$1.107$	$1.815$	$0.350$	$2.165$
Joint Scrambling	$10^{3}$	$0.418$	$0.002$	$0.002$	$0.118$	$0.120$
Shou and Gupta [10]	$10^{3}$	$0.623$	$10.74$	$1.373$	$0.035$	$1.409$
Joint Scrambling	$10^{4}$	$0.418$	$0.004$	$0.000$	$0.013$	$0.013$
Shou and Gupta [10]	$10^{4}$	$0.393$	$108.6$	$1.188$	$0.004$	$1.193$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Arulandu, A.C.; Gupta, S. Kernel Density Estimation for Joint Scrambling in Sensitive Surveys. Mathematics 2025, 13, 2134. https://doi.org/10.3390/math13132134

AMA Style

Arulandu AC, Gupta S. Kernel Density Estimation for Joint Scrambling in Sensitive Surveys. Mathematics. 2025; 13(13):2134. https://doi.org/10.3390/math13132134

Chicago/Turabian Style

Arulandu, Alvan Caleb, and Sat Gupta. 2025. "Kernel Density Estimation for Joint Scrambling in Sensitive Surveys" Mathematics 13, no. 13: 2134. https://doi.org/10.3390/math13132134

APA Style

Arulandu, A. C., & Gupta, S. (2025). Kernel Density Estimation for Joint Scrambling in Sensitive Surveys. Mathematics, 13(13), 2134. https://doi.org/10.3390/math13132134

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Kernel Density Estimation for Joint Scrambling in Sensitive Surveys

Abstract

1. Introduction

2. Joint Scrambling

3. Distribution Estimators for Continuous Variables

3.1. Density Estimation

Simulation Study

3.2. Cumulative Distribution Estimation

Simulation Study

4. Mean Estimation

Comparison with Additive Scrambling

5. A General Class of Estimators

6. Privacy

7. Real Data Example

8. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Proofs for Section 3

Appendix A.1. Proof of Theorem 1

Appendix A.2. Proof of Equation (13)

Appendix A.3. Proof of Theorem 2

Appendix A.4. Proof of Corollary 1

Appendix B. Proofs for Section 4

Proof of Theorem 3

Appendix C. Proofs for Section 5

Proof of Theorem 4

Appendix D. Proofs for Section 6

Proof of Proposition 1

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI