Robust Estimation Method against Poisoning Attacks for Key-Value Data with Local Differential Privacy

Horigome, Hikaru; Kikuchi, Hiroaki; Fujita, Masahiro; Yu, Chia-Mu

doi:10.3390/app14146368

Open AccessArticle

Robust Estimation Method against Poisoning Attacks for Key-Value Data with Local Differential Privacy

by

Hikaru Horigome

^1,*,

Hiroaki Kikuchi

²

,

Masahiro Fujita

¹ and

Chia-Mu Yu

³

¹

Mitsubishi Electric Corporation, 5-1-1 Ofuna, Kamakura 247-8501, Japan

²

Graduate School of Advanced Mathematical Science, Meiji University, 4-21-1 Nakano, Tokyo 164-8525, Japan

³

Department of Electroncis and Electrical Engineering, National Yang Ming Chiao Tung University (NYCU), 1001 University Rd., Hsinchu 300, Taiwan

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(14), 6368; https://doi.org/10.3390/app14146368

Submission received: 16 April 2024 / Revised: 21 June 2024 / Accepted: 6 July 2024 / Published: 22 July 2024

(This article belongs to the Special Issue Progress and Research in Cybersecurity and Data Privacy)

Download

Browse Figures

Versions Notes

Abstract

Local differential privacy (LDP) protects user information from potential threats by randomizing data on individual devices before transmission to untrusted collectors. This method enables collectors to derive user statistics by analyzing randomized data, thereby presenting a promising avenue for privacy-preserving data collection. In the context of key–value data, in which discrete and continuous values coexist, PrivKV has been introduced as an LDP protocol to ensure secure collection. However, this framework is susceptible to poisoning attacks. To address this vulnerability, we propose an expectation maximization (EM)-based algorithm combined with a cryptographic protocol to facilitate secure random sampling. Our LDP protocol, known as emPrivKV, exhibits two key advantages: it improves the accuracy of statistical information estimation from randomized data, and enhances resilience against the manipulation of statistics, that is, poisoning attacks. These attacks involve malicious users manipulating the analysis results without detection. This study presents the empirical results of applying the emPrivKV protocol to both synthetic and open datasets, highlighting a notable improvement in the precision of statistical value estimation and robustness against poisoning attacks. As a result, emPrivKV improved the frequency and the mean gains by

17.1 %

and

25.9 %

, respectively, compared to PrivKV, with the number of fake users being

0.1

of the genuine users. Our findings contribute to the ongoing discourse on refining LDP protocols for key–value data in scenarios involving privacy-sensitive information.

Keywords:

LDP; key-value data; expectation maximization

1. Introduction

Wearable devices are seamlessly integrated into our daily lives and capture various key–value data that span health metrics, physical activity tracking, sleep patterns, location data, biometric identifiers, and environmental conditions. For example, consider the key–value data generated from smartphone app usage, where the key represents the application name and the corresponding value indicates the time spent on each app, e.g., “a user may spend 2 h on YouTube, 1 h on Twitter, and 0.5 h on Instagram”. In this scenario, “app name” serves as the key, while the respective durations constitute the values. This key–value pair not only provides a snapshot of the digital behavior of the user, but also unveils insights into their preferences and social interactions. Although such data are valuable for personalization and app optimization, they raise privacy concerns relating to tracking the online activities of individuals. The aggregation of these seemingly benign key–value pairs could enable the inference of aspects of a user’s lifestyle, underscoring the need for carefully considering privacy implications.

Local differential privacy (LDP) is a privacy-enhanced technique for mitigating privacy challenges. In this approach, individual users locally perturb their personal data before transmission to an untrusted server. Various LDP protocols have been introduced to address distinct data types. Erlingsson et al. proposed the LDP framework RAPPOR [1], which represents a significant advancement in privacy-preserving techniques. Ye et al. presented PrivKV [2], which is an LDP scheme that is tailored for the secure collection of key–value data and is characterized by a two-dimensional structure accommodating both discrete and continuous values. The landscape of LDP protocols for key–value data has been further enriched by the introduction of several alternative schemes [3,4], reflecting the ongoing exploration within the realm of LDP.

However, the localized nature of perturbation in LDP protocols introduces a vulnerability to “poisoning attacks”. In such attacks, malicious entities introduce fake users who transmit counterfeit data for a specific key with the intention of manipulating the analytical outcomes of the server, such as altering the frequency of particular keys or their mean reputation scores. If a fraudulent user submits the fabricated key and value data in a predetermined LDP protocol, the server cannot detect these data because of the privacy guarantee offered by LDP. Cao et al. [5] studied a family of poisoning attacks on LDP schemes. Furthermore, Wu et al. [6] identified three distinct types of poisoning attacks that specifically target PrivKV, demonstrating its vulnerability. They proposed defense mechanisms against these poisoning attacks; however, these defenses necessitate the prolonged observation of data collection for optimal efficacy.

In this study, we investigate the vulnerability of the LDP protocol for key–value data to poisoning attacks. First, we introduce a cryptographic protocol known as oblivious transfer (OT) [7] to mitigate intentional key selection by fake users. In contrast to the conventional approach of conducting random sampling locally, our protocol involves the server collaboratively in the secure sampling process, providing proactive security, as opposed to the reactive methods that were employed in prior studies. Second, we demonstrate the vulnerability of PrivKV to poisoning attacks using its estimation algorithm. Because the reliance of estimation depends on a single frequency for a key, it is vulnerable to manipulation, particularly when dealing with a limited number of targeted keys. To address this limitation, we propose integrating an expectation maximization (EM) algorithm [8]. By iteratively estimating posterior probabilities, the EM algorithm enhances consistency across all observed values, thereby improving accuracy even in scenarios with a large user base.

We performed empirical investigations using synthetic data and common open datasets to assess the robustness of the proposed protocol against various types of poisoning attacks. The outcomes of these experiments provided a basis for comparative analysis, allowing us to evaluate the efficacy of our proposed scheme compared with conventional approaches such as PrivKV and PrivKVM. This examination aims to elucidate the resilience and performance of the protocol under varying conditions, thereby providing valuable insights into the discourse on privacy-preserving data collection methodologies.

Our contributions are as follows.

Novel LDP Algorithm: We introduce a novel LDP algorithm that is designed to enhance robustness against specific types of poisoning attacks. This newly proposed algorithm leverages an iterative process involving Bayesian posterior probabilities, thereby improving the accuracy of the estimates while maintaining resilience against the impact of poisoning data.
Experimental Validation: The proposed protocol was empirically validated through a series of experiments employing both synthetic and publicly available open datasets. The experimental results demonstrate the robustness of the proposed algorithm. Comparative analyses reveal superior performance in terms of the estimation accuracy and resistance to poisoning attacks compared with the PrivKV protocol. These findings substantiate the efficacy and practical applicability of the proposed method in the context of privacy-preserving data collection frameworks.

The preliminary version of this work was presented at Modeling Decisions for Artificial Intelligence (MDAI) 2023 [9]. This paper includes three additional points: related works and their comparison, experimental results and costs with OT protocol, and experimental results using synthetic data.

2. Related Works

2.1. Privacy Preservation for Key–Value Data

Differential privacy (DP) was proposed by Dwork [10]. DP is a central model that theoretically guarantees the privacy of statistical information publications [11,12] and machine learning models [13,14]. Duchi et al. proposed LDP [15], which does not require trustworthiness of the collectors. Academic research has been conducted on LDP owing to its stronger privacy guarantees. Kairouz et al. [16] proposed an LDP protocol for discrete values using randomized response (RR) [17]. Elingson et al. proposed an LDP protocol known as RAPPOR [1] using a Bloom filter [18] and RR to improve the estimation accuracy. Duchi et al. proposed the LDP protocol stochastic rounding (SR) [19] and Nguyên et al. proposed the LDP protocol Harmony [20] for continuous values. Wang et al. [21] proposed the piecewise mechanism (PM) and hybrid mechanism to improve the estimation accuracy for continuous values. LDP protocols including PrivKV, PrivKVM [2], PCKV [3], and PrivKVM* [4] have been proposed for key–value data that consist of a combination of discrete and continuous values.

In addition, advanced methods have been proposed to improve the estimation accuracy. Ren et al. [22] and Fanti et al. [23] proposed a method that applies the EM algorithm [8] and LASSO regression [24] to the LDP protocol RAPPOR. Li et al. proposed the EM with a smoothing algorithm [25] that applies the EM algorithm to SR and PM.

The LDP protocol is vulnerable to poisoning attacks that manipulate the estimated values by intentionally modifying the protocol [26]. Poisoning attacks have been studied for several LDP protocols [5,6,27].

Cao et al. [5] evaluated the robustness of discrete-value LDP protocols against poisoning attacks and demonstrated that attackers could increase the estimated frequency of specific target items by injecting fake users into the system. Wu et al. [6] studied the robustness of LDP protocols for key–value data such as PrivKV and PCKV against poisoning attacks. Du et al. [27] studied targeted poisoning attacks on LDP protocols to estimate the mean and variance of the numerical data.

Furthermore, prior works [28,29,30] studied poisoning attacks on LDP protocols for specific use cases. Wang et al. [28] investigated poisoning attacks on multidimensional preference data, including both implicit preference data, such as the item list bought by a user, and explicit preference data, such as ranking data over candidates. Imola et al. [29] studied poisoning attacks on LDP protocols for degree estimation in graphs. Sasada et al. [30] considered targeted poisoning attacks on LDP protocols for location data estimation.

Previous studies [31,32,33] investigated poisoning attacks against machine learning models employing DP or LDP. Machine learning models against data poisoning attacks involve the manipulation of a poisoned training dataset to change the models to a desired one. Borgnia et al. [31] demonstrated that differential private data augmentation can mitigate poisoning attacks. Ma et al. [32] investigated the impact of poisoning attacks on a machine learning model trained using differential private algorithms. Naseri et al. [33] investigated the robustness against poisoning attacks in federated learning using DP and LDP.

Several simple countermeasures against poisoning attacks on LDP have been proposed in existing studies. Wu et al. [6] developed one-class classification (OCC) [34] using an isolation forest [35] and a method based on an anomaly score for detecting fake users. Similarly, Li et al. [27] proposed a sampling-then-clustering (STC) method [36] using k-means clustering. Li et al. [37] proposed an optimization method for setting parameters that are robust against poisoning attacks on LDP.

A straightforward approach is to apply a standard randomized response mechanism, such as that in [17,38], to each coordinate, which incurs a large communication cost proportional to the size of the range of items. Some prior studies [39,40,41] assumed that each user holds exactly one item from a large range of items. However, this assumption is too strong to generalize the problem. Zhou et al. [41] combined the sparse scheme with a new random binning and achieved an optimal communication cost of

O (k log k)

, where k is the maximum number of non-zeros independent of the range size, and optimal accuracy.

To detect fake users, Wu et al. [6] proposed two methods: (1) OCC-based detection, where observations of multiple rounds for each user provide the feature vector used for outlier detection, which distinguishes between genuine and fake groups, and (2) anomaly score-based detection, where the anomalous behavior of sending the same key in multiple rounds is detected based on the frequencies of keys in multiple rounds for each user. They reported that these defense methods are effective when the number of targeted keys is small. Li et al. [27] proposed a clustering-based method in which multiple subsets of users are sampled and clustering algorithms such as k-means are used to detect attackers. However, these methods assume that each user sends data over multiple rounds, implying that real-time detection is not feasible.

2.2. Novelty of This Research

In this paper, we propose an estimation method using the EM algorithm for the perturbation method, similar to the LDP protocol PrivKV [2]. Some studies have leveraged attacker detection [6,27] and parameter optimization [37] to improve the robustness against poisoning attacks on LDP. However, to the best of our knowledge, no studies have evaluated the robustness of these estimation methods against poisoning attacks on LDP. We also propose defense methods using the OT protocol against poisoning attacks. OT is a proactive measure, whereas methodologies such as OCC [6] and STC [27] are reactive. This is the greatest difference between our study and the previous works. In our approach, random item selection in LDP is performed in collaboration between the user and server; hence, no malicious user sends fake data to manipulate the estimation. This is an ideal countermeasure under the assumption that the building blocks are secure. In contrast, OCC and STC are probabilistic measures that allow small fake data to be sent to the LDP. By combining EM and OT, we aim to improve the estimation accuracy and robustness to poisoning attacks.

Why propose a method that applies the EM algorithm to fixed-length vectors containing missing values? Rather than estimating the key frequency and average value independently in maximum likelihood estimation (MLE), the marginal probability of the intermediate set is estimated from the output using the EM algorithm, resulting in a more robust combination estimation. Therefore, we propose an EM algorithm application method that is specific to PrivKV and attempt to improve the estimation accuracy of the statistical values.

We also report the effects of the estimated value manipulation by poisoning attacks on the LDP protocol of the proposed method.

3. Local Differential Privacy

3.1. Fundamental Definition

LDP is a privacy-preserving technique in which individual users randomize their own data before submitting them to a service provider. This randomization process helps to protect the privacy of each user’s data while allowing accurate computation of the aggregate statistics. LDP ensures that sensitive information remains confidential at the local level, providing a decentralized solution for data privacy. LDP is defined as follows:

Definition 1.

A randomized algorithm Q satisfies ϵ-LDP if for all pairs of values v and

v^{'}

of domain V and for all subsets S of range Z (

S \subset Z

), and for

ϵ \geq 0

, the following inequality always holds:

P r [Q (v) \in S] \leq e^{ϵ} P r [Q (v^{'}) \in S] .

3.2. PrivKV

PrivKV takes input data in the key–value form, which is a two-dimensional data structure of discrete (“key”) and continuous (“value”) variables. For the sample, the key–values are {

〈 Y o u T u b e, 2 〉

,

〈 T w i t t e r, 1 〉

,

〈 I n s t a g r a m, 0.5 〉

}, which represent the pair of each app and the time spent on that app. PrivKV estimates the frequency and mean values of each key. The PrivKV approach combines two LDP protocols: RR [16] for randomizing keys and the value perturbation protocol (VPP) [20] for perturbing values. Although the number of dimensions is restricted to two, the key–value is a primitive data structure that is commonly used in several applications. PrivKV consists of three steps for frequency and mean estimation: sampling, perturbation, and estimation.

3.2.1. Sampling

Suppose that n users exist and let

S_{i}

be a set of key–value tuples

〈 k, v 〉

possessed by the i-th user. In PrivKV, the set of tuples is transformed into a d-dimensional vector. Here, d represents the cardinality of key domain K, and the missing key is denoted as

〈 k, v 〉 = 〈 0, 0 〉

. For instance, a set of key–values

S_{i} = {〈 k_{1}, v_{1} 〉, 〈 k_{4}, v_{4} 〉, 〈 k_{5}, v_{5} 〉}

is converted into a

d = 5

dimensional vector

S_{i} = (〈 1, v_{1} 〉, 〈 0, 0 〉, 〈 0, 0 〉, 〈 1, v_{4} 〉, 〈 1, v_{5} 〉)

. In this case, keys

k_{1}

,

k_{4}

, and

k_{5}

are implicitly indicated by the presence of 1 in the corresponding position. PrivKV employs a 1-out-of-d random sampling method to select one element

〈 k_{a}, v_{a} 〉

from the d-dimensional vector

S_{i}

of the key–value data.

3.2.2. Perturbation

This process consists of two steps: perturbing the values and perturbing the keys. It uses the VPP used in Harmony [20] for the selected tuple. The value

v_{a}

in the key–value pair is discretized as

v_{a}^{'} = \{\begin{matrix} 1 & w / p . & \frac{1 + v_{a}}{2}, \\ - 1 & w / p . & \frac{1 - v_{a}}{2} . \end{matrix}

(1)

The discretized value

v_{a}^{'}

of the tuple

〈 1, v_{a} 〉

is perturbed to yield

{v^{+}}_{a} = V P P (v_{a}^{'}, ϵ_{2})

, which is defined as

v_{a}^{+} = \{\begin{matrix} v_{a}^{'} & w / p . & p_{2} = \frac{e^{ϵ_{2}}}{(1 + e^{ϵ_{2}})}, \\ - v_{a}^{'} & w / p . & q_{2} = \frac{1}{(1 + e^{ϵ_{2}})}, \end{matrix}

(2)

where

ϵ_{2}

is the privacy budget for values. Based on Equations (1) and (2), the value of the “missing” tuple

〈 0, 0 〉

is replaced by

{v^{+}}_{a} = V P P (v_{a}^{'}, ϵ_{2})

, where

v_{a}

is selected uniformly from

[- 1, 1]

.

A key is perturbed by the RR scheme [16] as

〈 k_{a}^{*}, v_{a}^{+} 〉 = \{\begin{matrix} 〈 1, v_{a}^{+} 〉 & w / p . & p_{1} = \frac{e^{ϵ_{1}}}{1 + e^{ϵ_{1}}}, \\ 〈 0, 0 〉 & w / p . & q_{1} = \frac{1}{1 + e^{ϵ_{1}}}, \end{matrix}

(3)

where

v_{a}^{+}

is perturbed as described above. A “missing” tuple

〈 0, 0 〉

is randomized as

〈 k_{a}^{*}, v_{a}^{+} 〉 = \{\begin{matrix} 〈 0, 0 〉 & w / p . & p_{1} = \frac{e^{ϵ_{1}}}{1 + e^{ϵ_{1}}}, \\ 〈 1, v_{a}^{+} 〉 & w / p . & q_{1} = \frac{1}{1 + e^{ϵ_{1}}} . \end{matrix}

(4)

Each user submits the perturbed tuple

〈 k_{a}^{*}, v_{a}^{+} 〉

together with the index a of the tuple.

3.2.3. Estimating

Let

f_{i}

be the true frequency of key

k_{i}

and let

f_{i}^{'}

be the observed key frequencies among the perturbed vectors, for which

k_{i}

= 1. The frequency estimation using MLE is as follows:

{\hat{f}}_{i} = \frac{n (p_{1} - 1) + f_{i}^{'}}{2 p_{1} - 1},

(5)

where

p_{1} = \frac{e^{ϵ_{1}}}{1 + e^{ϵ_{1}}} .

(6)

Let

m_{i}

be the true mean of key

k_{i}

. Let

n_{i}^{(1)}

and

n_{i}^{(- 1)}

be the counts of

v_{i} = 1

and

v_{i} = - 1

in the perturbed vectors.

n_{i}^{(1)}

and

n_{i}^{(- 1)}

are calculated as

{\hat{n}}_{i}^{(1)}

and

{\hat{n}}_{i}^{(- 1)}

in the same manner as in Equation (5).

\begin{matrix} {\hat{n}}_{i}^{(1)} & = & \frac{n (p_{2} - 1) + n_{i}^{(1)}}{2 p_{2} - 1}, \end{matrix}

(7)

\begin{matrix} {\hat{n}}_{i}^{(- 1)} & = & \frac{n (p_{2} - 1) + n_{i}^{(- 1)}}{2 p_{2} - 1}, \end{matrix}

(8)

where

\begin{matrix} p_{2} = \frac{e^{ϵ_{2}}}{1 + e^{ϵ_{2}}} . \end{matrix}

(9)

Therefore, the mean estimation using Maximum Likelihood Estimation (MLE) is as follows:

{\hat{m}}_{i} = \frac{{\hat{n}}_{i}^{(1)} - {\hat{n}}_{i}^{(- 1)}}{n} .

(10)

According to the composition theorem of DP [10], the sequential composition of randomization algorithms with privacy budgets

ϵ_{1}

(for keys) and

ϵ_{2}

(for values) results in

(ϵ_{1} + ϵ_{2}, 0)

-DP.

3.2.4. PrivKVM

The weakness of PrivKV is its accuracy. Because it samples only one tuple out of d candidates, the estimated mean tends to 0 as the domain size d increases. To improve the performance, Ye et al. proposed an iterative improvement of PrivKV known as PrivKVM [2].

Instead of assigning a uniform random value

v_{a}^{'}

for the “missing” tuple

〈 0, 0 〉

, privKVM uses the first estimation for the second iteration. The updated frequencies are estimated by repeating the iteration until the predetermined times. Privacy budgets are assigned optimally as

ϵ_{11} = ϵ_{1}

,

ϵ_{12} = ϵ_{13} = \dots = ϵ_{1 c} = 0

and

ϵ_{21} = ϵ_{22} = \dots = ϵ_{2 c} = ϵ_{2} / c

, where c is the number of iterations.

3.3. Poisoning Attack

A poisoning attacks in LDP [5,6,27] involves an attacker injecting fake users into the system and manipulating the analytics results of the server via sending carefully crafted data from the fake users to the server.

Figure 1 shows the framework of a poisoning attack. We assume that attackers can inject m fake users into the system. They have access to public information regarding the targeted LDP scheme, such as the privacy budget

ϵ

and perturbation procedures. Within a set of n genuine users, the server estimates the frequency and mean of r targeted keys among

n + m

users. The attacker intentionally manipulates the estimated frequency and mean value of a set of targeted keys. We assume that the attacker targets r keys out of d, aiming for manipulation of the frequencies and mean values.

Wu et al. [6] proposed three types of poisoning attacks:

Maximum gain attack (M2GA): All fake users generate optimal fake outputs for perturbation messages to maximize the gains of the frequency and mean. That is, they select a targeted key k (selected randomly from r targeted keys) and send $〈 1, 1 〉$ to the server.
Random message attack (RMA): Each fake user uniformly selects a message from the domain and sends $〈 0, 0 〉$ , $〈 1, - 1 〉$ , and $〈 1, 1 〉$ with probabilities $1 / 2$ , $1 / 4$ , and $1 / 4$ , respectively.
Random key–value pair attack (RKVA): Each fake user randomly selects a key k from the target key set with a specified value 1 and perturbs $〈 1, 1 〉$ according to the protocol.

4. Proposed Algorithm

4.1. Concept

To prevent fake key–value data poisoning attacks, we propose two defense methods: perturbation using Oblivious OT (refer to Section 4.2) and EM estimation of the frequency and mean (refer to Section 4.3).

Figure 2 shows the framework of the proposed protocol, which is adapted based on the OT protocol. First, attempts to increase the frequency of the target keys are thwarted by deliberately selecting keys without random sampling. Therefore, the attempts fail when the server conducts random sampling of fake user actions. Even if the server selects random keys, no information regarding the key–value data is leaked. Note that the privacy budget (

ϵ_{1}

and

ϵ_{2}

) is used solely for perturbation of the keys and values. Thus, we ensure secure sampling using encryption protocols (OT).

Next, we consider why the estimation may be vulnerable to poisoning attacks. The MLE used in PrivKV is claimed to have a low estimation accuracy for biased distributions because it is based on a single frequency of keys. Therefore, it is vulnerable when the number of targeted keys is small. Instead, we attempt to resolve this limitation using the EM algorithm. EM iteratively estimates the posterior probabilities and can improve the accuracy when the number of users n is large and ample observation data are available, providing estimates that match all observed values.

Table 1 summarizes the approach for each step in PrivKV, namely sampling, perturbation, and estimation.

4.2. Oblivious Transfer

OT is a cryptographic protocol in which a sender transfers one of multiple pieces of information to a receiver without knowing which piece is sent. Naor and Pinkas [7] introduced a 1-out-of-N OT protocol using a 1-out-of-2 OT as the building block, as shown in Algorithm 1.

Our objective is to thwart M2GA attacks, in which fake users deliberately select target keys (or sets of keys) to increase the frequency and mean value of those specific keys. To achieve this, we replace the 1-out-of-d random sampling of PrivKV with the 1-out-of-d OT protocol between the user (A in OT) with d key–value pairs and the server (B), which selects one element

〈 k_{a}, v_{a} 〉

. However, the server cannot perform subsequent perturbation steps, as it must remain unaware of whether the user has key

k_{a}

or the private value

v_{a} \in [0, 1]

. Therefore, we rearrange the steps such that users perturb the keys and values before the server randomly selects a key–value pair via OT.

Algorithm 2 outlines the proposed perturbation process using the OT protocol for sampling. The perturbed key–value pairs are used to estimate the frequency and mean for the keys. With the step reordering, users must perturb key–value pairs for all d keys, leading to an increased computational cost on the user side by a factor of d. However, we consider this increase in the computational cost to be negligible because the perturbation is lightweight compared to the cryptographic cost of the 1-out-of-d OT. This algorithm proves robust against poisoning attacks.

Algorithm 1 1-out-of-N OT [7]

Suppose A has N messages

m_{0}, \dots, m_{N} \in {0, 1}^{n}

, where

N = 2^{ℓ} - 1

A generates $2 ℓ$ secret key pairs $(K_{1}^{0}, K_{1}^{1}), \dots, (K_{ℓ}^{0}, K_{ℓ}^{1})$ .
A sends to B the ciphertexts $C_{0}, \dots, C_{N}$ , where $C_{I} = m_{I} \oplus F_{K_{1}^{I_{1}}} (I) \oplus \dots \oplus F_{K_{ℓ}^{I_{ℓ}}} (I)$ and I is the ℓ-bit string $I_{1} \dots I_{ℓ} \in {0, 1}^{2}$ and $F_{K}$ is a pseudo-random function.
A and B perform ℓ 1-out-of-2 OT ( $K_{i}^{0}, K_{i}^{1})$ so that B learns $K_{1}^{t_{1}}, \dots, K_{ℓ}^{t_{ℓ}}$ where t is the index that B chooses from N messages such that $t = 1_{1} \dots t_{i} \in {0, 1}^{ℓ}$ .
B decrypts $C_{t}$ using $K_{1}^{t_{1}}, \dots, K_{ℓ}^{t_{ℓ}}$ to obtain $m_{t}$ .

Proposition 1.

An M2GA poisoning attack against the PrivKV scheme with 1-out-of-d OT for sampling key–value pairs has frequency and the mean gains as large as an RMA poisoning attack has.

Proof of Proposition.

Using an OT protocol, the fake users in the M2GA attack are not able to intentionally select the targeted keys. They may craft an arbitrary value but the server can detect invalid pairs other than the valid perturbed pairs

〈 0, 0 〉

,

〈 1, - 1 〉

, and

〈 1, 1 〉

. Therefore, they can prepare the valid perturbed pairs with arbitrary fractions, which is equivalent to an RMA attack. Therefore, the frequency and the mean gains will be less than or equal to those of an RMA attack. □

Algorithm 2 Perturbation of key–value pairs with OT

$S_{1}, \dots, S_{n} \leftarrow$ key–value data for n users.
for all $u \in {1, \dots, n}$ do perturbs all $〈 k_{a}, v_{a} 〉 \in S_{u}$
$v_{a}^{+} \leftarrow V P P (v_{a}^{'}, ϵ_{2})$ and $k_{a}^{*} \leftarrow R R (k_{a}^{'}, ϵ_{1})$
u with $〈 v_{1}^{+}, k_{1}^{*} 〉, \dots, 〈 v_{d}^{+}, k_{d}^{*} 〉$ performs 1-out-of-d OT with a server.
end for return The server has n perturbed key–value pairs.

4.3. EM Estimation for Key–Value Data

The EM algorithm operates through iterative steps, in which the posterior probabilities are refined using Bayes’ theorem [8]. We propose employing the EM algorithm to estimate the frequency and mean values from key–value data that are perturbed in PrivKV.

Algorithm 3 outlines the overall procedure for the proposed EM algorithm, which is tasked with estimating the frequency and means of the key–value data. Given n perturbed values

z_{1}, \dots, z_{n}

, we iterate the estimation of posterior probabilities for

x_{1}, \dots, x_{d}

as

Θ^{(t)} = ({θ_{1}}^{(t)}, {θ_{2}}^{(t)}, \dots, {θ_{d}}^{(t)})

until convergence.

Algorithm 3 EM algorithm for PrivKV

$〈 v^{+}, k^{*} 〉 \dots \leftarrow$ the perturbed key–value pair for n users.
$Θ^{(0)} \leftarrow$ a uniform probability for $X = {〈 1, 1 〉, 〈 1, - 1 〉, 〈 0, 1 〉, 〈 0, - 1 〉}$ .
repeat(E-step)
$t \leftarrow 1$
Estimate posterior probability ${\hat{θ}}_{u, i}^{(t)} \leftarrow P r [x_{i} | z_{u}] = \frac{P r [z_{u} | x_{i}] {θ_{i}}^{(t - 1)}}{\sum_{s = 1}^{| X |} P r [z_{u} | x_{s}] {θ_{s}}^{(t - 1)}},$
(M-step) Update marginal probability $θ^{(t)} \leftarrow \frac{1}{n} \sum_{u = 1}^{n} {\hat{θ}}_{u}^{(t - 1)} .$
until $| θ_{i}^{(t + 1)} - θ_{i}^{(t)} | \leq η$
for all $a \in K$ do estimate
$\hat{f_{a}} \leftarrow n (θ_{〈 1, 1 〉}^{(t)} + θ_{〈 1, - 1 〉}^{(t)})$ and ${\hat{m}}_{a} \leftarrow \frac{θ_{〈 1, 1 〉}^{(t)} - θ_{〈 1, - 1 〉}^{(t)}}{θ_{〈 1, 1 〉}^{(t)} + θ_{〈 1, - 1 〉}^{(t)}}$
end for return $\hat{f_{1}}, \hat{m_{1}}, \dots, \hat{f_{d}}, \hat{m_{d}}$

5. Evaluation

5.1. Data

In this study, experiments were conducted using synthetic data from two open datasets. An overview of each dataset is presented in Table 2. For the synthetic data, we synthesized the data such that the number of each key held by the user and value of each key depended on the Gaussian distribution

(μ = 0, σ = 10)

. The MovieLens dataset [42] and Clothing Fit dataset [43] were used in addition to synthetic data.

5.2. Metrics

5.2.1. Accuracy Metrics

Given a dataset of key–value pairs provided by n users, we employ emPrivKV, PrivKV, and PrivKVM (c = 3) to estimate the frequency

\hat{f} k

and mean value

\hat{m} k

for key k. The mean square error (MSE) of these estimates is defined as follows:

\begin{matrix} M S E f & = & \frac{1}{| K |} \sum_{i = 1}^{| K |} {({\hat{f}}_{i} - f_{i})}^{2}, \end{matrix}

(11)

\begin{matrix} M S E m & = & \frac{1}{| K |} \sum_{i = 1}^{| K |} {({\hat{m}}_{i} - m_{i})}^{2}, \end{matrix}

(12)

where

f_{k}

and

m_{k}

represent the actual frequency and mean of key k. After repeating each estimation 10 times, we assess the accuracy of the estimations.

5.2.2. Robustness Metrics

The robustness of the estimation algorithm is determined by its ability to withstand poisoning attacks, in which a poisoning attempt fails to alter the estimation outcomes. We measured this robustness using the concept of “frequency gain”, which is defined as the cumulative difference between the estimated and poisoned frequencies for the targeted keys. Formally, the frequency gain is expressed as

\begin{matrix} G_{f} (Y) = \sum_{k \in T} E [Δ {\hat{f}}_{k}], \end{matrix}

(13)

where

Δ {\hat{f}}_{k} = {\tilde{f}}_{k} - {\hat{f}}_{k}

represents the distance between the estimated and poisoned frequencies for key k, in which

{\tilde{f}}_{k}

is the estimated frequency when key k is targeted by a poisoning attack.

Similarly, the “mean gain” is calculated as the aggregate difference between the estimated and poisoned values:

\begin{matrix} G_{m} (Y) = \sum_{k \in T} E [Δ {\hat{m}}_{k}] . \end{matrix}

(14)

Here,

Δ {\hat{m}}_{k} = {\tilde{m}}_{k} - {\hat{m}}_{k}

, with

{\tilde{m}}_{k}

representing the estimated mean value when key k is subjected to a poisoning attack.

5.3. Experimental Results

5.3.1. MSE with Respect to $ϵ$

Figure 3a–c show the estimation accuracy with respect to the privacy budget

ϵ

as the MSE distributions of frequencies for the synthetic data and open datasets, MovieLens and Clothing, respectively. Note that the MSE values for emPrivKV were the minimum for both datasets and all

ϵ

. The MSE values of the estimated frequencies for the synthetic data and open datasets, MovieLens, and Clothing were 44.1%, 0.5%, and 0.5% of that of PrivKV, respectively, where

ϵ

was 0.1.

Similarly, the MSE of the mean estimation is shown in Figure 4a–c. The MSE values of the estimated mean values for the synthetic data, MovieLens, and Clothing were 24.8%, 0.2%, and 0.2% of that of PrivKV, respectively, where

ϵ

was 0.1.

5.3.2. Frequency Gain

Figure 5 shows the frequency gain distributions concerning the fraction of malicious users b, privacy budget

ϵ

, and number of target keys r for the three types of poisoning attacks (M2GA, RMA, and RKVA) using synthetic data (Gaussian distribution).

Notably, M2GA (Figure 5a) yielded the highest gains across all poisoning schemes. This outcome is unsurprising given that M2GA assumes the most potent control over the output by malicious users, thus posing the most significant threat to LDP schemes.

The emPrivKV results consistently exhibited the lowest gains for all poisoning attack types and parameter settings of b,

ϵ

, and r. As the proportion of malicious users b escalated, the gains for PrivKV increased accordingly (as shown in Figure 5a). In contrast, the gain of emPrivKV remained stable at

0.5

. For instance, at

b = 0.25

, emPrivKV achieved

70.25 %

of the gain of PrivKV, highlighting its superior resilience against the most severe poisoning attack (M2GA).

Figure 6 shows the frequency gains for the MovieLens dataset. The gain distributions were similar to those obtained using the Gaussian synthetic data, except for the effect of the fraction of malicious users parameter b (see Figure 6a,c). The gain did not depend on b for M2GA (Figure 6a) and was unstable for RKVA (Figure 6c).

The MovieLens data yielded greater gains than the synthetic data (by a factor of 2–5), primarily because of the different distributions of keys and the presence of numerous low-frequency keys (e.g., minor movie titles with minimal audiences). These low-frequency keys are more susceptible to low-resource poisoning attacks. Moreover, with an equivalent number of malicious users, the manipulated keys were already saturated in the MovieLens dataset, resulting in greater gains compared to the synthetic data scenario.

5.3.3. Mean Gain

Figure 7 shows the mean gains for the MovieLens dataset. emPrivKV always had a smaller gain than PrivKV and PrivKVM. For example, the gain for emPrivKV at

b = 0.2

was stable around

1.0

, which was

1 / 3

of that for PrivKV and

1 / 10

of that for PrivKVM. We observed similar results for the three LDP schemes using the MovieLens dataset (Figure 8a). Here, PrivKVM was considered to be the most vulnerable to poisoning attacks.

Furthermore, emPrivKV exhibited the smallest gain concerning the privacy budget

ϵ

, as depicted in Figure 8a. While the mean gains increased for PrivKV and PrivKVM with decreasing

ϵ

, emPrivKV maintained low gains, indicating minimal susceptibility to poisoning attacks. This underscores the robustness of emPrivKV.

Moreover, the gain exhibited a linear increase with the number of targeted keys r, as shown in Figure 8a,c. Notably, emPrivKV exhibited the lowest coefficient among all of the LDP schemes.

Interestingly, the LDP schemes did not exhibit significant differences in terms of RMA poisoning. Figure 8b shows that the differences in the gain escalated with an increase in the fraction of malicious users b.

5.3.4. Gains with OT

Figure 9 and Figure 10 show the frequency and mean gains using the OT protocol. The gains from poisoning attacks on the proposed protocol could be estimated as the maximum gains for RMA and RKVA attacks. We summarize the results of this frequency and mean gains in Table 3, Table 4 and Table 5. Our experiments using the MovieLens dataset, with the ratio of fake users to genuine users being 1 to 10, demonstrated that the proposed emPrivKV improved the frequency and mean gains by

17.1 %

and

25.9 %

of those of PrivKV, respectively, where the number of fake users was

0.1

of genuine users (Table 3). When the privacy budget

ϵ

was

0.1

to 5, our experiments demonstrated that the proposed emPrivKV improved the frequency and mean gains by

0.6 %

and

1.6 %

of those of PrivKV, respectively, where

ϵ

was

0.1

(Table 4). When the number of target keys was 1 to 15, our experiments demonstrated that the proposed emPrivKV improved the frequency and mean gains by

0.7 %

and

20.8 %

of those of PrivKV, respectively, where the number of target keys was 1 (Table 5).

We conclude that our approach is effective for data that are perturbed using the LDP algorithm.

5.3.5. The Cost with OT Protocol

We estimated the overhead of the OT protocol using standard assumptions.

Calculation cost: When there are d keys,

2 ℓ

encryptions are performed, where

d = 2^{ℓ} - 1

. If decoding one encryption key costs

0.1 s

, the computational cost is

0.1 s \cdot 2 ℓ

. The computational cost with respect to the number of keys is shown in Figure 11a.

Communication cost: Assuming that the communication amount of one cipher text is N, the user sends the server

2 ℓ \cdot N

data. The server returns

ℓ \cdot N

data to the user. Finally, the user sends

2 ℓ \cdot N

of cipher text to the server.

The total communication cost is

5 ℓ N

. If

N = 2048

[bit] and communication speed is 100 Mbps, the communication will be

5 ℓ \cdot 2048 / (100 \cdot 10^{6})

. The communication cost are shown in Figure 11b.

5.4. Discussion

A comparison of previous studies relating to LDP using the proposed method is presented in Table 6. We show the results of the robustness against poisoning attacks in terms of the frequency and mean gains. The frequency and value gains for PrivKV and PrivKVM without defensive measures were 2.5 and 2.5, respectively, and 10.5 and 30.2, respectively. However, the frequency and value gains of the proposed method, which applies OT for defense and EM for estimation, were 0.4 and 2.9, respectively.

In this study, we did not evaluate the frequency and mean gains with the defense method applying OCC. In our approach, random item selection in LDP is performed in collaboration between the user and server; hence, no malicious user sends fake data to manipulate the estimation. This is an ideal countermeasure under the assumption that the building blocks are secure. On the other hand, OCC is a probabilistic measure that allows small amounts of fake data to be sent in LDP. Therefore, we consider the proposed protocol to be more resilient than OCC.

The experimental findings highlight the superior robustness of the emPrivKV scheme compared with other LDP schemes. Several factors contribute to this observation.

First, PrivKV relies on MLE, where the single highest frequency determines the expected value of the perturbation. Consequently, the manipulation of the highest frequency can significantly impact the scheme. In contrast, the EM algorithm iteratively adjusts the probabilities based on all observed frequencies. Therefore, even if the highest frequency is manipulated, the involvement of other elements helps to mitigate the manipulation of frequency.

Second, our estimation of the mean value incorporates not only positive statistics (

v_{k}^{'} = 1

) but also both positive and negative statistics (

v_{k}^{'} = 1

and 0). This renders the estimation more resilient to poisoning attacks, thereby explaining why emPrivKV exhibits smaller mean gains.

Finally, by extrapolating our experimental gain results, we can gauge the overall robustness of the proposed protocol. According to Proposition 1, the M2GA becomes irrelevant when a perturbation with the OT protocol is employed.

6. Conclusions

We have studied the privacy preservation of key–value data using the LDP algorithm PrivKV. Our proposed emPrivKV scheme uses the OT protocol to prevent the intentional sampling of target keys and uses the EM algorithm for estimation. This improves the accuracy of statistical information estimation from randomized data. In our experiments using the MovieLens dataset, the MSE of the estimated frequencies and mean values improved by

0.5 %

and

0.2 %

of that of PrivKV, respectively, where

ϵ

was

0.1

. In addition, the frequency and mean of the keys were robust against fake data poisoning attacks. Our proposed emPrivKV improved the frequency and mean gains by

17.1 %

and

25.9 %

of those of PrivKV, respectively, where the number of fake users was

0.1

of genuine users, when using the MovieLens dataset.

The main future challenges raised in this paper can be summarized as follows. First, the robustness of poisoning attacks on machine learning models depends on the dataset [44]. We plan to conduct comprehensive experiments using multiple datasets to evaluate the robustness. Second, the computational resources for users are generally limited [45], and it is necessary to explore an effective method against poisoning attacks with less computational overhead. Finally, it is challenging to set an appropriate level of protection for sensitive data [46]. In our results, as we collect data more privately, our protocol tends to become more vulnerable to poisoning attacks. From the perspective of both privacy and security against poisoning attacks, it is essential to determine the proper levels of protection.

Author Contributions

Conceptualization, H.H. and H.K.; Methodology, H.H., H.K. and C.-M.Y.; Software, H.H.; Investigation, H.H.; Resources, H.K. and M.F.; Data curation, H.H.; Writing—original draft, H.H. and H.K.; Writing—review & editing, H.K., M.F. and C.-M.Y.; Visualization, H.H.; Supervision, H.K. and M.F.; Project administration, H.H. and M.F.; Funding acquisition, H.K. All authors have read and agreed to the published version of the manuscript.

Funding

Part of this work was supported by JSPS KAKENHI Grant Number JP18H04099, JP23K11110 and JST, CREST Grant Number JPMJCR21M1, Japan.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found at https://grouplens.org/datasets/movielens/ and https://www.kaggle.com/datasets/rmisra/clothing-fit-dataset-for-size-recommendation/.

Conflicts of Interest

Authors Hikaru Horigome and Masahiro Fujita were employed by the company Mitsubishi Electric Corporation. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Erlingsson, Ú.; Pihur, V.; Korolova, A. RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, Scottsdale, AZ, USA, 3–7 November 2014; pp. 1054–1067. [Google Scholar]
Ye, Q.; Hu, H.; Meng, X.; Zheng, H. PrivKV: Key–value Data Collection with Local Differential Privacy. IEEE Secur. Priv. 2019, 5, 294–308. [Google Scholar]
Gu, X.; Li, M.; Cheng, Y.; Xiong, L.; Cao, Y. PCKV: Locally Differentially Private Correlated key–value Data Collection with Optimized Utility. In Proceedings of the 29th USENIX Security Symposium, Virtual Event, 12–14 August 2020; pp. 967–984. [Google Scholar]
Ye, Q.; Hu, H.; Meng, X.; Zheng, H.; Huang, K.; Fang, C.; Shi, J. PrivKVM*: Revisiting key–value Statistics Estimation with Local Differential Privacy. IEEE Trans. Dependable Secur. Comput. 2021, 20, 17–35. [Google Scholar] [CrossRef]
Cao, X.; Jia, J.; Gong, N.Z. Data Poisoning Attacks to Local Differential Privacy Protocols. In Proceedings of the 30th USENIX Security Symposium, Virtual Event, 11–13 August 2021; pp. 947–964. [Google Scholar]
Wu, Y.; Cao, X.; Jia, J.; Gong, N.Z. Poisoning Attacks to Local Differential Privacy Protocols for key–value Data. In Proceedings of the 31st USENIX Security Symposium, Boston, MA, USA, 10–12 August 2022; pp. 519–536. [Google Scholar]
Naor, M.; Pinkas, B. Computationally Secure Oblivious Transfer. J. Cryptol. 2005, 18, 1–35. [Google Scholar] [CrossRef]
Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum likelihood from incomplete data via the em algorithm. J. R. Stat. Soc. Ser. B 1977, 39, 1–22. [Google Scholar] [CrossRef]
Horigome, H.; Hiroaki, K.; Yu, C.M. Local Differential Privacy protocol for making key–value data robust against poisoning attacks. In Proceedings of the 20th International Conference on Modeling Decisions for Artificial Intelligence, Umeå, Sweden, 19–22 June 2023; pp. 241–252. [Google Scholar]
Dwork, C. Differential privacy. In Proceedings of the 33rd International Colloquium on Automata, Languages and Programming, Venice, Italy, 10–14 July 2006; pp. 1–12. [Google Scholar]
Li, H.; Xiong, L.; Jiang, X.; Liu, J. Differentially private histogram publication for dynamic datasets: An adaptive sampling approach. Inf. Knowl. Manag. 2015, 2015, 1001–1010. [Google Scholar]
Yang, X.; Wang, T.; Ren, X.; Yu, W. Survey on Improving Data Utility in Differentially Private Sequential Data Publishing. IEEE Trans. Big Data 2021, 7, 729–749. [Google Scholar] [CrossRef]
Abadi, M.; Chu, A.; Goodfellow, I.; McMahan, H.B.; Mironov, I.; Talwar, K.; Zhang, L. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, Vienna, Austria, 24–28 October 2016; pp. 308–318. [Google Scholar]
Sarwate, A.D.; Chaudhuri, K. Signal Processing and Machine Learning with Differential Privacy: Algorithms and Challenges for Continuous Data. IEEE Signal Process. Mag. 2013, 30, 86–94. [Google Scholar] [CrossRef] [PubMed]
Duchi, J.C.; Jordan, M.I.; Wainwright, M.J. Local privacy and statistical minimax rates. In Proceedings of the 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, Berkeley, CA, USA, 26–29 October 2013; pp. 429–438. [Google Scholar]
Kairouz, P.; Oh, S.; Viswanath, P. Extremal mechanisms for Local Differential Privacy. In Proceedings of the Neural Information Processing Systems, Montréal, QC, Canada, 8–13 December 2014; pp. 2879–2887. [Google Scholar]
Warner, S.L. Randomized response: A survey technique for eliminating evasive answer bias. J. Am. Stat. Assoc. 1965, 60, 63–69. [Google Scholar] [CrossRef]
Bloom, B.H. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 1970, 13, 422–426. [Google Scholar] [CrossRef]
Duchi, J.C.; Jordan, M.I.; Wainright, M.J. Minimax optimal procedures for locally private estimation. J. ACM 2014, 61, 1–57. [Google Scholar] [CrossRef]
Nguyên, T.T.; Xiao, X.; Yang, Y.; Hui, S.C.; Shin, H.; Shin, J. Collecting and analyzing data from smart device users with Local Differential Privacy. arXiv 2016, arXiv:1606.05053. [Google Scholar]
Wang, N.; Xiao, X.; Yang, Y.; Zhao, J.; Hui, S.C.; Shin, H.; Shin, J.; Yu, G. Collecting and analyzing multidimensional data with Local Differential Privacy. In Proceedings of the 35th IEEE International Conference on Data Engineering, Macau SAR, China, 8–11 April 2019; pp. 638–649. [Google Scholar]
Ren, X.; Yu, C.M.; Yu, W.; Yang, S.; Yang, X.; McCann, J.A.; Yu, P.S. LoPub: High-Dimensional Crowdsourced Data Publication With Local Differential Privacy. IEEE Trans. Inf. Forensics Secur. 2018, 13, 2151–2166. [Google Scholar] [CrossRef]
Fanti, G.; Pihur, V.; Erlingsson, Ú. Building a RAPPOR with the unknown: Privacy-preserving learning of associations and data dictionaries. In Proceedings of the Privacy Enhancing Technologies Symposium, Darmstadt, Germany, 19–22 July 2016; pp. 41–61. [Google Scholar]
Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 1996, 58, 267–288. [Google Scholar] [CrossRef]
Li, Z.; Wang, T.; Milan, L.Z.; Li, N.; Škoric, B. Estimating Numerical Distributions under Local Differential Privacy. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, Portland, OR, USA, 14–19 June 2020; pp. 621–635. [Google Scholar]
Cheu, A.; Smith, A.; Ullman, J. Manipulation Attacks in Local Differential Privacy. In Proceedings of the 2021 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 24–27 May 2021; pp. 883–900. [Google Scholar]
Li, X.; Li, N.; Sun, W.; Gong, N.Z.; Li, H. Fine-grained Poisoning Attack to Local Differential Privacy Protocols for Mean and Variance Estimation. In Proceedings of the USENIX Security Symposium, Anaheim, CA, USA, 9–11 August 2023; pp. 1739–1756. [Google Scholar]
Wang, S.; Luo, X.; Qian, Y.; Du, J.; Lin, W.; Yang, W. Analyzing Preference Data with Local Privacy: Optimal Utility and Enhanced Robustness. IEEE Trans. Knowl. Data Eng. 2023, 35, 7753–7767. [Google Scholar] [CrossRef]
Imola, A.; Chowdhury, R.; Chaudhuri, K. Robustness of locally differentially private graph analysis against poisoning. arXiv 2022, arXiv:2210.14376. [Google Scholar]
Sasada, T.; Taenaka, Y.; Kadobayashi, Y. Oblivious Statistic Collection with Local Differential Privacy in Mutual Distrust. IEEE Access 2023, 11, 21374–21386. [Google Scholar] [CrossRef]
Borgnia, E.; Geiping, J.; Cherepanova, V.; Fowl, L.; Gupta, A.; Ghiasi, A.; Huang, F.; Goldblum, M.; Goldstein, T. Dp-instahide: Provably defusing poisoning and backdoor attacks with differentially private data augmentations. arXiv 2021, arXiv:2103.02079. [Google Scholar]
Ma, Y.; Zhu, X.; Hsu, J. Data poisoning against differentially-private learners: Attacks and defenses. arXiv 2019, arXiv:1903.09860. [Google Scholar]
Naseri, M.; Hayes, J.; Cristofaro, E.D. Local and central differential privacy for robustness and privacy in federated learning. In Proceedings of the Network and Distributed System Security (NDSS) Symposium 2022, San Diego, CA, USA, 24–28 April 2022. [Google Scholar]
Moya, M.M.; Hush, D.R. Network constraints and multi-objective optimization for one-class classification. Neural Netw. 1996, 9, 463–474. [Google Scholar] [CrossRef]
Liu, F.T.; Ting, K.; Zhou, Z.H. Isolation forest. In Proceedings of the 2008 IEEE International Conference on Data Mining, NW Washington, DC, USA, 15–19 December 2008; pp. 413–422. [Google Scholar]
Cao, X.; Jia, J.; Gong, N.Z. Provably secure federated learning against malicious clients. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual Event, 2–9 February 2021; Volume 35, pp. 6885–6893. [Google Scholar]
Li, M.; Berrett, T.B.; Yu, Y. On robustness and local differential privacy. Ann. Statist. 2023, 51, 717–737. [Google Scholar] [CrossRef]
Balle, B.; Wang, Y.X. Improving the gaussian mechanism for differential privacy. In Proceedings of the Machine Learning Research, Stockholm, Sweden, 10–15 July 2018; pp. 394–403. [Google Scholar]
Acharya, J.; Sun, Z.; Zhang, H. Hadamard Response: Estimating Distributions Privately, Efficiently, and with Little Communication. In Proceedings of the Machine Learning Research, Long Beach, CA, USA, 9–15 June 2019; pp. 1120–1129. [Google Scholar]
Bassily, R.; Nissim, K.; Stemmer, U.; Thakurta, A. Practical Locally Private Heavy Hitters. In Proceedings of the Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 2285–2293. [Google Scholar]
Wang, T.; Blocki, J.; Li, N.; Jha, S. Locally Differentially Private Protocols for Frequency Estimation. In Proceedings of the USENIX security Symposium, Vancouver, BC, Canada, 16–18 August 2017; pp. 729–745. [Google Scholar]
MovieLense 10 M Dataset. Available online: https://grouplens.org/datasets/movielens/ (accessed on 1 August 2022).
Clothing Fit Dataset for Size Recommendation. Available online: https://www.kaggle.com/datasets/rmisra/clothing-fit-dataset-for-size-recommendation/ (accessed on 1 August 2022).
Phan, T.C.; Tran, H.C. Consideration of Data Security and Privacy Using Machine Learning Techniques. Int. J. Data Inform. Intell. Comput. 2023, 2, 20–32. [Google Scholar]
Singh, P.; Pandey, A.K. A Review on Cloud Data Security Challenges and existing Countermeasures in Cloud Computing. Int. J. Data Inform. Intell. Comput. 2022, 1, 23–33. [Google Scholar]
Jones, K.I.; Suchithra, R. Information Security: A Coordinated Strategy to Guarantee Data Security in Cloud Computing. Int. J. Data Inform. Intell. Comput. 2023, 2, 11–31. [Google Scholar]

Figure 1. Poisoning attack.

Figure 2. Proposed method.

Figure 3.

M S E_{f}

of mean with regard to

ϵ

. (a) Synthetic data. (b) MovieLens. (c) Clothing.

Figure 3.

M S E_{f}

of mean with regard to

ϵ

. (a) Synthetic data. (b) MovieLens. (c) Clothing.

Figure 4.

M S E_{m}

of mean with regard to

ϵ

. (a) Synthetic data. (b) MovieLens. (c) Clothing.

Figure 4.

M S E_{m}

of mean with regard to

ϵ

. (a) Synthetic data. (b) MovieLens. (c) Clothing.

Figure 5. Frequency gain of poisoning attacks (synthetic data). (a) M2GA. (b) RMA. (c) RKVA.

Figure 6. Frequency gains for poisoning attacks (MovieLens). (a) M2GA. (b) RMA. (c) RKVA.

Figure 7. Mean gain of poisoning attacks (synthetic data). (a) M2GA. (b) RMA. (c) RKVA.

Figure 8. Mean gain of poisoning attacks (MovieLens). (a) M2GA. (b) RMA. (c) RKVA.

Figure 9. Frequency gain of poisoning attacks with OT. (a) Synthetic data. (b) MovieLens.

Figure 10. Mean gain of poisoning attacks with OT. (a) Synthetic data. (b) MovieLens.

Figure 11. Costs of OT protocol. (a) Calculation cost. (b) Communication cost.

Table 1. Comparison of defenses approaches.

Step	PrivKV [2]	Our Work
1 Pre-sampling	1-out-of-d sampling	–
2 Perturbing	Value VPP $(v, ϵ_{2})$
	Key RR $(k, ϵ_{1})$
3 Post-sampling	–	1-out-of-d OT
4 Estimating	MLE	EM

Table 2. Open datasets.

Item	Synthetic Data (Gauss)	MoveiLens [42]	Clothing [43]
Ratings	5,000,000	10,000,054	192,544
Users (n)	100,000	69,877	9657
Items (d)	100	10,677	3183
Value range	$[- 1$ , $1]$	$[0.5$ , $5]$	$[1$ , $10]$

Table 3. Gains of emPrivKV for the PrivKV with respect to b (MovieLens).

Fraction of malicious users b	0.001	0.005	0.01	0.05	0.1	0.15	0.2
Frequency gain [%]	66.7	20.8	18.6	19.3	17.1	16.7	15.8
Mean gain [%]	18.2	20.4	24.5	30.6	25.9	21.4	27.8

Table 4. Gains of emPrivKV for the PrivKV with respect to

ϵ

(MovieLens).

Table 4. Gains of emPrivKV for the PrivKV with respect to

ϵ

(MovieLens).

privacy budget $ϵ$	0.1	0.5	1	2	3	4	5
Frequency gain [%]	0.6	4.8	90.0	41.7	50.0	55.6	66.7
Mean gain [%]	1.6	14.6	15.5	36.2	40.1	41.3	44.8

Table 5. Gains of emPrivKV for the PrivKV with respect to r (MovieLens).

The number of target keys r	1	3	5	7	9	11	13	15
Frequency gain [%]	0.7	18.8	17.8	21.2	18.2	16.3	15.8	15.2
Mean gain [%]	20.8	13.5	21.8	30.2	25.8	27.7	25.9	25.5

Table 6. Comparison with previous research, where fake users are 0.1 of genuine users.

	PrivKV [2]	PrivKVM [2]	Wu et al. [6]	Proposed Method
Defense method	N/A	N/A	OCC	OT
Estimation method	MLE	MLE	MLE	EM
Frequency gain	2.5	2.5	–	0.4
Mean gain	10.5	30.2	–	2.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Horigome, H.; Kikuchi, H.; Fujita, M.; Yu, C.-M. Robust Estimation Method against Poisoning Attacks for Key-Value Data with Local Differential Privacy. Appl. Sci. 2024, 14, 6368. https://doi.org/10.3390/app14146368

AMA Style

Horigome H, Kikuchi H, Fujita M, Yu C-M. Robust Estimation Method against Poisoning Attacks for Key-Value Data with Local Differential Privacy. Applied Sciences. 2024; 14(14):6368. https://doi.org/10.3390/app14146368

Chicago/Turabian Style

Horigome, Hikaru, Hiroaki Kikuchi, Masahiro Fujita, and Chia-Mu Yu. 2024. "Robust Estimation Method against Poisoning Attacks for Key-Value Data with Local Differential Privacy" Applied Sciences 14, no. 14: 6368. https://doi.org/10.3390/app14146368

APA Style

Horigome, H., Kikuchi, H., Fujita, M., & Yu, C.-M. (2024). Robust Estimation Method against Poisoning Attacks for Key-Value Data with Local Differential Privacy. Applied Sciences, 14(14), 6368. https://doi.org/10.3390/app14146368

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Robust Estimation Method against Poisoning Attacks for Key-Value Data with Local Differential Privacy

Abstract

1. Introduction

2. Related Works

2.1. Privacy Preservation for Key–Value Data

2.2. Novelty of This Research

3. Local Differential Privacy

3.1. Fundamental Definition

3.2. PrivKV

3.2.1. Sampling

3.2.2. Perturbation

3.2.3. Estimating

3.2.4. PrivKVM

3.3. Poisoning Attack

4. Proposed Algorithm

4.1. Concept

4.2. Oblivious Transfer

4.3. EM Estimation for Key–Value Data

5. Evaluation

5.1. Data

5.2. Metrics

5.2.1. Accuracy Metrics

5.2.2. Robustness Metrics

5.3. Experimental Results

5.3.1. MSE with Respect to ϵ

5.3.2. Frequency Gain

5.3.3. Mean Gain

5.3.4. Gains with OT

5.3.5. The Cost with OT Protocol

5.4. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

5.3.1. MSE with Respect to $ϵ$