Protecting Infinite Data Streams from Wearable Devices with Local Differential Privacy Techniques

Zhao, Feng; Fan, Song

doi:10.3390/info15100630

Open AccessArticle

Protecting Infinite Data Streams from Wearable Devices with Local Differential Privacy Techniques

by

Feng Zhao

and

Song Fan

^*

Cyberspace Security Institute, Chang’an Campus, Xi’an University of Posts and Telecommunication, Xi’an 710061, China

^*

Author to whom correspondence should be addressed.

Information 2024, 15(10), 630; https://doi.org/10.3390/info15100630

Submission received: 9 September 2024 / Revised: 10 October 2024 / Accepted: 11 October 2024 / Published: 12 October 2024

(This article belongs to the Special Issue Digital Privacy and Security, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

The real-time data collected by wearable devices enables personalized health management and supports public health monitoring. However, sharing these data with third-party organizations introduces significant privacy risks. As a result, protecting and securely sharing wearable device data has become a critical concern. This paper proposes a local differential privacy-preserving algorithm designed for continuous data streams generated by wearable devices. Initially, the data stream is sampled at key points to avoid prematurely exhausting the privacy budget. Then, an adaptive allocation of the privacy budget at these points enhances privacy protection for sensitive data. Additionally, the optimized square wave (SW) mechanism introduces perturbations to the sampled points. Afterward, the Kalman filter algorithm is applied to maintain data flow patterns and reduce prediction errors. Experimental validation using two real datasets demonstrates that, under comparable conditions, this approach provides higher data availability than existing privacy protection methods for continuous data streams.

Keywords:

local differential privacy; infinite data streams; privacy protection

Graphical Abstract

1. Introduction

With the rapid advancement of information technology, wearable devices have become essential tools for health monitoring, gaining global attention and widespread adoption [1]. Their real-time capabilities, affordability, and portability provide innovative solutions for both personal and professional healthcare. According to the latest Internet Data Center (IDC) data, the global and Chinese markets are experiencing continuous growth in wearable device shipments, including general health monitors, like smartwatches, and specialized medical devices, such as blood glucose meters and blood pressure monitors. These devices enable real-time health data monitoring and analysis in daily life, support the ongoing management of chronic diseases, and assist medical professionals in clinical decision-making. However, as the popularity and use of wearable medical devices continue to rise, ensuring privacy protection and data security has become increasingly critical. Users’ health data, including metrics such as heart rate, physical activity levels, and sleep patterns, are not only crucial for personal health management but also contain sensitive information that requires robust protection and lawful handling.

The real-time data streams collected by wearable devices need to be aggregated and analyzed by trusted third-party organizations to fully realize their social and personal benefits [2]. Consequently, there has been a growing focus on enhancing the availability of these data while simultaneously safeguarding the privacy of wearable device users. To address this challenge, Differential Privacy (DP) [3] has been widely adopted as a robust privacy protection framework. However, DP assumes that the server is trustworthy. In practice, servers may inadvertently or deliberately compromise user privacy due to curiosity or commercial interests, leading to potential privacy breaches. To mitigate the risk of privacy leakage by the server, Local Differential Privacy (LDP) [4] was introduced. LDP has the advantage of locally protecting a significant amount of end-user data. LDP shows great promise in enabling statistical analysis of data streams without relying on trusted third-party entities. Its goal is to preserve the privacy of individual data during both data collection and transmission. The core principle of LDP involves locally injecting noise at the data source to obscure the true values of individual data, allowing external entities to perform aggregated statistical analyses without compromising individual privacy.

However, the current method for protecting privacy in wearable device data streams using Local Differential Privacy (LDP) relies on the random response technique [5] to introduce noise into the data, thereby ensuring user privacy. While effective in protecting privacy, this approach significantly compromises data availability. To address these challenges, the authors [6] proposed the Pattern-LDP algorithm to enhance privacy protection in data streams. The algorithm first employs piecewise linear approximation to normalize the data stream and selects the furthest point within a fixed error threshold as the sampling point. However, Pattern-LDP requires continuous normalization, making it less suitable for the real-time, dynamic data streams generated by wearable devices. Additionally, when allocating the privacy budget, the algorithm only considers the speed of data stream fluctuations, neglecting the directional trends in these fluctuations, which are crucial indicators for describing dynamic real-time data streams. Furthermore, the algorithm does not optimize non-sampling data points, leading to poor availability of the published data streams and potential privacy leaks. Thus, there is an urgent need for a new and effective privacy protection method for streaming data generated by wearable devices—one that accommodates the real-time nature of dynamic data streams and allows wearable device users to locally perturb data before real-time uploading.

In response to the challenges mentioned above, this paper introduces a lightweight WIDS-LDP algorithm specifically designed for wearable devices. The algorithm consists of two primary components: the wearable device side and the device service provider side. First, on the wearable device side, significant data points are sampled based on trends and rates of fluctuation within the data stream. The privacy budget is then adaptively allocated to these sampled points, which are subsequently perturbed to protect user privacy. Second, on the device service provider side, post-processing optimization is applied to both the non-sampled and sampled points. This optimization enhances the overall utility of the published data, thereby minimizing privacy risks. The main contributions of this paper are summarized as follows:

(1): This paper proposes a local differential privacy protection framework specifically designed for wearable devices, aiming to enhance data availability while safeguarding the privacy of wearable device users.
(2): The adaptive privacy budget mechanism is optimized based on the characteristics of sampling points in the data stream, resulting in a more reasonable allocation of the privacy budget. Additionally, an improved SW mechanism is applied for perturbation, ensuring that data with smaller errors are output with higher probability.
(3): Comparisons with existing methods, using real datasets, demonstrate that this approach not only effectively protects user privacy but also preserves the availability of data streams.

2. Related Work

In this paper, we focus on local differential privacy (LDP) methods for protecting data streams. We start by introducing privacy protection techniques for sequential data streams and then provide an overview of LDP-based methods.

Privacy Protection for Time Series Data: Data stream privacy protection can be categorized into two types based on different usage scenarios: privacy protection for aggregate statistical analysis and privacy protection for time series analysis. For the first type, the authors in [7] proposed a variant of smooth projection hashing to construct a privacy protection scheme for aggregate statistical analysis. However, this scheme is relatively complex and unsuitable for wearable devices. For the second type, Zheng et al. [8] proposed a scheme involving querying within a similar range and then reporting the results. This solution is designed for similarity queries over time series data and is not applicable to the privacy protection of sensitive data. In [9], the authors introduced a novel method that combines cryptographic algorithms with emerging data mining technologies to ensure the privacy protection of time series data. The key idea of this scheme is to utilize the observation of plaintext DTW scores and promote scalable computation in the ciphertext domain through a customized security design. However, this work focuses on the privacy protection of sensitive data, which does not align with the requirement for differential privacy. These algorithms effectively guarantee the privacy and security of time series data streams but do not consider differential privacy.

Data Stream DP: Differential Privacy (DP) was first introduced by Dwork et al. [3] to balance data availability with privacy protection requirements. Local Differential Privacy (LDP) [4] can mitigate the risk of privacy leakage associated with DP, particularly when dealing with untrustworthy third-party servers. Guan et al. [10] introduced the EDPDCS clustering scheme, which incorporates a privacy-preserving clustering method within the Map-Reduce framework. This approach uses K-means clustering combined with differential privacy (Laplace noise) to enhance the accuracy of published data. Han et al. [11] proposed the PPM-HAD algorithm, which supports operations such as (mean, variance) addition and (minimum/maximum, median) aggregation. This mechanism is particularly effective for cloud servers and can strongly resist differential attacks, but it is not applicable to wearable devices. Saleheen [12] proposed the mSieve algorithm, which integrates data-driven technology with Laplace noise to obfuscate sensitive data on demand while maintaining differential privacy. However, this approach is associated with significant error and is, therefore, not well-suited for protecting sensitive data in wearable devices. Additionally, other technologies, such as the exponential mechanism [13], Fourier algorithm [14], and classification trees [15], can be integrated with differential privacy.

Wearable Devices LDP: Kim et al. [16] proposed a privacy-preserving aggregation algorithm based on LDP. This algorithm identifies key points from the original data, adaptively adds random noise to these points, and linearly connects the noisy key point values to reconstruct data curves. Although Kim et al.’s algorithm can reconstruct data flows, it suffers from significant errors due to excessive noise and an unreasonable data curve reconstruction method. Li et al. [17] improved upon this algorithm by integrating the concept of random response with LDP and adaptively adjusting the noise magnitude based on the characteristics of the original data. However, Li’s algorithm still uses a linear reconstruction method for predicting non-sampled points. Despite proposed enhancements using interpolation and fitting, the improvement remains limited. Additionally, Li’s privacy budget allocation strategy employs a uniform allocation scheme, which does not fully address the protection needs of critical data, thereby posing a risk of important data leakage. Zhang et al. [18] proposed the RE-Dpocpor algorithm, which utilizes the Laplace noise mechanism combined with adaptive sampling, filtering, and budget allocation algorithms to publish differential privacy-protected real-time health data collected over w consecutive days. However, because this scheme relies on information from future timestamps, it is not applicable to infinite data streams. Tu et al. [2] proposed a mean data publishing algorithm for wearable devices that offers high availability for this purpose. However, the algorithm requires users to calculate the mean global sensitivity in advance. As a result, when dealing with different big data statistics, the global sensitivity must be recalculated, leading to reduced availability. Furthermore, calculating global sensitivity necessitates the use of the entire dataset, which is not practical for the continuous data streams generated by wearable devices. Although the aforementioned privacy protection measures for wearable devices can safeguard data flow, they each have limitations. There remains a gap in privacy protection for unlimited data streams from wearable devices when utilizing local differential privacy techniques.

Therefore, it is essential to further investigate and develop a local differential privacy protection method that ensures the published values are as close to the true values as possible while effectively safeguarding user privacy. Additionally, the method should be applicable to wearable devices and capable of preserving the unique patterns of real-time dynamic data streams.

3. Basic Knowledge

In this section, we present the problem statement pertinent to our research topic and introduce the foundational theoretical concepts relevant to this paper.

3.1. Infinite Data Streams

Wearable devices collect data at fixed intervals, forming a data stream

S = (x_{1}, x_{2}, \dots, x_{n})

, where

x_{i}

denotes the data point at the

i

timestamp (

0 \leq i \leq n

), n denotes the length of the data stream, and S represents the wearable device. Device S allows users to customize timestamps and sampling intervals according to the requirements of different data types, facilitating the creation of data curves by connecting timestamped data points. Data collection and analysis are typically conducted over a defined period (e.g., 24 h) to generate a finite data stream. Wearable devices continuously collect data from users’ wrists as long as the device is worn, resulting in what is referred to as an infinite data stream. When applying differential privacy protection to finite data streams, a thorough analysis based on the data’s characteristics can enhance privacy protection effectiveness. However, for infinite data streams, the uncertainty of future data points requires predicting information for the next timestamp based on previous data, necessitating more sophisticated methods to ensure privacy protection.

3.2. Problem Statement

The data stream collected by a single wearable device is represented as a univariate discrete time series x. The aggregated set of multiple discrete time series x at discrete times n, where

0 \leq n \leq T

and T is the length of the sequence, is denoted as S. S represents the aggregated sequence of raw data. For example, S could be the aggregated numerical sequences of heart rate data collected from 20 individuals over a period of time. The goal of this article is to publish the sanitized version S* of the aggregated sequence S in real time. S* is the published data that satisfies local differential privacy. After aggregation and analysis, S* can provide valuable insights across various aspects. The algorithm’s usage scenario is illustrated in Figure 1.

3.3. Local Differential Privacy

Local Differential Privacy (LDP) is a privacy protection mechanism focused on safeguarding the privacy of individual users or devices. Its purpose is to protect privacy during the collection and transmission of individual data. This is achieved by adding noise to the data locally at the time of collection. The processed data, which are now privacy-protected, are then transmitted to the data collector. Typically, this provides a lighter level of privacy protection compared to other methods. The specific definition is provided in Formula (1):

\Pr [M (S) \in O] \leq e^{ε} \cdot \Pr [M (S^{'}) \in O]

(1)

Here, S and S′ are sibling datasets that differ by at most one data point. If the probability that the result of the random algorithm M applied to these two datasets satisfies the specified formula is as described, then the random algorithm M satisfies ε-local differential privacy. Here, ε represents the privacy budget [6], which specifies the level of privacy protection provided. A smaller ε indicates a stronger privacy guarantee but introduces more noise and reduces accuracy. Conversely, a larger ε provides weaker privacy protection but allows for higher accuracy.

3.4. W-Event Privacy

W-event privacy [19]: A mechanism M is said to satisfy w-event privacy if, for any two datasets D and D′ that differ in their data values on w events, and for any possible output S satisfying Formula (2), the following condition holds:

\Pr [M (D) \in O] \leq e^{ε} \cdot \Pr [M (D^{'}) \in O]

(2)

where ε is the privacy parameter. Here, ε quantifies the difference in the distribution of query results between D and D′, and

P r [M (D) \in O]

represents the probability that the mechanism’s output falls in the set O when the dataset is D.

4. Recommend Method

In this section, we propose a new privacy protection method for infinite data streams collected by wearable devices in real time, called WIDS-LDP (Wearables Infinite Data Stream-Local Differential Privacy).

4.1. WIDS-LDP

The framework design of WIDS-LDP is shown in Figure 2. The method includes two parts: the wearable device side and the device service provider side. First, the wearable device side performs salient point sampling, privacy budget allocation, and data perturbation. Second, the device service provider side performs post-processing optimization.

Salient point sampling: We use a method that combines linear fitting equations with the least squares approach (hereinafter referred to as LFLS) to sample salient points, effectively representing the patterns in real-time streaming data.

Privacy budget allocation and perturbation: Based on the salient points, as well as the volatility and fluctuation amplitude of the streaming data identified in the first part, the privacy budget is adaptively allocated, and perturbation is applied using the SW mechanism.

Post-processing optimization: Kalman filtering is used to predict non-sampling point data, while the perturbed sampling point data are predicted and updated to enhance accuracy.

4.2. Significant Point Sampling

In this section, we provide a detailed introduction to the LFLS algorithm. The algorithm begins by using the least squares method [20] to determine the mathematical representation of the linear fitting equation for the current significant point. Next, based on this linear fitting equation, it assesses whether the next significant point can be fitted into the current equation. This approach effectively models the fluctuation pattern of the flow data, allowing for accurate sampling of the pattern’s significant points.

4.2.1. First-Order Difference Method

Assuming the data stream is

S = {x_{1}, x_{2}, \dots, x_{n}}

, this method can determine whether the point at time

t

is significant once the point at time

t + 1

is known. The steps are as follows: First, calculate the first-order difference

{D i f f}_{i} = x_{t} - x_{t - 1}

corresponding to each data point in the stream. When the sign of the slope of the linear fitting equation between

x_{t}

and

x_{t + 1}

, and

x_{t - 1}

changes,

x_{t}

is identified as a sampling point.

4.2.2. LFLS

The fitting process of the LFLS algorithm is as follows: First, we select two adjacent significant points

A (a, x_{a})

and

B (b, x_{b})

, and derive the fitting straight line equation

F (x) = k x + b

, where the slope

k

is calculated as

k = (x_{b} - x_{a}) / (b - a)

. For the next significant point

C (c, x_{c})

, the straight-line equation connecting B and C is given by

F (x^{'}) = k x^{'} + b

, where

k = (x_{c} - x_{b}) / (c - b)

. Assume that the angle between

F (x)

and

F (x^{'})

does not exceed a certain threshold, meaning

\tan θ = |\frac{k - k^{'}}{1 + k k^{'}}| \leq \tan α

(3)

If the angle between the fitting equation of

A

and

B

and the line connecting

B

and

C

does not exceed the specified threshold,

C

can be considered to fit the linear equation of

A

and

B

. This allows us to calculate the best-fitting line for points

A

,

B

, and

C

. However, if the angle between the fitting lines of

A

and

B

and

C

exceeds the threshold, the linear fitting equation is recalculated starting from

C

. By applying this process, we generate a set of linear fitting equations. If adjacent fitting lines exhibit the same trend, the point is not deemed significant.

4.2.3. Dynamic Angle α

We use a PID controller to represent the fluctuation rate of the data stream. A greater fluctuation rate indicates a faster rate of change of the data flow, while a smaller fluctuation rate reflects a slower rate of change. The complete PID algorithm is shown below:

Δ_{k_{n}} = C_{p} E_{k_{n}} + \frac{C_{i}}{T_{i}} \sum_{j = n - T_{i} + 1}^{n} E_{k_{j}} + C_{d} \frac{E_{k_{n}} - E_{k_{n - 1}}}{k_{n} - k_{n - 1}}

(4)

Among them,

C_{p}

represents the gap between the target value and the actual value,

C_{i}

represents the accumulation of error over time, and

C_{d}

represents the error in predicting future values.

T_{i}

represents the number of errors in the cumulative integral error.

E_{k_{i}}

is the feedback error, expressed as

E_{k_{i}} = |x_{t} - x_{t}^{'}|, t

represents the timestamp,

x_{t}^{'}

represents the predicted value.

The formula α = λπ/2 indicates that changes in α lead to corresponding changes in λ. To dynamically adjust λ, we can change α. Since λ needs to remain within the range (0,1), we use an exponential function to determine its value. Accordingly, λ can be defined as follows:

λ = 1 - \exp (- (\frac{1}{|k|} + Δ_{k_{n}}))

(5)

Therefore, depending on the trend and rate of data flow, we can dynamically change the angle threshold to adaptively change the sampling interval to balance privacy and utility. The specific salient point sampling is shown in Algorithm 1.

Algorithm 1: Significant point sampling

Input: Raw data

S = (x_{1}, x_{2}, \dots, x_{n})

,

α

Output:

S^{'}

for t

\in

[2,n] do

if

t + 1 < n

then

k = x_{t} - x_{t - 1}

,

k^{'} = x_{t} - x_{t - 1}

end
Calculate

\tan θ = |\frac{k - k^{'}}{1 + k k^{'}}|

if

\tan θ

<

\tan α

x_{t} \in S

end
end

4.3. Budget Allocation and Perturbations

4.3.1. Adaptive Privacy Budget Allocation

We use the LBD (LDP Budget Distribution) model to adaptively allocate privacy budgets based on the characteristics of the sampling points. For a single sliding window, the privacy budget ε is evenly distributed to the difference budget and the release budget. First, the difference budget is evenly distributed to each timestamp. The perturbed data are used to predict the difference error. Then, the remaining privacy budget for the current timestamp is calculated and the difference budget is adaptively allocated. The specific adaptive privacy budget allocation is shown in Algorithm 2.

Algorithm 2: Adaptive privacy budget allocation

Input: Privacy budget

ε

, window size

ω

Output: Perturbation data

p = (p_{1}, p_{2}, \dots, p_{n})

Initialize

i = 1

for

t \in [t - 1, n]

do

ε_{t, 1} = ε / (2 ω)

Calculate remaining publication budget

ε_{r m} = \frac{ε}{2} - \sum_{i = t - ω + 1}^{t - 1} ε_{t, 2}

ε_{t, 2} = ε_{r m} / 2

ε_{i} = ε_{t, 1} + ε_{t, 2}

end

4.3.2. Data Perturbations

Gao et al. [21] applied an enhanced version of the SW mechanism [22] to perturb data streams. This improved method updates the perturbation probability and range of the traditional SW mechanism, resulting in perturbed data that more closely align with the original data curve. The mechanism is defined as follows:

b_{i} = (ε_{i} e^{ε_{i}} - e^{ε_{i}} + 1) / (2 e^{ε_{i}} (e^{ε_{i}} - 1 - ε_{i}))

(6)

The disturbance probability is expressed in Equation (7).

\{\begin{cases} p_{i} = e^{ε_{i}} / 2 b_{i} e^{ε_{i}} + 1, if |x_{i} - {\tilde{x}}_{i}| \leq b_{i} \\ q_{i} = (2 b_{i} e^{ε_{i}} + 1)^{- 1}, o t h e r w i s e \end{cases}

(7)

Therefore, the original data with smaller prediction errors are output with probability

p_{i}

, while the original data with larger prediction errors are output with probability

q_{i}

. This approach not only minimizes the introduction of noise but also ensures user privacy

4.4. Post-Processing Mechanism

Kalman et al. [23] proposed a method to solve linear filtering and prediction problems using linear state equations, called Kalman filtering. The algorithm consists of two parts: prediction and update.

Prediction: Estimate the state at the current time based on the posterior estimate (update value) at the previous time and derive the prior estimate (prediction value) at the current time. The specific process is as follows:

{\tilde{x}}_{t} = A {\hat{x}}_{t - 1} + B u_{t - 1}

(8)

{\tilde{R}}_{t} = A {\hat{P}}_{t - 1} A^{T} + Q

(9)

Here,

{\hat{x}}_{t - 1}

is the filtering result, also called the best result. A represents the state transfer matrix, and B represents the input control matrix.

{\hat{x}}_{t - 1}

represents the external operation at timestamp t − 1. Q is the covariance of the state transition. Updating means using the measured value at a certain moment to correct the predicted system state. The specific process is as follows:

K_{t} = \frac{{\tilde{P}}_{t} H^{T}}{H \tilde{P} H^{T} + R}

(10)

{\hat{x}}_{t} = {\hat{x}}_{t} + K_{t} ({\bar{x}}_{k} - H {\tilde{x}}_{t})

(11)

{\hat{P}}_{t} = (1 - K_{t} H) {\tilde{P}}_{t}

(12)

Here, H represents the transformation matrix of the state variables.

K_{t}

represents the Kalman gain. Algorithm 3 is shown below.

Algorithm 3: Adaptive privacy budget allocation

Input: Perturbation data

p = (p_{1}, p_{2}, \dots, p_{n})

Output: Release data

r = (r_{1}, r_{2}, \dots, r_{n})

for

t \in [t - 1, n]

do

{\hat{x}}_{i}^{-} = {\hat{x}}_{i - 1}

P_{i}^{-} = p_{i - 1} + Q

K_{K} = P_{i}^{-} {(P_{i}^{-} + R)}^{- 1}

{\hat{x}}_{i} = {\hat{x}}_{i}^{-} + K_{K} (x_{i} - {\hat{x}}_{i}^{-})

P_{i} = (1 - K_{K}) P_{i}^{-}

r_{i} = {\hat{x}}_{i}

end

4.5. Theoretical Analysis

Theorem 1.

Serial Combination Theorem.

Proof.

In the dataset D, assuming that the algorithm contains M random algorithms

A_{i}

,

A_{i}

satisfies

ε_{i}

-differential privacy, and the random processes between mechanisms are independent of each other, then the algorithm satisfies

\sum_{1 \leq i \leq M} ε_{i}

-differential privacy. □

Theorem 2.

WIDS-LDP satisfies

ε

-LDP.

Proof.

In the WIDS-LDP algorithm, the perturbation module and the privacy budget allocation module access the original data, and the others are operations on the perturbed data. According to [24], as long as the post-processing algorithm does not directly use the original data information, the post-processing algorithm is privacy-preserving. Therefore, if we can prove that the perturbation and privacy budget allocation modules satisfy

ε

-local differential privacy, then the solution in this paper satisfies it. □

According to [21], the SW perturbation module satisfies

ε_{i}

-local differential privacy. According to Algorithm 2 and Theorem 1, the whole algorithm satisfies

\sum_{1 \leq i \leq n} ε_{i}

-local differential privacy. Also, because

\sum_{1 \leq i \leq n} ε_{i} \leq ε

, the algorithm satisfies

ε

-LDP.

5. Experiment

In this section, we first describe the specific experimental settings of this study, followed by an introduction to the comparative scheme. Finally, we evaluate the performance of the proposed scheme using two real datasets. The evaluation focuses on three key aspects: (1) the impact of different window sizes on the error rate; (2) the impact of different privacy budgets on the error rate; and (3) the impact of different data lengths on the error rate. These results will provide a crucial basis for understanding the effectiveness and applicability of the proposed scheme.

5.1. Experimental Environment

The experimental part of this paper is completed on a personal computer equipped with an Intel(R) Core(TM) i7-8565UCPU, 8 GB RAM, and a 64-bit Windows 11 operating system. The algorithm is implemented using MATLAB 2020a and is compiled and run in this environment.

5.2. Real Dataset

PAMAP [25]: The PAMAP dataset has 9 subjects (8 males and 1 female) wearing three inertial measurement units and heart rate monitors to record 18 activities, with a total of more than 10 h of data collected. The heart rate data of 8 people were selected from the dataset as the raw data streams of the experiment, and the length of each data stream was 3000 (Table 1). The test subjects were numbered from 1 to 8 and recorded once every minute, so the total data were 3 × 8 K.

Taxi [26]: The dataset contains the real-time movement trajectories of 10,357 taxis. The real-time location was extracted every 10 min, with a total of 886 timestamps. The area was divided into 5 grids, that is, T = 5, and we obtained d = 10,357 data streams for each taxi.

Heart rate data are a crucial indicator for wearable devices in monitoring users’ health status, as they reflect physiological conditions, activity levels, and their changes. Additionally, the continuous nature of the data collection process aligns with the definition of a data stream. The PAMAP dataset has been extensively utilized in numerous related studies, and its findings are well-recognized, allowing us to compare our research with existing results to validate the effectiveness and innovation of our proposed method. Similarly, the TAXI dataset provides location information, and its data collection process also adheres to the definition of a data stream, which is pertinent to the research direction of this article.

5.3. Comparison Scheme

In this section, this article not only compares WIDS-LDP and PP-LDP, but also the following two solutions:

LDP Budget Distribution (LBD) [27]: The scheme allocates the privacy budget in an exponentially decreasing manner. The perturbation data distribution is

h = (d - 2 + e^{ε}) / ({(e^{ε} - 1)}^{2}), p_{i} \in [x_{i} - h, x_{i} + h]

.

Piecewise Mechanism (PM) [28]: The data perturbation of this scheme is defined as

h = (\frac{4}{3} e^{\frac{ε}{2}} / {(e^{\frac{ε}{2}} - 1)}^{2})

,

p_{i} \in [x_{i} - h, x_{i} + h]

.

5.4. Experiment Indicatorsr

This paper selects the mean relative error (MRE) as an indicator to measure the experimental error.

MRE = \frac{1}{n} \times \sum_{d = 1}^{n} \frac{|{AVG}_{actual} {(x}_{d} {) - AVG}_{est} {(x}_{d})|}{{AVG}_{actual} {(x}_{d})}

(13)

AVGest(x_d) and AVGactual(x_d) represent the estimated average and the actual average of x_d at timestamp t_d, respectively, and n denotes the sequence length. MRE is used as an indicator to measure data availability, with smaller values indicating lower errors and higher availability.

5.5. Experiment Parameters

In the adaptive sampling stage, this paper uses a PID controller to sample significant points according to the data flow fluctuation trend. We set the PID parameters as follows:

C_{P}

= 0.8,

C_{i}

= 0.1, and

C_{d}

= 0.1. When evaluating the solution of this article, this article uses the control variable method to evaluate the effectiveness of this method in generating real-time data streams on wearable devices for different privacy budgets, different sliding window lengths, and different data stream lengths. Each experiment was run 100 times, and the results were averaged. The privacy budget (defaults to 1 in other cases) and the sliding window length (defaults to 20 in other cases) in this article replicate the dataset to simulate different data stream lengths (Table 2).

5.6. Program Utility

5.6.1. MRE Analysis of Different Window Lengths

As shown in Figure 3, the impact of different window lengths on MRE is illustrated. Four different schemes were tested on the PAMAP and Taxi datasets with a privacy budget of 1 and data length of

20 \times 10^{4}

. As the sliding window length increases, the MRE gradually increases. This is because a longer sliding window contains more sampling points, leading to a smaller privacy budget allocated per point, which, in turn, increases the error. Therefore, the sliding window length is directly proportional to the MRE.

The PM and LBD algorithms exhibit larger errors compared to WIDS-LDP and PP-LDP across all window lengths. This is because WIDS-LDP and PP-LDP employ post-processing algorithms for optimization after perturbation, allowing them to predict and correct the perturbed data, resulting in lower errors. In the PAMAP dataset experiment, as shown in the left figure of Figure 3, the PM and LBD algorithms display significant fluctuations when the window length is between [10, 30], indicating that these algorithms are less stable across different datasets. In contrast, WIDS-LDP and PP-LDP demonstrate greater stability and are better suited for enhancing privacy protection and data availability.

5.6.2. MRE Analysis of Different Privacy Budgets

As shown in Figure 4, the impact of different privacy budgets on MRE is illustrated. Four different schemes were tested on various datasets with a sliding window length of 20 and data length of

20 \times 10^{4}

. As the privacy budget increases, the MRE decreases. This is because a higher privacy budget per window allows for a larger allocation of the budget to each sampling point, reducing the error, provided other conditions remain unchanged. Thus, the privacy budget is inversely proportional to the MRE.

In these experiments, the PM and LBD algorithms show the highest error when the privacy budget is 0.5. As the privacy budget increases, the error decreases, reaching its lowest point at a budget of 2.5. However, a very high privacy budget is not ideal, as it compromises privacy protection, which must be avoided. In the Taxi dataset, as illustrated in the right figure of Figure 4, the LBD algorithm experiences significant fluctuations when the privacy budget is 2. This fluctuation is similar to the errors observed with the PP-LDP and WIDS-LDP algorithms, suggesting that while privacy protection is effective, the error with LBD is greater compared to PP-LDP and WIDS-LDP in other scenarios. WIDS-LDP proves to be more suitable for wearable device environments after optimizing the privacy budget solution.

5.6.3. MRE Analysis of Different Data Lengths

As shown in Figure 5, this paper replicates the experimental dataset to simulate an infinite data flow scenario. Four different schemes were tested on various datasets with a sliding window length of 20 and a privacy budget of 1. As the number of data flows increases, the MRE gradually decreases. This is because, with longer data streams, previous data provides insights for processing subsequent data, resulting in reduced error over time. This characteristic aligns with the privacy protection requirements for infinite data streams from wearable devices, where the data characteristics of users are generally stable.

In these experiments, the PP-LDP and WIDS-LDP algorithms consistently show lower errors compared to PM and LBD across different data lengths. Nevertheless, as the data stream increases, the error decreases for all algorithms, indicating that all methods benefit from leveraging previous data experience.

In summary, WIDS-LDP and PP-LDP demonstrate stability across different experimental setups, indicating that both schemes are relatively robust and well suited for the privacy protection of infinite data streams. The WIDS-LDP scheme achieves a lower MRE compared to PP-LDP due to its optimized privacy budget allocation mechanism. Consequently, WIDS-LDP offers better data availability while ensuring privacy compared to existing schemes. For real-time dynamic infinite data streams from wearable devices, the WIDS-LDP scheme manages larger data volumes with progressively smaller MRE and higher data availability over time. The advantages and disadvantages of the proposed solutions are shown in Table 3.

6. Discussion

This paper primarily investigates the WIDS-LDP privacy protection method for dynamic, unlimited data streams collected by wearable devices and proposes a privacy protection framework specifically for these devices. The WIDS-LDP algorithm is implemented on both the wearable device management side and the user side, aiming to enhance data availability while safeguarding user privacy. The WIDS-LDP algorithm first identifies potential salient points using a linear fitting method. It then employs a PID controller to calculate an adaptive threshold based on previous data flows, dynamically sampling subsequent data flows and updating the threshold accordingly. This adaptive threshold ensures that data from previous points meet the criteria for a dynamic infinite data flow. Subsequently, an improved SW mechanism is used to probabilistically perturb the sampling points according to their characteristics, ensuring that user privacy is maintained. Finally, the Kalman filter mechanism is applied for post-processing optimization to prevent the inference of useful information from disturbed data points and to reduce prediction errors. Experiments conducted on two real datasets demonstrate that the WIDS-LDP scheme not only provides superior privacy protection but also enhances data availability. The WIDS-LDP framework contributes significantly to the expanding field of privacy-preserving data management by providing effective solutions tailored to dynamic data flows. Its implementation optimizes data availability while ensuring robust privacy protection, which is crucial for applications that rely on big data analytics.

The WIDS-LDP framework can be effectively applied across various domains, including health monitoring, fitness tracking, and personalized healthcare. In clinical settings, wearable devices continuously collect patient data while ensuring privacy, enabling real-time health monitoring without compromising sensitive information. In fitness applications, the WIDS-LDP framework allows users to share their performance data for analysis and improvement while maintaining personal privacy. This framework empowers users by enabling them to contribute to aggregated performance metrics without revealing their individual data, thus facilitating personalized training recommendations based on group performance trends.

Additionally, the framework offers solutions to common challenges in data security and privacy. By employing adaptive thresholding and probabilistic perturbation techniques, the WIDS-LDP algorithm preserves privacy while maintaining data utility. This approach enhances data security for users and improves data availability for analytics, allowing organizations to derive meaningful insights from large datasets without compromising individual privacy. Furthermore, implementing this framework can foster user trust, as individuals are more likely to adopt wearable technologies when they are confident that their personal data are protected.

Overall, the WIDS-LDP framework not only enhances privacy and security in data collection but also opens new avenues for data-driven decision making in health and fitness. Consequently, it encourages broader adoption of wearable technologies and supports the development of innovative health solutions that can significantly improve patient outcomes and enhance user experience.

Limitations: Despite these advances, this study has several limitations. The public dataset used for testing is relatively short and may not adequately capture the complexity of real-world data flows, potentially affecting the robustness of the results. Additionally, the algorithm may encounter errors when applied in real-world scenarios, impacting its effectiveness in certain cases. For instance, the algorithm’s assumptions about data distribution may not hold true across all operating environments, which could lead to inaccurate predictions or suboptimal performance.

Future work: To address these limitations, future research should prioritize the integration of diverse and comprehensive datasets that accurately reflect the complexity of real-world data flows. This effort could involve forming collaborations with industry partners to access proprietary datasets and employing synthetic data generation techniques to create more representative samples. By doing so, researchers can capture a broader range of variables and conditions that influence data behavior.

Furthermore, conducting rigorous cross-validation across various datasets will significantly enhance the reliability of the findings and support broader generalizations. This approach not only strengthens the robustness of the results but also ensures that the conclusions drawn are applicable to different contexts and scenarios. Ultimately, such methodological improvements will contribute to a deeper understanding of the underlying phenomena and promote more effective solutions to the challenges identified in this study.

Future work can also focus on a detailed analysis of error bounds across various application scenarios. By evaluating the performance of the WIDS-LDP algorithm in different environments, we can enhance the practical usability of data release while maintaining strong privacy protection. Such exploration is critical for adapting the framework to meet the specific needs of various applications in real-world settings.

Author Contributions

Conceptualization, S.F. and F.Z.; methodology, S.F.; software, S.F.; formal analysis, S.F.; investigation, S.F.; writing—original draft preparation, S.F.; supervision, F.Z.; project administration, F.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original data presented in the study are openly available in [Github] at https://github.com/songsmallsong/Protecting-Infinite-Data-Streams-from-Wearable-Devices-with-Local-Differential-Privacy-Techniques (accessed on 8 September 2024). The datasets used in this paper are not strictly data streams, as the data points are discrete. These datasets were selected due to limitations in experimental equipment and the need for convenient verification. Despite having fixed timestamps and intervals, the data collection process is still real-time and continuous. Consequently, these datasets reflect the dynamic, continuous, and near-real-time characteristics of data streams. The choice of these smaller datasets was driven by computing resource constraints, but the methodology is designed for broader application to larger-scale data streams.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

Terminology	Abbreviation	Definition
Square wave	SW	A periodic waveform whose value rapidly switches between two fixed levels, commonly used in signal processing and test systems.
Kalman filtering	KF	An algorithm for estimating the state of a dynamic system that achieves recursive estimation of the system state by modeling measurement noise and system noise.
Mean relative error	MRE	A measure of prediction accuracy that calculates the average of the relative error between the predicted value and the actual value and is used to evaluate the performance of the model.
Linear fitting equations with least squares	LFLS	A statistical method that finds the best-fitting line by minimizing the squared difference between the observed data and the fitted model. It is widely used in data analysis and regression modeling.
LDP budget distribution	LBD	Refers to the allocation strategy of privacy budget (or noise level) in local differential privacy, aiming to balance the privacy protection and information utility of data.
Differential privacy	DP	A method of protecting personal privacy by adding random noise to query results to ensure that individual participation does not significantly affect the overall data analysis results.
Local differential privacy	LDP	An implementation of differential privacy that allows users to perturb their data locally, ensuring that the privacy of the data is protected before being transmitted to the server.

References

Babu, M.; Lautman, Z.; Lin, X.; Sobota, M.H.; Snyder, M.P. Wearable devices: Implications for precision medicine and the future of health care. Annu. Rev. Med. 2024, 75, 401–415. [Google Scholar] [CrossRef] [PubMed]
Tu, Z.X.; Liu, S.B.; Xiong, X.X. Differential privacy mean publishing of digital stream data for wearable devices. Comput. Appl. 2020, 40, 6. [Google Scholar]
Dwork, C.; Mcsherry, F.; Nissim, K. Calibrating noise to sensitivityn in private data analysis. In Proceedings of the Theory of Cryptography: Third Theory of Cryptography Conference, New York, NY, USA, 4–7 March 2006; Springer: Berlin/Heidelberg, Germany, 2006; pp. 265–284. [Google Scholar]
Kasiviswanathan, S.P.; Lee, H.K.; Nissim, K. What can we learn privately? SIAM J. Comput. 2011, 40, 793–826. [Google Scholar] [CrossRef]
Yan, Y.; Chen, J.; Mahmood, A.; Qian, X.; Yan, P. LDPORR: A localized location privacy protection method based on optimized random response. J. King Saud Univ.-Comput. Inf. Sci. 2023, 35, 101713. [Google Scholar] [CrossRef]
Wang, Z.; Liu, W.; Pang, X. Towards pattern-aware privacy-preserving real-time data collection. In Proceedings of the IEEE INFOCOM 2020-IEEE Conference on Computer Communications, Virtual, 6–9 July 2020; pp. 109–118. [Google Scholar]
Benhamouda, F.; Joye, M.; Libert, B. A new framework for privacy-preserving aggregation of time-series data. ACM Trans. Inf. Syst. Secur. (TISSEC) 2016, 18, 1–21. [Google Scholar] [CrossRef]
Zheng, Y.; Lu, R.; Guan, Y. Efficient and privacy-preserving similarity range query over encrypted time series data. IEEE Trans. Dependable Secur. Comput. 2021, 19, 2501–2516. [Google Scholar] [CrossRef]
Liu, Z.; Yi, Y. Privacy-preserving collaborative analytics on medical time series data. IEEE Trans. Dependable Secur. Comput. 2020, 19, 1687–1702. [Google Scholar] [CrossRef]
Guan, Z.T.; Lv, Z.F.; Du, X.J.; Wu, L.F.; Guizani, M. Achieving data utility-privacy trade off in Internet of medical things, a machine learning approach. Future Gener. Comput. Syst. 2019, 98, 60–68. [Google Scholar] [CrossRef]
Song, H.; Shuai, Z.; Qinghua, L. PPM-HDA: Privacy-preserving and multifunctional health data aggregation with fault tolerance. IEEE Trans. Inf. Forensics Secur. 2016, 18, 1940–1955. [Google Scholar]
Saleheen, N.; Chakraborty, S.; Ali, N.; Rahman, M.M.; Hossain, S.M.; Bari, R.; Buder, E.; Srivastava, M.; Kumar, S. mSieve: Differential behavioral privacy in time series of mobile sensor data. In Proceedings of the 2016 ACM International Joint Conference, Heidelberg, Germany, 12–16 September 2016; ACM: Heidelberg, Germany, 2016; pp. 706–717. [Google Scholar]
Steil, J.; Hagestedt, I.; Huang, M.X.; Bulling, A. Privacy aware eye tracking using differential privacy. In Proceedings of the ACM. the 11th ACM Symposium, Denver, CO, USA, 20–25 June 2019; ACM: New York, NY, USA, 2019; pp. 1–9. [Google Scholar]
Bozkir, E.; Günlü, O.; Fuhl, W.; Schaefer, R.F.; Kasneci, E. Differential privacy for eye tracking with temporal correlations. PLoS ONE 2021, 16, e0255979. [Google Scholar] [CrossRef] [PubMed]
Zhang, S.Q.; Li, X.H. Differential privacy medical data publishing method based on attribute correlation. Sci. Rep. 2022, 12, 15725. [Google Scholar] [CrossRef] [PubMed]
Kim, J.W.; Jang, B.; Yoo, H. Privacy-preserving aggregation of personal health data streams. PLoS ONE 2018, 13, e0207639. [Google Scholar] [CrossRef] [PubMed]
Li, Z.B.; Wang, B.H.; Li, J.S. Local differential privacy protection for wearable device data. PLoS ONE 2022, 17, e0272766. [Google Scholar] [CrossRef] [PubMed]
Zhang, J.; Liang, X.; Zhang, Z.; He, S.; Shi, Z. Re-DPoctor: Real-time health data releasing with w-day differential privacy. In Proceedings of the IEEE.GLOBECOM 2017—2017 IEEE Global Communications Conference, Singapore, 4–8 December 2017; pp. 1–6. [Google Scholar]
Schäler, C.; Hütter, T.; Schäler, M. Benchmarking the Utility of w-Event Differential Privacy Mechanisms-When Baselines Become Mighty Competitors. Proc. VLDB Endow. 2023, 16, 1830–1842. [Google Scholar] [CrossRef]
Ding, F. Least squares parameter estimation and multi-innovation least squares methods for linear fitting problems from noisy data. J. Comput. Appl. Math. 2023, 426, 115107. [Google Scholar] [CrossRef]
Gao, W.; Zhou, S. Privacy-Preserving for Dynamic Real-Time Published Data Streams Based on Local Differential Privacy. IEEE Internet Things J. 2023, 11, 13551–13562. [Google Scholar] [CrossRef]
Li, Z.; Wang, T.; Lopuhaä-Zwakenberg, M.; Li, N.; Škoric, B. Estimating numerical distributions under local differential privacy. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, Portland, OR, USA, 14–19 June 2020; pp. 621–635. [Google Scholar]
Khodarahmi, M.; Maihami, V. A review on Kalman filter models. Arch. Comput. Methods Eng. 2023, 30, 727–747. [Google Scholar] [CrossRef]
Shanmugarasa, Y.; Chamikara MA, P.; Paik, H.; Kanhere, S.S.; Zhu, L. Local Differential Privacy for Smart Meter Data Sharing with Energy Disaggregation. In Proceedings of the 2024 20th International Conference on Distributed Computing in Smart Systems and the Internet of Things (DCOSS-IoT), Abu Dhabi, United Arab Emirates, 29 April–1 May 2024; pp. 1–10. [Google Scholar]
Reiss, A.; Stricker, D. Introducing new benchmarked dataset for activity monitoring. In Proceedings of the IEEE, The 16th International Symposium on Wearable Computers, ISWC 2012, Newcastle Upon Tyne, UK, 18–22 June 2012; pp. 108–109. [Google Scholar]
Available online: https://www.microsoft.com/en-us/research/publication/t-drive-trajectory-data-sample/ (accessed on 8 September 2024).
Ren, X.; Shi, L.; Yu, W. LDP-IDS: Local differential privacy for infinite data streams. In Proceedings of the 2022 International Conference on Management of Data, Philadelphia, PA, USA, 12–17 June 2022; pp. 1064–1077. [Google Scholar]
Wang, N.; Xiao, X.; Yang, Y.; Zhao, J.; Hui, S.C.; Shin, H.; Shin, J.; Yu, G. Collecting and analyzing multidimensional data with local differential privacy. In Proceedings of the 2019 IEEE 35th International Conference on Data Engineering (ICDE), Macao, China, 8–11 April 2019; pp. 638–649. [Google Scholar]

Figure 1. Usage scenario of the algorithm.

Figure 2. The framework design of WIDS-LDP.

Figure 3. The impact of different window lengths on MRE (left: PAMAP, right: Taxi).

Figure 4. The impact of different privacy budgets on MRE (left: PAMAP, right: Taxi).

Figure 5. The impact of different data lengths on MRE (left: PAMAP, right: Taxi).

Table 1. PAMAP data range table.

Tester	1	2	3	4	5	6	7	8
Size	3000	3000	3000	3000	3000	3000	3000	3000
Range	78~120	74~107	68~94	57~121	70~101	60~104	60~99	66~104

Table 2. Parameter setting table.

Parameter	$C_{P}$	$C_{i}$	$C_{d}$	$ε$	$ω$
Range	0.8	0.1	0.1	[0.5, 2.5]	[10, 50]
Default	0.8	0.1	0.1	1	20

Table 3. Solutions comparison.

Solution	Advantage	Disadvantage
PM	Supports multi-value and multi-attribute data	Not applicable for wearable devices; Unable to maintain the data flow pattern
LBD	Population division-based and data-adaptive algorithms; Two privacy budget allocation methods are more reasonable	Unable to maintain the dataflow pattern; The perturbation scheme has large errors; Not applicable for wearable devices
PP-LDP	Optimized SW data perturbation method; Maintain data flow patterns	Exponentially decreasing privacy budget allocation method
WIDS-LDP	A framework suitable for wearable devices; Maintain data flow patterns; LBD privacy budget allocation method	Only supports single-dimensional data

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, F.; Fan, S. Protecting Infinite Data Streams from Wearable Devices with Local Differential Privacy Techniques. Information 2024, 15, 630. https://doi.org/10.3390/info15100630

AMA Style

Zhao F, Fan S. Protecting Infinite Data Streams from Wearable Devices with Local Differential Privacy Techniques. Information. 2024; 15(10):630. https://doi.org/10.3390/info15100630

Chicago/Turabian Style

Zhao, Feng, and Song Fan. 2024. "Protecting Infinite Data Streams from Wearable Devices with Local Differential Privacy Techniques" Information 15, no. 10: 630. https://doi.org/10.3390/info15100630

APA Style

Zhao, F., & Fan, S. (2024). Protecting Infinite Data Streams from Wearable Devices with Local Differential Privacy Techniques. Information, 15(10), 630. https://doi.org/10.3390/info15100630

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Protecting Infinite Data Streams from Wearable Devices with Local Differential Privacy Techniques

Abstract

1. Introduction

2. Related Work

3. Basic Knowledge

3.1. Infinite Data Streams

3.2. Problem Statement

3.3. Local Differential Privacy

3.4. W-Event Privacy

4. Recommend Method

4.1. WIDS-LDP

4.2. Significant Point Sampling

4.2.1. First-Order Difference Method

4.2.2. LFLS

4.2.3. Dynamic Angle α

4.3. Budget Allocation and Perturbations

4.3.1. Adaptive Privacy Budget Allocation

4.3.2. Data Perturbations

4.4. Post-Processing Mechanism

4.5. Theoretical Analysis

5. Experiment

5.1. Experimental Environment

5.2. Real Dataset

5.3. Comparison Scheme

5.4. Experiment Indicatorsr

5.5. Experiment Parameters

5.6. Program Utility

5.6.1. MRE Analysis of Different Window Lengths

5.6.2. MRE Analysis of Different Privacy Budgets

5.6.3. MRE Analysis of Different Data Lengths

6. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI