PPDC: A Privacy-Preserving Distinct Counting Scheme for Mobile Sensing

: Mobile sensing mines group information through sensing and aggregating users’ data. Among major mobile sensing applications, the distinct counting problem aiming to ﬁnd the number of distinct elements in a data stream with repeated elements, is extremely important for avoiding waste of resources. Besides, the privacy protection of users is also a critical issue for aggregation security. However, it is a challenge to meet these two requirements simultaneously since normal privacy-preserving methods would have negative inﬂuence on the accuracy and efﬁciency of distinct counting. In this paper, we propose a Privacy-Preserving Distinct Counting scheme (PPDC) for mobile sensing. Through integrating the basic idea of homomorphic encryption into Flajolet-Martin (FM) sketch, PPDC allows an aggregator to conduct distinct counting over large-scale datasets without disrupting privacy of users. Moreover, PPDC supports various forms of sensing data, including camera images, location data, etc. PPDC expands each bit of the hashing values of users’ original data, FM sketch is thus enhanced for encryption to protect users’ privacy. We prove the security of PPDC under known-plaintext model. The theoretic and experimental results show that PPDC achieves high counting accuracy and practical efﬁciency with scalability over large-scale data sets.


Introduction
With the rapid development of information technology and modern manufacturing, mobile devices are almost ubiquitous nowadays and have occupied an indispensable position in daily lives of many. Especially, those devices, like smartphones, which are equipped with ROMs, CPUs, and a variety of sensors such as GPS, camera and so on, are used not only for their traditional functions, but also for sensing, data transmission, and calculation. As a result, these features make these devices ideal mobile carriers favored by researchers as they study many issues. The mobile sensing problem is one of these issues. In recent years, a considerable amount of mobile sensing projects have been developed using different mobile devices, like [1][2][3].
The process of mobile sensing can be described in the following steps: the sensing task publisher (or the aggregator) issues tasks to users with mobile devices, then mobile devices of users collect sensing data and send them to the aggregator. After that, the aggregator processes all the data to draw valid conclusions. In general situation, the aggregator needs to collect and monitor users' data continuously, which means that the scale of collected sensing data is considerable. The data can be in various forms, including camera images, location data, etc.
There are two essential challenges in actual mobile sensing projects. One is that whether users with mobile devices are willing to give the original sensing data to the aggregator. As the original data may contain users' private information such as physical location, consumption habits, physical health, etc., most users would give a negative reaction to such a mobile sensing application lacking reliable privacy protection. The aggregation result would be incomplete and lack representativeness. The other one is, for the aggregator, how to solve the distinct counting problem [4] when facing the huge sensing dataset with a large amount of duplicate data in various forms. If the aggregator does not have a good understanding of the cardinality of users' data, then a lot of meaningless computing resources will be wasted to handle duplicate data during the whole aggregation. In addition, excessive repetitive elements in an aggregated dataset may result in characteristics of data being inconspicuous. For example, in the vehicular sensing network, users may transmit sensing information about road congestion to aggregators, and information about the same intersection from different users can be highly repetitive. Aggregators should not waste time and resources on duplicate data when they count congestion and expect subsequent analysis, such as optimizing path selection. At this point, the aggregator should first make cardinal statistics of the original data, and then analyse the traffic congestion degree. In other words, a solution which can ensure the safety of users' privacy as well as solve the distinct counting problem is in urgent need.
Exiting studies about the distinct counting problem in mobile sensing mainly focus on researching various algorithms (such as the Flajolet-Martin sketch [5] and LogLog [6]); few works have considered the privacy of users during data aggregation. Han et al. [7] propose a secure data aggregation scheme, while their security goal is to enable the traffic monitoring center to verify whether an aggregate sensing report is correct or not. Their security refers to the aggregator's aggregated security rather than the user's privacy protection.
In this paper, we propose a scheme, Privacy-Preserving Distinct Counting scheme (PPDC for short), to solve the distinct counting problem with privacy protection of users. PPDC is based on a semi-honest model and it can complete distinct counting over large datasets with various forms of elements in the mobile sensing scenario. Through expanding each bit of the hashing values of users' original data added to the FM sketch, PPDC enhances the FM sketch to apply the bitwise XOR homomorphic encryption algorithm as an encryption method, so that users' privacy gets protected even under a known plaintext model. We conduct theoretical analysis and experiments, and the results show that our scheme achieves practical counting accuracy and efficiency.
The remainder of this paper is organized as follows. Section 2 defines system and security models and introduces several necessary preliminaries. In Section 3, we present the main idea and essential module of PPDC. Section 3 also analyzes the correctness and security of PPDC. After that, Section 4 provides the experimental results about the evaluations of accuracy rate and efficiency of PPDC. Section 5 discusses the related work. Finally, we conclude the paper in Section 6.

Problem Statements and Preliminaries
In this section, aiming at the privacy protection and distinct counting problem, we conduct the system and security models in detail. Then we introduce the encryption algorithm and aggregation sketch applied in our scheme.

System and Security Model
System model. We consider the system model in this paper as follows: there is a group of users with mobile devices who are providing data to a sensing task publisher or an aggregator to do some sensing task. Assume that sensing data of each user is a set of data with various forms of elements, including images and location data and so on. The aggregator needs to find the number of distinct data of all users' data, a big dataset composed by plenty of sub datasets. When transmitting sensing data, all users would not reveal their original data to the aggregator. We discuss a general network model in mobile sensing, in which there is a direct communication channel between the aggregator and every mobile device user. That is to say, the aggregator and all users form a star network topology. The communication channels could be 3G/4G, wifi, or other kinds of channels that are supported by the mobile devices and the aggregator in practical applications. Besides, as for each user's device, it has the ability to do the hash operation and bitwise XOR encryption to its sensing data and transmit worked data to the aggregator. The whole process of data aggregation in our scheme PPDC is described in Figure 1. Security model. In this paper, we assume that it is a semi-honest model. All the aggregators and users observe the data transmission and collection process described above. However, they may attempt to derive extra information about other participators' private inputs during the whole execution, which they should not know. Therefore, the scheme is believed to be secure if it guarantees that every participator can learn no more information from the process than the information that this participator is entitled to know. For the users, they should not be able to get the values of data of each other without permission. While for the aggregator, except for the encrypted data from users and the calculating result of these aggregated data, no extra knowledge about users ought to be acquired or speculated from the data he/she aggregates.

XOR Homomorphic Encryption
We choose the bitwise XOR homomorphic encryption as the encryption algorithm in this paper, of which the main idea is very similar to that of an additively homomorphic encryption scheme proposed in [8]. A trusted third party, the authority, is needed during the process of key generation. Let f m,α,β () denote a function in the pseudo-random function family F m,α, where α, β, γ ∈ N. Let t ∈ {0, ..., 2 v − 1} denote the nonce information of data. The following process shows the details of the encryption algorithm.
a. Key generation: (1) The trusted authority uniformly and independently picks m 1 , ..., m n ∈ {0, 1} γ . Then the authority computes M i a = m i and M i b = m (i+1) mod n for each user i(i = 1, ..., n), and sends them to user i.
(2) For each dataset with the nonce information t which is different in each time of transmission and all the user are synchronized, user i computes its secret key by b. Encryption: Denote by x i ∈ {0, 1} l a bit-string. The user i encrypts it by computing c. Decryption: Denote by x i a ciphertext of user i. The user i decrypts it by computing

d. Aggregation:
Anyone can decrypt the bitwise XOR of all users' plaintexts without any user's secret key by computing Because keys are obtained from Equation (1) which means that the total number of seeds for all users' keys are even and the times of the value of each seed are even, a conclusion can be drawn that the bitwise XOR of all users' keys equals 0. As a result, the bitwise XOR of all users' ciphertexts is equal to the bitwise XOR of all users' plaintexts, which is Equation (4). In other words, this encryption algorithm is homomorphic on the bitwise XOR computation.
In this paper, the aggregator does not have any user's private key, so that it cannot decrypt any user's plaintext. Instead, it decrypts the bitwise XOR of all users' plaintexts and uses this information to solve the distinct counting problem. Therefore, when we talk about the aggregator's decryption operation, it refers to the decryption of the bitwise XOR of all users' plaintexts.

FM Sketch
A FM sketch is a data structure for probabilistic counting of distinct elements that has been introduced in [9]. It is widely used in network applications, such as data dissemination [10] and probabilistic aggregation [11,12].
FM sketch represents an approximation of a positive integer by a bit field S = s 1 , s 2 , ..., s w of length w, where w ≥ 1. The bit field is initialized to zero at all positions. To add an element x to the sketch, it is hashed by a hash function h with geometrically distributed positive integer output, where the probability is P(h(x) = i) = 2 −i . The entry s h(x) is then set to 1. With probability 2 −w , we have h(x) > w and no operation is performed in this case. A hash function with the necessary properties can easily be derived from a common hash function with equidistributed bit string output by using the position of the first 1-bit in the output string as the hash value.
According to [9], an approximation C(S) of the number of distinct elements added to the sketch can be obtained by locating the end of the initial, uninterrupted sequence of ones.
Since the variance of Z(S) is pretty significant, the approximation C(S) in Equation (6) is not very accurate. To avoid this situation, a set of sketches will be used to represent a single value instead of only one sketch. [9] proposes the respective technique called Probabilistic Counting with Stochastic Averaging (PCSA). With PCSA, before being added there, each element is first mapped to one of the sketches by using an equidistributed hash function. If d sketches are used, denoted by S 1 , ..., S d , the estimation for the total number of distinct elements added is then calculated through However, in [9] it also points out that Equation (7) is rather inaccurate as long as the number of elements is below approximately 10 · d. According to [13], we modify Equation (7) in the following way: This alleviates the initial inaccuracies, while otherwise being asymptotically equivalent to Equation (7). PCSA with d sketches yields a standard error of approximately 0.78/ √ d [9,14]. For many mobile sensing projects, it can achieve sufficiently good approximations when the sizes of dataset are reasonable.
The FM sketch can be merged to obtain the total number of distinct elements added to any of them by a simple bitwise OR. It is important to note that, by their construction, repeatedly combining the same sketches or adding already present elements again will not change the results, no matter how often or in which order these operations occur. This makes FM sketches ideally suited for the distinct counting scheme in mobile sensing.

Privacy-Preserving Distinct Counting Computation
In this section, we describe a specific process of operating on the sensing data of users based on FM sketch. Here, we employ a knack on FM sketch to greatly reduce the overall computing time. And then, the important part in PPDC, operations of encryption and decryption(i.e., calculation based on ciphertexts), are presented in detail. After that, in Section 3.4, the correctness and the security of PPDC will be discussed. Assume that the space of users' data is [0, N − 1](N 2), and w = log 2 N .

Overview of PPDC
At a high level, PPDC works as follows. First, each user independently prepares his dataset and transforms the sensing data into a specific form through their smart devices. Then, the trusted authority calculates and distributes a pair of key seeds to each user. Combining with key seeds and a nonce information which is different in each time of aggregation process, each user encrypts his transformed sensing data for this process and sends the ciphertexts to the aggregator. Since all the data are encrypted and there is no information of keys published, the aggregator has no way to decrypt received data, thus, the privacy of users is protected. In the end of PPDC, FM sketches are applied by the aggregator to acquire the count of distinct elements in all users' sensing datasets based on the judgement of ciphertexts. The main idea and essential part of PPDC will be described below.

Main Idea
From Section 2.3, it is obvious that while dealing with the distinct counting problem, FM sketch cannot provide the protection of users' privacy during the transmission and calculation process. Therefore, we provide a method that expands each bit of the string to make the sketch suitable for the encryption operation, where the string is the calculating result of each user's dataset.
As mentioned above, there are n users in total. In the FM sketch, the bit field S = s 1 , s 2 , ..., s w of length w is initialized to zero at all positions. In the meanwhile, each user's sensing data is a dataset with various forms of elements, ranging in size from small to large. Assume that user i(i = 1, ..., n) has L i elements in his sensing dataset. Let the set {x l } into the FM sketch, PPDC determines the bit field S bit by bit, from the Most Significant Bit (MSB) to the Least Significant Bit (LSB). The MSB refers to the last bit of S, and the LSB is the first bit relatively.
Step 1 This step is taken by users. For a user i(i = 1, ..., n), every element x (i) l ) j is the jth bit in the string and the probability is P((r . Then a bitwise OR operation is taken to get The string Λ i = r i 1 , ..., r i w represents all elements in user i's original dataset. According to the knowledge mentioned in Section 3.3, Equation (10) should be correct.
However, S should not be calculated out straightforward. Because according to Equation (9), if the aggregator could receive Λ i directly, Λ i would reveal the original data of user i, especially when the size of his dataset is small. Therefore, a series of operations should be carried on the Λ i .
Step 2 This step is also done on the user's side. The user i operates on each bit of the string Λ i in order to avoid any damage caused on the privacy. In PPDC, we design a kind of specific coding scheme for these bits. Let T(r i j ) denote the corresponding code of r i j in the coding scheme, where j = 1, ..., w.
The coding scheme is defined as follows: where q ∈ N is the accuracy controlling parameter and denotes to sample uniformly at random. Figure 2 shows an example of the process user 1 deals with his original dataset.
Step 3 The aggregator takes this step after aggravation all users' coded data. Let G(j) = T(r 1 j ) ⊕ ... ⊕ T(r n j ) with bitwise XOR operation. Then there is a judgement rule designed to determine each bit of FM sketch S, corresponding to the coding scheme (11). We define the rule as follows: where s j is the jth-LSB, or (w − j)th-MSB in the bit field S. Notice here that when PPDC judging each bit in FM sketch, it starts from the MSB to the LSB.
Step 4 The calculation work is done by the aggregator. Based on the FM sketch S, the aggregator can get a significant parameter Z(S), the position of the last bit in S that is 1, according to Equation (5). As mentioned in Section 3.3, the approximation of distinct counting needs several more FM sketches in which the hash functions are different. After taking Step 1 to Step 3 for d times and according to Equation (8)

Remark 1. In
Step 2, it is worth noting that there is a probability of 1 − 1/2 q to occur such a situation, where r i j equals to 1 but T(r i j ) is coded to be 0 q . Thus in Equation (11), the coding scheme requires that if this situation happened, r i j should be recoded until T(r i j ) is not 0 q . In that step, each bit of the string Λ i , from MSB to LSB, would be expanded into a q-bit string T(r i j ) under the action of our coding scheme (11), which is suitable for encryption operation.

Remark 2.
Notice that during the whole process, in order to reduce the computing time, we employ a knack here which is that PPDC determines bits of FM sketch S from the last bit to the first bit. When the aggregator applies FM sketches, the purpose is to find out the position of the last bit in S that is 1 and regard it as an index. This is to find out the position of the first bit in S that is 1, when PPDC starts finding from the last bit of S. This transformation means the aggregator does not have to determine all bits in S, after all, what the aggregator needs is the index to calculate the number of distinct counting of the dataset rather than the whole S. Through this knack, PPDC can leave out a lot of computing steps, thus improving the efficiency.

Privacy-Preserving Distinct Counting Scheme
In Section 3.2, we calculated the number of distinct counting through PPDC. The specific operation towards users' data is prepared for the homomorphic encryption to protect users' privacy. In this subsection, we highlight the modules of encryption and decryption(i.e., calculation based on ciphertexts) in PPDC that allow the aggregator to solve the distinct counting problem and to avoid acquiring each user's data privacy at the same time. In our assumption, there is a trusted authority as a third party who helps users and the aggregator to establish a key system each time.
(1) Setup. The protection mechanism of PPDC is based on the bitwise XOR homomorphic encryption introduced in Section 2.2. The trusted authority has m 1 , ..., m n ∈ {0, 1} γ privately and he computes M i a = m i and M i b = m (i mod n)+1 for each user i(i = 1, ..., n). Then the two seeds are sent to the corresponding user. The user i does a bitwise XOR operation on the seeds as well as a nonce number t according to Equation (1) to acquire his own key k i . Notice that the nonce number t used for calculating k is different in each transmission.
(2) Encrypt. The data encryption is operated on the user's side. The user i regards the coding string T(r i j ) for jth bit of his data representation Λ i as the plaintext and encryptes it with the bitwise XOR homomorphic encryption algorithm to get the ciphertext where the user i's key k i is generated as introduced above. Then the ciphertext T(r i j ) is sent to the aggregator as user i's sensing data. (3) Aggregate. On the side of the aggregator, he collects all the n users' data about the jth-LSB and then does the bitwise XOR computations. Denote by G(j) the bit string result. According to Equation (4), it can be drawn that It is easy to see that if the jth-LSBs of the n users are all 0, then the bitwise XOR of the corresponding strings G(j) is always a q-bit string of 0s. If there is any user whose data Λ is 1 on the jth-bit, the bitwise XOR of all reports' corresponding strings is not a q-bit string of 0s with a probability of 1 − 1/2 q . However this situation has little influence on the accuracy of PPDC which will be proved in Section 3.4.
(4) Judge. Just like the rule mentioned above, we define the rule as follows: where s j is the jth-LSB in the bit field S.
In Figure 3, a detailed example of aggregation in the FM sketch using above described transformation and corresponding bitwise XOR computations is shown. And the formal description of our entire scheme is shown in Algorithm 1. Remark 3. The operation of homomorphic encryption causes no damage on the accuracy of PPDC. According to the property of the bitwise XOR homomorphic encryption, there is G(j) = T(r 1 j ) ⊕ ... ⊕ T(r n j ) = T(r 1 j ) ⊕ ... ⊕ T(r n j ) = G(j). Thus, we can say that Equation (15) is equal to Equation (12), which means that the calculated aggregation result is not influenced by the encryption and decryption operations for user's privacy protection. Then the final result of distinct counting problem is calculated by Step 4 in Section 3.2.

Remark 4.
The correctness and security of PPDC are credible. According to Equation (10), we have In PPDC, Equation (15) is equal to Equation (16) with a probability of 1 − 1/2 q , which means that the operations in PPDC have nearly no effect on the final result when the parameter q is appropriate. We will prove it in Theorem 1. The security of PPDC will later be formally proved in Theorem 2.

Input:
{x l i }: User i's dataset with various elements, l ∈ [0, L i ], x l i ∈ [0, N − 1], i = 1, ..., n; h: A hash function with a w-bit string output; d: the number of FM sketches; M i a and M i b : two secret seeds of User i; t ∈ [0, 2 v − 1]: A public known nonce number; q ∈ N: An accuracy controlling parameter.
1: for k = 1 to d do 2: for i = 1 to n do 3: User i: end for 6: for j = w to 1 do 7: for i = 1 to n do 8: User i: T(r i j ) ← r i j in Λ i , len(T(r i j )) = q;

Scheme Analysis
We present analysis of PPDC in terms of correctness and security. (16) is greater or equal to 1 − 1/2 q . The correctness of PPDC is greater or equal to 1 − (w/2 q ) d . (10) is the correct result of our problem. Equation (16) is one of Equation (10)'s mutually independent w parts to determine the jth-LSB bit. While in PPDC, Equation (15) represents the result. Actually, Equation (15) is equal to Equation (16) with a probability of 1 − 1/2 q on the calculation.

Proof. According to the definition in Section 3.3, it is obvious that the sketch constructed by Equation
On the basis of a regular bitwise OR, only when all the numbers on that bit are 0, the result bit is 0, which is 0 ∨ ... ∨ 0 = 0. Otherwise, that bit should be 1. If our scheme is 100 percent accurate, when s j = 0, it means G(j) should be 0 q where all users' r i j should be 0. However there is a special case that a user y whose r y j is 1 while y's encrypted bit string equals the bitwise XOR of all other users' encrypted strings. Since the encoding function in our scheme is random and the encoding string has 2 q different choices, the probability for the result of our scheme being not accurate is 1/2 q . Therefore, we have: P(the jth bit is accurate) = P(all users jth bits are 0) × 1 + P(any user s jth bit is 0) (15) has to be independently calculated for w times to achieve the goal of Equation (10) and there are d FM sketches used, it is obvious that the correctness of PPDC is greater or equal to 1 − (w/2 q ) d .
As in most cases, a malicious aggregator can only have the knowledge of ciphertext in privacy-preserving mobile sensing schemes. This belongs to the ciphertext-only attacks, which correspond to an attaker of minimal capability. However, we still analyze the security of PPDC under a more stronger model, known as the plaintext model, which assumes that the attackers may obtain a certain number of plaintext-ciphertext pairs through extra channels. The security of the proposed PPDC is summarized in the following theorem.
Theorem 2 (Security). For the homomorphic operations in PPDC, there is no probabilistic polynomial time (P.P.T.) adversary that can break the data confidentiality of user's data under the known plaintext model.

Proof.
In PPDC, for the aggregator, we prove that there is no extra knowledge revealed to him in PPDC. We consider the situation that the aggregator could acquire most information. To calculate out the final result, all the w bits in S should be confirmed, which means that there must be w times communication between each user and the aggregator and each time the aggregator could get n cipertexts from all users. Let I = (I 1 , ..., I w ) denote the aggregator's information received from all users for w times, where I j = (T(r 1 j ), ..., T(r n j )) (j = 1, ..., n) is the ciphertexts of q-bit bit strings from all users to decide the jth bit in S. The aggregator calculates G(j) according to Equation (14) and then determines s(j) in S by Equation (15).
If the result is s j = 1, then G(j) = 0 q , which can help the aggregator speculate that in I j there is at least one bit string T from some user which is not 0 q . The aggregator wants to speculate T from T. Since the aggregator has no corresponding key, the probability that he guesses the correct plaintext is 1/2 q . Under known plaintext model, the aggregator could get a plaintext-ciphertext pairs to calculate a keys. Then, the probability of knowing a certain user's original data rises to 1/A a n . However, the fact that users' keys are different each time, leads to low probability and the attacking time O(nq) is over P.P.T. Therefore, the aggregator could not conjecture any extra knowledge about the users.
If s j = 0, then I j are all 0 q or it happens to such a case that some bit strings, {0, 1} q \ 0 q , equal to 0 q under the effect of XOR operations. However the aggregator could not distinguish these two situations by calculation in P.P.T.
Moreover, because the keys are pseudo-random for each user and each time, I 1 , ..., I w are independent. As a result, there is no probabilistic polynomial time (P.P.T.) adversary that can break the data confidentiality of a user's data under known plaintext model. PPDC ensures the privacy protection of users.

Performance Evaluation
In this section, we conduct experiments to evaluate the accuracy of PPDC as well as its efficiency compared with the situation lacking of privacy protection. In our experiments, the schemes are implemented with Python. The sensing data of users are randomly formed by the Python programs in the uniform random distribution, where we use difference values of integers to represent users' data.

Accuracy Evaluation of PPDC
In PPDC, q is an accuracy controlling parameter, the length of encoded bit string in Step 2 of Section 3.2. Figures 4 and 5 show the relationship between the value of q and the accuracy rate when the length w of FM sketch is changing, where the total amounts of users are n = 15,000 and n = 25,000 respectively. It can be concluded that no matter what value w is, as the value of q approaches w, the accuracy rate gradually increases to nearly 100%, which is in accord with theoretical analysis above. At the same time, through comparing Figures 4 and 5, we can see that the trend of PPDC's accuracy rate is not affected by the number of users, but related to the corresponding value relationship between q and w. When the difference between the value of q and w is reduced, the total accuracy is much more close to 100%. This conclusion reflects that as long as the parameters of PPDC are set appropriately, PPDC can be applied in the mobile sensing situation where no matter what the scale of users is.   Since there is a significant error in applying only one FM sketch, d, the number of FM sketches, must be discussed. From Figure 6, it is observed that the error rate of estimated data decreases dramatically along with the increase of the number of repeat times in the beginning, then keeps relatively stable after a specific threshold, like d = 4 in this experiment. In the face of different sizes of datasets, the threshold will be different. We set experiments with different numbers of users participating in the program. In the meantime, the corresponding number of distinct counting in each dataset is independent and irregular since the elements which present users' sensing data are generated entirely randomly. The calculated values of PPDC are contrasted with true values in Figure 7. There is a difference between the two values, and as the size of the dataset improves, the overall difference tends to decrease but still shows fluctuation. The fluctuation is associated with the result of the FM sketch which has a close relationship with the multiplies of 2.
According to the results in Figure 7, we calculate the accuracy rate of PPDC presented in Figure 8 to evaluate the correctness of PPDC more intuitively. It is obvious that the accuracy rate of PPDC is gradually raising close to 100% along with the increase of the size of datasets, where, even in the case of a small dataset, the accuracy rate can still reach 97%. When the amount of users is huge and corresponding cardinal number is big, PPDC can perform much better.

Efficiency Evaluation of PPDC
After confirming the accuracy of PPDC through a set of experiments, we explore efficiency of PPDC by testing the communication cost and computing time.
(1) Communication Overhead. Table 1 shows the comparison of communication cost between the baseline method without privacy protection and PPDC. In Table 1, the total bits sent by a user, as the communication cost of a user, and the total bits received by the aggregator, as the communication cost of the aggregator, are the measured standards, as well as the computation complexity and round complexity of two schemes. The mentioned parameters include: n which is the total number of users, and the range of users' data is [0, N − 1], and w = log 2 N is the length of each user's bit string, and d is the number of FM sketches we applied. As the proof of Theorem 1 shows, 1 − (w/2 q ) d is the upper limit of the correctness of PPDC. When q is approximately equal to w and not too small, the error rate of PPDC will decrease to an acceptable level (for example, less than 0.001). Meanwhile, the communication cost of PPDC affected by q would also be reduced. Besides, PPDC can send or receive less data than the baseline method when n is not greater than N, i.e., n = O(N).
However, the total communication cost is also influenced by the round complexity. The round complexity refers to the amount of time a user has to keep communicating online. Notice that the baseline method needs only one round of communication which is its most significant advantage. Therefore, the cases where PPDC performs better are when the network connection is stable, while when the network connection cannot stay reliable, the baseline method is more suitable. (2) Computation Overhead. We discuss the computing time spent during the whole process. Here, the computing time of PPDC includes the time of hash operation and coding, encryption time for each user, and the time of decryption to determine the final FM sketch S and calculating results for the aggregator. Note that 'decryption' is the decryption of the bitwise XOR of all users' plaintexts which guarantees the protection of user's privacy. The data used as a comparison is the computing time of the number of distinct counting calculated without privacy protection in Figure 9. It can be seen that it takes more time for PPDC to calculate the results. Since there are more processes like encoding, encrypting, decrypting and formula calculating than the general method, PPDC is relatively more time-consuming. However, this consumption is within an acceptable range, as shown in Figure 9; with the dataset expanding, the trend of the increase in the consumption time is slower than the linear increase. Besides, due to solving the distinct counting problem, the following other operations on the aggregated dataset will reduce resources consumption of repetitive process. Furthermore, when the size of a dataset is huge, the probability of the index Z(S) approaching the end of FM sketch S is high. With our knack in Section 3.2, the computing time will gradually decrease accordingly. Therefore, on the whole, PPDC does not waste computing time. This conclusion proves that the efficiency of PPDC is appropriate for large-scale data aggregation processing. When it comes to the computing time, we also conducted assessments of PPDC under different factors. Firstly, the value of q has an effect on the computing time. Figure 10 shows the variation trend of computing time of PPDC with different values of q. As the q is bigger, the corresponding time is much more and the difference caused between contiguous different q rises when the size of dataset increases. Therefore, an appropriate value of q is needed. Next, we evaluate the computing time spent by each step in PPDC. The corresponding results are presented in Figure 11. In Figure 11, the encryption and decryption step which is the most important part of achieving users' privacy-preserving in PPDC, is the most costly compared with other steps of hashing and coding. Due to the bitwise XOR operation of encryption and decryption, such result is reasonable. As the size of dataset increases, the increase of encryption and decryption time will slow down, since we employ the knack on FM sketches in PPDC. Besides, the increasing curves of all steps in Figure 11 tend to be lower than the linear increase, which is in accordance with the result in Figure 9.

Related Work
Recently, both the applications about mobile sensing and the problem of distinct counting have been discussed from a variety of aspects [7,[15][16][17][18]. However, their scenarios usually include either privacy protection of users or distinct counting without considering them at the same time, which leads to lacking a secure and resource-saving system.

Privacy Preserving in Mobile Sensing Applications
In terms of the applications about mobile sensing, most works focus on researching the various methods of a user's privacy protection or discussing the operation of aggregated data, like [19][20][21][22][23].
Both [15,24], consider the protection of user's privacy and then to seek the minimum computation in the aggregated dataset. They study how an untrusted aggregator in mobile sensing can periodically obtain desired statistics over the data contributed by multiple mobile users, without compromising the privacy of each user. Their scheme [15], which is based on [24], utilizes the redundancy in security to decrease the communication cost caused by each user's joining and/or leaving activities. Their protocol traverses the entire data space to find the minimum value on the basis of summation protocols rather than bitwise XOR operations.
Xiong et al. in [25] propose a scheme for mobile crowdsensing sevices. In this scheme, a differential privacy mechanism is utilized through allowing different users to add noise data, then employing homomorphic encryption for protecting the sensing data, and finally, uploading ciphertext to the mediator, who is able to obtain the collection of ciphertext of the sensing data without actual decryption. However, the cost of transmission and evaluation is relatively non-negligible.
Miao et al. describe a lightweight framework for mobile crowd sensing systems in [26]. The framework can achieve the protection of each participating worker's sensory data and reliability information, and introduce little overhead to the workers. It is implemented by involving two non-colluding cloud platforms and adopting additively homomorphic cryptosystem, where on the workers' side, their jobs are perturbing the data with some random numbers rather than directly encrypting the data to be uploaded. This method, however, has more requests for the participating clouds which are proposed to be non-colluding but can communicate with each other.
In [27], Zhang and Chen propose semi-honest protocols to calculate the minimum and kth minimum values in mobile sensing systems. The data can be a time-series. By using probabilistic coding schemes and a cipher system, they construct two protocols that allow homomorphic bitwise XOR computations for their problems. The homomorphic bitwise XOR algorithm ensures privacy during the whole process. As the interaction times increase, the bits sent or received by users and the aggregator are much more.

Distinct Counting
On the other hand, distinct counting is also of interest to a lot of researches. However, it is mainly discussed in Vehicular Ad Hoc Networks, like [5,7], rather than a more general mobile sensing scenario. In the mean while, in order to solve distinct counting problem, there are many studies [5,28,29], adopting different algorithms, including FM sketch.
In [6], Wangle et al. present a self-adaptive algorithm, Self-Adaptive LogLog, which is proposed based on Refined LogLog, to adapt to cardinalities of different scales automatically. They focus on the accuracy of the method, while the application scene is not clear and they rarely take the collection process into consideration.
Considine et al. in [30], use FM sketches to accomplish a kind of robust in-network aggregation in sensor networks. The application situation is believed to result in packet loss or node failures. They consider the coordinated collection of information towards a sink in the sensor network. However, the security problem is overlooked during the entire aggregation process. In [31], the FM sketch is used to integrate with spatio-temporal indexes to solve the problem: "How many objects were in region x over the time interval t?" Like [30], Tao et al. in [31], do not mention the privacy protection of user data.
In [32], Zekri et al. propose an event-exchanging and data-gathering scheme based on FM sketches in vehicular networks. The sketches can be exchanged without loss of information and can be insensitive, so as to allow manipulating the same physical repository for all vehicles. Their aggregation structure reduces the time needed to actually find a parking space and increases the percentage of vehicles finding such a resource in a bounded time in congested situations.
Han et al. propose a secure data aggregation scheme in Vehicular Ad Hoc Networks which is based on the FM sketch in [7]. They also consider the security problem, while their security goal is to enable the traffic monitoring center to verify whether an aggregate sensing report is correct or not. Their security refers to the aggregator's aggregated security rather than the user's privacy protection.
Remark 5. In general, there are few works that have studied both privacy protection of users and distinct counting for mobile sensing simultaneously. Our scheme, PPDC, integrating the idea of homomorphic encryption into FM sketch, provides a useful solution to solve this research problem. PPDC not only reduces the consumption of storage space and other resources by calculating the number of different elements, but also has practicality due to guaranteeing user privacy, which makes it worthy to be studied and applied for practical applications.

Conclusions
Both privacy protection of users and the distinct counting problem on large datasets are essential issues in mobile sensing applications. In this paper, we propose a privacy-preserving distinct counting scheme, PPDC, to solve these two problems simultaneously. PPDC expands each bit of the hashing values of the users' original data, so that the FM sketch is enhanced for encryption to protect user privacy. We choose the bitwise XOR encryption algorithm as the encryption algorithm. According to the theoretical analysis, PPDC causes little damage to the accuracy of the FM sketch. Moreover, a set of experiments demonstrates that, with appropriate value of several parameters, PPDC achieves high counting accuracy and practical efficiency with scalability over large-scale datasets.
For future work, we will aim to expend PPDC to support the screening of invalid users and wrong data, so as to make the aggregated conclusions more accurate and facilitate other subsequent analysis operations.