A Lightweight and Privacy-Friendly Data Aggregation Scheme against Abnormal Data

Abnormal electricity data, caused by electricity theft or meter failure, leads to the inaccuracy of aggregation results. These inaccurate results not only harm the interests of users but also affect the decision-making of the power system. However, the existing data aggregation schemes do not consider the impact of abnormal data. How to filter out abnormal data is a challenge. To solve this problem, in this study, we propose a lightweight and privacy-friendly data aggregation scheme against abnormal data, in which the valid data can correctly be aggregated but abnormal data will be filtered out during the aggregation process. This is more suitable for resource-limited smart meters, due to the adoption of lightweight matrix encryption. The automatic filtering of abnormal data without additional processes and the detection of abnormal data sources are where our protocol outperforms other schemes. Finally, a detailed security analysis shows that the proposed scheme can protect the privacy of users’ data. In addition, the results of extensive simulations demonstrate that the additional computation cost to filter the abnormal data is within the acceptable range, which shows that our proposed scheme is still very effective.


Introduction
With the application of electricity in our daily life becoming increasingly extensive, more factors need to be considered in the production decisions of the cloud server [1,2], such as how to maintain a balance between supply and demand when electricity usage changes dramatically [3]. Thus, it is critical to obtain the electricity usage data of all users. In addition the smart grid, as a key infrastructure, adds upstream information feedback based on the traditional grid, which can help us collect the electricity usage data of users in various regions [4,5]. The prominent advantage of smart meters is to make sure that electricity supply matches the demand of users within a short period, which is of great significance for the rational distribution of power resources and the reduction of economic losses [6,7]. To obtain the real-time electricity demand of users, their electricity usage data should be measured, aggregated, and analyzed through advanced metering infrastructure [8,9].
However, it is a noteworthy problem of the smart grid that the abnormal electricity data, caused by electricity theft or meter failure, can lead to inaccurate aggregation results. This not only harms the personal interests of users, but also interferes with the production decisions of the cloud center. To the best of our knowledge, none of the existing schemes consider the impact of abnormal data. In the extant schemes, the aggregation center is responsible for aggregating all the reported electricity usage data of smart meters but cannot detect whether the reported data is abnormal, let alone find the source of the abnormal data.
Therefore, it is an important challenge to filter out the abnormal data and find the source of the abnormal data when the data is encrypted. To address this issue, we propose a lightweight and privacy-friendly data aggregation scheme against abnormal data, in which the valid data is correctly aggregated, but the abnormal data is automatically filtered out during the aggregation process. Notably, the filtration of the abnormal data does not need additional procedures, which is the highlight of this work. Besides, compared with other methods in other schemes, the encryption method used in our scheme is more suitable for smart meters with limited computing capacity. Specifically, the main contributions of this paper are summarized as follows: • We propose a lightweight and privacy-friendly data aggregation scheme against abnormal data by using lightweight matrix encryption. It is suitable for smart meters with limited computing power, since no time-consuming computation operators are involved. • Abnormal data can automatically be filtered out without additional procedures. In addition, the source of the abnormal data can also be found out in this process. Thereby, accurate aggregation results can be obtained through the proposed scheme, and abnormal meters can also be identified for maintenance, even if the data is encrypted. • Finally, a detailed security analysis is provided to prove that our scheme can fully ensure the privacy and security of users' data. Experiments and performance evaluations demonstrate that our scheme has a low computation cost and high practicality.
The rest of the paper is outlined as follows. In Section 2, some related works are provided. The preliminary is provided in Section 3. Section 4 illustrates the system model and adversary model. We propose the details of our scheme in Section 5, followed by the security analysis of our scheme in Section 6. A performance analysis is conducted in Section 7. Finally, the conclusion of our scheme is summarized in Section 8.

Related Work
There exist extensive data aggregation schemes on the topic of protecting users' privacy in smart grids [10][11][12][13][14][15][16][17][18][19][20][21][22][23]. Homomorphic encryption has been applied in several works to achieve privacy-preserving data aggregation [10][11][12][13][14][15][16][17][18][19]. Shen et al. [10] proposed a Paillierbased data aggregation scheme against malicious data mining attacks, which can prevent the adversary from inferring a target user's electricity usage data and obtain accurate aggregated results of electricity usage data. Xue et al. [11] proposed a privacy-preserving service-outsourcing scheme for a real-time pricing demand response in a smart grid, which solves the privacy issues by modifying the Paillier cryptosystem to hold two different decryption keys and achieves the flexible enrollment and revocation of smart meters. In addition, Saleem et al. [12] proposed a scheme to resist the malfunctioning of smart meters for data aggregation based on a modified Paillier cryptosystem. Their system can resist false data injection attacks by filtering out the inserted values from external attackers. For achieving secure data aggregation, the ElGamal-based algorithm has been taken into account [13,14]. Liu et al. [14] proposed a lifted elliptic ElGamal-based privacy-preserving data aggregation scheme, in which the trusted third party is removed and the users, with some measure of trust, construct a virtual aggregation area to mask the single user's data against the denial of service attack. In order to resist quantum attacks and improve the efficiency of the algorithm, the lattice-based homomorphic approach has been applied to achieve secure data aggregation for smart grids [15,16]. Abdallah et al. [16] proposed a lattice-based privacy-preserving data aggregation scheme for a smart grid, which can further reduce the computation burden for smart appliances, because it depends on simple arithmetic operations. In [17], a privacy-friendly data aggregation scheme is proposed by Vahedi et al. They use elliptic curve digital signature algorithms (ECDSA) in smart grids to protect users' privacy from the grid operators. Besides, to meet the higher data analysis requirements of the cloud server, multidimensional data is aggregated in some schemes [18][19][20]. Although the schemes based on homomorphic encryption can obtain accurate aggregation results, a heavy computational and communication burden will also be imposed on smart meters with limited computing power.
As another major encryption technology, masking-value-based schemes also have been proposed to achieve secure and efficient data aggregation in smart grids. As for maskingbased data aggregation schemes [21][22][23][24], Gope et al. [21] first proposed a lightweight and privacy-friendly masking-based spatial data aggregation scheme for secure forecasting of power demands in smart grids. Their scheme only uses lightweight cryptographic primitives, such as exclusive OR operations and hash functions, thus it has a significantly lower computational cost as compared with other approaches. The LCEDA scheme proposed by Su et al. [22] achieves an efficient update of masking the value share to ensure forward security of individual data, dynamic enrollment, and revocation of smart meters. Moreover, Huang et al. [23] propose a lightweight and fault-tolerable data aggregation scheme that can determine the smart meters which fail to upload data on time with the idea of flag bit, and correct aggregation results can be obtained even if the data is not reported by the smart meters. However, the existing masking-based aggregation schemes cannot screen abnormal electricity consumption data either.
In addition, accurate aggregation results can be obtained by utilizing zero-knowledge proof [24], but heavy communication and the computational burden will also be imposed on the smart meters with limited computing power. Thus, the solution using zero-knowledge proof is not practical.
Therefore, we propose a lightweight and privacy-friendly data aggregation scheme against abnormal data by using matrix encryption, which can effectively filter abnormal data and find out the source of abnormal data. To more intuitively show the advantages of the proposed scheme compared with other schemes, the security feature comparisons are shown in Table 1.

Scheme
Data Confidentiality

Preliminaries
In this section, the preliminaries of the proposed scheme are presented, in which we describe the basic idea of filtering abnormal data.
Filtering abnormal data: Suppose that a is the data to be determined and b is the upper limit of the normal value, and they are in the range of [0, N 2 − 1]. Then, whether the data a is abnormal can be determined as follows [25]:

1.
Construct an N × N matrix containing all possible values in [0, N 2 − 1], as shown in Figure 1. Each value has a row coordinate and a column coordinate in this matrix. The value in the matrix can be represented by iN + j, the corresponding row coordinate and the column coordinate of this value are (i + 1) and (j + 1), respectively. Based on these values, a and b can be represented as two-dimensional coordinates (i a , j a ) and (i b , j b ), where i a , j a is the row and column coordinate of a, and i b , j b is the row and column coordinate of b. (i a , j a ) and (i b , j b ) can be computed from the following formulae: (2) Figure 1. Representation of the constructed matrix.

2.
Based on (i a , j a ), we can construct three N-dimensional column vectors for a as where 0 i a denotes an i a -dimensional zero vector, 1 N−i a denotes an 1 N−i a -dimensional vector, and all elements are 1; e i a denotes an N-dimensional unit vector, and the i a -th element is 1. In this way, we can obtain the following transformation relation: and the other elements in Q are 0. We havẽ So we have the conclusion that As we describe above, the judgment on whether data a is abnormal can be transformed to the equality test of XQX T = 1 or 0. To be specific, if XQX T = 1, it is equivalent to the fact that a is less than or equal to b, where b is the upper limit of the normal value we set. Therefore it means that the data a is normal. The opposite is also true.

System Model
In this section, we will introduce the system model and the adversary model of the proposed scheme.

System Model
The system model of our scheme is shown in Figure 2, which consists of three entities: smart meters(SM), the aggregation center (AC), and the cloud server(CS).

Adversary Model
In this scheme, we assume that: • Users may not only try to steal electricity by compromising smart meters, but also be interested in the privacy of other users' electricity usage data. In addition, there may be cases where the meter fails and reports abnormal electricity consumption data. • AC and CS are semi-honest. This means that the two entities will honestly execute the proposed protocol and do not tamper with the computational results, but they may attempt to learn individual electricity usage data as much as possible. Besides, AC and CS will not collude with each other. • Any probabilistic polynomial-time adversary can intercept the channels between SMs and AC and the channels between AC and CS to obtain the reported data.
Other security issues are beyond the scope of our scheme.

Security Goals and Functionality
On the basis of the system model and adversary model above, our system should satisfy the following security goals and functionality requirements.
• Data privacy: Because the data reported by the electricity meters is closely linked to the users' daily habits and household situations, the proposed scheme should ensure that the privacy of users' electricity usage data is not compromised by curious internal entities, as well as by external attackers. • Filter abnormal data: In order to prevent the abnormal electricity usage data reported by the electricity meters from affecting the accuracy of the aggregation results, the abnormal electricity usage data should be filtered out during the aggregation process. • Trace abnormal source: The proposed scheme should track the source of abnormal data to further repair and maintain abnormal meters.

The Proposed Scheme
Our scheme is mainly composed of five stages: system initialization, registration, data encryption, aggregation and filtering, and decryption. In addition, the work flow of our scheme is presented in Figure 3.

Registration
When the smart meter SM i registers with the cloud server, the cloud server generates a random number r i and a pseudo-identity PID i for it. Then, the cloud server sends {PID i , r i } to it over a secure channel.

User: Data Encryption
(1) For electricity usage data x i , the smart meter SM i generates two random numbers, µ x,i and µ x,i , and constructs the following matrices X i ,X i as wherex i ,x i , andx i are constructed as in Section 3, i.e.: where (2) The smart meter SM i encrypts X i ,X i into the ciphertext {HT i,1 , HT i,2 } as follows: (3) Finally, the smart meter SM i reports the ciphertext {HT i,1 , HT i,2 , PID i } to the aggregation center.

The Aggregation Center: Aggregation and Filtering
(1) The aggregation center generates the matrixQ according to the upper limit of normal data, q:Q where µ Q,1 and µ Q,2 are random numbers, and Q is a 2N × (N + 1) matrix constructed as in Section 3: where r Q,1 , r Q,1 , and r Q,1 are random numbers. Then, the aggregation center constructs matrix TT according to the matrixQ and the matrix M 3 ,M 4 as in the following equation: (2) The aggregation center aggregates the reported data to obtain the aggregation result R according to the following equation: For abnormal data, the result of XQX T is 0, therefore the result of the formula HT i,1 TTHT i,2 is 0. While, for normal data, XQX T = 1, the result of the formula HT i,1 TTHT i,2 is still (x i + r i ). In this way, the abnormal data is automatically filtered in the process of aggregation, that is, the aggregation result R is ∑ (x m + r m ), where x m represents the normal electricity usage data, and r m represents its corresponding masking value. Besides, if reported data are judged to be abnormal, the aggregation center will record their source, PID ab , and send it to the cloud server.
(3) Lastly, the aggregation center sends the aggregated result, R = ∑ (x m + r m ), and the pseudo identities, {PID ab }, of the abnormal smart meters to the cloud server.

The Cloud Server: Decryption
After receiving the aggregated result, R = ∑ (x m + r m ), and the pseudo identities, {PID ab }, of the abnormal smart meters from the aggregation center, the cloud server decrypts the data to obtain the real aggregated result R as in the following equation: where r ab represents the masking value corresponding to the smart meter which reports abnormal data. Therefore, the cloud server can obtain the accurate aggregated result R that does not include abnormal data and the pseudo identities PID ab of abnormal meters, so that it can make appropriate production decisions and check for abnormal smart meters.
As the range of electricity usage data expands, the constructed matrix will become larger, which greatly increases the communication cost. For example, the bit length of the report data will be at least 1000 bits when the electricity usage data reaches 1000.
To solve this problem, a mapping function f : S → S * is proposed to map the original data to a smaller set, where S and S * are the original data set and the mapped set, respectively. For any x i ∈ S, there exists a unique x i * ∈ S * corresponding to it and x i * = [x i /b], where b is determined by the filtering accuracy. By sacrificing some accuracy within an acceptable range, communication overheads can be greatly reduced.

Security Analysis
In this section, we present the security proof of the proposed scheme to solve the problem of adversarial models. Theorem 1. (Resistant to the middle-man attack) The proposed scheme can ensure that the privacy of users' data is not compromised by the external adversaries.
Proof. The confidentiality of users' electricity data x i (i = 1, 2, . . ., n) and the aggregation result ∑ x m will be proved below.
If the PPT adversary tries to obtain x i from {HT i,1 , HT i,2 }, (s)he must know r i since HT i,1 = (x i + r i )X iM1 and HT i,2 =M 2X i . However, r i is a random number only available to registered users and the cloud server. Consequently, the external adversaries cannot infer the individual electricity data x i from {HT i,1 , HT i,2 }.
If the external adversary tries to derive ∑ x m from R, (s)he needs to know the sum of random numbers ∑ r m since R =∑ (x m + r m ). However, ∑ r m is only available to the cloud server. Thus, adversaries cannot infer the normal total electricity usage data ∑ x m .
To sum up, any adversary cannot recover individual electricity usage data x i or total electricity usage data ∑ x m that excludes abnormal data.

Theorem 2.
Our proposed scheme can achieve the privacy of data transmitted by a smart meter.
Proof. In our scheme, the attackers of data privacy can be divided into two categories: internal attackers and external attackers. For external attacks, they can be resisted, since an encryption algorithm is adopted in our scheme. For internal attackers, we discuss it in the following three cases.

1.
When the internal attacker is the aggregation center, although it can obtain the encrypted users' electricity usage data, it cannot gain the users' real electricity usage data. Specifically, the aggregation center can get {HT i,1 , HT i,2 } reported by smart meters, where HT i,1 = (x i + r i )X iM1 , and HT i,2 =M 2X i . If the aggregation center tries to recover x i from {HT i,1 , HT i,2 }, it must know r i . However, r i is only available to the user i and the cloud server. Therefore, the proposed scheme can resist privacy attacks on the transmitted data from the aggregation center.

2.
When the internal attacker is the cloud server, although it can obtain the aggregated result of normal electricity usage data, it cannot gain the electricity usage data of a single user. Concretely, the cloud server can only obtain ∑ (x m + r m ) from the aggregation center, that is, it can only obtain the aggregated result of normal electricity usage data ∑ x m , which is computed by ∑ (x m + r m ) − ∑ r m . Therefore, the proposed scheme can resist privacy attacks on the transmitted data from the cloud server.

3.
When the internal attacker is a valid smart meter. Although it can intercept electricity usage data reported by other smart meters, it cannot obtain that the corresponding user's true electricity usage information x i , because the masking value r i is known only to the corresponding user and the cloud server. Hence, any smart meter cannot recover the electricity usage data of other smart meters.
To sum up, our scheme can achieve data privacy.

Theorem 3.
It is infeasible to learn users' electricity usage data information according to the reported data in different rounds.
Proof. In each round of data aggregation, the smart meter SM i updates the masking value r i as r i = H 0 (r i ). Even if the adversary gets the reported data in two different rounds, (x i + r i ) and (x i + r i ), (s)he can only obtain (x i + r i ) − (x i + r i ), which does not reveal the changes in electricity usage data in the two aggregation rounds. Therefore, it is still infeasible to obtain information related to users' electricity usage data according to the reported data in different rounds.

Performance Analysis
In this section, we evaluate the performance of our scheme and compare our scheme with two representative and related schemes, the LCEDA scheme by Su et al. [20] and the DMDA scheme by Song et al. [25]. All of these schemes involve the use of masking values to encrypt the electricity usage data, and our scheme uses matrix encryption to filter abnormal data beyond that. Hence, we primarily evaluate the performance of the proposed scheme with LCEDA and DMDA in terms of communication and computation costs. Table 3 lists some notations for the performance comparisons. Table 3. Notations.

Notation
Semantics Notation Semantics Time of an addition operation in Z p |T h | Time of a hash operation T s Time of a subtraction operation in Z p T pm Time of a point multiplication operation M m Time of a multiplication operation in matrix |M a | Time of an addition operation in matrix

Communication Costs
The communication costs of the LCEDA, the DMDA, and our scheme in the enrollment stage are shown in Table 4. The highest communication costs are mainly concentrated between the cloud server and the smart meters in these schemes.
It costs Z p + |ID| communication overheads for the aggregation center to register at the cloud server in LCEDA. In addition, each smart meter spends t Z p + |ID| and Z p on registering at the cloud server and the aggregation center, respectively. Hence, in the enrollment stage, the complexity of communication times in LCEDA is O(1), and the total costs are (t + 2) Z p + 2|ID|. In DMDA, the complexity of communication times is O(1), and the aggregation center spends |G| communication overheads on registering at the cloud server to obtain the mask values, while the smarts register spends (t + 2) Z p + |G| + 2|ID| communication overheads. Therefore, the total length of a communication message is constant in the enrollment stage of DMDA. In our scheme, the complexity of communication times is O(1), and it costs |M 1 | + |M 2 | + Z p + |ID| communication overheads for the smart meters to register at the cloud server.
To sum up, in the enrollment stage, the total length of the communication message of LCEDA in the enrollment stage is linear with t, hence, it has the highest communication costs among these schemes. Although both the DMDA and our scheme are constant, our scheme is less efficient than DMDA, comprehensively considering communication times and message length.

Computation Costs
To evaluate performance, we conducted some experiments on a computer running Windows 10 with a 3.00 GHz Intel Core i5-8500 CPU and 8 GB memory. These experiments were run separately 50 times to obtain the mean results using the GNU Multiple Precision Arithmetic (GMP) Library and Pairing-Based Cryptography (PBC) Library.
The system initialization stage consists of two stages: the system setup stage and the enrollment stage. We set the number of users as 1000 in the implementation. The system setup stage in LCEDA, DMDA, and our scheme costed 8.74 ms, 29.9 ms, and 4.90 ms, respectively. The comparison of computation costs related to LCEDA, DMDA, and our scheme in the enrollment stage is shown in Figure 4, where we set the number of users to vary from 100 to 1000 at an increasing interval of 100. In LCEDA, the smart meters spent (t + 1)T (t−1)−poly on registering at the cloud server and the aggregation center without negotiating with each other. As shown in Figure 4, the computation time of LCEDA ranged from 502.2 ms to 5895.2 ms when the number of users varied from 100 to 1000. The computation costs of DMDA are 2(T pm + T h + T a ). In our scheme, the smart meters and the aggregation center register at the cloud server, which costs 4M m , and the computation time of the proposed protocol ranges from 465.3 ms to 4897.6 ms. The data collection stage consists of three stages: the data encryption stage, the aggregation stage, and the decryption stage. The encryption times of LCEDA, DMDA, and our scheme are shown in Table 5. Each smart meter in LCEDA needed 0.001 ms to encrypt the electricity usage data, while our scheme needed 0.3 ms to encrypt. The aggregation of the encrypted electricity usage data costs 28.3 ms, 28.3, and 33.2 ms in LCEDA, DMDA, and our scheme, respectively, when the number of users is 1000. In LCEDA and DMDA, the cloud center needs 0.53 ms to decrypt the aggregation result, whereas our scheme only needs 0.13 ms. Finally, the time to encrypt data and to aggregate data in our scheme are shown in Figure 5 and Figure 6, respectively. Therefore, LECDA and DMDA have lower computation cost,s (2(T pm + T h ) + (n + 1)T a + T s ) and ((n + t) T(t−1)−ploy + 2(n − 1)T m + (2n + 1)T a + T s ), respectively, compared to our scheme, which is because they do not involve filtering abnormal electricity usage data and do not support finding out the source of abnormal electricity usage data. To sum up, our scheme needs to pay more computation costs for filtering abnormal users and finding out the source of abnormal users, but the increase is not significant, that is, our scheme is indeed efficient.

Conclusions
In this paper, we propose a lightweight and privacy-friendly data aggregation scheme against abnormal data to solve the problem that the abnormal electricity usage data cannot be filtered out when it is encrypted. Besides, our scheme can find out the smart meters which reported the abnormal data. Compared with other complex schemes, our scheme only uses a lightweight matrix encryption, which has lower computational costs and is more suitable for smart meters with limited computing capacity. Finally, a security analysis of our proposed scheme is presented to prove that our scheme can fully protect the privacy of users' electricity usage data. In addition, the performance evaluations and experiments validate the effectiveness and practicability of our scheme. Consequently, our scheme can be implemented in smart grids to effectively filter abnormal data and find out its source.
It is hard to say that our scheme has no drawbacks. We mainly focus on filtering abnormal data during aggregation and finding the source of the abnormal data. We use lightweight matrix encryption to process real-time electricity usage data. However, as the range of electricity usage data expands, the constructed matrix will become larger, which will gradually increase the computational and communication overheads. To overcome this problem, we mapped the original data onto a smaller data set to reduce the size of the construction matrix, and the mapping function was determined by the filtering accuracy. By sacrificing some accuracy within an acceptable range, communication overheads can be greatly reduced. In future work, we will focus on reducing computing and communication overheads while ensuring better filtering accuracy.  Data Availability Statement: The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.