Detection of False Data Injection Attacks in Smart Grids Based on Expectation Maximization

The secure operation of smart grids is closely linked to state estimates that accurately reflect the physical characteristics of the grid. However, well-designed false data injection attacks (FDIAs) can manipulate the process of state estimation by injecting malicious data into the measurement data while bypassing the detection of the security system, ultimately causing the results of state estimation to deviate from secure values. Since FDIAs tampering with the measurement data of some buses will lead to error offset, this paper proposes an attack-detection algorithm based on statistical learning according to the different characteristic parameters of measurement error before and after tampering. In order to detect and classify false data from the measurement data, in this paper, we report the model establishment and estimation of error parameters for the tampered measurement data by combining the the k-means++ algorithm with the expectation maximization (EM) algorithm. At the same time, we located and recorded the bus that the attacker attempted to tamper with. In order to verify the feasibility of the algorithm proposed in this paper, the IEEE 5-bus standard test system and the IEEE 14-bus standard test system were used for simulation analysis. Numerical examples demonstrate that the combined use of the two algorithms can decrease the detection time to less than 0.011883 s and correctly locate the false data with a probability of more than 95%.


Introduction
The current power system is continuously monitored by an energy management system (EMS), and a supervisory control and data acquisition (SCADA) system us used to maintain normal and secure operating conditions [1]. In particular, the SCADA system in the control center uses state estimators to process the received measurements. The estimator obtains the best estimate of the system's state by filtering incorrect data. These state estimates are then transmitted to all EMS to control the proper functioning of the physical aspects of the grid, such as the power flow calculation.
The measurements collected by the SCADA system include not only measurement noise due to the limited precision of sensors and communication medium, but also errors due to various problems, such as connecting and calibrating a failed meter. To decrease the effects of noise and error, power system researchers have developed many methods to deal with the measurements during state estimation [2,3]. The basic principle of these methods is to use the redundancy of multiple measurements to identify and eliminate anomalies.
Most of the technologies used to protect grid systems are designed to ensure system reliability, such as preventing random failures. However, more and more attention has been paid to preventing malicious network attacks in the recent proposals for smart grids [4]. The operation and control of smart grids depend on the complex network space of computer, software and communication technology [5]. Since measurement components supported by smart devices, such as smart instruments and sensors, play important roles in confirming the real-time physical states of power systems, they are likely to be targets of attack. These measuring devices widely use Internet-based protocols in communication systems, which are open to external networks and lack of hardware to prevent tampering. In order to promote data sharing, enterprise networks, and even individual users, are allowed to connect to the infrastructure of power grid information [6]. Potential complex malicious attacks increase after these network interfaces are introduced into power systems [7][8][9][10]. Liu et al. [11] indicated in 2009 that a new FDIA could bypass bad data detection (BDD) in current SCADA systems and introduce any errors into state estimation without being detected. Malicious covert data injection of network buses will inevitably have a negative impact on power-system state estimation [12,13]. The injection of these malicious data that deviates state estimates away from security values can directly result in serious social and economic losses, and an attacker can utilize the FDIA to manipulate the electricity price of the electric market [14][15][16], and this attack can even result in regional power shortages [17].
Du et al. [18] proposed a method to extract network parameters from the limited data obtained by phasor measurement units (PMUs) when the network parameters are unknown and then use these parameters to build an AC attack model, finally making the state estimation deviate from the securely value. Most of the classical methods used to construct the attack model focus on tampering measurements, such as the power injected into the bus and the power flow between buses. Liu et al. [19] proposed a method to attack network parameters which reduces the number of attack measurements by coordinating the modifications of parameters and other measurements in the power system. The attack method is still applicable in cases where the topology and line impedance of the network are incomplete. Since it is unrealistic for an attack to modify network parameters directly. Liu et al. [20] proposed a more universally applicable attack model. The concrete approach is to tamper with network parameters indirectly by exploiting the vulnerabilities that exist when the network parameters are incorrectly handled.
Several directions have been taken in the research of detecting FDIAs in smart grids. Although these detection methods differ to varying degrees, they can be broadly classified into two broad categories. Detection methods can be categorized as model-based detection algorithms and data-driven detection algorithms. In response to the situation in which network parameters are attacked, [21] proposed a way to detect network parameter attacks based on the inconsistency of historical data and specified network parameters. However, such methods are no longer applicable in detecting combinatorial attacks. Methods to detect FDIAs using differences in the probability distributions of historical and current measurement data may not be applicable any longer, such as assuming the attack vector is a trapezoidal attack or that spurious data injected do not significantly deviate from the historical trend [22][23][24]. In addition, such a detection method will easy cause false detection when encountering actual events, such as sudden changes in the load or from the generator. To deal with this situation, a method was proposed in [25] to detect FDIAs using the difference in the residual probability distribution between historical measurement data and that of current measurement data. This method still maintains good detection performance when facing trapezoidal attacks and real events. Chen et al. [26] proposed a scheme to detect data before state estimation by using vector autoregression model. This scheme uses vector autoregressive model to predict and classifiers to detect, which improves the detection rate based on the autoregressive model. Saleh et al. [27] proposed a detection method to detect FDIAs that destroy the state estimation of PMUs. The phase lock value (PLV) is used to judge whether the phase changes between buses are consistent. If the phase change was no longer constant, the data for the PMU were considered to have been manipulated; otherwise, data security at PMUs was considered. The above are several model-based detection methods.
Unlike model-based detection algorithms for FDIAs, machine learning, as a datadriven technique, implies a huge dependence on historical data of the system under test. Yu et al. [28] proposed a false data injection attack detection method for AC state estimation. When FDIAs exist, their spatial and temporal data correlations may deviate from the correlations under normal conditions. By using wavelet transforms and deep neural networks to analyze the estimated states in continuous time, the proposed method can effectively detect this inconsistency. Xun et al. [29] proposed an extreme learning machine (ELM)-based one-class and one-network (OCON) framework for detecting FDIAs. In this framework, the subnetwork of the state identification layer in OCON uses the ELM algorithm to accurately classify false data and normal data. Almasabi et al. [30] proposed a new method to detect FDIAs using moving average, correlation and machine learning algorithms. The experiments showed that the proposed method is able to detect the attacked PMUs and its timing issues with a high detection rate. Most existing machinelearning-based detection methods generally assume that the labels of the training data are known, which may not be consistent with common sense. Since real-life FDIAs are generally considered as rare events, it may be challenging to obtain the identity of the compromised data. An et al. [31] proposed the use of unsupervised integrated autoencoders connected to a Gaussian mixture model (GMM) to accommodate multiple domains. Attention-based potential representation and minimum error reconstruction features are utilized in the hidden space of the integrated autoencoder. The expectation maximization (EM) algorithm is used to estimate the sample density in the GMM. When the estimated sample density exceeds the learning threshold obtained in the training phase, the sample is identified as an outlier. Since the EM algorithm has the disadvantage of being sensitive to initial values, excellent initialization parameters are required for the next iterative step of the calculation. To deal with this challenge, we are required to develop an unsupervised detection approach.
This paper proposes a detection and location method for the false data injection attacks in smart grid. FDIAs threaten the management and control of grids by tampering with the measurement data of the smart grid systems. In fact, the attacker adds an unknown deviation to the measurement data of a system to launch an FDIA. Since the presence of unknown attacks generates error bias, there are different characteristic parameters for the measurement error contained by false data and that of normal data. Therefore, we used the k-means++ algorithm and the expectation maximization (EM) algorithm to estimate the corresponding parameters of the measured data to eliminate the data affected by the FDIA, and finally achieved the purpose of attack detection. The main contributions of this paper can be summarized as follows: • Since the error models of both measurement vectors and state variables with false data have the characteristics of the Gaussian mixture model (GMM), a false data injection attack detection method based on the k-means++ and expectation maximization (EM) algorithms is proposed. • To address the fact that the k-means algorithm is sensitive to the initial clustering centers and affects the convergence efficiency, the k-means++ algorithm is proposed to determine the initial estimated parameters of the GMM in a faster iterative approach. • The k-means++ algorithm is used to preprocess the data to solve the problem of EM algorithm being sensitive to initial values. It also decreases the calculation complexity of the EM algorithm, and finally detects and locates false data rapidly according to the classification results.

System Model
For complex information processing of smart grid, it is necessary to generate corresponding mathematical model according to network topology and data of distribution network [32]. The general linear state equation of voltage and current phasors in the smart grid distribution system is as follows [33]: where y ∈ C m is the original measurement vector of voltage and current phasor; z is the noiseless measurement vector; x ∈ C n is the vector describing the system state variable; H ∈ C m×n is the network topology matrix describing the vicinity of a given working point; e ∈ C m is the measurement error produced by the sensor, where each component is modeled as an independent homodistributed and obeys a complex Gaussian random variable with a zero mean and variance of σ 2 .
Attackers use FDIAs to add attack vectors to the measurement vectors to corrupt the measurements available to the operator. The actual measurements after being attacked are where a ∈ C m is the attack vector; y a ∈ C m represents the measurement after being attacked by false data injection.
With the rapid development of synchronous phasor measurement units (PMUs), a smart grid can obtain impeccable phasor measurement values by arranging PMUs on the terminal buses [34]. Using these measurements, the system state variable x can be accurately estimated. However, due to the price factor of PMUs, the device cannot be installed on all transmission buses of the power system, and can only cooperate with other sensors to obtain system measurements. One of the attacks considered under this condition is that during the stable operation of the power system, one of the N phasor measurements in the measurement vector y is continuously attacked; that is, a component in the attack vector a is not zero. In the subsequent measurement acquisition process, we determine whether the phasor measurements are replaced with false data by K(K ≥ 1) measurement vectors. To facilitate the calculation, the obtained measurement samples are converted from complex representation to real coordinate representation, and then the actual obtained component of the ith phase measurement of the kth measurement vector y k ∈ R N×2 is represented as where y i,k ∈ R 1×2 , z i,k ∈ R 1×2 , e i,k ∈ R 1×2 . The error distribution of the secure phase measurement is represented by p (1) e (e; µ 1 , Σ 1 ), and the error distribution of the phase measurement tampered with by the attack is represented by p (2) e (e; µ 2 , Σ 2 ). In addition, the phasor measurement error distributions belong to two-dimensional Gaussian distributions with unknown parameters.
For ease of calculation, the actual obtained model for the phasor measurement sample of K measurement vectors is written as where Y ∈ R NK×2 , Z ∈ R NK×2 and E ∈ R NK×2 represent the original measurement, actual measurement and measurement error obtained from K measurements for N phase measurement units, respectively.
E = [e 1,1 , · · · , e 1,K , · · · , e N,1 , · · · , e N,K ] T Power-grid operators generally apply a likelihood ratio test to each measurement to judge whether the measurement is correct. However, there are errors in the measurement data that conform to a Gaussian distribution, and the number of false alarms increases as the number of measurements increases, making it more difficult to detect false data. In this study, we used the method of processing the results of multiple measurements as a set of data. Since interrelated measurement data are linked, the probability of false alarms can be decreased by mathematically determining the relationship between the data. However, the difficulty of this method is also in which calculation method should be used to quickly determine the relationship between the data in the group. An inappropriate method is likely to increase the workload of the detection system and decrease the detection efficiency.

Maximum Likelihood Estimation
When all measurements Y are considered as a whole, the corresponding measurement error samples E can be seen as coming from two clusters-one with MK correct phasor measurement samples and the other with (N − M)K attacked tampered phasor measurement samples. Without testing, it is impossible to determine which samples of measurements have been tampered with by FDIAs. The probability distribution of the measurement error e for each measurement y according to the assumed statistics can be represented by a Gaussian mixture model (GMM): where In this paper, we derived the distribution parameters of the measurement error by exploiting the asymptotic property of maximum likelihood estimation (MLE). Knowing about the phase measurements associated with the parameters and the actual values derived from the state variables, the maximum likelihood estimate θ for unknown parameters can be solved by maximizing the log-likelihood function globally. According to the noise model assumed in (8), the log-likelihood function with parameter vector θ=[α 1 , α 2 , µ 1 , Σ 1 , µ 2 , Σ 2 ] T can be obtained as The maximum likelihood estimateθ ML was obtained by solving arg max Since the cost function in (10) is too complex, we would like to use a method to decrease the complexity of calculating the MLE. Therefore, we introduce a complete dataset {E, γ}, where γ= γ 1,1,1 , · · · , γ 1,K,1 , · · · , γ N,1,1 , · · · , γ N,K,1 γ 1,1,2 , · · · , γ 1,K,2 , · · · , γ N,1,2 , · · · , γ N,K,2 contains 2NK random hidden variables whose values reflect which mixed component the random variable in the measurement error E belongs to. γ i,k,l is defined as follows: With unobserved data γ i,k,l , the complete data are (e i,k , γ i,k,1 , γ i,k,2 ). More specifically, if e i,k is the measurement error of the security data, then e i,k belongs to the first mixture component p (1) e (e; µ 1 , Σ 1 ) of the Gaussian mixture model, and its complete data are (e i,k , 1, 0). If e i,k is the measurement error of the false data, then e i,k belongs to the other components of the Gaussian mixture model, denoted as (e i,k , 0, 1). The log-likelihood function for complete data is To avoid ambiguity, the original log-likelihood function L I (θ; E) in (9) is referred to as the log-likelihood function for incomplete data. Clearly, the newly introduced loglikelihood function L C (θ; E, γ) for complete data is much simpler to calculate. For GMMcompliant measurements, the EM algorithm can be used to approximate MLE [35].

K-Means++ Algorithm
Since the EM algorithm has the disadvantage of being sensitive to initial values, the parameter θ needs to be initialized in order to proceed to the next iteration of the calculation. The convergence efficiency is greatly decreased by the randomly chosen initial estimated parameter θ (0) due to the information uncertainty in estimating parameter θ. At the same time, whether to get a global optimal solution is also worth considering. The k-means algorithm classifies data according to the minimum distance criterion, which is commonly used in the clustering of data streams; its advantages are simplicity and rapidity [36]. The k-means++ algorithm determines the initial estimated parameters of the Gaussian mixture model with faster iterating than the k-means algorithm. At the same time, the k-means++ algorithm decreases the sensitivity to the initial clustering center, thereby accelerating the rate of convergence.
The idea of the k-means++ algorithm can be summarized in two steps. In the first step, the only difference between k-means++ and k-means algorithms is that the k-means++ algorithm chooses initial clustering centers that are far away from each other rather than randomly. Therefore, the above characteristics allow the k-means++ algorithm to have faster calculation speed. In the second step, sample points in the dataset are assigned to cluster centers that are nearest to each other to form different clusters and recalculate cluster centers.
In this paper, the workflow of k-means++ algorithm can be summarized as three steps. The first step is to select the initial cluster center. First, a sample e is randomly selected from the data set E as the initial clustering center c 1 . Then, the Euclidean distance between each sample e i,k and the currently existing clustering center c (0) 1 is calculated and denoted by D(e i,k ). Next, the probability of each sample being selected as the next cluster center is calculated by using Finally, the second initial cluster center c The second step is to assign the dataset. Assign each sample of the dataset to the appropriate cluster center according to the principle of minimum Euclidean distance.
where (15) indicates that e i,k belongs to the c (n) l -centered clustering domain. The third step is to update the clustering centers. At the (n + 1)th iteration, the cluster centers of the dataset are recalculated based on the hidden variable γ (n+1) . The newly calculated cluster centers are then used as the center of mass of the samples belonging to that category.

EM Algorithm
The idea of EM algorithm is to estimate unknown parameters through two iterations: an expectation (E) step and a maximization (M) step. In the first step (E-step), the conditional expectation of the log-likelihood function for complete data is calculated based on the conditional probability of the hidden variable. In the second step (M-step), the conditional expectation obtained by the E-step is maximized for the desired parameters. Using the estimated parameter θ obtained with the k-means++ algorithm, we proposed the workflow of the EM algorithm for the (η + 1)th iteration thereafter.
Step 1 (E-step): The conditional expectation for defining the log-likelihood function of complete data is as follows: i,k,l can be calculated from the Bayesian rule of Equation (18).
Step 2 (M-step): The maximum of function Q θ, θ (η) is obtained from Equation (18) with θ as the vector parameter. The result of the (η + 1)th iteration is

Algorithm Implementation
The probability density function (PDF) of random variables in measurement error E is p e (e)=α 1 N (e; µ 1 , The more appropriate initial vector parameter θ (0) = α T obtained according to the k-means++ algorithm was used for the first iteration of the EM algorithm. The cost function in (17) can be simplified as In order to maximize the GMM with parameter Λ (η) (θ), we can solve where λ in (22) is a Lagrange multiplier. In (24), θ= α

Meanwhile, the solutions of the equations are all in closed form, and the result is
The above calculations are repeated until the log-likelihood function value no longer changes significantly. By rounding the final dataγ (η+1) i,k,l of the hidden variable, we obtain the complete data set {E, γ} and the vector parameter θ of the GMM.
Thus, the pseudo-algorithm of the joint use of k-means++ algorithm and EM algorithm for parameter estimation of GMM is shown in Algorithm 1.
Initialize: Iteration index n = 0 for k-means++ algorithm; the EM algorithm's iteration index η = 0; convergence tolerance is ∆; and maximum iteration number is N max itr .

K-means++ algorithm loop:
(1) A sample point is randomly selected as the initial cluster center c (0) 1 , and then the second cluster center c (0) 2 is selected according to the roulette wheel selection.
is satisfied, the k-means++ algorithm is terminated. Otherwise, set n ← n + 1 and return to (2). Get the initial estimation parameters:

Convergence Analysis
The essence of using k-means++ algorithm to calculate new clustering centers is to minimize the sum of squared error (SSE) function: As can be found from the algorithm, SSE is a rigorous coordinate descent procedure. Selecting the mean of the current clustering as the new clustering center ensures that SSE will be decreased at each iteration.
Since SSE is monotonically decreasing and has a lower bound, the optimal solution c l that converges SSE to the minimum can finally be obtained.
For any Gaussian distribution parameter vector θ (η) in the EM algorithm's parame- is easily verified via the following relationship [37,38]: Based on the monotonicity of the log-likelihood function Q θ, θ (η) for complete data and the boundedness of p(E; θ) in the EM algorithm, it can be proved that the proposed EM algorithm converges to a stationary point L * I of the log-likelihood function L I (θ; E) for incomplete data.

Complexity Analysis
In the complexity analysis, we focused on the iterative process between the k-means++ algorithm and the EM algorithm in the estimation of parameters. Since they consume more computationally, complexity was evaluated with floating point operations (FLOPs).
We define FLOPs in relation to some basic operations as follows: (1) ε add : FLOPs required for addition.
(3) ε mul : FLOPs required for multiplication. (4) ε div : FLOPs required for division. Note that the FLOPs used in actual practice may differ depending on the processor. Since both k-means++ and EM algorithms are iterative, we focused our analysis in a single iterative process. The (n + 1)th iteration of the k-means++ algorithm to reclassify the dataset according to (15) requires NK(4ε sub + 4ε pow + 2ε add + 1ε com + 2ε ass ) flops, and to update the clustering center according to (16) requires (3NK − 5)ε add + 4ε div + 1ε sub . We define FL(c) as the FLOPs required to estimate cluster center c in one iteration of the k-means++ algorithm.
Finally, the number of iterations required to achieve convergence is assumed to be N k itr or N EM itr for the k-means++ and EM algorithms, respectively. Then, the FLOPs needed to ultimately estimate the vector parameter θ are approximately

Simulation Analysis
To verify the feasibility of the proposed algorithm, the simulation in this paper was performed with IEEE 5-bus standard test system and IEEE 14-bus standard test system. The MATLAB R2018b software was used for simulation, and the related data in the MAT-POWER 7.1 power simulation package were used for routine power flow calculation. The final operating data were used as the measurement data for the power system. The at-tack vector was injected into the system first, and then the k-means++ algorithm and EM algorithm were jointly used to verify the feasibility of this detection method.

Simulation Parameters
The related data modified from the simulation of IEEE 5-bus standard test system are shown in Table 1. The other data were unchanged. We summarize the simulation parameters that were used in the simulation in Table 2, and generated simulation data based on these parameters to test the algorithm.

Simulation Results
For the 600 data points shown in Figure 1, the measurement errors of some phasors begin to shift when a meter measurement in the power system is tampered with. Figure 2 shows the initial data-clustering results processed by the k-means++ algorithm. The classification results of the GMM and data obtained after the subsequent EM algorithm are shown in Figure 3, and the images of their final classification results are basically consistent with those shown in Figure 1. Figure 4 visualizes the PDF image of the measurement error distribution of GMM, and the figure shows the error offset caused by the false data.     Figure 5 shows that the sum of squared errors of the model gradually flattens out as the number of iterations monotonically changes when using the k-means++ algorithm for simulation. Figure 6 shows that with the EM algorithm, the logarithmic likelihood function values of the model gradually flatten out as the number of iterations monotonically changes. The simulation results show that both algorithms can take little time to achieve convergence.  Value of log-likelihood The simulation result shows in Figure 7 that the detected false data come from the branches I 1−2 between measurement buses 1 and 2. There was one misdetected measurement datum each in branch I 1−5 and branch I 4−5 . For a changing number of measurement buses injected with false data, the average error change of vector parameter θ=[α 1 , α 2 , µ 1 , Σ 1 , µ 2 , Σ 2 ] T in GMM obtained by the detection method in this paper is shown in Figures 8-10. It can be seen that as the false data increase in number, the estimation errors of parameters α 2 , µ 2 and Σ 2 of this algorithm decrease continuously.   As the proportion of false data in the overall data increases, the probabilities of false data detection, missed detection and false detection by this algorithm change, as shown in Figure 11. It can be seen that the detection rate of the algorithm for false data is basically above 95%, and the detection probability can be further improved to above 99% as the amount of false data increases; thus, the probabilities of false detection and missed detection are normally below 1%. In order to further verify the rapidity of the algorithm proposed in this paper for detecting false data injection attacks, we have conducted 1000 repeated experiments. The simulated time statistic histogram and normal distribution curve obtained after 1000 repetitions of simulation experiments are shown in Figure 12. From the normal distribution curve in the graph, it can be seen that the algorithm can basically detect false data in 0.011883 s. To verify the feasibility of the proposed algorithm, it was further tested in the IEEE 14-bus standard test system. The measurement errors of active and reactive power of the bus and transmission lines and the errors after being attacked by false data injection are shown in Table 3. The validity of the method was verified by injecting false data into arbitrarily selected measurement units. One thousand sets of quantitative measurement vectors with false data were generated as experimental data according to the Monte Carlo method.
Firstly, the measurement errors were used to detect FDIAs. The measurement errors obtained by Monte Carlo method for 1000 instances of normal data were transformed into samples that conformed to the standard normal distribution model, and the measurement error data obtained are shown in Figure 13. All the data conform to the model of standard normal distribution, and the measurement errors of the sample data are not shifted. The results of the measurement error after injecting false data are shown in Figure 14. It can be seen in the figure that the FDIAs with Equation (35) as the attack vector made the degree of offset of the measurement error more significant. The results of clustering the measurement errors after the false data injection attack by the k-means++ algorithm are shown in Figure 15.  The data preprocessed using the k-means++ algorithm were further iteratively calculated using the EM algorithm. The final PDF image of the GMM of the measurement error was obtained as shown in Figure 16. The results of classifying the sample data of 1000 measurement vectors according to the fitted GMM are shown in Figure 17. From the figure, it can be seen that there is no influence of bias in the normal measurement data, so its error distribution is basically around zero. The data with error deviations were removed and classified by classifying the sample data. It is known that the power measurement data of P 3 , Q 2 , Q 3 , P 1−2 , P 2−3 , Q 2−3 , Q 1−2 and Q 4−2 in the power system were tampered with by the attacker through FDIAs. The detection of false data in the measurement data using the algorithm of this paper is shown in Figure 18. A small number of data were identified as normal data because the data in measurement units P 3 , Q 3 , P 1−2 and P 2−3 are more similar to the normal data.   Secondly, we detected FDIAs from the perspective of the results of state estimation. When not under attack, 100 sets were randomly selected from the 1000 sets of measurement data for state estimation. The errors of their state estimation results were transformed into samples that conformed to the model of standard normal distribution, and the obtained estimation errors are shown in Figure 19. All data conform to the model with a standard normal distribution, and none of the sample data are biased by the measurement errors. The results of its measurement error after injecting false data are shown in Figure 20. From the figure, it can be seen that the voltage amplitude and phase angle of the state estimate of some buses are significantly shifted. The data preprocessed by the k-means++ algorithm were further iteratively calculated using the EM algorithm, and the final PDF image of the state estimation error conforming to the GMM is shown in Figure 21. The results of classifying the sample data of 100 state variables according to the fitted GMM are shown in Figure 22. From the figure, it can be seen that the data with error deviations were removed and classified by classifying the sample data. The errors of voltage magnitude and phase angle of bus 1 and buses 4-14 are around zero, and their deviations are very small, so they basically have no impact on the power system. The results of the state estimation of bus 3 are mainly the offset of voltage amplitude, which has a mild impact on the power system. The results of the state estimation of bus 2 show large shifts in voltage magnitude and phase angle, indicating that bus 2 was the main target of the FDIAs. The detection of false data in the measurement vector using the algorithm proposed in this paper is shown in Figure 23.

Conclusions
Considering that false data injection attacks can disrupt the secure operation of smart grids, we proposed a method to detect and locate false data injection attacks in power systems using statistical learning. By combining the k-means++ algorithm with the EM algorithm, it is possible to accurately model the smart grid bus measurement data within 0.011883s. At the same time, the GMM containing the characteristic parameters of data measurement errors can be obtained. Numerical examples showed that the mathematical model obtained by this joint algorithm provides a detection probability of more than 95% for false data, and can accurately locate the measured buses that are tampered with by FDIAs.
Subsequent research can provide the best choice of GMM with different models by combining the Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), Silhouette Coefficient (SC), Calinski-Harbasz (CH) score and other methods, so as to build a more perfect model to improve the algorithm in this paper.