A Network Parameter Database False Data Injection Correction Physics-Based Model: A Machine Learning Synthetic Measurement-Based Approach

: Concerning power systems, real-time monitoring of cyber–physical security, false data injection attacks on wide-area measurements are of major concern. However, the database of the network parameters is just as crucial to the state estimation process. Maintaining the accuracy of the system model is the other part of the equation, since almost all applications in power systems heavily depend on the state estimator outputs. While much effort has been given to measurements of false data injection attacks, seldom reported work is found on the broad theme of false data injection on the database of network parameters. State-of-the-art physics-based model solutions correct false data injection on network parameter database considering only available wide-area measurements. In addition, deterministic models are used for correction. In this paper, an overdetermined physics-based parameter false data injection correction model is presented. The overdetermined model uses a parameter database correction Jacobian matrix and a Taylor series expansion approximation. The method further applies the concept of synthetic measurements, which refers to measurements that do not exist in the real-life system. A machine learning linear regression-based model for measurement prediction is integrated in the framework through deriving weights for synthetic measurements cre-ation. Validation of the presented model is performed on the IEEE 118-bus system. Numerical results show that the approximation error is lower than the state-of-the-art, while providing robustness to the correction process. Easy-to-implement model on the classical weighted-least-squares solution, highlights real-life implementation potential aspects.


Introduction
The use of machine learning techniques in power systems research has been an increasing trend in recent years due to the transition to the Smart Grid (SG). As utility companies move towards SG implementation, the new technology being used will provide more data to be analyzed. This is a natural fit for the integration of machine learning techniques into the power systems field. One area in particular that can take advantage of machine learning is cyber-physical security of the SG. Besides all of the benefits that come with the transition to the SG, the increasing reliance on communication, automation and information technology systems adds vulnerability to cyber-security threats [1][2][3][4][5][6][7]. Cyberattacks have already been successfully executed with major consequences. In Ukraine, a cyber-attack led to a major blackout that impacted 225,000 customers [8]. This draws a lot of awareness from academic circle and industrial practitioners which fueled research in the cyber-security of the power grid, including solutions based on machine learning techniques.
When considering cyber-security of the SG monitoring systems, a critical process is the state estimation (SE), which is the core real-time monitoring tool used by utility companies. SE analyzes measurements from throughout the system to estimate the voltages of each bus. The results of SE are then used in many applications, including bad data analysis, which can be used to detect and identify a variety of different potential cyber-attacks on the SG. However, most research only focuses on false data injection (FDI) attacks on measurements used in SE. Further, research that involves machine learning so far focuses on either creating more accurate SE results or aiding in the detection of measurement FDI attacks [9][10][11][12][13][14][15][16]. In the authors' previous works [17,18], hybrid data-driven and physics-based methods for anomaly detection on the SG are developed, but they still only deal with measurement FDI attacks. On the other hand, there is no real-time monitoring for these network parameters which are used in SE process. The database of network parameters can be stealthily attacked [19,20] or can be corrupted due to several reasons such as human entry error or failure to update replaced equipment parameters etc. Therefore, inaccuracy in the database information may lead system operators to blame errors in the SE results on measurements accuracy. Hence, the quality of SE solution can be severely impacted due to these two main sources: measurement or network parameter errors. In the literature though, the work on addressing network parameter FDI is seldom covered as measurements, giving that both sources of cyber-attacks can greatly impact the SE.
Bad data processing is an important sub-routine function which mainly aims to detect, identify and correct measurement errors in SE. Different from tampering the sensor measurements, the database of network parameters is stored at a control center, and as previously stated, are not monitored. These are ideal conditions for maliciously adversaries, which might attempt to modify the network parameters database with the intent to change state estimation results. There are several methods developed to detect, identify and correct errors pertaining to network parameters. The aim in this work pertains though to model parameter FDI correction in bad data processing. Regarding parameter errors processing, in [21], the author uses an augmented state vector-based approach. Similarly, refs. [2,[22][23][24][25][26] all depend heavily on high measurement redundancy, since they are based on the state vector augmentation approach. Considering stealthy parameter FDI, this can easily lead to observability issues in the SE. Still, these approaches cannot handle multiple simultaneous attacks. In previous works of the authors, parameter FDI cyber-attack correction models have been presented [19,27,28]. However, these works consider only either single-parameter attacks or multiple attacks of equal magnitude. With the integration of machine learning in SG technologies, refs. [29] developed a parameter error detection technique that uses both machine learning based multivariate linear regression models and the Innovation-based SE [30]. However, the work proposed in [29] deals with detection of the parameter FDI attack only and leaves parameter FDI correction modeling for future work. Correction is a critical step in ensuring the SE produces reliable results for future measurement sets.
In the authors' previous work [31], a parameter FDI correction model is presented. However, the limitations of [31] are twofold: (1) the assumption of the availability of a measurement set at the FDI location, i.e., both real power flow and reactive power flow measurements are available for parameter correction. (2) the method is deterministic, i.e., the residual is zero, which means inaccuracy in the measurement set will lead to an inaccurate correction of parameters. These limitations naturally inspired the authors to develop a new, overdetermined correction model that doesn't assume all of the real measurements necessary are always available as well as accurate. In order to do this, an additional measurement set is generated, called Synthetic Measurements (SM) in this paper, to increase the redundancy level and enable a new SE towards parameter correction. The linear regression prediction used in [29] was used to generate Synthetic Measurements (SM) in a similar way as in [32]. Therefore, the contributions of this work are threefold:

1.
Creating synthetic measurements based on weights obtained from machine learning linear regression prediction;

2.
Developing an overdetermined physics-based model for parameter FDI correction; 3.
Incorporating synthetic measurements in the parameter FDI correction model.
The remainder of this paper is organized as follows. Section 2 provides theoretical background on state estimation with synthetic measurements, the machine learning (ML) model for measurement prediction and the correction model for unbalanced parameter FDI attacks. Section 3 presents the parameter FDI correction physics-based model. Test results of a case study are shown in Section 4. Finally, Section 5 presents conclusions and remarks of this work.

State Estimation Augmented with Synthetic Measurements
State estimation aims at solving a set of non-linear algebraic differentiable equations that have the following form [33]: where z ∈ R m is the measurement vector, x ∈ R N is the state variables vector, h(x) : is a non-linear differentiable function that relates the states to the measurements, e is the measurement error vector assumed with zero mean, standard deviation σ and having Gaussian probability distribution, and N = 2n − 1 is the number of unknown state variables. Hence, in the weighted least square state estimation (WLS SE), the approach consists of solving the following minimization problem: where W is a diagonal weight matrix composed by the inverse of the squared values of measurement standard deviations (σ): index is a norm in the measurements vector space.
The measurement model in (1) relies on the characterises of the grid, i.e., connectivity and system parameters. If corrupted data is used, then the obtained solution will be physically incorrect, and could potentially mislead the operators who monitor the grid.
In the view of the minimization problem in (2), WLS SE considers minimizing the error as described in (1), which assumes that the residual tends to follow a normal distribution. From the Central Limit Theorem [33], adding large number of independent random variables that follow any distribution with bounded variance, their properly normalized sum tends to approximate a normal distribution. Therefore, for detecting errors using classical WLS, the hypothesis test solely relies on the distribution of χ 2 . If χ 2 distribution does not follow a normal distribution, then the hypotheses test will fail. In ordered to have χ 2 distribution to follow a normal distribution the measurement model degrees of freedom needs to be increased. Increasing degree of freedom comes with the financial cost of increasing the measurement set through additional meters.
Ideally, one would want to have a measurement reading in every bus and line section. However, this is not realistic, considering the inherent financial cost. Instead, measurements can be created artificially at locations where no real-life measurement or historical data exist. These measurements, named synthetic measurements, were modeled in [32]. One should not confuse synthetic measurements with pseudo measurements, which are created considering available historical data [33]. The main idea of creating synthetic measurements is to approximate the residual of the measurement model to a normal distribution. In doing so, not only parameter FDI correction is enhanced, as will be presented, but also the global redundancy level of the system increases, which enhances gross error detection and identification [32].

Linear Regression Prediction Model
Considering a prediction data-driven model selection, performance comparison among several options were made, including linear and non-linear formulations. Test results indicated that a linear regression model was sufficient to yield a great prediction performance. In addition, it is a simpler model with less hyper parameters than other formulations. Multiple linear regression models can be used to estimate SG measurements. The justification for multiple linear regression use is that it is a simple model with few parameters to train. It has been shown that the linear model can achieve satisfying prediction performance on daily load data [29]. Historical measurement values are used as input features to train these multiple linear regression models. During the training, ML model takes historical data (measurement values from past 'K' days) as inputs and current day measurements as the target. The training process returns model coefficients that can be used to generate synthetic measurement values from historical measurements. These coefficients estimated using the multiple linear regression model for properly scaled input features also provide an easy way to understand the contribution of each input feature in estimating the target.
Consider D equal to the number of measurement values, and the number of past days used as input features for regression models be K. Thus, D linear regression models can be trained, corresponding to D measurements. The corresponding measurements from K past days are used as input features for the regression model.
where the dependent variable y d ∈ R D×1 is a vector that contains d-th measurement values for the current day, the independent variable x dk ∈ R D×1 is a vector that contains d-th measurement values from k-th past day, and  We create 'Weight matrix (W ml )' based on values of regression model coefficients of each measurement. W ml ∈ R D×D is used to weigh the measurements in state estimation process and it is calculated from fi as: where i, j = 1, 2, 3, . . . , D represents the measurement number and fi i represents measurement i average of K + 1 regression coefficients. As a greater number of days are used to create historical data, the reliance of the model on any one day reduces and the model develops a capability to capture the general measurement deviation patterns. This can improve the generalization ability of the model. We assume that the current day measurements do not deviate from the historical trend by a huge margin and the training data does not have any anomalies. The historical data could be analyzed through bad detection techniques such as [18] for the presence of anomalies and then only the data that does not contain bad data could be used for ML model training.

Unbalanced Parameter FDI Attack Correction Model
In (1), the possibility of errors in the parameter data is not considered. Instead, if one considers z = h(x, p) + e, where p is the parameter in error, this function can be expanded into a Taylor Series [19]: where p denotes the parameter error. From (5), the parameter error can be calculated to be as follow: where H p,0 denotes the Jacobian of the parameter. All the quantities are known, so the parameter error can be calculated by (6), which is called the relaxed model here, since it considers the measurement without error. With this model, parameter error can be corrected by using measurement value of reactive power flow corresponding to the line where the parameter attack happened through iterations. However, system net parameter values includes three components, which are series conductance g, series susceptance b and shunt susceptance b sh . In the meantime, the weights of these three components are decided by network parameter database, so one can only correct the parameter error through this model when these components have the exactly same percentage attack, lets say, 10% on g, 10% on b, and 10% on b sh . Thus unbalanced FDI in parameter values are not considered by this model, for example, 30% on g, 20% on b,and 10% on b sh . To address this issue, an unbalanced correction model is presented as follows [31]: where the parameter correction Jacobian matrix τ is defined as: In (7), n denotes the iteration index; Z P k−m(loss) , Z P k−m , Z Q k−m are recorded measurement values of real power loss, real power flow and reactive power flow for FDI attacked line from bus k to m; h n P k−m(loss) , h n P k−m , h n Q k−m denote the continuous nonlinear differentiable function of above three quantities at n th iteration. In (8), parameter correction Jacobian matrix τ uses the magnitude of the voltage drop |E k − E m | 2 , voltage injection V and phase difference θ for bus k and m to perform parameter correction. One can see from (7) and (8), parameter errors of conductance ∆g km , series susceptance ∆b km and shunt susceptance ∆b sh km are corrected by using 3 measurements. However, two issues may yield a failure of this process: (1) There are incomplete measurements dataset needed by this model; (2) conductance error ∆g km can only be estimated from corresponding real power loss measurement which is relatively small in Transmission Line (TL), so an incorrect estimated ∆g km will cause a wrong estimation of series susceptance error ∆b km and shunt susceptance error ∆b sh km . To address this issue, a synthetic measurement enhanced parameter correction model is presented in this paper.

Framework for Parameter FDI Correction
The parameter FDI correction framework is illustrated in Figure 2. Input data consists of historical measurements of the system for the past days, recorded measurements from meters, parameters data and system topology such as connectivity status. The Machine Learning (ML) model process measurements from past days to predict measurements of the current day. The resultant weights attained from ML model are considered in stage-I WLS state estimation, which generates SM. Meanwhile, gross error analysis is performed in stage-II WLS state estimation. In this stage, FDI detection and identification is taken place [18,19]. Upon detecting and identifying parameter attacks in the network, a parameter FDI correction model is solved. Once parameter correction performed, system model is updated. The process of creating SM is referred to as stage-I WLS in Figure 2. In this stage, existing measurements (bus power injection, power flow and voltage measurements) are weighted based on weights derived from ML models described in Section 2.2. Upon WLS SE convergence, power flow estimates in lines with no real-life measurement recorded are to be used as SM. The idea here is to increase, artificially, local measurement redundancy, with linear independent information, providing thus the necessary condition for the development of an overdetermined database of network parameters FDI cyber attack correction model. Hence, for every line section, there will be power flow measurement reading from both ends of the lines. These SM in addition to existing scan of measurements are processed in the FDI correction model, described in Section 3.2.

Overdetermined Parameter FDI Correction Model
Consider the conjugate of the complex power flow [33]: The expressions for real and reactive power flows can be obtained by identifying the corresponding coefficients of the real and the imaginary parts of (9): Through equation (10), one can derive the real power loss of a line: Through equation (11), one can derive the reactive power loss of a line: Equations (10)-(13) provide a model which correlates the real power flow losses, reactive power losses, real power flow, and reactive power flow with system net parameters, as (14) and (15). It is important to note that in this framework, all six of the measurements in (14) are always used in the correction process, but they may not all be real measurements. Any measurements that are not available through a sensor are calculated as SM as described earlier. For example, if the P k−m , and Q k−m are the only true measurements, P m−k , and Q m−k will be SM. P k−m(loss) will be calculated by P k−m and P m−k , Q k−m(loss) will be obtained by Q k−m and Q m−k .
where τ up is defined as: By linearizing (14) through a Taylor series, considering a Newton-Raphson method at n th iteration: where A is the residual of the set of measurements associated with the correction of line parameters as follows: The expression in (15) shows that augmenting τ up and the measurements set, will change the model in (8) into an overdetermined system of nonlinear algebraic equations. Model (16) can be solved considering the minimization problem: One finds the classical WLS SE solution for (18) is: where W p is the weight matrix for parameter correction, σ r in each element denotes one standard deviation of corresponding residue at each iteration:

Case Study
The presented model is validated using the IEEE 118-bus system. Topology and parameters of the IEEE 118-bus system are found in [34]. With the aid of MATPOWER [35], a measurement set is obtained, which consists of 712 measurement leading to a global redundancy level (GRL) 3.029. A Gaussian noise with zero mean and known variance is added to the measurement set. In addition, a measurement dataset corresponding to eight consecutive days each of which one contains 21,600 samples based on a common daily load profile that contains temporal information of a power system's changing state is generated and fed to machine learning models for measurements prediction and measurements' weight. It is worth noting that different noise levels were used for generating multiple datasets. Real line power flows, reactive line power flows, bus power injections, and voltage magnitudes are included in each measurement set. The implementation and evaluation of the machine learning algorithm was executed using python libraries such as NumPy [36], SciPy [37], Matplotlib [38] and Scikit-learn [39]. The first 50% of the samples in each day are used to train multiple linear regression models.
In the following, two different parameter FDI attack scenarios are presented. In each scenario, attack detection is flagged if the objective function J(x) is above threshold value C = χ 2 p,do f . Identification is performed by building a descending list of CME N (based on their absolute values) [19]. In the correction step, the presented model in Section 2.1 is solved.

Parameter Attack Scenario I
In scenario I, an unbalanced parameter FDI attack is injected to the series and shunt parameter of the line 23-32 (−18% on parameter g, 12% on parameter b, −6% on parameter b sh ) on the IEEE 118-bus system.
In this case, the parameters of line 23-32 are attacked. The first process of the framework is FDI detection. Results are presented in Table 1. One can see that the objective function J(x) is 1246.587, which is higher than threshold value (C = χ 2 = 775.1861), thus a cyber attack is detected. For identification, a descending list of CME N is built. From the resultant list, the largest absolute value of CME N which is 8.4156 is related to reactive power injection for bus 23. In addition, one can see that the CME N value of corresponding reactive power flow Q 32-23 , real power flow P 23-32 and reactive power injection Q 32 are also above the threshold value (β = 3). This situation is characterized as a parameter attack on line 23-32 [19]. For parameter correction, the process described in Section 2.1 is implemented. The results are shown in Table 2. In the state of the art parameter correction model [31] (also presented in (7)), one can clearly see that the model requires at least three different real-life measurements: two real power flow P km , P mk and one reactive power flow measurement Q km . However, there is no guarantee that the required measurements will exist. For example, in the current testing measurement configuration, only 1 real power flow P 23-32 measurement and 1 reactive power flow Q 32-23 measurement are assumed to exist. Therefore, the correction model in (7) is unable to process the expecting correction due to the unavailability of the power flow measurement P 32-23 . To resolve this issue, in the presented framework, synthetic measurements are considered only for unavailable measurements. These synthetic measurements are generated by running WLS SE using weights obtained from machine learning linear regression prediction. For this task, no gross error detection analytics is performed. In this case, 358 SMs are generated (adding 358 SM yields a GRL increase from 3.029 to 4.55). Then, two synthetic measurements P 32-23 and Q 23-32 are obtained and augmented from SM dataset. Parameter correction is processed by using 2 existing measurements and two synthetic measurements in the presented parameter FDI correction model (18). Results of such correction are presented in Figure 3. As one can see, the parameter correction process converges at 26th iteration while approximation errors for g [23][24][25][26][27][28][29][30][31][32] , b 23-32 , b sh 23-32 are (0.525%), (0.183%), (0.340%) respectively, which are all lower than state of the art model in [31]. After parameter correction is obtained, a new state estimation process is performed in which objective function J(x) is found to be 684.5437, which is lower than threshold value C. Hence, no FDI attack is detected.

1.
A measurement cyber-attack of magnitude 5 σ is added to reactive power flow from bus 31 to bus 17 (Q 31-17 ).

2.
An unbalanced parameter FDI attack is injected to the series and shunt parameter of the line 47-69 (13% on parameter g, −7% on parameter b, 8% on parameter b sh ).
In this scenario, the attack is detected and the result is shown in Table 3, where the objective function is higher than the C value for this scenario (C = χ 2 = 775.1861). For identification step, a descending list of CME N is built and shown in the same table. The largest CME N values (absolute values) are associated with reactive power injection Q 47 , real power injection P 47 , real power flow P 69-47 and reactive power flow Q 69-47 . This scenario characterizes a parameter cyber-attack on line 47-69 [19]. After identification, the net system parameter are corrected using the presented model in (18). Still, only flow P 69-47 and Q 69-47 are provided in current measurements configuration. The lack of missing the real power flow measurement P 47-69 limits the possibility of performing state of the art model in (7), since incomplete information prevent this model to calculate real power loss mentioned in (7). However, with synthetic measurement Q 47-69 and P 47-69 provided, one will be able to use presented overdetermined model in (18) to perform parameter correction. Correction converges after 16 iterations, as illustrated in Figure 4, and corrected values and comparable results are presented in Table 4. System net parameters g 47-69 , b 47-69 , b sh 47-69 have approximation error (0.499%), (0.241%) and (0.092%) after convergence. After correction, a new state estimation is performed, objective function value 837.6015 is obtained which is still higher than threshold C in Table 5. As seen, the only CME N value (absolute value) above the threshold is the reactive power flow Q  . Therefore, the measurement Q 31-17 is in error. The correction of measurements as shown in the flowchart is performed using their CNE values. The corrected measurement is shown in Table 6.
After re-running the state estimator, the χ 2 is smaller than C, thus no further FDI attack detected.
To further evaluate the robustness of presented model, different measurement noise levels are simulated. A combined parameter error metric is presented to illustrate the total parameter error after correction, while e p = . e p represents the weighted norm of the sum of all parameter errors. A 100 Monte-Carlo simulation is performed and average value is presented. Figure 5 shows a comparison between state-of-the-art solution presented in (7) and proposed model (18). One can see in Figure 5 that the error increases with noise level when using state-of-the-art solution model. However, the proposed model under different noise level provides error below 0.07.    To further illustrate the robustness of proposed solution under different noise level, comparison result is presented in Figure 6. In Figure 6, the highest error e p of line 47-69 reaches 0.5 after system convergence when the noise level increase to 1 standard deviation using state-of-the-art solution (7), while using presented model (18) all values of e p , under different noise level, are lower than 0.06.

Conclusions
In this paper, a physics-based model for malicious parameter FDI cyber-attacks correction is presented. An FDI framework is further presented to detect, identify and correct FDI attacks on measurements and database of the network parameters. The main foundation of the proposed framework relies on creating synthetic measurements that are derived based on weights obtained from linear regression measurement prediction. Synthetic measurements, in addition to recorded measurements, are used in an overdetermined parameter correction model to estimate and correct network parameters. Simulation results show that the presented framework is able to obtain parameter approximation error less than 1% under different noise level from 0 to 1% standard deviation, which outperforms state of the art solution. In addition to the robustness of the presented model, the framework can be easily integrated, without hard-to-design parameters, to the classical WLS SE software, which highlights potential aspects for real-life implementation.

Conflicts of Interest:
The authors declare no conflict of interest.