Improving Detection of False Data Injection Attacks Using Machine Learning with Feature Selection and Oversampling

: Critical infrastructures have recently been integrated with digital controls to support intelligent decision making. Although this integration provides various beneﬁts and improvements, it also exposes the system to new cyberattacks. In particular, the injection of false data and commands into communication is one of the most common and fatal cyberattacks in critical infrastructures. Hence, in this paper, we investigate the effectiveness of machine-learning algorithms in detecting False Data Injection Attacks (FDIAs). In particular, we focus on two of the most widely used critical infrastructures, namely power systems and water treatment plants. This study focuses on tackling two key technical issues: (1) ﬁnding the set of best features under a different combination of techniques and (2) resolving the class imbalance problem using oversampling methods. We evaluate the performance of each algorithm in terms of time complexity and detection accuracy to meet the time-critical requirements of critical infrastructures. Moreover, we address the inherent skewed distribution problem and the data imbalance problem commonly found in many critical infrastructure datasets. Our results show that the considered minority oversampling techniques can improve the Area Under Curve (AUC) of GradientBoosting, AdaBoost, and kNN by 10–12%.


Introduction
Today, the umbrella term 'Industry 4.0' represents the integration of digital control, Information and Communications Technology (ICT), and intelligent decision-making into critical infrastructures. This upgrade is possible due to the amalgamation of information and industrial technologies into standard components and processes [1,2]. This shift, from the traditional system to Industry 4.0, has helped improve the overall performance and productivity of critical infrastructures that have become the fundamental building blocks of modern society. For instance, electricity distribution and usage can be optimized in smart grids. In water systems, in-time data about usage and plant treatment capacity can reduce water wastage. Along with various benefits and improvements, the addition of new components into critical infrastructures presents new vulnerabilities [3][4][5]. This critical infrastructure is especially sensitive to cyberattacks. Even a low-scale attack that causes a few critical infrastructure components to malfunction can impact the whole system. For example, even a short disruption in the power grid can halt the functioning of many industries and infrastructures, from food processing plants to hospitals. The attack on the Ukraine grid infrastructure and a recent ransomware attack on a colonial pipeline are some of the many alarming examples that call for improvements to be made to the defense techniques which protect critical infrastructures [6,7].
The injection of data or commands at the source or during communication is collectively called False Data Injection Attack (FDIA). Data injection refers to the manipulation 1.
We provide a comprehensive analysis of machine-learning algorithms for FDIA detection using the two representative datasets, namely power system and water treatment datasets.

2.
We determine the subset of features which can be used to achieve the best performance using different filter and wrapper approaches.

3.
We mitigate performance bias in imbalanced datasets using four different oversampling methods.
The remainder of the paper is organized as follows. The related works are presented in Section 2. A detailed explanation of both critical infrastructures (power system and water treatment plant) from which data about events have been recorded for datasets creation is presented in Section 3. Section 4 presents the three feature selection approaches and provides a ranking of the features. The class imbalance issue and oversampling methods are discussed in Section 5. The results of training and testing and the outcome of oversampling are presented in Section 6. Lastly, Section 7 provides the conclusion and future research directions.

Related Work
In the section, we present the details of FDIA and other attacks targeted to the Cyber-Physical System (CPS) and provide a summary of existing FDIA detection methods based on machine learning. Further, the limitations and research gaps in the existing literature, which motivated the current study, are discussed.
With the fast transition of the traditional grid to the smart grid, the effective implementation of FDIA is critical to the success of the smart grid. There have been many FDIA attacks demonstrated in the literature. In the last five years (2015-2020), some surveys provided discussion and a comprehensive summary of challenges and countermeasures regarding FDIA. The role and importance of Artificial Intelligence (AI) and big data technologies for FDIA detection were also highlighted [9,[12][13][14]. The financial impact of FDIAs was demonstrated in [8]. The authors assumed an insider attack and simulated an injection attack by changing the value of the memory location of the Programmable Logic Controller (PLC). Experimental results showed that the injection attack could directly impact the electric usage billing system, generating a manipulated final bill.
Traditionally, state estimation and time-series analysis are the main methods used for FDIA detection. Recently, many AI-based approaches have been adapted to improve detection performance [15]. Class labeling and class-balanced datasets are two critical challenges for developing a machine learning-based FDIA detection system for the smart grid because of the small sample size for FDIA class and complex labeling. Maglaras et al. [16] used a One-Class Support Vector Machine (OCSVM) with normal events to resolve these two challenges for a Supervisory Control and Data Acquisition (SCADA)-based critical infrastructure. Due to the challenges involved in dataset preparation, FDIA detection with minimum training and prediction time is required to handle the high rate of data generation in the smart grid. Reducing the vectors of features using Principal Component Analysis (PCA) and speeding up the training time using Distributed SVM are used to achieve low computation requirements of the smart grid [17]. Further, FDIAs in the smart grid are grouped into 'direct' and 'stealth', where 'stealth FDIAs' are more challenging to detect than 'direct FDIAs'. Yan et al. [18] used supervised machine learning to build FDIA detection systems by formulating the detection as binary classification (direct and stealth). The authors also tested detection performance for balanced vs. imbalanced class distribution using the IEEE 30-bus simulation dataset. More recently, the Artificial Neural Network (ANN) has been applied for FDIA detection. Khanna et al. [4] used ANN and Extreme Learning Machine (ELM) to detect Data Injection Attacks on the consumer side of the smart grid and classified electric meters as either benign malicious. The NYISO load data was mapped to an IEEE 14-bus system for performing simulation, experiments, and validation of results. Data generation sources in the smart grid can be grouped into cyber or physical space. Wang et al. [19] have collected simulated and real-world measurements of synchronized PMUs and applied the Margin Setting Algorithm (MSA) for detection. The ensemble of Machine Learning (ML) algorithms was shown to improve the detection performance in [15]. In this direction, the performance of ensemble learning for multi-class classification was tested for a total of 37 classes, including FDIA in [20]. The experiments were executed using a dataset containing measurements of four Phasor Measurement Units (PMUs) and network communication data to and from the firewall and IDS of the experimental power system [10]. FDIA detection is also formulated as a three-class problem, rather than a binary classification, in the literature. Panthi et al. [21] used machine-learning algorithms and the publicly available power dataset [10] to build a classifier to group events into natural, no-event, or attack classes.
A fingerprinting-based detection of stealthy cyber-attacks in water treatment plants was proposed in [22]. An IDS using a semi-supervised system for attack localization and deep neural network learning for anomaly detection was proposed in [23]. More recently, a two-level attack-detection framework using a decision tree for detection and deep learning for attribution was proposed in [24].
Based on the summary of existing literature, we observe that machine learning-based FDIA detection approaches can improve detection performance and address some of the key requirements, such as real-time large-scale data generation in the smart grid. Such improvement will promote machine learning models for FDIA detection in smart grids and other critical infrastructure. Our literature also indicates that most existing research works have used the power system dataset and formulated FDIA classification as a multi-class problem. So, to explore a novel dimension, we consider FDIA classification as a binary problem and made pre-processing necessary to the dataset to experiment under various environments. The power system dataset [10] used also contains a binary version that was formulated the classification as 'Attack' and 'Normal'. In contrast, the classification problem was formulated as 'FDIA' and 'non-FDIA' in this study. Feature selection is useful and required to reduce time complexity but is seldom used with critical infrastructure datasets. Therefore, we experimented with feature selection methods and machine-learning algorithms. We aim to find the best performance of classifiers given the selected features.
As shown in Table 1, the data imbalance issue is rarely addressed. Therefore, we also performed minority class oversampling to balance the class distribution beyond identifying the imbalanced dataset's effect on detection accuracy.

Critical Infrastructure Experimental Framework
The critical infrastructure is any physical infrastructure, such as a power system, healthcare [25], or gas pipeline, that is essential to support our daily lives [26]. Hence, disturbance to these systems has a huge impact on society, the economy, and the environment [27]. Recently, the development in computer and network technologies has enabled the fast adoption of ICT in such critical infrastructures. For example, the traditional power grid is now controlled, operated, and monitored using ICT, migrating the century-old power grid into a smart grid. Such integration is also evident in almost all critical infrastructures. However, cyber involvement in the physical system makes it vulnerable to various cyber attacks such as FDIA, unauthorized access, etc. In our study, we considered FDIA detection in power systems and water treatment plants using two very popular open datasets.

The Experimental Framework for Power System Data
Power generation, storage, and distribution tasks are continuously performed in the smart grid. Moreover, the complexity and large scale make it infeasible to experiment with the real infrastructure. Moreover, data access in the smart grid environment is highly restricted due to privacy and security concerns. Research is often performed on a reduced scale or using simulated datasets, such as IEEE 14/30 buses, to accommodate the abovementioned limitations. In 2015, the Mississippi State University and Oak Ridge National Labora-tory dataset (https://www.sites.google.com/a/uah.edu/tommy-morris-uah/ics-data-sets (accessed on 8 November 2021)) produced a scaled-down version of the power system and recorded a dataset with various simulated attacks in addition to normal events [10]. The experimental power system has two power generators (G1 and G2), four Intelligent Electronic Devices (IEDs) (R1 to R4), and four breakers (BR1 to BR4). Two lines were created in the power system using the pairs of breakers (BR1 and BR2; BR3 and BR4). The four IEDs, R1 to R4, were configured to open or close the four breakers, BR1 to BR4, respectively. A server controls the physical part of the power framework in the control room, and these cyber and physical parts are connected using a switch and a Power Distribution Center (PDC).

Dataset Pre-Processing
The power system environment discussed in Section 3.1 helped to create a suitable dataset for conducting machine learning-based detection experiments [10]. The complete dataset is available and being distributed as 15 sets. The dataset comprises 37 power system events that can be grouped into three main scenarios: (1) Natural, (2) No Events, and (3) Attack Events (data injection and command injection) containing 8, 1, and 28 events, respectively. Regrouping and resampling are performed using these three types of events, and three datasets are created for binary, three-class, and multi-class classifications. For the multi-class classification, each event type is considered a class therefore, it has 37 classes. The binary and three-class datasets are distributed in CSV file format. However, the multiclass dataset is available only as an ARFF (an Attribute-Relation File Format (ARFF) is a file format created to be used by the Waikato Environment for Knowledge Analysis (WEKA) tool. It is a Graphical User Interface (GUI) tool for performing machine learning tasks such as per-processing, training, exporting models, and creating an ML pipeline.).
Building a multi-class machine learning classifier is complex and resource-consuming. It also creates a dataset's class imbalance problem. Considering this, we reformulated the FDIA detection as a binary classification with 'FDIA' and 'Non-FDIA' classes. However, the existing dataset was unsuitable for this study, so we grouped samples based on the type of events. Before resampling, we converted the multi-class ARFF to CSV format to simplify further pre-processing, training, and testing. Filtering and merging were performed on all 15 sets to group all scenarios into two predefined classes; Normal/non-FDIA classes were 1-6, 13, 14, and 41, while "FDIA" classes were 7-12. Further, the Normal/non-FDIA sample was labeled as 0, and the FDIA sample was labeled as 1. The total number of samples that the final pre-processed and resampled dataset contained was 32,296. Figure 1 shows the class distribution. There are 22,714 samples in the 'Normal/non-FDIA' class and 9582 in the 'FDIA' class. As shown in Figure 1 and listed in Table 2, it is clear that the power system dataset is an imbalanced dataset where 'Normal/non-FDIA' is the majority class. All four features related to impedance for IEDs relays such as 'R1-PA:Z' had infinite value, and so, as a pre-processing step, they were replaced with 0.

. Description of Features
There is a total of 128 features in the dataset, consisting of PMUs measurement and software logs. A total of 29 measurements were recorded for each PMU, so a total of 116 features were collected from 4 PMUs. The logs were recorded from three sources: snort, relay, and control panel. Each had 4 values, so logs contributed a total of 12 features. Each feature was given a name based on the combination of a source of data and type of value. For example, PMUs features begin with R#−Signal Reference and log features start with the source of logs such as snort, control_panel, and relay. The # for PMU features was a number between 1-4, indicating the PMUs number, while postfix Signal Reference was the type of measurement. These measurements fall into two groups: first, phase angle and magnitude for voltage and current, and, second, frequency, frequency delta, appearance impedance, and appearance impedance angle for relays. Details of these features are presented and explained in the original dataset description document (http://www.ece.uah.edu/~thm0 009/icsdatasets/PowerSystem_Dataset_README.pdf (accessed on 8 November 2021)).
It is important to understand the impact of false data injection on individual features. We used a data distribution approach and created an overlapped histogram for individual features. For example, a histogram for R1-PA1:VH features is plotted for all normal and FDI samples. We can observe from Figure 2 that the count for the specific value range is higher in the FDI sample, which indicates data injection.

Water Treatment Plant
For the same reason as for a power system, a testbed, i.e., a scaled-down version of a real water treatment plant or pipeline, is normally created to experiment and collect data. In this study, we used a similar dataset, the Secure Water Treatment (SWaT) testbed [11] (a fully operational scaled-down water treatment plant), for FDIA detection. The configuration and framework of the experimental water treatment plant are depicted in Figure 3. It has six processing stages for water treatment labeled P1-P6. In total, the testbed has 24 sensors, 27 actuators, and 6 PLCs (one for each stage). The count for each type of sensor and actuator is listed in Table 3.  [11], where P1-P6 denotes total six stages of processing in the plant.

Dataset Pre-Processing
In the SWaT testbed, there are PLCs, Human Machine Interfaces (HMIs), SCADA, and a Historian in a layered communication network. Data from field devices are available to SCADA via PLCs and transferred to the Historian for analysis. The dataset contains events from physical and network activities against the 36 predefined attacks. The complete dataset was collected during a period of 11 days, during which the plant was running continuously for 24 h each day. Table 3. Description of different sensors and actuators used for data generation. Features are named as combination of type (i.e., MV, P, FIT etc.) and suffix (process number and device number). For example, FIT-101 can be read as being a first flow meter sensor of process stage 1.

Feature Description
The collected dataset contains 449,919 physical events and 51 features mainly generated from 24 sensors and 27 actuators. Table 3 provides the details of the sensors and actuators used in the water treatment process. The network data are packets communication between PLCs and SCADA. They have 18 features based on network attributes such as date, time, IPs, etc. This sub-part of the dataset is not used in this study. The dataset was collected, stored, and distributed on CSV files. The attacks on both physical and network were injection type attacks, i.e., on either the value of sensor or actuators.
The power and water system datasets were obtained using different field devices and operational environments. In the power system dataset, the majority of features are measurements of PMUs, whereas, in the water treatment dataset the events were collected by sensors and actuators. Differences in the data source provide varying data types: PMUs provide voltage, current phase angles, and magnitude, while sensors and actuators provide numerical or Boolean values.

Feature Selection
A feature represents a characteristic of any object. In ML, a sample is decomposed into a set of features before training and testing for tasks, such as classification, prediction, or clustering. The dimension of a feature vector can be small to large, and each feature has unequal discriminative potential. So, there is a need to select the best possible set of features without significantly impacting the model's performance. The different feature-selection approaches provide various ways to rank and select a set of features. The selection of features is performed in relation to the output variable that can be a class for classification or the predictive variable. Feature selection provides two key benefits. First, it helps to improve the model's performance in terms of accuracy, precision, and recall. Second, it reduces the computation cost (time and space) for the training, testing, and deployment of ML models. As a result of these two benefits, feature selection (as a part of feature engineering) is critical in the ML model. Based on the technique of feature ranking and selection, various feature selection methods are grouped into three main classes: the filter, wrapper, and embedded methods [28].

Filter Method
The filter method examines the dependency relationship of features X and class labels Y to select features based on their strength level with Y. The dependency strength level of the variables is calculated using traditional statistical tests, such as ANOVA, Z-test, T-test, chi-square, and Pearson Correlation Coefficient. Due to the individual evaluation of each feature, the filter method is also called univariate selection; it also speedily calculates and easily interprets results [28]. In this study, under the filter method, the ANOVA F-value was used as the statistical test; the dataset features were ranked, and the best set of features was selected. Figure 4 and the left four columns of Table 4 show the results of feature selection using a filter-for-power system and water treatment plant (SWaT dataset) respectively. The results are represented as feature name and score from the top (ten features) and bottom (ten features) of the feature rank list. In the case of the power system dataset, from Figure 4, we can observe that magnitude-related measurement of PMUs achieves larger scores and is top-ranked using the filter approach. Based on the value of the magnitude features in the dataset, we can observe that the larger values influence the statistical test. In contrast, angle-based features fall into a smaller value range (negative to a small positive value), and the statistical test was given a low score and was lower-ranked.   Table 4 shows the results of feature selection using 278 filter for power system and water treatment plant (SWaT dataset) respectively. The result 279 is represented as feature name and score from the top (ten features) and bottom (ten   In the SWaT dataset, the top three features are FIT401, FIT504, and FIT503, and all these are flow control sensors placed in the crucial stages, i.e., the 4-th and 5-th stages of a 6-stage process. Similarly, other top features also have critical roles and are found in later stages of the plant process. From Table 4, we can observe that the two bottom features are P601 and P60. These are two actuators placed in the last stage. Interestingly, these two were not implemented in SWaT, and the features selection correctly placed these at last. Other bottom features, P401, P404, and P502, are actuators. These were implemented as backups, and so, for this reason, they are not considered during attack events.

Wrapper Method
Compared to the filter method, features ranking is performed concerning a particular algorithm in the wrapper method. So, the best-selected feature set works well with the machine-learning algorithm, and the feature set differs when the selection is made using another algorithm. Unlike the filter method, the wrapper feature selection process is costly in terms of time and space. Most wrapper methods use greedy search, which is not optimal, and suffer from false starts (wrongly choosing the first best feature) [28]. Figure 5 and the right four columns of Table 4 show the best-selected feature sets and their importance for the wrapper method for the power system dataset and the water treatment plant dataset (SWaT dataset), respectively.    The embedded method combines the techniques of filter and wrapper approach.

318
The purpose of the combination is to take advantage of both approaches in terms of In the case of the power system dataset, as shown in Figure 5, the features based on magnitude and angle are top-ranked in a nearly equal proportion, i.e., 6 and 4, respectively. So, the feature rank list differs from the filter method in which magnitude-related features were dominating. This study used a tree-based classifier as a wrapper method. In this approach, features are selected based on their impact on classification accuracy, rather than number of features. Similar behavior can be observed in the control log features listed at the bottom of the feature list. These features are Boolean and sparse, and their contribution to the classification is negligible, i.e., with an importance value of zero.
In the SWaT dataset, FIT401 and FIT504 are ranked as top features. The other top features, i.e., P501 and PIT502, are the actuator and the sensor for a pump and a pressure meter, respectively. The bottom features are similar to those from the filter methods, which verify the importance of ranking these features.

Embedded Method
The embedded method combines the techniques of the filter and wrapper approaches. The purpose of the combination is to take advantage of both approaches in terms of the speed and performance of the filter and wrapper methods, respectively. From the implementation perspective, feature selection becomes part of training in the embedded method. The algorithm starts training with the seed feature set (i.e., all features) and recursively selects a set of best features for the next round of training based on the importance of the features in the trained model [28]. The retraining continues until the predefined termination condition, e.g., based on the algorithm's convergence criteria or expected performance. The commonly used embedded methods are LASSO and RIDGE regression.

Imbalance Dataset: Issue and Solution
In a supervised ML case, if the training sample for each classification class is approximately equal, then the given dataset is considered imbalanced [29][30][31]. If the dataset is imbalanced, the training is highly influenced by the majority class sample (i.e., the class with the largest samples). Hence, the trained model lacks generalization in the real world and misclassifies the minor class.
The imbalance issue is more prominent in the cases such as this study, where the task is to detect a rare event, i.e., an anomaly, maliciousness, an attack, etc. In contrast, the normal events contribute the majority of the dataset. Both of our datasets are of an imbalanced nature, as can be verified from Table 2 and the bar plots in Figure 1. The power system dataset has an imbalance ratio of 1:2.5. In contrast, the water treatment plant dataset has an imbalance ratio of 1:7.23, meaning that the samples for normal events are 70% and 88% of the total sample, respectively. Imbalanced datasets are a major issue and create a bottleneck in machine learning, so there have been many methods to address and resolve the problem of training with imbalanced datasets. These techniques mainly work on two principles: oversampling and under-sampling. Oversampling suggests increasing the sample in the minority class, while under-sampling is the process of reducing the sample in the majority class. The undersampling method goes against the basic principle of machine learning, which mainly aims to obtain more samples to achieve better performance. So, under-sampling is suitable only when the dataset has a very large sample for the majority class, and removing the sample will have a very low or no impact on training. In this study, we adopted the oversampling technique, given the limited number of samples, and focused on increasing the sample of the minority class.

Synthetic Minority Oversampling Technique (SMOTE)
SMOTE is a minority-class oversampling method that creates synthetic examples.
The synthetic examples are created by performing operations in data space and are nearly free from any particular application domain. The synthetic examples are plotted against the minority class samples, and the required samples, denoted as k, are randomly selected as k nearest neighbors [30].

Borderline-SMOTE
Borderline-SMOTE is also a minority oversampling method. There are two variants [32]: borderline-SMOTE1 and borderline-SMOTE2. Both the methods only oversample those minority samples on the borderline of class separation. The algorithm first finds the borderline samples from minority groups, then generates synthetic examples. It is assumed that borderline samples of minority classes are more prone to misclassification than samples far from the classification line.

Borderline Oversampling
Borderline Oversampling is similar to other oversampling methods that try to create synthetic samples around the classification line. Support Vector Machine (SVM) can be used to create a classification line and select boundary samples for oversampling [33]. First, the SVM model is trained on the complete dataset. Later, the trained model is used to identify the borderline, and synthetic/new samples for minority class are generated around the borderline. The number of samples, i.e., nearest neighbors, are generated either using interpolation or extrapolation depending upon the density of majority class instances around the borderline. This method differs from SMOTE, mentioned above, by choosing a new sample (i.e., nearest neighbor). SMOTE chooses randomly, while this method chooses the first k nearest neighbors.

Adaptive Synthetic (ADASYN) Sampling
Adaptive Synthetic (ADASYN) sampling incorporates weighting for oversampling of the minority class as per the difficulty level in learning [34]. The method claims to improve learning in two aspects, first, by reducing the bias induced by class imbalance, and second, Apart from the aforementioned oversampling methods, new methods have been proposed in the recent literature for improving model performance with an imbalanced dataset. Elyan et al. [35] have proposed class decomposition-based SMOTE (CDSMOTE). The proposed method improves performance by taking two actions: first, to reduce the dominance of the majority class by applying class decomposition, and second, to increase the representation of the minority class by oversampling. Moreover, a two-step hybridization of minority oversampling (SMOTE) and a novel data cleaning method (Weighted Edited Nearest Neighbor rule, or WENN) was proposed in [36]. Fajardo et al. [37] have applied deep conditional generative models for learning to the distribution of minority classes and then generated synthetic samples for solving the class imbalance in the dataset to improve the model's performance. Similarly, Bellinger et al. [38] have proposed a new training approach of a deep learning model (CNN) which mixes three techniques (batch resampling, instance mixing, and soft labels) to create a robust model from a long-tailed or imbalanced dataset. Krawczyk et al. [39] have studied the issues of the imbalanced dataset for multiclass classification. The authors have proposed a two-step under-sampling approach; in the first step, a one-class SVM is trained for all classes. An evolutionary under-sampling approach is applied to each learned classifier in the second step. Using under-sampling on the set of support vectors instead of on the original dataset, the authors claimed significant computational and performance improvements.
All the methods mentioned above for handling class imbalances in learning are suitable for single-model-based learning algorithms. They can be extended to suit ensemble-based learning algorithms [39]. SVM is a good choice for dealing with imbalanced datasets [33].
All these oversampling methods were tested on the power system dataset, and the results are presented in Section 6.4 along with explanations.

Experiments and Results
This section provides details of various experiments conducted to analyze the performance of ML algorithms for feature selection and improvement of minority class detection for imbalance datasets. Figure 6 illustrates the steps, structures, and components of the conducted experiments. These experiments were designed to test and validate the different hypotheses. For example, performance comparison for the train-test split vs. cross-validation, identifying the impact of the imbalanced dataset on performance, testing oversampling techniques to improve the performance, and finding the classifiers' performance on a different subset of features (ALL, Top10, Top20, Top30, Top40, and Top50). All the experiments were tested against standard performance metrics such as accuracy, precision, recall, and F1-score. The Receiver Operating Characteristic(ROC) and Area Under ROC(AUC) are additionally used to show performance.

Experimental System
The experimental system was prepared with Ubuntu OS in the Python development environment. The python environment was prepared with required machine learning modules and frameworks such as Pandas, NumPy, matplotlib, CSV, and Scikit-learn [40].

Machine Learning Algorithms
In our work, ML algorithms were chosen based on their working principles. We tried to keep a diverse set of algorithms for a better understanding and performance comparison. For example, Naive Bayes (NB) works on conditional probability, while kNN applies a distance function to associate a node to a group or cluster [41]. Based on this, we initially selected nine algorithms and later, due to relatively much large training time, dropped the Bagging (SVC) and XBoost algorithms from further experiments. The selected algorithms were Decision Tree (DT), Support Vector Machine (SVM), Random Forest (RF), k-Nearest Neighbour (kNN), Naive Bayes (NB), Regression, Bagging, and Boosting. Training and testing with such a diverse set of algorithms helped the authors to understand and find suitable features and algorithms. All the algorithms were tested with default parameters available in the scikit-learn framework. However, parameter setting is explicitly mentioned wherever the default value changes. Some critical parameters for the best performing model, i.e., random forest, are the number of trees: 100; split method: Gini; and the minimum number of samples required to split: 2. Hyper-parameter-tuning finds the best value for the algorithm's parameters from the search space. This study did not perform hyper-parameter-tuning; however, this is a possible area of research for future work.

Training and Testing
The power system and SWaT datasets were divided into training and testing sets. Each algorithm was trained on the training set, while the performance evaluation of the model was completed on the testing set. Percentage split and cross-validation are two main methods for splitting the dataset into training and testing sets. The percentage split simply divides the original samples into two sets based on the given percentage of the sample to be considered for training and testing. However, the cross-validation divides the original samples into N folds containing equal numbers of samples. We used a 70/30 ratio for percentage split, while 10 folds (i.e., N = 10) were used with cross-validation. As crossvalidation is an iterative process, the algorithm's performance was taken as the mean of N rounds of training and testing. In each round, N − 1 folds were used for training, and the remaining fold was used for testing. Training and testing in multiple folds provide diversity to the dataset, cross-validation provides robust training, and the trained model was generalized well on an unseen sample [42].

Percentage Split (70-30)
As mentioned earlier, based on the required percentage for training and testing samples, the percentage split method divides the samples into two sets. We used 70% for training and the remaining 30% for testing. This split method randomly selects the sample from the dataset for training. Training on a train-test dataset (from the split method) provides an approximate model, because randomly selected training samples do not represent actual data distribution. As such, the trained model suffers from over-fitting, i.e., it performs poorly on unseen samples. We trained and tested all nine algorithms on training and testing sets obtained from a percentage split (70-30%) to measure training time and approximate performance of FDI classification. Tables 5 and 6 show the precision, recall, F1-score, and accuracy of all algorithms for the power system and SWaT dataset, respectively. Except for accuracy, the other three metrics provide results for both classes (Normal and FDI). From Table 5, we can observe that, with an accuracy score 92%, Random Forest performance is the best performed as an ensemble algorithm, while Decision Tree has an accuracy score of 85% and is the best performer as a single model. The accuracy value is biased towards the majority class, and the model suffers for the minority class. This is evident from the precision, recall, and F1-score value from Table 5 for both classes. The performance reduced by about 6-10% for best-performing classifiers. In the case of the SWaT dataset, all the classifiers have an accuracy above 95%. The precision, recall, and F1-score for the normal class is as per the accuracy but reduced for the FDI class for many classifiers. From Table 6, we can observe that kNN, DT, and RF show a perfect 100% score for all metrics; this is indicating over-fitting. Over-fitting can be attributed to the water treatment dataset having fewer features and many samples. Therefore, kNN, DT, and RF can memorize the class distribution for training data. We have investigated these three algorithms further with cross-validation, and the results are presented in Section 6.3.2. In critical infrastructure, decision making needs to be quick. So, a low prediction time is required from the machine learning model. In this study, time is one of the key performance metrics. So, the time taken for training and testing by each algorithm is measured. The training time will help find a suitable algorithm for the power system or other critical infrastructure. Training time is also essential because data generation is fast, and the models often require retraining. A model with a lower training time will be more suitable. Table 7 presents the training and testing times (in seconds) of all nine algorithms. As mentioned earlier, Naive Bayes (NB) is a fast and probabilistic algorithm, because it uses prior probability values to calculate the posterior. Probability values can be calculated in advance, so NB training is faster than others. However, conditional independence is one of the critical assumptions that attributes need in order to be satisfied. Bagging is an ensemble method that creates multiple base models on the subset of the dataset. These subsets are created using random sampling. Table 7 shows that NB has the lowest training time of all algorithms, while Bagging (with SVC) has the highest training time. Hence, the highest training time is the accumulation of time taken for dataset generation, multiple model training, and testing. The previous section shows the detection performance and training time of algorithms for the training and testing set created using a percentage split. With this initial estimation, the algorithms were further trained and tested with 10-folds cross-validation to know the trained model's generalization capacities. We compared the detection accuracy of all algorithms for the percentage split and cross-validation of the power-system dataset. We have two key observations; first, five algorithms (DT, SVC, kNN, GB, and RF) achieved lower accuracy in the 10-fold cross-validation than the percentage split. Second, four other algorithms (Adaboost, Bagging (ensemble), LR, and NB) had minimum impact, i.e., accuracy either reduced with smaller margin or remained constant. Based on these two observations, we can conclude that the former five algorithms over-fit inherently, while the latter four algorithms have the inbuilt function to overcome over-fitting during training. Considering this outcome, for the SWaT dataset, we performed training with 10-fold cross-validation for the three most over-fitted classifiers, i.e., kNN, DT, and RF. Figure 7 shows the AUC of all classifiers for 10 folds training. For RF, although 4 passes have an AUC value of 1.0, the mean performance of all three classifiers reduced to 0.79, 0.75, and 0.81 from 1.0 for kNN, DT, and RF, respectively. The previous section presented the outcome of the filter and wrapper methods. One further key objective of this study is to test the performance of all algorithms on the selected set of features. For this, we experimented only on the power-system dataset. A total of ten datasets were created to train and test different machine-learning algorithms using sets of the selected top features, five sets each from the filter and wrapper method. The experimental results are shown as Top10, Top20, Top30, Top40, and Top50 for feature selection sets, and ALL represents all features. These 11 sets of the dataset were used to train and test all algorithms in 10-fold cross-validation. Figure 8a shows the results for the filter method sets. The detection accuracy in Adaboost, SVC, and GB decreased, while the detection accuracy in DT, kNN, and LR increased. The performance of RF and NB was unaffected (insignificant change) by the change in the number of features. Similarly, five datasets were created using the feature ranking of the wrapper method. Further, using the 10-fold cross-validation approach, all algorithms were trained and tested on all five sets. Figure 8b shows the performance of all algorithms. Unlike the filter method case, there was no change in the performance of Logistic Regression; the performance of Naive Bayes decreased significantly. In either selection method, kNN had a similar pattern, i.e., accuracy increased with the number of selected features. However, there was no clear pattern in the performance change of Adaboost, RF, GB, DT, and SVC. The performance of RF and DT decreased and had the lowest accuracy with thirty top features, while there were no significant changes in accuracy with other sets of features (top10, top20, top40, and top50). In other groups, the accuracy of SVC and Adaboost did not seem related to the number of features.

Imbalance Dataset and Impact
As discussed in Section 5, with an imbalanced dataset, ML models suffer performance degradation while making predictions about minority classes. This is because a model learns mainly from the majority class or is over-fitted to the majority class. Accuracy is the most-used metric for measuring the performance of machine-learning algorithms, but it is not suitable for imbalanced datasets [30]. Values from Tables 5 and 6 verify the performance degradation of the model with more robust metrics such as precision, recall, and F1-score.
The two main approaches for handling the imbalanced dataset are oversampling the minority class and under-sampling the majority class. While oversampling is suitable for maximum use-cases, under-sampling only suits when the majority class has many samples and the minority class also has enough samples to represent the nature of the distribution. In this study, we have adopted the oversampling approach, and the minority class is oversampled using four different sampling techniques. SMOTE is the main technique for oversampling, and the other three, i.e., Borderline-SMOTE, Borderline-SMOTE with SVM, and Adaptive Synthetic Sampling, are variants of SMOTE. As explained in Section 5, in borderline-SMOTE, only the borderline sample from the minority class is used for oversampling. In the original borderline-SMOTE algorithm [32], kNN is used for sample selection, while in the modified version (borderline-SMOTE with SVM [33]), SVM is used for sample selection. To understand and highlight the impact of the imbalanced dataset, we have used AUC as a performance metric. After applying each oversampling technique, we trained all ML algorithms on the imbalanced dataset and again trained the algorithms. Figure 9 depicts the algorithms' AUC values with the imbalanced dataset, i.e., Figure 9a and after balancing the dataset (making an equal sample for both classes by oversampling minority class, i.e., FDI class) using SMOTE, shown in Figure 9b. From Figure 9, it is evident that imbalance datasets have varying impacts on the different types of ML algorithms. These are obvious, because each performs training differently. Further, this observation can be broken down into two: first, some algorithms are not impacted (they can have high performance, i.e., DT and RF, or low-performance, i.e., NB and LR) by the ratio of samples for each class, so oversampling also fails to impact performance. Second, some algorithms (GradientBoost, AdaBoost, and kNN) have a high impact on the imbalanced dataset, and so the performance of these algorithms improves after oversampling. Table 8 shows the AUC values of different classifiers for the imbalanced dataset and after applying four selected oversampling methods. In Table 8, borderline-SMOTE and SVM-based borderline-SMOTE is coded as BSMOTE and BSMOTE-SVM. From Table 8 we can observe that all oversampling techniques improve the AUC values of almost all classifiers. The magnitude of improvement depends upon the type of algorithm used. As mentioned previously, the best improvement, i.e., 10-12%, was observed for GradientBoost, and AdaBoost algorithms, while kNN had a 6-8% improvement with different oversampling techniques. DT and RF are considered robust against imbalanced datasets, but these algorithms achieved a 2-3% performance improvement after applying oversampling.
, 2021 submitted to Energies 18 of 22 (a) ROC with imbalance dataset (b) ROC after SMOTE Figure 10. ROC for classifiers with imbalance and SMOTE balance power system dataset We used robust performance metrics such as precision, recall, and F1-score calcu-586 lated using a confusion matrix. In addition, to visualize the model's performance, we 587 selected the best performing model in both the imbalanced and balanced dataset and 588 plotted the confusion matrix. Figure 11 shows the results for both cases. As shown in 589 Figure 11a and Figure 11b, the detection performance of the attack class improved with 590 Figure 9. ROC for classifiers with imbalance and SMOTE balance power system dataset. In this study, we used robust performance metrics such as precision, recall, and F1score calculated using a confusion matrix. In addition to this, to visualize the model's performance, we selected the best performing model in both the imbalanced and balanced dataset and plotted the confusion matrix. Figure 10 shows the results for both cases. As shown in Figure 10a,b, the detection performance of the attack class improved with a balanced dataset but decreased for normal class. However, improved attack detection is critical and required for critical infrastructures. (a) Imbalance dataset (b) Balance dataset (SMOTE) Figure 11. Confusion matrix for best model (RF) on imbalance and SMOTE balanced power system dataset show that the performance of algorithms varies significantly depending on the feature 614 Figure 10. Confusion matrix for best model (RF) on imbalance and SMOTE balanced power system dataset.

Comparison with Previous Works
In this section, we compare the performance of our best model with those described in the existing literature. Jingyu wang et al. [43] used deep autoencoder to detect data manipulation attacks in power systems. Adhikari et al. [44] combined Non-Nested Generalized Exemplars (NNGEs) and the STate Extraction Method (STEM) for cyber-attack event detection. Defu Want et al. [20] divided the features as per each PMU and then used an ensemble approach to combine the results of five classifiers (the authors trained four classifiers on four PMUs data and the fifth with combined features). Table 9 shows the performance comparison of the best model of earlier studies and this study. The performance of this study is shown as an AUC value. AUC is a robust metric that represents the total area under ROC (trade-off between True Positive Rate (TPR) and False Positive rate (FPR)). A higher AUC value indicates better classification performance of the model. This study has achieved an AUC value of 0.984, which is better than the existing literature.

Conclusions and Future Scope
We examined and improved the performance of machine-learning algorithms for detecting FDIAs in critical infrastructure by determining the best features and mitigating imbalanced dataset problems. Performance improvement was tested and validated through various experimental results. These experiments included feature selection methods, oversampling techniques, and training and testing ML algorithms on two popular datasets related to power systems and water treatment plants. Our results show that the performance of algorithms varies significantly depending on the feature selection and the number of features. For example, the performance of NB is unaffected by increasing the number of features in the filter method while decreasing the number wrapper features. We also found that selection methods rank features differently. We found that RF is generally suitable for building an FDIA detector based on detection performance and training time trade-offs. Additionally, model training with 10-fold cross-validation is suitable because it highlights the over-fitting issues. Moreover, we analyzed the impact of the imbalanced dataset and applied minority oversampling techniques to improve detection performance.
New sampling techniques based on deep learning and hybrid sampling approaches are proposed in the literature [35][36][37][38][39]. Future studies can explore these recent techniques with the power system and other critical infrastructure datasets. The binary classification formulation in this study can be further divided and reformulated as a multi-class classification for training machine-learning algorithms. Moreover, consideration of the space and computation requirements of critical infrastructures can motivate new research objectives.

Conflicts of Interest:
The authors declare no conflicts of interest.
Sample Availability: The modified version of the Power system dataset is available on Github for comparison, but citing the original authors' work while using the dataset is also suggested. The SWaT dataset is only available on request to the original author.