1. Introduction
Supervisory control and data acquisition (SCADA) systems are used in a variety of industries to monitor and control industrial processes, such as manufacturing, energy, and transportation. They consist of a central computer or server that communicates with a network of devices, such as sensors and control systems, to collect data and control various processes [
1]. For example, in the energy industry, a SCADA system might be used to monitor and control the flow of electricity through a power grid. Sensors within the grid would collect data on the flow of electricity, and the SCADA system would use that data to adjust the grid in real-time to ensure that the electricity is distributed efficiently and safely.
In recent years, there has been an increase in the integration of IoT in SCADA systems. An IoT platform connects billions of devices, including PLCs, actuators, and intelligent electronic devices (IEDs) of industrial control systems (ICS) [
1]. SCADA systems benefit greatly from IoT in automation, improved monitoring, and data collection. The Internet of Things platform has enabled the modern industry to efficiently monitor and control physical systems (different hardware and machinery), resulting in intelligent data collecting, processing, and highly effective and successful business management. The use of IoT in SCADA is impossible unless the isolation of SCADA networks is broken, as these devices interact via the public internet. This advancement enhances the performance of SCADA networks and introduces new vulnerabilities and hazards for cyberattacks.
Security researchers in industry and academia have created many types of intrusion detection systems (IDSs) to improve the security of SCADA networks. Some of them are classic rule-based intrusion detection systems (IDS) while others have created anomaly detection solutions using machine learning (ML) and deep learning (DL) algorithms. A good SCADA dataset for training and evaluating ML/DL models is required to construct an efficient and intelligent IDS.
When we examined the SCADA intrusion detection dataset attentively, we noticed data imbalances caused by many regular traffic samples compared to only a few anomalies. This is because normal operations produce more data compared to cyberattacks. This is highlighted also in [
2], in which the authors analyzed the CICIDS2017 dataset. When utilizing machine learning to detect attacks as anomalies, we must address data imbalance to develop a classifier that accurately distinguishes between regular and abnormal traffic. A classifier will have a difficult time detecting anomalies without overfitting when these dataset imbalances are not considered [
3]. Historically, there have been few reliable and publicly available datasets. Those that are now available have been chastised for being out of date, lacking sufficient labeling, and including flaws that do not exist in real-world applications [
4].
This is due to the following factors: (a) SCADA is concerned with critical infrastructure and industrial control systems. As a result, information regarding how these systems work would be inaccessible. (b) Due to privacy concerns, companies and governments do not share SCADA datasets. These systems hold sensitive and secret data. (c) Such data could be a gold mine for hostile individuals, companies, or state actors. We focus on this problem in this study and demonstrate the impact of data imbalances on intrusion detection using machine learning techniques. The motivation is that there is a critical need to design effective IDS models for the following reasons:
Cyberattacks against SCADA networks and industrial control systems are on the rise;
Increased risks and vulnerabilities are associated with IoT integration in SCADA systems (network leaks);
To identify an effective route for building ML/DL IDS models in the absence of a reliable dataset.
For these reasons, we consider the Morris power and the CICIDS2017 datasets comprising power grid substation network traffic. We also examine CNN-LSTM algorithms and the handling of data imbalances for improving anomaly detection. We raise the following research question: Is there a difference in the performance of a machine learning model when the data is balanced versus unbalanced, and if so, what is the difference in metrics such as accuracy, precision, recall, and F1-score?
This paper is organized as follows: The next section discusses the relevant research based on the research questions. Then, in
Section 3, we explain the methodology employed in this research, which includes a review of various techniques for dealing with imbalanced datasets.
Section 4 contains the experiment results and answers to the research questions. We offer recommendations and suggestions for overcoming the difficulties highlighted in
Section 5. Finally, we conclude the paper.
2. Related Works
In the existing literature, security researchers have widely studied the topic of dataset imbalance. This section discusses the related work that deals with imbalanced datasets in IDS models for SCADA systems. The publications discussed were chosen based on the following criteria: (a) they were published between 2018 and 2022, (b) they are review papers or research papers published only in journals, and (c) they contained the terms “IDS” and “imbalanced dataset.” These articles were gathered from ScienceDirect, Wiley Online Library, and Google Scholar. Before examining the related works, some concepts, such as SCADA systems, intrusion detection systems, data imbalance, undersampling, and oversampling, must be defined.
SCADA systems are control systems comprised of two major components: field devices, such as remote terminal units (RTUs) and programmable logic controllers (PLCs), and human machine interface (HMI). These systems monitor and control properties across a large geographic area as well as automate and control industrial operations. IoT devices collect data in modern SCADA systems, and data are transmitted via the public internet network. As a result, intrusion detection systems (IDSs) were created to secure such systems [
5].
Intrusion detection systems (IDSs) are classified as either network-based or host-based. They operate in three modes: rule-based, anomaly detection, and hybrid. Machine learning (ML) and deep learning (DL) models are at the heart of these anomaly detection systems. A good dataset is required for training such models, which can be acquired from an entire SCADA system or a simulated system utilizing testbeds. One of the difficulties that security researchers face while constructing an ML/DL model is data imbalance.
When one class has a higher percentage than another, data imbalance occurs [
6]. This is interpreted as a problem since it introduces bias into the results of ML/DL models. The model intends to classify an input as being in the majority. There have been published works on two primary solutions to this problem: undersampling and oversampling. Undersampling is a technique for lowering the proportion of the majority class. Oversampling, on the other hand, increases the minority class’s percentage by randomly reproducing it. The following section will go through works that employed these two approaches to resolve dataset imbalance.
The real network traffic data from the SCADA system contains a substantial amount of regular traffic and a minor amount of irregular traffic, which is a classic imbalanced data categorization challenge. Other datasets are created by simulations in which attack scenarios are manually injected into the system, resulting in more attacks than usual, such as in the case of the BoT–IoT dataset [
7]. Although the prediction accuracy of some majority classes improves when the total error is minimized in this scenario, the prediction accuracy of minority classes is typically poorer.
Random undersampling (RUS) and random oversampling (ROS) are two popular sampling approaches. In network intrusion detection, the unbalanced ratio (IR) of various traffic data is extremely high. When the RUS method is employed, crucial information may be lost [
8]. Using only the ROS strategy, on the other hand, will allow the classifier to learn a large amount of information, resulting in overfitting [
9].
KDD99 employs a random forest (RF) with clustering undersampling to address the unbalanced dataset problem [
10]. The model has two detection levels. The first level classifies the data instance as an assault or normal. The second level decides the attack based on the output of level one attacks. Because the attack types are uneven, the undersampling technique is applied at this stage. To undersample the majority class, [
11] applied clustering with instance selection. The technique was then experimentally evaluated using affinity propagation and k-means algorithms, and three alternative instance selection algorithms (IB3, DROP3, and GA) were independently coupled for performance comparisons. The authors in [
12] used clustering undersampling to create an IDS with NSL-KDD and UNSW-NB15. The k-means algorithm was used to construct new clusters for the majority class.
The authors in [
13] employed an adaptive synthetic oversampling method (ADASYN) to address dataset imbalance. The detection approach in this study is a hybrid of a sparse autoencoder and random forest. To overcome the imbalance data problem, [
14] coupled SMOTE and edited nearest neighbors (SMOTE-ENN) at the preprocessing stage. The data were then transformed into visuals for feature extraction with CNN.
In [
15], the authors utilized a different technique to balance the data depending on the proportions of the classes in the sample. A CNN served as the foundation for the IDS model. To address the data imbalance problem, their research employed a combination of undersampling and oversampling strategies. They used the SVM and random forrest methods for the classification.
3. Methodology
This section provides the steps needed to answer the research questions and to achieve the goal of this paper, which is to understate the impact of datasets imbalance on the development of IDS in SCADA systems. This section is organized as follows: firstly, training the CNN-LSTM with imbalanced datasets; secondly, training the model with balanced training data; thirdly, a brief description of the datasets used in this paper; and finally, the experiment environment and settings are discussed.
3.1. CNN-LSTM with Imbalanced Datasets
Figure 1 shows the flow of this experiment. The unbalanced SCADA datasets were adjusted using the MinMaxScaler with 70% of the data used for training and 30% for testing. This experiment consists of three steps, which are as follows:
Dataset preprocessing. In this step, categorical features are converted to numerical features. The data values are then normalized between 0 and 1;
Training and testing. The CNN-LSTM model is developed, and the best parameters for training the dataset are chosen;
Evaluation stage. The model’s performance is evaluated using metrics, such as accuracy, recall, and F1-score.
The Morris power dataset comprises only one categorical data, the “marker,” for the preprocessing stage (Normal, Attack). Using binary encoding, this attribute was converted into numerical data. Normal was “0,” while Attack was “1.” In the same way, the multiclass in the CICIDS2017 dataset was converted into Normal and Attack as well. Then, using the min–max function, all the values were normalized between 0 and 1, see Equation (1).
Any instance that contained missing values was removed as well as any feature with the same value for more than 80% of all records.
3.2. CNN-LSTM with Balanced Datasets
Figure 2 illustrates the flowchart of balancing the datasets and training the model. In Experiments 2, 3, and 4, the dataset was divided based on its majority and minority classes. To balance the dataset, the majority class was undersampled and the minority class was oversampled. In undersampling, there are three techniques:
Methods that select records to keep, such as near miss under sampling and condensed nearest neighbor rule for undersampling;
Methods that select instances to delete, including Tomek links for undersampling and edited nearest neighbors rule for undersampling;
Combinations of both techniques. One-sided selection and neighborhood cleaning rule are examples of this approach;
In oversampling, there are five techniques:
Random oversampling;
Synthetic minority oversampling technique (SMOTE);
Borderline–SMOTE;
Borderline oversampling with SVM;
Adaptive synthetic sampling (ADASYN).
3.3. Datasets Description
The Morris power dataset is imbalanced, as shown in
Figure 3, which shows the class distribution in the dataset. To understand the dataset, principal component analysis (PCA) was used to visualize the data. From the visualization, we can determine whether a clear pattern can be seen and if so, determine which ML/DL model is most appropriate. The output of the PCA can provide a clear picture of the overall dataset. Although PCA reduces the dimensionality of the dataset, hence losing some data. It simply groups together features that are strongly correlated. We need an accuracy of approximately 70% to understand the overall dataset; we do not need to see every street to know the city.
The original dataset contains 129 features. First, the data were cleaned by removing columns that had low variance. There were 36 columns with the same data across 80% of the rows, which would not contribute to the model performance. The SCADA dataset was also checked for any missing values. Using PCA, the remaining 93 features were reduced to 6 principal components. The first two principal components are visualized in
Figure 4; it shows that there is no distinct boundary, and therefore it needs a more complicated model, such as a neural network, to define these boundaries. The first principal component represents around 27% of the data, and the 6th principal component represents only 3.47%. It is understood that these components represent around 71% of the data, but we do not need to see all 100% of the data to see a pattern.
The Canadian Cyber Security Institute collected and assembled the CICIDS2017 dataset with the help of the B-Profile system at the end of 2017 [
16]. The dataset contains 2,830,473 network traffic samples, with benign traffic accounting for 80.30 percent and attack traffic accounting for 19.70 percent. The categories include the most prevalent attacks, such as DoS, DDoS, Botnet, PortScan, web attacks, etc. The dataset collects 84 features from the generated network traffic, with the multiclass label being the last column. Furthermore, compared to publicly available datasets from 1998 to 2016, this dataset entirely fits the 11 performance evaluation criteria. The CICIDS2017 dataset is divided similarly to the UNSW-NB15 dataset.
Figure 5 depicts the CICIDS2017 data distribution for each class.
3.4. Experiment Settings
Four experiments were conducted to determine the effect of dataset imbalance. In the first one, CNN-LSTM detected intrusions using imbalanced data. In the second experiment, the data were balanced using undersampling only. The model was trained with balanced data using an oversampling approach in the third experiment. A hybrid balancing technique was used in the fourth experiment, undersampling the majority class and oversampling the minority class. Next, the CNN-LSTM model was used to detect anomalies in the dataset.
Google Colab was used for these experiments because it is easy to use and provides GPU access to improve the model’s training. Average values were reported for each experiment after it was conducted several times. The deep learning model was built with the TensorFlow, Pandas, and Keras frameworks. The measures we used to assess the performance of these experiments are described in the next section.
The evaluation metrics used in these experiments are briefly discussed in this section. All experiments evaluate the model based on accuracy (ACC), recall, precision, and F1-score. For binary and multiclass classification problems, accuracy is the most common performance metric. An IDS accuracy rate measures how accurately it detects normal or abnormal network traffic [
13]. The true positive rate (TPR) is the ratio of correctly predicted network anomalies and the total number of network anomalies. The TPR is called recall or sensitivity. The precision rate is an indicator of accuracy, which indicates the proportion of the number of positive cases correctly classified by the classifier to the number of positive cases. The F1-score is the weighted harmonic average of precision and recall, which is quite effective for the imbalanced classification problem.