A Study on Performance Metrics for Anomaly Detection Based on Industrial Control System Operation Data

: Recently, OT (operational technology) networks of industrial control systems have been combined with IT networks. Therefore, OT networks have inherited the vulnerabilities and attack paths existing in IT networks. Consequently, attacks on industrial control systems are increasing, and research on technologies combined with artiﬁcial intelligence for detecting attacks is active. Current research focuses on detecting attacks and improving the detection accuracy. Few studies exist on metrics that interpret anomaly detection results. Different analysis metrics are required depending on the characteristics of the industrial control system data used for anomaly detection and the type of attack they contain. We focused on the fact that industrial control system data are time series data. The accuracy and F1-score are used as metrics for interpreting anomaly detection results. However, these metrics are not suitable for evaluating anomaly detection in time series data. Because it is not possible to accurately determine the start and end of an attack, range-based performance metrics must be used. Therefore, in this study, when evaluating anomaly detection performed on time series data, we propose a range-based performance metric with an improved algorithm. The previously studied range-based performance metric time-series aware precision and recall (TaPR) evaluated all attacks equally. In this study, improved performance metrics were studied by deriving ambiguous instances according to the characteristics of each attack and redeﬁning the algorithm of the TaPR metric. This study provides accurate assessments when performing anomaly detection on time series data and allows predictions to be evaluated based on the characteristics of the attack.


Introduction
Recently, attacks targeting industrial control systems have been increasing. This is because OT networks, which were more closed off in the past, have been combined with IT networks, and the number of devices communicating with the outside has increased [1,2]. Therefore, OT networks have inherited the vulnerabilities and attack paths and vectors that exist in IT networks. Notably, because various industrial control systems comprise national infrastructure, the scale of damage caused by an attack is significantly large. Therefore, research on protection and detection methods is being actively conducted. In the past, systems such as firewalls, IDSs, and IPSs were introduced to protect and detect external attacks. However, recently, attention has shifted toward solutions involving artificial intelligence, a core element of the Fourth Industrial Revolution [3,4]. Most studies focus on using machine learning, which uses deep learning to train testbed-based data to improve accuracy and detect more anomalies. However, research on how to interpret the results of anomaly detection is limited. An appropriate interpretation must be made according to the characteristics of the data applied to the anomaly detection and the type of attack they contain.
The data extracted from industrial control systems are characteristically time series data. To perform and evaluate anomaly detection on time series data, non-traditional performance metrics are needed. In many cases, standard point-based metrics are used when evaluating detections. However, standard point-based metrics cannot accurately evaluate anomaly detection on time series data. Standard point-based metrics give higher scores to detect longer anomalies. However, it is important to detect various attacks in industrial control systems. Therefore, point-based metrics are not appropriate because they give higher scores to the method of detecting multiple attacks and the method of detecting long attacks, whichever method that may be. In addition, abnormal behavior that occurs in industrial control systems often progresses slowly over a long period of time. Therefore, it is difficult to clearly grasp the start and end points of an attack. It is necessary to evaluate the result of anomaly detection in consideration of its characteristics.
The purpose of our research is to evaluate the performance of anomaly detection and suggest directions for further improving the performance metrics, using performance metrics that characterize the data. We overwrote the ambiguous section of the time-series aware precision and recall (TaPR) metric [5], which is a range-based performance metric. TaPR consists of a detection score and a partial score. At this time, the partial score is given in consideration of the feature that it is difficult to grasp the start and end points of the attack. We looked at this ambiguous section of the algorithm and figured out that we could apply the same percentage to all anomalous behaviors to derive the ambiguous section. The attack duration section is from the moment the attacker starts attacking the system to the moment it stops. However, the sections in which the system is actually affected are different.
Assuming the system was attacked by an attacker while it was operating normally, the physical operating values will begin to change over time. Then, the moment the operator sets the threshold value, the real affected section begins. The real affected section is not the moment when the attacker finishes the attack, but the moment when the physical action value reaches below the threshold value. Therefore, this section is not directly affected by the attacker, but it means that the attack remains and affects the system. The duration of the attack and the real affected section are different for each attack. We would like to focus on this part and define ambiguous sections according to the characteristics of the attack.
In summary, our contributions include the following: • We create training models and perform anomaly detection using operation datasets derived from real industrial control systems; • We propose a range-based metric suitable for evaluating anomaly detection against time series data; • We redefine the algorithm for the ambiguous section of TaPR in consideration of the characteristics of the attack.
This paper is structured as follows. In Section 2, we discuss industrial control systems and the data in the networks and operating systems of the industrial control system datasets that have been published thus far. Section 2 also describes the standard and time-seriesbased performance metrics typically used to describe the proposed anomaly detection assessment of performance metrics. After that, we analyze the operational data to determine which performance index to use for the anomaly detection research. Section 3 proposes an improved algorithm for the time-series aware precision and recall (TaPR) metric, a time-series-based performance metric. We redefine the algorithm of the ambiguous section of the TaPR metric using the upper control line (UCL) and lower control line (LCL), which are concepts of statistical process control (SPC). In Section 4, we perform and analyze anomaly detection in our operational database using deep learning. Additionally, the redefined algorithm proposed in Section 3 is applied to the case study. Finally, Section 5 draws conclusions and presents future research directions. The safety zone is where the safety instrumentation system, responsi operation of the industrial control system, is located. The industrial zone is eration technology) and includes the cell/area zone containing levels 0, 1, an level 3, which serves as a control center that integrates and manages multi Level 0 is the process area, consisting of I/O devices that recognize and ma ical processes, including sensors that measure temperature, humidity, pre actuators that control equipment according to the commands from the con is the basic control area, primarily composed of a manufacturing process c acting with level 0 devices and operated by a PLC (programmable logic c VFD (variable frequency drive). Level 2 is the supervision and control area erator HMI (human-machine interface) and alarm systems that supervise a erations are operated [9].
Level 3 is the field operation management area comprising an applicat integrates the monitoring and control of physical on-site processes, an e The safety zone is where the safety instrumentation system, responsible for the safe operation of the industrial control system, is located. The industrial zone is called OT (operation technology) and includes the cell/area zone containing levels 0, 1, and 2, as well as level 3, which serves as a control center that integrates and manages multiple cells/areas. Level 0 is the process area, consisting of I/O devices that recognize and manipulate physical processes, including sensors that measure temperature, humidity, pressure, etc., and actuators that control equipment according to the commands from the controller. Level 1 is the basic control area, primarily composed of a manufacturing process controller interacting with level 0 devices and operated by a PLC (programmable logic controller) and VFD (variable frequency drive). Level 2 is the supervision and control area, where an operator HMI (human-machine interface) and alarm systems that supervise and control operations are operated [9].
Level 3 is the field operation management area comprising an application server that integrates the monitoring and control of physical on-site processes, an engineer workstation Electronics 2022, 11, 1213 4 of 21 that configures and controls the overall hardware of the industrial control system, an HMI, a domain controller, and a historian server that primarily uses a real-time database. This level is an essential zone because there are applications, devices, and controllers crucial for monitoring and controlling the industrial control system. An antivirus or patch management server operates in this area. The enterprise zone includes levels 4 and 5 and refers to an environment where the production management system, manufacturing execution system (MES), process monitoring, general IT system application, enterprise resource planning (ERP) system, and e-mail server are installed [9].
Industrial control systems introduced and operated in the national infrastructure are rapidly becoming intelligent and digitized. In addition, the increase in connectivity be-tween heterogeneous devices in various environments owing to the Fourth Industrial Revolution triggers the expansion of the attack surface of the existing IT environment security threats to the OT environment, leading to more diverse security threats than before [10].
According to the international standard IEC 62443, a standard that defines regulations and requirements exists. IEC 62443-1-1 describes seven functional requirements (FR), such as the definition of the control system capability security level and component requirements [11]. This study focused on the timely event response and resource availability among the seven requirements of IEC 62443-4-2 [12]. The two requirements focus on a control system's safe monitoring and operation. Therefore, monitoring the operation data in the industrial control system in real time and performing anomaly detection are necessary.
In the past, behavior-based IDSs/IPSs were used, but recently, anomaly detection has also begun to use neural network algorithms.

Datasets
The data extracted from industrial control systems can be divided into network traffic and operating system datasets. Table 1 shows the operating system datasets, where five types were investigated. The operating system conducts a scenario-based attack through the attack point. SWaT Datasets iTrust 41 2015 [17,18] HAI Datasets NSR 50 2020 The first dataset is from the ICS Cyber Attack Datasets [13], consisting of datasets for five processes: power systems, gas pipelines, gas pipelines and water storage tanks, improved gas pipelines, and energy management systems. The second dataset is BATADAL [14], a dataset from a contest aimed at generating an attack detection algorithm, based on a water distribution system, and consisting of three datasets, two for training data and one for validation. Training data #1 are normal data collected for one year, and training data #2 are data collected for 6 months and consist of normal data and attacked data. Data for verification are unlabeled data collected for 4 months. The third dataset is WADI [15], which was generated from multiple testbeds extended from the SWaT testbed. The drainage testbed simulates water demand and consumption over time. It collects steady-state data for 14 days and contains attacks for 2 days.
The fourth dataset is SWaT [16], which was collected from the testbed of a water treatment system. It contains the physical devices that make up each stage of the water treatment system, and eight versions through seven renewals exist thus far. The fifth dataset is HAI [17], which comprises data collected from a water treatment process system. It consists of three physical systems, namely, GE's turbine, Emerson's boiler, and FESTO's water treatment system, and a dSPACE HIL simulator. The variables measured and controlled by the control system were collected at 1 s intervals at 78 points. Currently, versions 20.07 [18] and 21.03 exist, and the renewed 21.03 version adds attack and collection points.

Performance Metrics for Anomaly Detection
Performance metrics for anomaly detection evaluation can be divided into point-and range-based methods, as shown in Figure 2. In point-based methods, only abnormal and normal answer labels are correct, and points are given if the answer is the same. In contrast, range-based methods not only divide into abnormal and normal labels but also assign different scores according to performance metrics. This allows the evaluator to assign different weights to give different scores. It consists of three physical systems, namely, GE's turbine, Emerson's boiler, and FESTO's water treatment system, and a dSPACE HIL simulator. The variables measured and controlled by the control system were collected at 1 s intervals at 78 points. Currently, versions 20.07 [18] and 21.03 exist, and the renewed 21.03 version adds attack and collection points.

Performance Metrics for Anomaly Detection
Performance metrics for anomaly detection evaluation can be divided into point-and range-based methods, as shown in Figure 2. In point-based methods, only abnormal and normal answer labels are correct, and points are given if the answer is the same. In contrast, range-based methods not only divide into abnormal and normal labels but also assign different scores according to performance metrics. This allows the evaluator to assign different weights to give different scores.

Standard Metric
The standard metric refers to a point-based performance index and a function constructed using items of a confusion matrix. The function uses accuracy and often considers precision and recall as well. There are the F1-score using precision and recall, the ROC curve (receiver operating characteristic curve) using true and false positive rates, and the AUC (area under the curve) indicating the bottom area of the ROC curve.

Range-Based Recall and Precision
Anomaly detection is defined as a range-based anomaly caused by an atypical event; therefore, using a point-based performance evaluation index is inappropriate for this [19]. Therefore, range-based recall (RR) and precision (RP), which are range-based performance indexes, were used. Recall receives a score if the abnormality is sufficiently identified and a penalty if not. However, because time series data are expressible as a range, recall is an inappropriate performance evaluation index.

TaPR
TaPR was proposed as a performance index suitable for evaluating anomaly detection methods in time series data with time-series aware precision and recall [5]. TaPR consists of a detection score indicating the number of anomalies detected and a subscore indicating how accurately each anomaly was detected. TaPR presents two problems in the

Standard Metric
The standard metric refers to a point-based performance index and a function constructed using items of a confusion matrix. The function uses accuracy and often considers precision and recall as well. There are the F1-score using precision and recall, the ROC curve (receiver operating characteristic curve) using true and false positive rates, and the AUC (area under the curve) indicating the bottom area of the ROC curve.

Range-Based Recall and Precision
Anomaly detection is defined as a range-based anomaly caused by an atypical event; therefore, using a point-based performance evaluation index is inappropriate for this [19]. Therefore, range-based recall (RR) and precision (RP), which are range-based performance indexes, were used. Recall receives a score if the abnormality is sufficiently identified and a penalty if not. However, because time series data are expressible as a range, recall is an inappropriate performance evaluation index.

TaPR
TaPR was proposed as a performance index suitable for evaluating anomaly detection methods in time series data with time-series aware precision and recall [5]. TaPR consists of a detection score indicating the number of anomalies detected and a subscore indicating how accurately each anomaly was detected. TaPR presents two problems in the existing performance metrics: a method that detects only long anomalies receives a high score; the ambiguity of whether the value derived from the process of returning to the normal value within the physical process is affected by the attack being regarded as normal or abnormal. Figure 3 presents the problem of giving a high score to a method that detects only long anomalies. Method 1 performed anomaly detection with predictions 1 and 2, and method 2 with prediction 3. The actual anomaly occurred in anomalies 1 and 2 in time units 2 to 8 and 9 to 13 on the time axis. Table 2 shows the precision and recall for methods 1 and 2.
existing performance metrics: a method that detects only long anomalies receives a high score; the ambiguity of whether the value derived from the process of returning to the normal value within the physical process is affected by the attack being regarded as normal or abnormal. Figure 3 presents the problem of giving a high score to a method that detects only long anomalies. Method 1 performed anomaly detection with predictions 1 and 2, and method 2 with prediction 3. The actual anomaly occurred in anomalies 1 and 2 in time units 2 to 8 and 9 to 13 on the time axis. Table 2 shows the precision and recall for methods 1 and 2. Method 1 detected all anomalies but had lower precision and recall values than method 2, which detected only one long anomaly. If abnormality 2 is an attack that can have a fatal effect on the internals of the industrial control system, method 1 outperforms method 2 in abnormality detection. We describe the second problem, the ambiguity, in reference to Figure 4. In Figure 4, the X-axis represents the time, and the Y-axis represents the physical operating values of internal system devices. During normal operation, an attacker carries out an attack, and consequently, the physical operating value exceeds the threshold and reaches a value determined as abnormal. Once the attack ends, section (a) outlines the ambiguous section while returning to the normal value. Section (a) is the process of returning to normal because it is not directly affected by the attacker, but the influence remains. In addition, half of section (a) exceeds the threshold and may be regarded as abnormal, and the other half may be determined as normal because it exceeds the threshold.   Method 1 detected all anomalies but had lower precision and recall values than method 2, which detected only one long anomaly. If abnormality 2 is an attack that can have a fatal effect on the internals of the industrial control system, method 1 outperforms method 2 in abnormality detection.
We describe the second problem, the ambiguity, in reference to Figure 4. In Figure 4, the X-axis represents the time, and the Y-axis represents the physical operating values of internal system devices. During normal operation, an attacker carries out an attack, and consequently, the physical operating value exceeds the threshold and reaches a value determined as abnormal. Once the attack ends, section (a) outlines the ambiguous section while returning to the normal value. Section (a) is the process of returning to normal because it is not directly affected by the attacker, but the influence remains. In addition, half of section (a) exceeds the threshold and may be regarded as abnormal, and the other half may be determined as normal because it exceeds the threshold. mal or abnormal. Figure 3 presents the problem of giving a high score to a method that detects only long anomalies. Method 1 performed anomaly detection with predictions 1 and 2, and method 2 with prediction 3. The actual anomaly occurred in anomalies 1 and 2 in time units 2 to 8 and 9 to 13 on the time axis. Table 2 shows the precision and recall for methods 1 and 2. Method 1 detected all anomalies but had lower precision and recall values than method 2, which detected only one long anomaly. If abnormality 2 is an attack that can have a fatal effect on the internals of the industrial control system, method 1 outperforms method 2 in abnormality detection. Table 2. Precision and recall for methods 1 and 2.

Method
Precision We describe the second problem, the ambiguity, in reference to Figure 4. In Figure 4, the X-axis represents the time, and the Y-axis represents the physical operating values of internal system devices. During normal operation, an attacker carries out an attack, and consequently, the physical operating value exceeds the threshold and reaches a value determined as abnormal. Once the attack ends, section (a) outlines the ambiguous section while returning to the normal value. Section (a) is the process of returning to normal because it is not directly affected by the attacker, but the influence remains. In addition, half of section (a) exceeds the threshold and may be regarded as abnormal, and the other half may be determined as normal because it exceeds the threshold.  The performance metric derived from these two problems is TaPR. TaPR consists of a detection score and a partial score. The detection score indicates the number of detected anomalies, and the partial score indicates how accurately each anomaly is detected. TaPR emphasizes the importance of the number of anomalies detected through the sum of these two scores. In addition, although the value of 'normal' is shown among the sections owing to external influences, this section (a) being 'abnormal' is highly likely; thus, if this section is predicted as 'abnormal', the subsequent scoring is given. TaP and TaR are calculated as the sum of the detection score (TaP_d, TaR_d) and the portion score (TaP_p, TaR_p). The ambiguous instances used in the portion score are denoted as a and the number of ambiguous instances is denoted as δ [5] in Table 3. Table 3. Factors and formulas constituting the performance metrics TaP and TaR.

Factor
Description Formula The number of anomalies ρ The suspicious instances The number of predictions a The ambiguous instances The number of ambiguous instances -A A set of ambiguous instances A = a 1 , a 2 , · · · , a n

Requirements for Performance Metrics
To apply the anomaly detection evaluation method to time series operation data, the following conditions are required for performance metrics: 1.
The highest score is not given to the method of detecting long-length anomalies.

2.
Scores are given to cases that detect various anomalies.

3.
Scores are given to cases that detect proactive signs.

4.
When an external influence such as an attack is received, the threshold value that separates normal and abnormal is exceeded, and the external influence ends; the value showing abnormality has a score for the case in which it returns to normal.

5.
A detection score is given according to the characteristics of the attack.
Supplementing the explanation of No. 4 are cases in which the external influence is applied but the boundary value is not exceeded yet, and the external influence is over, but the threshold value is not exceeded. Determining this section only as abnormal and normal is difficult. Table 4 summarizes papers that performed anomaly detection on the ICS operation dataset and evaluated performance metrics. It is indicated whether the performance metrics used in the papers are point-based or range-based. The authors of [20] conducted an anomaly detection study on the SWaT dataset. SWaT's attack types consist of single-stage single-point, single-stage multipoint, multistage single-point, and multistage multipoint attacks. Among them, the problem was that there was no research to detect multi-step attacks, and the attacked variables could not be identified. The model used for anomaly detection was the genetic algorithm (GA). To improve detection performance, exponentially weighted moving average smoothing, the mean p-powered error measure, individual error weights for each tag value, and disjoint prediction windows were adjusted. Then, the NAB performance metric was used to evaluate the prediction of anomaly detection. The authors of [21] proposed a deep learning-based anomaly detection method using the Seq-to-Seq model. The study was conducted on the SWaT dataset, and by using the Seq-to-Seq model, it can be said that the encoding-decoding method is suitable for time series operational data. Anomaly detection prediction consists of three steps. In the first stage, the model predicts the sensor and actuator values, and in the second stage, the difference between the predicted data and the actual data is calculated. In the last step, if the difference calculated in step 2 is greater than the set value, it is regarded as abnormal, and a warning is sent. After that, cases in which false negative attacks were not detected and false positives are analyzed. The authors of [22] performed anomaly detection on the SWaT dataset through five machine learning algorithms. SVM, random forest, and KNN were used as supervised learning algorithms, and one-class SVM (OCSVM) and an autoencoder (AE) were used as unsupervised learning algorithms. As an anomaly detection performance evaluation index, values derived from the accuracy and F1-score were compared. On average, supervised learning using data labeling performed better than unsupervised learning, and random forest showed the best performance among the supervised learning algorithms. The authors of [23] studied an intrusion detection mechanism using physical operational data. The method that can be integrated into the network intrusion detection system (NIDS) and the ICS system security were improved. Using operational data, it is easy to detect anomalies in the section without interfering with the communication of each level section. Anomalies were detected using three machine learning models for the HAI 20.03 dataset. The data imbalance problem was solved using the Synthetic Minority Over-sampling Technique (SMOTE). The stratified shuffle split (SSS) method was used for training dataset separation. As a machine learning model, KNN, a decision tree, and a random forest algorithm were used to build the model. Random forest showed the best performance using precision, recall, F1-score, and AUC as performance metrics. The authors of [24] performed anomaly detection based on a stacked autoencoder (SAE) and deep support vector data description (deep SVDD). HAI version 20.03 was used as operational data. Data preprocessing was performed through data scaling, and a model was created through the SAE and deep SVDD. Single event anomaly detection was performed by comparing the accuracies by setting the threshold with the highest accuracy. Point-based  Table 4 shows the results of whether the performance metrics presented in Figure 2 satisfy the relevant items. The point-based performance metrics do not satisfy all the conditions. In addition, a study evaluating anomaly detection based on point-based performance metrics is presented. There were papers that studied the range-based performance metrics themselves, but there was no study of anomaly detection using them as performance metrics. Among the range-based performance metrics, RR and RP do not satisfy conditions 3 and 5, and TaPR does not satisfy condition 5. Therefore, in Section 3, we propose performance metrics that satisfy all the conditions by improving the TaPR metric. Moreover, the existing research performed anomaly detection targeting industrial control system time series data but evaluated it using performance metrics in a point-based method. In this paper, the results are derived from the performance metrics of the point-based method and the range-based method. We then compare the results to discuss which method is more suitable for evaluating anomaly detection targeting time series data.

Proposed Performance Metrics
In this section, the method of deriving performance metrics of the existing TaPR is improved. As a direct definition was unspecified in this study, the ambiguous section of the detection score (factor α ) can be defined by looking at the python code of TaPR . . . For example, an actual attack occurring in the time interval from 10 to 100 is assumed. The size of the section where the attack occurred is then 90. Because start_id adds 1 to the last point of the attack time, it becomes 101. The end_id is multiplied by the ratio (delta, 0 ≤ delta ≤ 1), multiplied by the size of the section where the attack occurred, and then start_id is added. Therefore, if the delta is 1.0, the ambiguous section becomes the ambiguous section from 101 to 191 immediately after the attack.

Statistical Process Control (SPC)
Because the duration of the actual attack and the section affected by each attack are different, there is a limit to applying only one ratio to all attacks. Therefore, the following method is proposed. Because the attack duration section is defined in the test dataset, it is not defined separately. The actual affected section must be defined. First, the upper and lower limits are set based on the threshold. Here, the concept of process control [25,26] is introduced. The industrial control system dataset used in this experiment comprises data collected from several processes for which process management, which is statistical data, was determined to be appropriate for detecting anomalies. The process control selects quality characteristics that indicate the state of the process and obtains information about the process. The process is maintained in control by acting on the cause of the abnormality found. A graph in which one centerline and two control limit lines are set to manage the dispersion of quality is called a control chart. The control limit line is divided into the specification limit, which is a standard for judging the quantity and defects of products already produced, and the control limit, which is a management procedure that instructs improvement measures for the process to eliminate the cause of abnormalities. The threshold that separates abnormality from normality is defined as the central line (CL) of the control chart. The upper control line (UCL) and lower control line (LCL) are set by taking the section width at one time (1σ) of the standard deviation of both sides based on the CL.

Defining the Real Affected Section
The starting point of the affected section is where the size of the mean error, which is the blue line in the graph, exceeds the lower limit value. Next, if the magnitude of the average error falls below the lower limit again and lasts for 1 min, the affected section ends. The size of the ambiguous instance is the absolute value of the duration of the attack minus the section affected. Therefore, the start point of the ambiguous instance is obtained by adding 1 to the last point of the attack duration section, and the end point is obtained by adding the start point and the size of the ambiguous instances. Algorithm 1 for finding ambiguous instances are summarized.
Initialize Central Line (CL) to threshold During the attack, the algorithm executes the following in Figure 5. First, the magnitude of the mean error and the magnitude of the LCL are compared. If the magnitude of the mean error is larger than that of the LCL, the starting point is appended to the RI, which is the affected section. Next, if the magnitude of the average error is smaller than the LCL, the time point is stored in RI_end_time, which temporarily stores the end time point. Next, if the magnitude of the average error for 1 min is smaller than the LCL, RI_end_time is stored as the end time. However, if the size of the average error becomes larger than the LCL again, the end point must be set again. Because ambiguous instances must be calculated and stored as absolute values, the sizes of AD and RA are compared and stored in AI. The start point is set by adding 1 to the last point of the attack time, and end_id is set by adding AI to start_id. During the attack, the algorithm executes the following in Figure 5. First, the magnitude of the mean error and the magnitude of the LCL are compared. If the magnitude of the mean error is larger than that of the LCL, the starting point is appended to the RI, which is the affected section. Next, if the magnitude of the average error is smaller than the LCL, the time point is stored in RI_end_time, which temporarily stores the end time point. Next, if the magnitude of the average error for 1 min is smaller than the LCL, RI_end_time is stored as the end time. However, if the size of the average error becomes larger than the LCL again, the end point must be set again. Because ambiguous instances must be calculated and stored as absolute values, the sizes of AD and RA are compared and stored in AI. The start point is set by adding 1 to the last point of the attack time, and end_id is set by adding AI to start_id.  This is a learning data selection process for anomaly detection based on industrial control system operation data using artificial intelligence. In the industrial control system operation datasets, SWaT and HAI, the water treatment system operation data collected in Section 2.2 were compared. When comparing data points and attack points between them, SWaT had 60 and 41, and HAI had 78 and 50, respectively. From the diversity of data points and attack points, we determined that HAI had more suitable data for experimentation. The experiment was more suitable for making a case because it had various attack points, while experimenting with SWaT was difficult because there was no labeling for anomalies in recent data.

Choosing a Deep Learning Model
Because the data are time series, selecting a model suitable for time series is necessary. Representative recurrent neural network (RNN), long short-term memory (LSTM), and gated recurrent unit (GRU) algorithms were compared. The RNN [27] receives several input values in chronological order as current inputs and calculates them. However, if the length of the training data is long owing to the structure of the RNN, transferring the previous information to the present is difficult, resulting in a gradient loss problem (long-term dependency). LSTM is an algorithm that solves the gradient loss problem of RNNs. It has a structure in which each unit in the middle layer of a neural network is replaced with a memory unit called an LSTM block. LSTM [28] consists of four gates and one cell state. The gate layer includes a forget gate layer, an input gate layer, the current state, and an output gate layer, and the cell includes the cell state. The cell state indicates the long-term memory. GRU [29] is a simplified version of the cells constituting the LSTM. Compared to LSTM, the structure is simple, and the calculation speed is fast because there are fewer parameters to learn. GRU has two gates: update and reset gates. The update gate determines how much of the previous memory is remembered, and the reset gate decides whether the new input is merged with the old memory.
The performance metric comparison in Table 5 was performed by changing only the model in the HAI 2.0 data and HAI 2.0 baseline code provided by the HAICon2020 Industrial Control System Security Threat Detection AI Contest [17]. Table 5 shows the best indicator of the GRU's performance. Studies have demonstrated that LSTMs show better performances as the amount of data increases. However, without an accurate comparison index, this was directly applied to the data to be tested, and the GRU showing the best performance was selected as the model to be applied to the experiment; PyTorch was used for implementation.

Training Data Preprocessing
The HAI version 21.03 data used as training data consisted of three training data and were provided as a CSV file. Each file is in the form shown in Table 6. Column 1 is the time at which the data were collected, and columns 2 to 80 are the physical values collected from field devices P1, P2, P3, and P4. Column 81 indicates whether an attack event occurred (1) or not (0), and columns 82 to 84 indicate at which stage of the process the attack occurred with attack_P1, attack_P2, and attack_P3, respectively. Scaling data using normalization; 3.
Eliminating unnecessary data points with feature graphs.
The first step purifies data by removing outliers and missing values. An outlier is a value with a larger deviation than the normal value it belongs to and is detected using methods such as normal distribution, likelihood, nearest neighbor, density, and clustering. Thereafter, outliers are replaced with the upper or lower limits, the standard deviation of the mean, the absolute mean deviation, and extreme percentiles. The training data of HAI 21.03 are the data collected when operated normally without an attack, all attack labels are 0, and all data points are within the range of the minimum and maximum values. Therefore, there are no outliers. Missing values indicate absent values and are expressed as NA, which must be processed during preprocessing because they cause distortions later when using the model to forecast. Therefore, the missing values are processed by the deletion, substitution, and insertion of the predicted values. The HAI 21.03 training data do not have missing values.
The second step scales the data using normalization, as shown in Figure 6. Data scaling requires preprocessing because if each data range is too wide or small, the data may converge or diverge to 0 during the learning process. For example, P1_B4022 is data that represent temperature with a range from 0 to 40, and P1_FT03 is data that represent pressure with a range from 0 to 2500. Because each range of data differs, either standardization or normalization is used for scaling. Standardization involves distributing data on either side of 0 as a value indicating the distance between the data and the mean. In contrast, normalization in-volves transforming the data range from a minimum of 0 to a maximum of 1 to reduce the effect of the relative size of the data. The minimum and maximum physical data values (columns 2 to 80) were derived from the training data, and normalization was performed.
ing. Thereafter, outliers are replaced with the upper or lower limits, the standard deviation of the mean, the absolute mean deviation, and extreme percentiles. The training data of HAI 21.03 are the data collected when operated normally without an attack, all attack labels are 0, and all data points are within the range of the minimum and maximum values. Therefore, there are no outliers. Missing values indicate absent values and are expressed as NA, which must be processed during preprocessing because they cause distortions later when using the model to forecast. Therefore, the missing values are processed by the deletion, substitution, and insertion of the predicted values. The HAI 21.03 training data do not have missing values.
The second step scales the data using normalization, as shown in Figure 6. Data scaling requires preprocessing because if each data range is too wide or small, the data may converge or diverge to 0 during the learning process. For example, P1_B4022 is data that represent temperature with a range from 0 to 40, and P1_FT03 is data that represent pressure with a range from 0 to 2500. Because each range of data differs, either standardization or normalization is used for scaling. Standardization involves distributing data on either side of 0 as a value indicating the distance between the data and the mean. In contrast, normalization in-volves transforming the data range from a minimum of 0 to a maximum of 1 to reduce the effect of the relative size of the data. The minimum and maximum physical data values (columns 2 to 80) were derived from the training data, and normalization was performed.  The third step removes unnecessary data points using the feature graph. After normalizing the data points of the training data and converting them to values between 0 and 1, we used the graph to study the characteristics of the data points. Figure 7 shows the feature graph of data point P1_B3005. Because the values are dis-tributed between 0 and 1 and change over time, they represent the characteristics well. In contrast, the feature graph of data point P1_PCV02D in Figure 8 does not change its value over time. Thus, it is unnecessary data for anomaly detection. Data points with these characteristics were removed. The list of the 22 removed data points includes the following: P1_PCV02D, P1_PP01AD, P1_PP01AR, P1_PP01BD, P1_PP01BR, P1_PP02D, P1_PP02R, P1_STSP, P2_ASD, P2_AutoGO, P2_Emerg, P2_MSD, P2_ManualGO, P2_OnOff, P2_RTR, P2_TripEx, P2_VTR01, P2_VTR02, P2_VTR03, P2_VTR04, P3_LH, and P3_LL.

Model Generation
In the model generation process, the model was created considering the following factors: 1. Sliding window; 2. Number of hidden layers; 3. Hidden layer size.
The sliding window [30] is a method of recognizing data according to the characteristics of time series data. The method receives data and remembers the pattern for that window, and it involves input and output values. The input is the value at the beginning of the window, and the output is the value at the end of the window. As a result of the experiment considering the size of the training data for the sliding window input, the size between 75 and 90 was suitable for the experiment. Therefore, it was set at intervals of 5. Table 7 shows the corresponding sliding window input and output values from which the model was created.

Model Generation
In the model generation process, the model was created considering the following factors: 1. Sliding window; 2. Number of hidden layers; 3. Hidden layer size.
The sliding window [30] is a method of recognizing data according to the characteristics of time series data. The method receives data and remembers the pattern for that window, and it involves input and output values. The input is the value at the beginning of the window, and the output is the value at the end of the window. As a result of the experiment considering the size of the training data for the sliding window input, the size between 75 and 90 was suitable for the experiment. Therefore, it was set at intervals of 5. Table 7 shows the corresponding sliding window input and output values from which the model was created.

Model Generation
In the model generation process, the model was created considering the following factors:
Number of hidden layers; 3.
Hidden layer size.
The sliding window [30] is a method of recognizing data according to the characteristics of time series data. The method receives data and remembers the pattern for that window, and it involves input and output values. The input is the value at the beginning of the window, and the output is the value at the end of the window. As a result of the experiment considering the size of the training data for the sliding window input, the size between 75 and 90 was suitable for the experiment. Therefore, it was set at intervals of 5. Table 7 shows the corresponding sliding window input and output values from which the model was created. The following is the number of hidden layers and their sizes. The stacked GRU used is a multi-layer neural network and consists of input, hidden, and output layers. Each layer comprises nodes, and the number of hidden layers determines how many exist. The model was created using the stacked GRU algorithm selected earlier as the deep learning model. The L1 loss function was used, and AdamW [31] was used as the optimization function. The AdamW function separates the weight decay step of the existing Adam algorithm. During training, the epoch was set to 64, and the model's parameter with the best loss value was stored. A rectified linear unit (ReLU) was added as an activation function to solve the gradient vanishing problem. In addition, the PyTorch library was used for implementation. Table 8 shows if the learning proceeded according to the size of the sliding window input and output.

Anomaly Detection
The test data were formatted as shown in Table 6, and the size was 402,005 rows × 84 columns. In addition, the preprocessing of the test data was performed at the same stage as the training data. The data do not have missing values. The 22 data points were removed, and the data were normalized in the scaling step. This is the process of predicting the results of a test dataset with a model created using a training dataset. In this process, a threshold is used. To determine whether an attack occurs, the average of the calculated differences by the entire field is determined, and then the abnormality is predicted using the threshold value.
The blue line in Figure 9 indicates the size of the average error. If the blue line exceeds the red line, the average error is determined to be abnormal (an attack). Abnormality is indicated by 1, and normality by 0. The predicted label is stored in LABELS, and the correct label of the test set is extracted as ATTACK_LABELS. In this case, the LABELS and ATTACK_LABELS sizes do not match. Because the model learned with the sliding window method, the first part cannot be predicted as much as the sliding window input value. In addition, because the three learning datasets were combined into one set of data, 'time' does not continue between the file changes and is thus unpredictable, and the number of predicted labels, LABELS, is less than ATTACK_LABELS. To solve this, we applied a function that fills in the blanks with the normal 0 and remaining values with the judgment of the model.
As shown in Table 9, the F1-score, TaP, and TaR performance metrics according to the boundary values were derived. The number of detected anomalies was given the highest priority, and subsequently, the highest performance metrics were listed as C1-2, C2-2, C3-1, and C4-2. Among them, C1-2, which had the highest number of detected anomalies and performance metrics, was applied in the analysis of four anomaly detection experiments and evaluation methods. A total of 50 ATTACK_LABELS in the test dataset were labeled, and of these, 36 anomalies were detected as abnormal (attacks) by C1-2: . Average of differences produced by the entire field (blue)/threshold (red). The blue line is the average of the differences calculated by the entire field to determine if there is an attack. The red line is a value arbitrarily specified by the experimenter, and if the blue line exceeds the red line, that point is predicted to be an attack.
As shown in Table 9, the F1-score, TaP, and TaR performance metrics according to the boundary values were derived. The number of detected anomalies was given the highest priority, and subsequently, the highest performance metrics were listed as C1-2, C2-2, C3-1, and C4-2. Among them, C1-2, which had the highest number of detected anomalies and performance metrics, was applied in the analysis of four anomaly detection experiments and evaluation methods. A total of 50 ATTACK_LABELS in the test dataset were labeled, and of these, 36 anomalies were detected as abnormal (attacks) by C1-2:   . Average of differences produced by the entire field (blue)/threshold (red). The blue line is the average of the differences calculated by the entire field to determine if there is an attack. The red line is a value arbitrarily specified by the experimenter, and if the blue line exceeds the red line, that point is predicted to be an attack.

Analysis of Performance Metrics
For the performance evaluation, a section was set in the test dataset. Among the 36 anomalies detected by C1-2, attacks ['1', '10', '16', '27', '33'] were selected. Those five attacks are shown in Figures 10-14.       Table 10 compares the anomaly detection performance of five case intervals selected in the interval setting for performance evaluation. The point-based standard metrics accuracy and F1-score and the range-based standard metrics TaP and TaR were used as performance metrics. Five attacks scored an accuracy of 0.88 or higher. If only this metric is checked, we can determine that all attacks were accurately detected. However, referring to the F1score, which was always lower than the accuracy, as it is a combination of recall and precision, a score between TaP and TaR was derived. When evaluating anomaly detection in time series data by deriving various performance metric results, using point-based performance metrics may lead to an inaccurate evaluation.
As evident in Figure 15, 71,955~72,000 was undetected, 72,000~72,045 was detected, 72,045~72,055 was undetected, 72,055~72,140 was undetected, 72,140~72,160 was detected, and 72,160~72,235 was undetected. From the performance index for this section, the accuracy was 0.883, which is unsuitable as an index for performing anomaly detection. In contrast, the TaP and TaR scores considering the time series were 0.534 and 0.843. For the ambiguous section, the actual duration of the attack was section 71,955~72,235, and it can be inferred that the starting point of the affected section began at 71,980. Comparison of Performance Metrics by Attack Table 10 compares the anomaly detection performance of five case intervals selected in the interval setting for performance evaluation. The point-based standard metrics accuracy and F1-score and the range-based standard metrics TaP and TaR were used as performance metrics. Five attacks scored an accuracy of 0.88 or higher. If only this metric is checked, we can determine that all attacks were accurately detected. However, referring to the F1-score, which was always lower than the accuracy, as it is a combination of recall and precision, a score between TaP and TaR was derived. When evaluating anomaly detection in time series data by deriving various performance metric results, using point-based performance metrics may lead to an inaccurate evaluation.
As evident in Figure 15, 71,955~72,000 was undetected, 72,000~72,045 was detected, 72,045~72,055 was undetected, 72,055~72,140 was undetected, 72,140~72,160 was detected, and 72,160~72,235 was undetected. From the performance index for this section, the accuracy was 0.883, which is unsuitable as an index for performing anomaly detection. In contrast, the TaP and TaR scores considering the time series were 0.534 and 0.843. For the ambiguous section, the actual duration of the attack was section 71,955~72,235, and it can be inferred that the starting point of the affected section began at 71,980.
Through this analysis, while an attack was predicted, there is a section in which an attack was detected and a section in which an attack was not detected. Additionally, an attack (red line) occurred, but the blue line does not cross the threshold value, which confirms that the attack is slowly affecting the inside of the system. Therefore, to evaluate the anomaly detection of time series data, time series features must be considered. Through this analysis, while an attack was predicted, there is a section in which an attack was detected and a section in which an attack was not detected. Additionally, an attack (red line) occurred, but the blue line does not cross the threshold value, which confirms that the attack is slowly affecting the inside of the system. Therefore, to evaluate the anomaly detection of time series data, time series features must be considered. Table 11 shows that TaP_d and TaR_d were 1.000 because an attack was detected. TaP_p was derived by considering the ambiguous instance value after calculating the case in which the actual anomaly and prediction overlap. TaR_p was derived by calculating the case in which the anomaly and prediction overlap and then dividing by the number of instances. As such, using TaP and TaR as performance metrics is more suitable for anomaly detection analysis than the accuracy or F1-score.

Application of the Proposed Algorithm
In this section, we define ambiguous intervals by applying the overridden algorithm proposed in Section 3. Figure 16 shows the size of the attack duration section, the section affected, and the ambiguous section of the proposed technique. We set upper and lower lines based on the threshold value, which is the central line. The attack duration section does not need to be calculated as it is a fact defined in the test dataset. Therefore, we need to define the real affected section. Depending on the algorithm, we need to set the start and end points of the real affected section. In the process of setting the end point of the real affected section, the green point is stored in RI_end_time but does not last for 1 min; thus, the subsequent red point is stored in RI_end_time. In addition, because the end point of the attack duration section occurred after the end point of the real affected section, the absolute value of the value obtained by subtracting the end point of the real affected section from the end point of the attack duration section is ambiguous. To derive the TaPR scores using the size of the newly defined ambiguous interval, the interval is substituted into the partial scores for calculating TaR and TaP. Therefore, when calculating S(a',p), the improved score of TaPR is derived by substituting the size of the newly defined ambiguous section. Derived as performance metrics, these values can be said to have improved  Table 11 shows that TaP_d and TaR_d were 1.000 because an attack was detected. TaP_p was derived by considering the ambiguous instance value after calculating the case in which the actual anomaly and prediction overlap. TaR_p was derived by calculating the case in which the anomaly and prediction overlap and then dividing by the number of instances. As such, using TaP and TaR as performance metrics is more suitable for anomaly detection analysis than the accuracy or F1-score.

Application of the Proposed Algorithm
In this section, we define ambiguous intervals by applying the overridden algorithm proposed in Section 3. Figure 16 shows the size of the attack duration section, the section affected, and the ambiguous section of the proposed technique. We set upper and lower lines based on the threshold value, which is the central line. The attack duration section does not need to be calculated as it is a fact defined in the test dataset. Therefore, we need to define the real affected section. Depending on the algorithm, we need to set the start and end points of the real affected section. In the process of setting the end point of the real affected section, the green point is stored in RI_end_time but does not last for 1 min; thus, the subsequent red point is stored in RI_end_time. In addition, because the end point of the attack duration section occurred after the end point of the real affected section, the absolute value of the value obtained by subtracting the end point of the real affected section from the end point of the attack duration section is ambiguous. To derive the TaPR scores using the size of the newly defined ambiguous interval, the interval is substituted into the partial scores for calculating TaR and TaP. Therefore, when calculating S(a',p), the improved score of TaPR is derived by substituting the size of the newly defined ambiguous section. Derived as performance metrics, these values can be said to have improved the existing performance metrics because the size of the ambiguous section suitable for the attack was defined.

Conclusions
Cybersecurity threats to industrial control systems that control and manage national infrastructure are on the rise. Therefore, the ability to detect and interpret anomalies in industrial control systems is important. To proactively detect and respond to threats, we need to use a real test-based operating system dataset. Furthermore, it is also important to analyze the results of predicting attacks. Industrial control systems mainly have the characteristics of the time series. However, existing studies used a point-based performance evaluation method that does not consider these time series characteristics. Therefore, we used a real testbed-based operational dataset to perform anomaly detection and derived results using various performance metric methods.
We then redefined the ambiguous section algorithm for TaPR, a range-based performance metric. Traditional TaPR partial scores were calculated at the same rate without considering the characteristics of the attack. It is difficult to conduct an accurate evaluation without considering the characteristics of the attack. Further, it is difficult to pinpoint when attacks on industrial control systems begin and end. Additionally, it is necessary to consider the time when the inside of the system returns to normal even after the attack is over. Therefore, we used the SPC upper and lower limit concepts to reflect the characteristics of these attacks. Using these concepts, we redefined the algorithm of the ambiguous section, and we could derive an evaluation by reflecting the characteristics of the attack. The proposed algorithm allows for more accurate evaluation when performing anomaly detection targeting industrial control system-based time series datasets. However, we need to quantitatively analyze whether the proposed performance indicators are superior to the existing performance metrics. This is because it is difficult to quantitatively express whether the interpretation was accurate, rather than simply determining that the anomaly detection was not successful.
Future research will apply the upper and lower limits applied to the proposed algorithm to anomaly detection. Existing studies only used thresholds to determine anomalies and normality. Each process has its own characteristics after normalizing various process values. Therefore, there is a limit to judging abnormal and normal with only one value. We plan to perform anomaly detection using the upper and lower limits of the SPC applied to the proposed algorithm together with the threshold value.

Conclusions
Cybersecurity threats to industrial control systems that control and manage national infrastructure are on the rise. Therefore, the ability to detect and interpret anomalies in industrial control systems is important. To proactively detect and respond to threats, we need to use a real test-based operating system dataset. Furthermore, it is also important to analyze the results of predicting attacks. Industrial control systems mainly have the characteristics of the time series. However, existing studies used a point-based performance evaluation method that does not consider these time series characteristics. Therefore, we used a real testbed-based operational dataset to perform anomaly detection and derived results using various performance metric methods.
We then redefined the ambiguous section algorithm for TaPR, a range-based performance metric. Traditional TaPR partial scores were calculated at the same rate without considering the characteristics of the attack. It is difficult to conduct an accurate evaluation without considering the characteristics of the attack. Further, it is difficult to pinpoint when attacks on industrial control systems begin and end. Additionally, it is necessary to consider the time when the inside of the system returns to normal even after the attack is over. Therefore, we used the SPC upper and lower limit concepts to reflect the characteristics of these attacks. Using these concepts, we redefined the algorithm of the ambiguous section, and we could derive an evaluation by reflecting the characteristics of the attack. The proposed algorithm allows for more accurate evaluation when performing anomaly detection targeting industrial control system-based time series datasets. However, we need to quantitatively analyze whether the proposed performance indicators are superior to the existing performance metrics. This is because it is difficult to quantitatively express whether the interpretation was accurate, rather than simply determining that the anomaly detection was not successful.
Future research will apply the upper and lower limits applied to the proposed algorithm to anomaly detection. Existing studies only used thresholds to determine anomalies and normality. Each process has its own characteristics after normalizing various process values. Therefore, there is a limit to judging abnormal and normal with only one value. We plan to perform anomaly detection using the upper and lower limits of the SPC applied to the proposed algorithm together with the threshold value.