A Resilient Cyber-Physical Demand Forecasting System for Critical Infrastructures against Stealthy False Data Injection Attacks

: The safe and efﬁcient function of critical national infrastructure (CNI) relies on the accurate demand forecast. Cyber-physical system-based demand forecasting systems (CDFS), typically found in CNI (such as energy, water, and transport), are highly vulnerable to being compromised under false data injection attacks (FDIAs). The problem is that the majority of existing CDFS employ anomaly-based intrusion detection systems (AIDS) to combat FDIAs. Since the distribution of demand time series keeps changing naturally with time, AIDS treat a major change in the distribution as an attack, but this approach is not effective against colluding FDIAs. To overcome this problem, we propose a novel resilient CDFS called PRDFS (Proposed Resilient Demand Forecasting System). The primary novelty of PRDFS is that it uses signature-based intrusion detection systems (SIDS) that automatically generate attack signatures through the game-theoretic approach for the early detection of malicious nodes. We simulate the performance of PRDFS under colluding FDIA on High Performance Computing (HPC). The simulation results show that the demand forecasting resilience of PRDFS never goes below 80%, regardless of the percentage of malicious nodes. In contrast, the resilience of the benchmark system decreases sharply from about 99% to less than 30%, over the simulation period as the percentage of malicious nodes increases.


Introduction
Our critical national infrastructure (CNI) systems like those driving power generation and distribution, water treatment, electricity production, and other platforms are being increasingly digitized and controlled by the Internet of Things (IoT) [1]. The task of forecasting demand for CNI is fundamental and crucially important to run CNI smoothly and make sound operational decisions [2]. Demand specifies the number of goods and services that will be used or consumed in an economy at a given time. Cyber physical system (CPS)-based demand forecasting systems (CDFS) are ubiquitous in CNI [2] where sensor nodes are deployed in the environment to monitor the current demand for the services of CNI (e.g., traffic and electricity consumption) [3][4][5][6][7]. The sensors transmit their measurements at a regular interval to a base station through one or more intermediate nodes depending on the distance between the sensor and the base station. In the base station, the measurements of all sensors are aggregated using aggregate operators like average and sum. The forecasting model stored in the base station uses the values of the aggregate demand time series to develop an estimate of expected demand during a specified future period. The demand forecast is transmitted to the application layer to adjust the supply of the services of CNI such as electricity, water, gas, transportation, etc., to the forecasted demand. Inaccurate demand forecasts can create an imbalance of demand and supply The purpose of the second subsystem (IDS) of the resilient CDFS is to mitigate the adverse effects of attacks on demand forecasting. We organise this section as follows. Important concepts are defined in Section 2.1. The major challenges in demand forecasting and how the existing resilient CDFS address these challenges are discussed in Section 2.2. An overview of existing attack detection and mitigation algorithms used in IDS is discussed in Section 2.3.

Definitions
This section aims to provide an operational definition of "an attack on CDFS" and define three core concepts (trust, trustworthiness, and reputation) that are frequently used in this paper.
An attack on a system is defined as "a threat action that undesirably alters system operation by adversely modifying system functions or data" [24]. For this study, an attack on CDFS is defined as systematically deviating the point forecasts and interval forecasts for k-step ahead aggregate demands from the actual values to a certain direction (upwards or downwards) over time by adversely modifying sensor data. The success of the attack is measured on the following two indicators: (1) the higher the deviations of the single-point demand forecast from the actual demand for a given period, the more successful the attack, and (2) the ultimate success of the attack is achieved when the actual demand falls outside of the forecasted bounds.
Although the terms 'trust', 'trustworthiness', and 'reputation' are widely used in the literature, there remains debate about the definitions of all three [11,13,22,23,[25][26][27][28][29]. Different studies have different definitions of these terms. In this study, we adopt the following definitions for trust, trustworthiness, and reputation. We use 'trust' and 'trustworthiness' interchangeably. If a node needs to evaluate the trust score of another node, the evaluating node is called the subject node, and the evaluated node is called the object node [20]. 'Trust' or 'trustworthiness' of a node is the subjective probability assigned to an object node by a subject node at time step t based on various factors (e.g., the object node's long-term reputation, the similarity of its measurements with the measurements generated by other nodes received at the same time step etc.) that the measurement received from the object node is accurate [27,30]. The trustworthiness of data provenance is the probability assigned by a subject node to the provenance that the measurement received from the data provenance is accurate. Data provenance refers to the metadata that captures the complete history of a piece of data from origin to destination. The data provenance of measurement consists of two basic components [31]. The first component is the route of the measurement. A route refers to the path a measurement travels on a network. The route includes every node that handles the measurement between its source and destination. The second component of the data provenance is the list of actions performed by participating nodes. The trustworthiness of a measurement value is the probability that the measurement is accurate [11]. The reputation of an object node is the overall probability estimated from the trust scores assigned to the object node by all subject nodes over time [26,30].

Challenges in Demand Forecasting
Demand forecasting is a mature area of research [32]. Nonetheless, for accurate predictions of demand, many challenges remain to be addressed [32][33][34]. One of the major challenges is the curse of dimensionality. In demand forecasting, the predictor variables are lagged variables. The lagged variables are highly correlated. Consequently, even a moderate number of predictor variables may trigger the curse of dimensionality, and demand forecasting models often suffer from overfitting. Most of the existing CDFS do not take special measures to protect themselves against the curse [9,17]. They train a single network (e.g., LSTM) with a set of lagged aggregate demand time series values to forecast the out-of-sample data.
Another major challenge in demand forecasting is the presence of non-stationarity in demand time series [32][33][34]. In a non-stationary time series, the frequency content changes in time; all frequencies do not always exist. There are currently three main approaches for dealing with time-varying frequency components [35]. The first approach is to apply the differencing operator on-demand time series and remove these patterns from the time series. However, the main difficulty of the differencing approach is that it requires exact knowledge about the locations of all frequency components in the time series, which is by no means easy to capture.
The second approach to deal with the time-varying frequency components is to enter a large number of non-seasonal lagged variables in a neural network (e.g., LSTM) during the training so that it automatically selects an optimal number of seasonal and non-seasonal lagged variables and models time-varying frequency components. Most of the existing resilient CDFS adopt this mechanism [9,17]. This approach leads to a large number of lagged input variables in the model which might exacerbate the curse of dimensionality.
The third approach to deal with the time-varying frequency components is to use a hybrid of wavelet transform and neural networks [35,36]. This approach has been drawing increasingly more attention in recent years. This approach uses wavelet techniques to split the original time series into multiple component time series. A neural network (e.g., LSTM) is trained with the non-seasonal lagged variables generated from the component time series to forecast the future values of the original demand curve. Currently, more comparative studies are required to confirm the superiority of one approach over the other.
Perhaps the biggest challenge in developing a demand forecasting model is that the older values of the demand time series are less useful than the newer ones as training data and become obsolete quickly, since the distribution of demand time series keeps changing with time [30]. Several studies suggest ensuring that the forecasting model adjusts rapidly to the changing distribution. Based on these studies, the higher penalty should be assigned to in-sample forecast errors on more recent demand observations during the training phase. However, this approach poses many practical problems. The key problem is how to weigh the values of demand time series appropriately in the presence of time-varying frequency components. An inadequate weighting scheme may cause more harm than good, as it may destroy the distribution of the original time series. Hence, the common practice is to keep adapting the forecasting model to the new observations as the sensors gather them without applying an arbitrary weighting scheme [9].

Overview of Existing Defense Strategies against Stealthy False Data Injection Attacks (FDIAs)
Existing defense systems to deal with stealthy false data injection attacks (FDIAs) can be grouped into two broad categories: anomaly-based intrusion detection systems (AIDS) and provenance-based trust management systems (TMS) [9,13,22,31,37,38].
The goal of AIDS is to detect malicious nodes and exclude their readings from the aggregation of sensor measurements while estimating the most trustworthy aggregate measurement at a given time step. When AIDS is used to deal with FDIA, the simple average is usually used to aggregate the measurements of the normal sensors. Existing AIDS can be broadly split into two categories: confidence interval-based AIDS [9,22,37] and clustering-based AIDS [38]. Both types of AIDS are based on the hard clustering concept that classifies a sensor node as either malicious (1) or normal (0). Confidence interval-based AIDS divides the search space into only two clusters (one for each class: malicious and normal). On the contrary, clustering-based AIDS allow multiple clusters for each class; this type of AIDS performs better on search space with local patterns.
AIDS uses unsupervised hard clustering algorithms to perform clustering of the search space to classify the sensor nodes as malicious and normal. Different search spaces have been explored for detecting stealthy malicious nodes, including measurement space, forecasting residual space, Spatio-temporal correlation space, the skewness space of residuals in the forecasting demand model, and reputation space [9,22,37,38]. The major limitation of AIDS is that they use a set of arbitrary rules to cluster the search space and to label the clusters as malicious and normal. Past studies suggest that AIDS suffer from low malicious node detection accuracy and a high false positive rate (FPR) [9].
The goal of TMS, in contrast to AIDS, is not to classify the sensor nodes as malicious and normal [13,31]. Instead, they carry out unsupervised fuzzy clustering in the sensor measurement space to assign a reputation score to each sensor to control its influence on the sensor measurement aggregation process. During the aggregation process, TMS assigns a trustworthiness score to each measurement at time step t based on the data similarity and provenance similarity among measurements, and the trustworthiness of its data provenance, which is usually set to the reputation score of the least reputed node in the provenance. TMS then determines the most trustworthy measurement at time step t using the weighted average method. Most of the existing TMS use linear models to update the reputation scores of nodes based on the trustworthiness scores of their measurements received at time step t. However, the linear models may not be able to accurately estimate the reputation scores of nodes in the face of stealthy FDIA.

Proposed Resilient Demand Forecasting Systems
We name our proposed resilient demand forecasting system PRDFS (Proposed Resilient Demand Forecasting System). PRDFS consists of five modules: (1) Table 1 shows the definitions of the terms used in this section.
The first layer of PRDFS is only for graphical representation ( Figure 1). Each node of this layer represents a data provenance (DPV). In our simulation study, for simplicity purposes, each DPV contains only one sensor node. The DPVs transmit streams of measurements of demand −−−→ MV(t) to the following two units: TMS and DG.

TMS:
The flowchart for TMS is presented in Figure 2. TMS produces the following three outputs, which the other modules use as inputs.  The data similarity between mv i (t) and mv j (t) estimated using MAAPE PS i,j (t) Provenance similarity between the DPVs of mv j (t) and mv i (t): PS i,j (t) = nc/n 1 mv i (t) One step ahead forecast of the conditional mean of mv i (t) The residual series for the one step ahead forecasts of ith DPVs measurements containing both the past and current residuals The residual series for the one step ahead forecasts of ith DPVs measurements containing the past residuals The observed conditional variance of mv i (t): The ensemble of CNNs forecasting mv i (t) The ensemble of CNNs forecasting σ 2 The measurements of all DPVs at t : The most trustworthy aggregate demand measurement at time step t based on The time series of aggregate demand −−−→

Term Definition
The most trustworthy aggregate demand measurement at time step t − − → D(t) The time series of aggregate demand (based on − −−− → MV(t)) including D(t) The k-step-ahead forecasted conditional mean aggregate demand based on − − → D(t) The k-step-ahead forecasted value of the conditional variance of D(t + k)

Ensemble mean
The ensemble of CNNs forecasting D(t + k)

Ensemble var
The ensemble of CNNs forecasting V(t + k) gmv i (t) Measurement generated for the ith DPV at time step t by DG contains the measurements generated by DG for all sensors at t: The most trustworthy aggregate demand measurement at time step t based on The time series of aggregate demand (based on The k-step-ahead forecasted conditional mean aggregate demand based on The position of the qth particle at time step t and iteration I.
The most trustworthy aggregate demand measurement at time step t based on The time series of aggregate demand (based on −−−−−→ pos q (t, I)) including D q (t) The k-step-ahead forecasted conditional mean aggregate demand based on −−−−−→ Vel q (t, I) −−−−−→ Vel q (t, I) is the velocity of the qth particle. If all DPVs were equally trustworthy, a simple average would be appropriate to aggregate the measurements of all DPVs. However, since the measurements of all DPVs are not always equally trustworthy due to adversarial attacks and mechanical issues, a weighted average is more appropriate than a simple average for the derivation of aggregate demand (D(t)). TMS first determines the trustworthiness weights of the measurements based on the data similarity, provenance sim-ilarity, past measurements, and the predicted trustworthiness of the corresponding DPVs at a given time step using the ensembles of Convolutional Neural Networks (CNNs) [39] and Generalized Regression Neural Networks (GRNNs) [40], and then performs the weighted average. TMS sends the aggregate demand time series − − → D(t) to DFS. Output 2: At the end of each time step, TMS assesses the trust scores [trs 1 (t),...,trs N (t)] for all DPVs based on the distances between their measurements and the most trustworthy ag-  is the summation of the forecasts of its all members. The overall forecast (V(t + k)) of Ensemble var is the summation of the absolute values of the forecasts of all its member CNNs. The point forecast (i.e., the most expected demand) of the DFS module at t+k is D(t + k). The interval forecast of DFS at t+k: The lower bound of the forecast distribution: The upper bound of the forecast distribution:

DG:
The flowchart for DG is provided in Figure 4. The DG learns the joint probability distribution of the individual time series of each DPV and generates an infinite stream to EHS to facilitate the generation of attack signatures.

EHS:
The flowchart for EHS is given in Figure 5. The task of EHS is to generate all possible attack signatures for the training of SIDS. To do so, the EHS uses the Particle Swarm Optimization (PSO) algorithm [42] with a game theoretic approach [43]. It plays repeated games with the other modules of PRDFS. In these games, EHS plays as an attacker while TMS and SIDS are teamed up to play as a defender of DFS. The attacker and defender play under the following rules: (1) in each game, the attacker has a fixed number of malicious nodes at the beginning. (2) If the defender detects a malicious node, it becomes a normal node. From the next step onwards, the attacker cannot modify its measurements.  To generate optimal attack signatures, the PSO (employed by EHS) follows the following steps.
Step 1. Initialize the positions and velocities of ten particles. The position −−−−−→ pos q (t,I) = [gmv 1 (t). . . ,gmv N (t)] of each particle (q) at time step t and iteration I is a vector containing the measurements of all DPVs at time step t, wherein the measurements of normal DPVs are the same as the generated measurements by the DG and the measurements of the malicious DPVs are randomly initialized. The velocity − −−−− → Vel q (t,I) = [Vel 1 (t),. . . ,Vel N (t)] of each particle (q) at time step t and iteration I is a vector that has the same length as the position vector. Each element of the velocity vector contains the direction and speed at which the measurement of the corresponding DPV is updated from the one iteration to the next during the PSO search. In each particle, the velocities of normal DPVs are set to zero and regarded as constant. In addition, each particle has a state vector −−−−→ ST q (t,I) containing the state of each DPV. Initially, the state vectors of all particles are the same.
Step 2. The fitness of each particle's position fitness −−−−−→ pos q (t,I) is determined using the following method: to TMS. TMS estimates the most trustworthy aggregated measurement of demand D g (t) and sends it to DFS to forecast the k-step ahead demand TMS estimates the most trustworthy aggregated measurement of demand D q (t) and sends it to DFS to forecast the k-step ahead demand Else if D q (t + k)< D g (t + k) when the objective is shifting the demand upward or D q (t + k)> D g (t + k) when the objective is shifting the demand downward, set fitness −−−−−→ pos q (t,I) to zero. Else estimate the fitness using the following formula: Step 3. Find the personal best position of each particle at the current Ith iteration from the positions it has visited before and set the personal best position −−−→ pbest q of the qth particle to it. If it is the first iteration, set the −−−→ pbest q to its current position. We denote the fitness of −−−→ pbest q by fitness −−−→ pbest q . Find the global best position of all particles −−→ gbest from the positions they have visited before. We denote the fitness of −−→ gbest by Step 4. Update the velocity of each malicious DPV in qth particle using the following equation: where rand(1,1) is a random number between (0,1). C 1 and C 2 are learning factors. We set the learning factor to 2 (C 1 = C 2 = 2).
Step 5. Update each particle's position using the following equation: Repeat steps 2-5 until the stopping criterion is satisfied.
Step 6. −−→ gbest contains the optimal measurements for the malicious sensors.
is an optimal attack signature, wherein − → ST is a vector containing the state of each DPV in −−→ gbest. EHS generates attack signatures through repeated games and keeps updating the attack signature database by adding new signatures.

SIDS:
The flowchart for SIDS is provided in Figure 6. The task of SIDS is to classify DPVs into one of the two groups: malicious (1) and normal (0). It uses an ensemble of GRNNs to perform the classification task. Each member GRNN is trained to classify one DPV as either normal or malicious. Target outputs vary between member GRNNs. During training, each GRNN's target outputs are the states (malicious state or normal state) of the corresponding DPV at a given time step. However, the input vectors of all member GRNNs are the same. During training, the raw input is −−→ gbest. During testing, the raw input is the vector of measurements transmitted by the DPVs. SIDS sends the raw input to TMS. TMS updates TS and SE accordingly. SIDS estimates the time series of the skewness coefficients of residuals at each time step by moving a sliding window over the residual series in SE. This yields the matrix SK containing the skewness series of each DPV. The input vector contains 100 consecutive most recent lagged variables from TS and SK. It is worth noting that when malicious DPVs are detected, they get converted to normal DPVs at the very time step, so that from the next time on, they provide the actual measurements. However, there are no ways to recover the actual measurements of the recovered DPVs at time step t when they are first detected as having compromised. Hence, SIDS cannot improve the performance of estimating the most trustworthy aggregate demand (D(t)) at time step t. By detecting and recovering the malicious DPVs at time step t, SIDS ensures the better performance of TMS in estimating the most trustworthy aggregate demand at the later time steps. In PRDFS, TMS does not exclude the measurements obtained from the malicious DPVs while estimating D(t). This is because there is a chance of the detected malicious DPVs being in benign modes at time step t, that means they have reported the actual measurements. Even if the detected malicious DPVs are in malicious modes at time step t, the distance between the malicious measurements may not be far away from the actual measurements if the attack is at the initial stage. Hence, to avoid losing information from malicious DPVs by discarding their readings, TMS adjusts the weights of the measurements to control their influence on the estimation of D(t). Thus, by

Research Methodology
We compare PRDFS with a benchmark resilient demand forecasting system. No existing resilient CDFS have all the features that have been demonstrated to be effective for resilient demand forecasting systems. Hence, we implement a benchmark resilient CDFS by combining the promising solutions proposed in previous studies that complement each other and strengthen the performance of the resilient demand forecasting system. We call this benchmark algorithm BRDFS. The remaining section is organized as follows: the evaluation methodology for PRDFS and BRDFS is discussed in Section 4.1. The performance metrics against which the performance of PRDFS and BRDFS is measured are discussed in Section 4.2. A brief description of BRDFS is presented in Section 4.3. PRDFS and BRDFS are evaluated on synthetic data in the presence of on-off colluding FDIA. The details of the adversary model that generated stealthy FDIA attacks are presented in Section 4.4. The synthetic data generation process is described in Section 4.5.

Evaluation Methodology for PRDFS and BRDFS
To compare PRDFS and BRDFS, we simulate two CPS. For simplicity purposes, each CPS consists of two layers of nodes. The first layer contains 14 sensors that monitor demand continuously. The second layer contains a base station. All sensor nodes are directly connected to the base station. They transmit their measurements to the base station at regular intervals. PRDFS resides in the base station of one CPS, and BRDFS resides in the base station of the other CPS. PRDFS and BRDFS are implemented in MATLAB and their performance is tested on High Performance Computing (HPC). PRDFS and BRDFS both play multiple games with the adversary model under different percentages of malicious nodes. The important rules of the games are stated below: (1) at the beginning of each game, the attacker compromises a fixed number of malicious nodes. (2) If the defender detects a malicious node, it becomes a normal node. From the next step onwards, the attacker cannot modify its measurements. (3) An attacker cannot compromise a new or recovered sensor node. (4) The duration of each game is 1000-time steps. (5) The adversary model has partial knowledge. At the start of the game, the adversary model gets the most recent 500 measurements of all malicious sensors, whereas PRDF and BRDFS get the most recent 1000 measurements of all sensors. (6) At each time step, the data generator (described in Section 4.5) generates the measurements of sensors at that time step. The adversary model modifies the measurements of malicious sensors. All measurements are then transmitted to the base station. PRDFS and BRDFS estimate the most trustworthy measurement based on the noisy sensor measurements, append the most trustworthy measurement to the aggregate demand time series, retrain the forecasting models on the updated aggregate demand time series, and forecast one-step ahead of demand using the updated forecasting model, and classify each sensor node into one of the two types: malicious and normal. The adversary model also adapts itself based on the new measurements of malicious sensors.

Performance Measures
PRDFS and BRDFS are compared in terms of two measures. The first performance measure is the accuracy of detecting malicious nodes. The second performance measure is the resilience of the demand forecasting system in the face of ongoing attacks. These  Table 2). We use accuracy rate and false positive rate (FPR) as the performance criteria metric based on the above performance metric shown in Table 2. The accuracy rate and FPR are measured using the following formulae. The resilience of PRDFS and BRDFS are measured using the following metric for resilience [44]: where P n and P present the out-of-sample one step ahead demand forecasting accuracy with and without attacks, respectively. The predictive accuracy is measured in both point estimates and confidence intervals. In the single point estimate, the predicted output is the most expected value of demand. We use the mean arctangent absolute percentage error (MAAPE) to measure the accuracy of single point forecasts [45]. In the MAAPE approach, the accuracy is defined as follows: where A t and F t denote the actual and forecasted values at data point t, respectively. N is the number of data points.
In the interval estimate, the predicted outputs are the upper and lower bounds of forecasts. In this approach, if the actual demand lies between the forecasted upper and lower bounds, the forecast is correct. Interval estimates are made for the 99.7% confidence level. The 99.7% limits lie three standard deviations below and three above the mean.

Description of the Benchmark Algorithm (BRDFS)
BRDFS consists of three modules: a demand forecasting system (DFS), a trust management system (TMS), and a malicious node detection system. The details of these systems are given below.

Demand Forecasting System (DFS) of BRDFS
Most of the existing resilient demand forecasting systems use a single Long-Term Short-Term Memory (LSTM) for forecasting the point estimate of the demand [9,17]. For comparison purposes, the DFS of BRDFS consists of two LSTMs for forecasting the interval estimate of the demand. LSTM1 is trained to forecast the conditional mean of demand and LSTM2 is trained to forecast the conditional variance of demand. LSTM1 is trained on the original aggregate demand time series and LSTM2 is trained on the squared residuals of LSTM1. Most of the existing resilient demand forecasting systems do not perform differencing or wavelet decomposition on the demand time series to deal with non-stationarity. Hence, BRDFS does not apply these measures as well.

Trust Management System (TMS) of BRDFS
BRDFS uses a popular TMS proposed in [31] to estimate the most trustworthy measurement at time step t based on the measurements from all available sensors. This TMS is implemented as follows. Normalize all measurements at time step t using the min-max normalization. Find the Euclidean distance between each pair of measurements. Estimate the similarity DS i,j between the measurements mv i (t) and mv j (t) of sensors i and j using the following equation: DS i,j = 1 − Euclidean distance between mv i (t) and mv j (t) (12) Estimate the provenance similarity PS i,j between mv i (t) and mv j (t) which is the ratio of the total number of common nodes in both provenances to the total number of nodes in the provenance with the longer length. Adjust DS i,j to PS i,j using the following pseudocode: if DS i,j ≥ 0.6 then DS i,j = DS i,j + 1 − PS i,j

else DS i,j = DS i,j − PS i,j end
Since in our simulation, we assume that all sensor nodes are directly linked to the based station, PS i,j is always zero. Find the average similarity DS j of each measurement mv j (t) to all other measurements.

DS j =
∑ n−1 i = 1 DS i,j n−1 (13) where n represents the total number of measurements. Find the measurement mv j (t) with the highest DS j as the mean represents the total number of measurements (M) measurement (i.e., the most trustworthy measurement).
where s n (t−1) is current reputation score of the sensor node made the measurement. c d is a constant to be specified by the user. We set c d to 0.5.
Set s d (t) as the intermediate reputation score ŝ n (t) of the corresponding sensor.
Estimate the final reputation score S n (t) of the sensor node at the end of time step t using the following equation and we set the value of c d to 0.5.

Malicious Node Detection System of BRDFS
BRDFS uses a skewness detector proposed in [37] and a correlation detector CAD proposed in [38] to detect the compromised nodes. If a node is found compromised by the skewness detector and/or CAD, the node is considered as compromised. The details of the above anomaly detectors are given below.

Skewness Detector
This skewness detector is proposed in [37]. The idea is, if a sensor is involved in a collusion attack, the distribution of forecasting residuals of the measurements from the sensor (that usually tend to be normal) should be skewed to the left or right.
The training phase of the skewness detector includes the following steps.
Step 1: Fit a forecasting model to the time series of the sensor's readings.
Step 2: Find the residuals of the forecasting model. Step 3: Estimate the rolling skewness coefficients of the residual series using the rolling window of N time steps. We set N to 30. This gives a series of skewness coefficients.
Step 4: Estimate the unconditional mean (m) and standard deviation (s) of the skewness series. Then compute the acceptable range (−δ = m − 3s, and + δ = m + 3s) for the skewness of the forecasting error distribution.
During the stealthy attack distribution, the following steps are followed.
Step 1: Take the new sequence (r 0 ) of forecasting residuals over the last N time steps.
Step 2: Generate a sequence (r rand ) of normal random number with µ and σ 2 estimated in the training phase. The length of r rand is 0.05N.
Step 3: Replace a small proportion of entries in r 0 with r rand and generate a new sequence r test for testing. Step 4: Compute the skewness coefficient (SC) of r test . If SC falls outside the acceptable range (−δ,+δ), raise the alarm that the sensor is participating in the collusion attack.

Correlation Detector 'CAD'
The correlation detector, called CAD, is proposed in [38] to detect compromised nodes. CAD includes two sub-schemes: temporal collusion detection scheme and spatial collusion detection scheme. If both sub-schemes find a node anomalous, the node is regarded as compromised.
Temporal collusion detection sub-scheme performs the following steps to detect the anomalous nodes.
Step 1: Estimate the correlation between each node's measurements at time step t and at time step t + 1.
Step 2: Estimate a predefined confidence interval for temporal correlations based on the bootstrap distribution. If at a given time window, the temporal correlation falls outside the confidence interval, the node is considered a malicious node.
The spatial collusion detection sub-scheme performs the following steps.
Step 1: Compute the Pearson's correlation coefficient between the measurements of each pair of sensors over time window t (a time window consists of multiple time steps). This gives a correlation matrix which is called the correlation map. Each vector of the matrix contains the pairwise correlation coefficients between the measurements of one particular sensor (i) and that of every other sensor over time window t-this vector is the location of sensor i in the location map at time window t.
Step 2: Estimate the global centroid which is the vector in the matrix that has the smallest distance from all vectors in the matrix.
Step 3: Perform k-means clustering to partition the vectors into k clusters.
Step 4: If a vector's distance from the global centroid is greater than 0.5, and if the vector's distance from the centroid of its closest cluster is greater than 0.3, the sensor represented by the vector is classified as compromised.

Adversary Model
Let there be N total number of sensors in a complete sensor network. The attacker can only compromise k out of N sensors.
Step 1, randomly choose k sensors to be compromised: generate a random number for each sensor. Sort the sensors based on the corresponding random numbers in ascending order. Select the first k sensors for compromising. Step 2, record each compromised sensor's untampered readings over the next 500-time steps. Build and train a pair of convolutional neural networks (CNN) for each compromised sensor. The first CNN predicts the conditional mean of the individual time series at time step t-the inputs of this network are the past values of the time series, and the desired outputs are one-step-ahead values of the time series. The second network predicts the conditional variance of the conditional mean at time step t-the inputs of this network are the past squared residuals of the first CNN, and the desired outputs are one-step-ahead squared residuals from the first CNN.
Step 3, to estimate the modified readings for the compromised sensors, construct a 95% confidence interval for the reading of each compromised sensor (l) at time step t based on the corresponding conditional mean m l (t) and conditional variance v 2 Step 4, Generate k random numbers in the range [0,1], one for each compromised sensor. If the random number is less than 0.5, the corresponding compromised sensor (l) uses the following formula to calculate its tampered measurement (mv l ) at time step t: mv l =m l (t) +θ l (t) × r (16) where r is the normal random number with mean 0 and standard deviation 1. Otherwise, it reports the upper or lower bound of the 99.7% confidence interval as the reading of the compromised sensor at time step t to shift the trend upward or downwards, respectively.
Step 5, if the defender fails to identify the lth compromised sensor at time step t, retrain the corresponding CNN pair with the reported measurement at time step t.
Step 6, repeat steps 3-5 to estimate the readings for the compromised sensors at each time step.

Data Generator
We build a data generator that produces data for an infinite time, so that we can evaluate the performance of the proposed forecasting system. In non-stationary time series, various local frequencies exist at different resolution levels. We use wavelets to generate synthetic non-stationary time series. The mother wavelet and the resolution levels (n) along with the values of approximation coefficients at the highest level and detail coefficients at each level must be specified to generate a time series. We complete the following steps to generate a time series for each sensor.
Step 1: Choose a wavelet and a set of scales (or level of resolution) (say 1 to n) where the most valuable frequencies will be assigned. We use Daubechies mother wavelet of order 2 (db2). The number of resolution levels (n) is chosen to be three. Step 2: We define a separate sinusoidal function for the generation of the approximation coefficients at level 3 and the detail coefficients at each resolution level. The higher the resolution level, the smaller the number of coefficients, due to subsampling. To generate a signal of length 10,000 with 'db2' as mother wavelet and three levels of resolution, we need to specify 1252 level-3 approximation coefficients, 1252 level-3 detail coefficients, 2502 level-2 detail coefficients, and 5001 level-1 detail coefficients. sin(B(t + C)) + D + σr (17) where w j,t is the (approximation or detail) coefficient at scale j and time point t. A = amplitude; C = phase shift; D = vertical shift; B = 2πf; t=time step; σ = standard deviation; r = Gaussian noise. The values of function parameters (A,f,C,D,σ) are arbitrarily assigned.
Step 3: Reconstruct the signal based on the selected mother wavelet and resolution level using the MATLAB function 'waverec'. We perform augmented Dickey-Fuller tests on the time series and its first and second order derivatives to check whether the time series is stationary or not. The test confirms that the time series and its derivatives are non-stationary.
Step 4: At each time step, estimate the mean (µ) of all measurements of demand obtained from 14 sensors to get the aggregate demand time series.

Results and Discussion
The performances of PRDFS and BRDFS were comparatively tested on 11 games with an adversary model. Each game was conducted with different percentages of malicious nodes. The duration of each game was 1000-time steps. The simulation results provide interesting insight into the comparative efficacy of PRDFS and BRDFS.

The Comparative Performance of PRDFS and BRDFS in Terms of Demand Forecasting Resilience to Stealth Attacks Major Findings
• PRDFS exhibits an almost perfect mean resilience value of~0.99 over the time frame of the game (i.e., 1000-time steps) regardless of the percentage of the nodes compromised by the adversary (Figure 7). This is mainly because it detects and recovers all compromised nodes at the very early stages of games (within first 30-time steps) ( Figure 10).

•
During the first 40-time steps of games while the attack is still occurring or just ended, the mean demand forecasting resilience of PRDFS slightly decreases but it never goes below 0.8 (its resilience ranges from 0.82 to 1), regardless of the percentage of the malicious nodes ( Figure 7). In contrast, the mean demand forecasting resilience sharply decreases in BRDFS over the simulation period (i.e., 1000 time-steps) as the percentage of malicious nodes increases (Figure 7).

•
In PRDFS, the interval demand forecast accuracy demonstrates greater resilience than the point demand forecast accuracy, while the opposite trend is observed in BRDFS (Figure 7). In BRDFS, the interval demand forecast accuracy has weaker resilience compared to the point demand forecast accuracy and the trend becomes more pronounced as the percentage of malicious nodes increases. This is mainly because BRDFS fails to detect most malicious nodes ( Figure 11). All colluding malicious nodes report similar measurements and thereby artificially reduce the uncertainty around the mean estimate, which results in low interval demand forecast accuracy and yields a very high FPR in detecting malicious nodes. We attribute the high resilience of PRDFS in the face of an attack to the combined effect of the strong performances from its modules: SIDS, TMS, and DFS. Sections 5.2-5.4 discuss the comparative performances of the individual subsystems.

Major Findings:
• PRDFS maintains a high malicious node detection accuracy of around 95% (ranging from 97% to 89%) while still maintaining a low FPR, ranging from 3-9%, with the increasing percentage of malicious nodes (Figures 8 and 9). In contrast, the accuracy of BRDFS sharply falls with the rising FPR as the percentage of malicious nodes increases (Figures 8 and 9). • PRDFS successfully detects all malicious nodes within the first 30-time steps regardless of the percentage of the malicious nodes, whereas BRDFS fails to detect any malicious node after 15-time steps (Figure 10). It appears that BRDFS has a shorter time window for detecting malicious nodes than PRDFS.
• BRDFS manages to correctly detect malicious nodes when the percentage of malicious nodes is low (Figures 10 and 11). BRDFS fails to detect any malicious node when the percentage of malicious nodes is 30% or over ( Figure 11).   Figure 9 shows the false positive rates (FPR) for PRDFS and BRDFS when detecting malicious nodes. The X-axis represents the percentages of sensor nodes that were malicious in the simulation trial. The Y-axis represents the false positive rates (FPR). The blue line shows the FPR of PRDFS under different percentages of malicious nodes whereas the orange line shows the FPR of BRDFS under different percentages of malicious nodes. The FPR in PRDFS never exceeds 9%. The FPR of BRDFS rises sharply with the percentage of malicious nodes. The FPR of BRDFS varies from less than 10% to around 75%.

Further Observations
We attribute the above observations to the following characteristics of PRDFS and BRDFS: • PRDFS uses a supervised intrusion detection system (SIDS) for detecting malicious nodes. SIDS learns from the input and output data sets that are contamination-free since they are generated through the game theoretic approach. It employs a supervised learning algorithm (GRNN) to distinguish between malicious and normal nodes. It can detect malicious nodes even when all nodes are malicious because it compares the present behaviors of the nodes in respect to dynamic spatiotemporal patterns of trustworthiness and the skewness of residual distribution with previous correct behaviors.
• BRDFS uses an unsupervised anomaly-based intrusion detection system (AIDS) for detecting malicious nodes.
AIDS learn from unlabeled real data that are contaminated with many undetected malicious nodes. Hence, its learning is not very effective. AIDS learn under the assumption that statistically rare or atypical behaviors are abnormal and are possible signs of maliciousness, whereas typical or common behaviors are normal. Hence, AIDS can only detect malicious nodes when the percentage of malicious nodes is low, and the malicious nodes are operating for a short period of time where it is not possible to inject a significant number of malicious instances in the data set. However, stealthy malicious nodes are highly difficult to detect at very early stage. The skewness detector and the temporal collusion detection scheme of CAD uses confidence interval-based anomaly detection. If a node's behavioral traits fall outside the confidence interval, the node is detected as malicious. However, such a strategy is ineffective against collusion attacks. In collusion attacks, the adversary takes control of multiple sensors and changes the readings of these sensors to the desired upward or downward direction in such a way that the malicious nodes always lie within the confidence intervals of correlation coefficients and skewness coefficients. One of the major challenges in time series forecasting is that the distribution of the time series changes over time. Hence, the forecasting model needs to be adapted to the most recent data-the adversaries take advantage of this situation. Returning the upper or lower extreme of the acceptable range as the readings of the compromised sensors, they can gradually move the forecasting model in the wrong direction without getting caught. Since the forecasting model is evolving towards the false data, over time, the non-compromised sensors are marked as compromised one after another and will lose all influence over the model's evolution, thereby accelerating the compromise rate of the entire forecasting system. The spatial collusion detector of the CAD in BRDFS uses unsupervised clustering to separate malicious nodes. These clustering algorithms are based on arbitrary rules. Thus, it is natural that its accuracy is not as good as PRDFS.

The Comparative Performance of PRDFS and BRDFS in Estimating the Most Trustworthy
Measurement of the Current Aggregate Demand − − → D(t)

Major Findings
• The demand forecast resilience is directly correlated to the accuracy in estimating current aggregate demand − − → D(t) that acts as the input signal while forecasting the demand time series (Figures 7 and 12).

•
Since PRDFS detects and recovers all compromised nodes at the very early stages of games (within first 30-time steps) (Figure 10), the mean accuracy of estimating − − → D(t) in PRDFS is almost constant to around 98% over the time frame of the game (i.e., 1000-time steps) regardless of the percentage of the nodes compromised by the adversary (Figure 12).

•
During the first 40-time steps of games, while the attack is still occurring or just ended, the mean − − → D(t) estimation accuracy in PRDFS decreases from around 98% to 83% as the percentage of malicious nodes increases. In contrast, the mean − − → D(t) estimation accuracy in BRDFS falls sharply over the simulation period (i.e., 1000 time-steps) as the percentage of malicious nodes increases.  Figure 12 depicts the prediction accuracy of current aggregate demands in PDRFS and BRDFS under the varying percentage of malicious sensor nodes. The accuracy is estimated based on the difference between the most trustworthy measurement of the current aggregate demand predicted by the TMS and the actual current aggregate demand (which is the simple average of the actual current measurements of the sensors). The X-axis represents the percentages of sensor nodes that were malicious in the simulation trial. The Y-axis represents the accuracy. The accuracy is estimated based on MAAPE. The green and blue lines plot the mean accuracy of PRDFS in respect to the mean current aggregate demand over 1000-time steps and over the first 40-time steps, respectively, under different percentages of malicious nodes. The red line plots the mean accuracy of BRDFS in terms of the mean current aggregate demand over 1000-time steps under different percentages of malicious nodes. The green line is almost flat to X-axis, indicating that the mean accuracy of mean current aggregate demand over 1000-time steps predicted by (the TMS of) PRDFS is not sensitive to the percentage of malicious nodes in the system. The blue line drops gradually as the percentage of almost flat to X-axis, indicating that the mean accuracy of mean current aggregate demand over 1000-time steps predicted by (the TMS of) PRDFS is not sensitive to the percentage of malicious nodes in the system. The blue line drops gradually as the percentage of malicious nodes increases, indicating that the mean accuracy of mean current aggregate demand over the first 40-time steps predicted by PRDFS is mildly sensitive to the percentage of malicious nodes in the system. The red line falls at an accelerated rate indicating that the mean accuracy of mean current aggregate demand over 1000-time steps predicted by BRDFS is highly sensitive to the percentage of malicious nodes in the system.

Further Observations
The following strategy differences between the TMS of PRDFS and that of BRDFS may explain their striking differences in − − → D(t) estimation accuracy in the face of an attack: • The TMS of PRDFS considers the following three factors in determining the trustworthiness of a sensor node at a time distant t: (1) the trustworthiness of the node, (2) the similarity of the sensor's reading with the other sensors' readings at the same time step, and (3) the similarity of the sensor's measurement at time step t with its past measurements, but in contrast, the TMS of BRDFS uses only the first two factors.

•
The TMS of BRDFS updates the reputation score of a node at the end of a time step t after determining the most trustworthy measurement at that time step. In the presence of on-off attacks, the normal behavior of a node at the last few time steps does not guarantee normal behavior from the node at the next time step. For this reason, the TMS of PRDFS predicts the trust score of a node at the beginning of a time step t (before determining the most trustworthy measurement at that time step) using GRNN, which is a powerful pattern recognition algorithm capable of learning local patterns.

•
In BRDFS, the reputation score of a node at time step t is a linear function of its long-term reputation and its intermediate reputation score computed at time step t. In contrast, the TMS of PRDFS uses a powerful fuzzy clustering-based non-linear algorithm (GRNN) to predict the trust score of a node at time step t based on its past trustworthiness scores.

•
The TMS of BRDFS sets the reputation score of a node at time step t to the estimated expected (i.e., mean) reputation score of the node at time step t. In stealthy on-off attacks, where at each time step the malicious nodes randomly choose whether to act maliciously or not, the variance of their reputation is an important indicator of their conditions. Hence, the point estimate of the expected reputation score is not very reliable. To overcome this problem, the TMS of PRDFS sets the trust score of a node to the lower bound of the predicted distribution of the plausible trust scores of the node at time step t • PRDFS outperforms BRDFS in terms of demand forecasting accuracy by 15% or more ( Figure 13). Figure 13. Comparison of demand forecasting performance between PRDFS and BRDFS when no attack is present. Figure 13 shows the demand forecasting accuracy of PRDFS and BRDFS when no attack takes place. The Y-axis of the figure represents the accuracy. The first two bars from the left are for PRDFS and the second two bars are for BRDFS. Blue bars represent the point forecasting accuracy, whereas the orange bars represent the interval forecasting accuracy. The figure shows that PRDFS outperforms BRDFS in both point and interval forecasting accuracy.

Possible Explanations
We attribute the finding to the robustness of PRDFS against non-stationarity and high dimensionality. To deal with the non-stationarity in the demand time series, the DFS of PRDFS applies wavelet multiresolution analysis on the raw demand time series and decomposes the original input time series into multiple component time series with a simpler structure. Each component time series contains the information content of the original demand time series at a particular level or scale. Our study suggests that the wavelet analysis eliminates the requirement of selecting optimal seasonal lagged predictor variables (i.e., non-consecutive lagged variables) from the component time series for the demand forecasting model, which are highly difficult to identify. In the process, it also reduces the dimensionality of the feature space.

•
The curse of dimensionality hits particularly hard on time series forecasting models of high or even moderate dimensional input data, since they use lagged variables as predictor variables and the lagged variables are highly correlated with each other. Hence, to reduce the dimensionality further, the DFS of PRDFS uses a hierarchical ensemble model where each member CNN is trained with only a small number of non-seasonal lagged variables. Ensemble members are trained sequentially. The first ensemble member is trained on the original demand time series. Each of the remaining ensemble members is trained on the residual series of the previous trained ensemble member. Additionally, the DFS of PRDFS uses CNNs as ensemble members. Our study suggests that CNN is more robust to the high dimensionality compared to other existing learning algorithms.

•
The DFS of BRDFS does not have any such effective strategy to deal with nonstationarity and high dimensionality. It just uses a single LSTM regression network to forecast the demand.

Conclusions
In this paper, we propose a novel resilient cyber-physical demand forecasting system (CDFS) called PRDFS that is resilient to false data injection attacks (FDIAs) since it can distinguish a natural change in the dynamics of demand over time, and a change due to the adversary's attack. To achieve this capability, it uses a signature-based intrusion detection system (SIDS) of malicious nodes and normal nodes. SIDS is trained on labeled training data set using a supervised learning algorithm (GRNN). To automatically generate the training dataset for SIDS with all possible attack paths (defined in the spaces of assessed node trustworthiness and the skewness of residuals of forecasting models for individual sensor time series), an ethical hacking system (EHS) is run. EHS plays repeated attackerdefender games with the defense system (SIDS and TMS) of PRDFS. The training data of SIDS are continuously automatically generated by these games. This is the main novelty of PRDFS.
PRDFS uses a trust management system (TMS) to determine the most trustworthy measurement at time step t from noisy sensor measurements for the aggregate demand time series. It considers the following information while determining the most trustworthy measurement at time step t: spatial data similarity, temporal data similarity, and the minimum expected trustworthiness of the corresponding sensor node at time step t. TMS employs GRNNs to predict the minimum expected trust score of each sensor based on the past trustworthiness scores of its measurements. The demand forecasting system (DFS) of PRDFS uses the lagged values of aggregate demand time series to make out-of-sample forecasts. Our study suggests that among all existing approaches for dealing with the non-stationarity problem, the wavelet multiresolution decomposition approach is the most effective. Our DFS decomposes the time series into multiple component time series using wavelet transform. To avoid the curse of dimensionality, the proposed DFS uses a hierarchical ensemble of convolutional neural networks (CNNs) where each ensemble member is trained with a relatively small number of non-seasonal lag values of the component time series. Each ensemble member is trained sequentially. The first ensemble member is trained on the component time series derived from the original demand time series. Each of the remaining members is trained on the component time series derived from the residual series of the previous ensemble member. The number of ensemble members depends on how many members are required to make the final residual series of the ensemble a white noise series. This is another novelty of the proposed PRDFS.
For comparison purposes, we implement a hybrid resilient demand forecasting system that incorporates the best modules of existing demand forecasting systems. We call this resilient demand forecasting system BRDFS. We compare the performance of PRDFS and BRDFS against stealthy FDIAs. The data streams of all sensors are generated synthetically. Both were tested in 11 simulation setups with different percentages of malicious nodes. The performance of our algorithm is highly encouraging. It detects all malicious nodes within the first 30-time steps and it has low false positive rates (FPRs), never exceeding 9% regardless of the percentage of malicious nodes. Its demand forecasting accuracy never drops below 80% in the face of attacks.

Conflicts of Interest:
The authors declare no conflict of interest.