## 3. Proposed Method

The proposed solution uses a deep Neural Network (NN) architecture called autoencoder. Autoencoders are a type of unsupervised learning algorithms in which the neural network learns to reconstruct its input after a compression of the data, i.e., a reduction of dimensionality. In practice, the autoencoder generates a reduced representation of the set of data and tries to reconstruct a representation of the set of data from the reduced set, which is as close as possible to the input. The difference between the two sets is the reconstruction error. Autoencoders have many applications in the field of image processing but their scope can be much wider. In the mathematical theory of artificial neural networks, the universal approximation theorem states that a feed-forward network with a single hidden layer containing a finite number of neurons can approximate continuous functions on compact subsets of

${R}^{n}$ [

22]. The idea used in this paper is that after training the autoencoder with physical measures, it will be able to learn the correlations among measures. Our approach follows the reconstruction-based novelty detection paradigm [

3] where a model is trained to reconstruct normal data with low error. If the input is abnormal, the reconstruction error will be higher. In this way the magnitude of the error, and a consequent proper threshold, is used to classify new data.

The formalized process is the following. We collect all the available measures (also called features) at a discrete sampling time so to build a vector that we call state vector (

1).

The aim is to detect an anomaly by the only static analysis of this vector. The entire dataset, composed of the state vectors collected over a period of time, which will be used to train the algorithm is called

${X}^{TR}$, while

${X}^{TEST}$ is another collection of state vectors used in the test phase. Therefore, for a given dataset, each row contains the measures collected at the same time and each column the measures of the same type over time. See Equation (

2), where

X is either

${X}^{TR}$ or

${X}^{TEST}$. Sampling time go from

${t}_{1}$ to

${t}_{T}$ in this example case.

#### 3.1. Preprocessing

Each measure varies over time in different ranges and ways. In order to compare the measures, we must compensate these differences in a preprocessing phase. Considering the i-th column of

${X}^{TR}$, we compute the mean

${\overline{X}}_{i}^{TR}$ as in Equation (

3) and the standard deviation

${\sigma}_{i}^{TR}$ as in Equation (

4), for the training dataset.

Then we normalize each single measure of the training dataset as in Equation (

5), so getting the matrix in Equation (

6):

#### 3.2. Training Phase

Each state vector

${x}^{TR}(t)=\{{x}_{1}^{TR}(t),{x}_{2}^{TR}(t),\cdots ,{x}_{n}^{TR}(t)\}$ is sent to the autoencoder as input and it is reconstructed as output. We call

${\tilde{x}}^{TR}(t)=\{{\tilde{x}}_{1}^{TR}(t),{\tilde{x}}_{2}^{TR}(t),\cdots ,{\tilde{x}}_{n}^{TR}(t)\}$ the vector reconstructed by the autoencoder starting from the input

${x}^{TR}(t)$. We define the reconstruction error as in Equation (

7)

During the training phase the autoencoder is trained to reconstruct its input so to minimize the error in Equation (

7). Details about this action are reported in the remainder of the paper.

The next step is to decide a threshold to build a classifier to be used in the Test Phase. For this purpose we compute the mean

${\overline{e}}^{TR}(t)$ and standard deviation

${\sigma}_{e}^{TR}$ of the error on the training dataset as in Equations (

8) and (

9)

Finally we set the threshold as in Equation (

10), where

h is a constant empirically defined.

This structure to define the threshold is due to the hypothesis of a normal distribution of the error. If the probability density function of the error is Gaussian, Equation (

10) allows the setting of a defined percentage of vectors of the training dataset as normal. For example, if

h is set to 3,

$99.73\%$ of the training vectors will be defined as normal. The choice will be analyzed and discussed in the Results section.

#### 3.3. Test Phase

During the test phase, each test state vector at generic instant

${t}_{k}$ is normalized by using the mean and standard deviation of the training dataset, as shown in Equation (

11)

Then the vector (

12) is sent to the autoencoder. The reconstruction vector is Equation (

13).

The error between input and output is computed as in Equation (

14)

The classification is made by comparing the error with the threshold

E in Equation (

10) as indicated in Equation (

15)

To summarize, the classifier is therefore built in two phases. During the training phase (

Figure 1) the autoencoder learns to reconstruct vectors representing the normal behavior in an optimal way by setting the parameters of the neural network. This operation is done in this paper through a gradient descent, an iterative learning algorithm that uses several hyperparameters to update a model. At each iteration the model should be improved by updating the internal model parameters. Two hyperparameters that will be used in the performance evaluation are the batch size and number of epochs. The batch size controls the number of training samples (i.e., the number of rows in the matrix

$(6)$) to use before the model’s internal parameters are updated. In practice, after the end of the batch, the input is compared with the output and the error is computed. Starting from the error the update algorithm is used to move down along the error gradient. The training dataset can be divided into one or more batches and this number is one parameter affecting the system performance analyzed in the Results section. The number of epochs defines the number of times that the learning algorithm will use the entire training dataset. An epoch is composed of one or more batches. The number of epochs should be large enough so that the learning algorithm can run until the error has been sufficiently minimized. The number of epochs is the second hyperparameter studied in the Results. After the training phase, a threshold

E is set.

In the test phase (

Figure 2) the state vectors are sent to the autoencoder, and the reconstruction error is used to classify vectors.

## 4. Materials and Methods

The use case of this paper is a typical small-scale photovoltaic system connected to the grid at the distribution level. From an electrical point of view it is composed of solar panels, a DC-DC boost electronic converter controlled by the maximum power point tracking algorithm, a DC link and a current source inverter. The inverter acts also as a server sending information to and receiving commands from a control unit, which can be represented by a SCADA in the case of a microgrid or by a distribution management system in the case of distribution grid. The system is also equipped with different sensors that collect heterogeneous types of measurements. We categorize all the collectible information in five groups:

Alternating Current (AC) side electrical information: active and reactive power, voltages (Root Mean Square, RMS), currents (RMS), frequencies, total harmonic distortion (THD)

Direct-Current (DC) side electrical information: voltages and currents

PV information: voltage, current, temperature of the cells

Environmental information: irradiance, temperature of the air

Electronic information: maximum power point, dc/dc converter duty cycle

As said, we would like to detect anomalies by the only observation of the collected measures. In order to validate the proposed approach, we set up a simulation environment. The physical behavior of the photovoltaic system is simulated by using MATLAB/Simulink software, as shown in

Figure 3.

The scheme takes into account the heat exchange between the panels and the environment (in red in the left part of

Figure 3) starting from the external data of solar irradiance and the temperature of the air. This is simulated by a radiation heat transfer coming from the sun, a convective heat exchange with the environment considered to be an ideal temperature source, and a thermal inertia of the photovoltaic panels. The legenda, taken directly in Simulink, concerning this part is reported in

Figure 4.

Then we implemented an electromagnetic electrical model, the portion between the PV Array and the Inverter in

Figure 3, starting from the blocks already present in the software. The grid is modeled by a small low voltage portion with some loads, and a transformer connected to the medium voltage distribution. We extract 22 features composing vector (

1), which represent the physical parameters that are measured in many commercial solutions, especially in microgrids. The list of these features is reported in

Table 1. The green boxes in

Figure 3 are the points where we extract the measures listed in

Table 1 from the system. Concerning the boxes that indicate the points where multiple measures are extracted: “PV_Meas” refers to

${X}_{4}$ and

${X}_{5}$, i.e.,

${V}_{pv}$ and

${I}_{pv}$; “AC measures” block is the point where we extract the measures from

${X}_{9}$ to

${X}_{22}$.

${X}_{8}$ is measured within the MPPT converter. The model runs for different working conditions in order to create a large dataset.

We simulated three types of faults/cyber-attacks that have different impact on the measures:

In the first case, we use the remote-control capabilities of the inverter: the action implies a minor active power injection to the grid compared to the possible available power considering environmental conditions. In the second case, we consider a typical fault that can happen in a solar panel, which affects the performance of the panel itself. In the third case, we modify only one feature for sample, by changing its value of a percentage ranging from 25 to 50% of the original value, in order to create an unfeasible state vector. For instance, the injected power is modified maintaining unchanged voltage and current measures. A bad data injection attack can be dangerous for DERs because it can induce a wrong decision in a remote-control system.

The classification is performed offline on the test dataset. The autoencoder, implemented in Python by using Keras library [

23], is a multilevel neural network (NN) architecture that uses input and output layers composed of the same number of neurons and a series of hidden layers whose dimension is strictly lower than the dimension of input and output layers. Neurons are fully connected, which means that each single neuron acts as an input for all the neurons of the following layer. The activation function of each neuron is a sigmoid. During the training phase, all the parameters of the neurons (i.e., the weights of the connections) are set by using the gradient descent technique. The impact on the performance of the following two hyperparameters, whose meaning is mentioned in

Section 4, is evaluated in the following:

The variation of these hyperparameters can strongly influence the model learned by the NN.

## 5. Results

#### 5.1. General Description

We extracted a training dataset composed of 7200 vectors, corresponding to about 30 days of operation under different weather conditions, and 3 test dataset composed of about 800 vectors corresponding to some hours of operation (because of a smaller sampling time with respect to the training set) in normal and abnormal conditions. Tests are made in order to evaluate the influence of different elements so to improve the efficiency of the detection method. We focused on the choice of the threshold

E in Equation (

10), on the neural network architecture, and on the hyperparameters used to train the NN.

#### 5.2. Threshold E

As indicated in Equation (

10) we set the threshold

E as the mean error plus

h times the standard deviation of the error computed on the training dataset. As expressed in

Section 4 the choice is theoretically fully justified if the probability density function of the error is Gaussian because, in this case, the area below the curve within the range identified by

$({\overline{e}}^{TR}-{\sigma}_{e}^{TR})$ and

$({\overline{e}}^{TR}+{\sigma}_{e}^{TR})$ is always equal to 0.6827, by

$({\overline{e}}^{TR}-2{\sigma}_{e}^{TR})$ and

$({\overline{e}}^{TR}+3{\sigma}_{e}^{TR})$ equal to 0.9545, by

$({\overline{e}}^{TR}-3{\sigma}_{e}^{TR})$ and

$({\overline{e}}^{TR}+3{\sigma}_{e}^{TR})$ equal to 0.9973, and so on, allowing

h to control the probability that the random variable error

${e}^{TR}$ is within the range defined by

$({\overline{e}}^{TR}+h{\sigma}_{e}^{TR})$. Supposing that the distribution of

${e}^{TEST}$ is the same of the one of

${e}^{TR}$,

h would allow the setting of a defined percentage of vectors as normal. Fixing the structure of the autoencoder with a single hidden layer of dimension 15, and low epochs (10) and high batch size (256),

Figure 5 reports the histogram of the normalized errors, the curve that approximates the related probability density function, and the corresponding normal curve with the same mean and standard deviation (red line). The shapes are qualitatively similar by changing hidden layer dimension, epoch and batch size.

We can notice that even if the distribution is different from the Gaussian, the proposed structure for the choice of the threshold may be still applied even if the selection of the

h value must be heuristic.

h can still select a percentage of vectors as normal, even if more coarsely than in the Gaussian case. In

Figure 5 two values of the threshold are reported; the choice of

$h=5$ has the intention to maintain a small number of false positives. To analyze numerically the effect on the performance we varied

h from 3 to 7, as reported in

Table 2. The motivation is that above

$h=3$ and beyond

$h=7$ results show that the performance decreases significantly. The acronyms TN, FN, FP and TP stand, respectively, for True and False Negative, and False and True Positive. Accuracy is the proportion of true results.

These preliminary results highlight important aspects: the best threshold’s choice depends on the specific case because some physical anomalies produce higher errors that are more easily separable. For examples, anomalies that involve the modification of a high number of measures cause a higher reconstruction error. Therefore, setting a high value for the threshold E (through a higher h) reduces the number of false positives. On the contrary, anomalies that induce a small amount of measures to change cause a lower reconstruction error, consequently the threshold E should be maintained lower (through a lower h), even if this choice raises the number of false positives. If we want to avoid to loose generality, we must fix a threshold that is a compromise. On the first phase, we set $h=5$.

#### 5.3. Autoencoder Architecture

Subsequently, fixing

$h=5$, the impact of the NN architecture on the accuracy has been investigated. We tried different combinations of depth and number of neurons for hidden layer. For example, considering

Table 3, “22-15-22” refers to an architecture composed of a single hidden layer whose dimension is 15 and an input and output layer each composed of 22 neurons corresponding to the number of features in

Table 1, while “22-18-15-18-22" refers to an architecture composed of 3 hidden layers whose dimensions are 18, 15 and 18, respectively. The layers are fully connected, which means that each single neuron of a layer acts as an input for each neuron of the following layer.

The number of neurons of the most compressed layer impacts on the accuracy more than the number of hidden layers. In the presented results, in some cases, the depth of the NN impacts negatively on the accuracy, but we can say, in general that the depth does not bring significant changes in the performance. A possible explanation is that the dimension of the smallest hidden layer is the element that mostly affects the reduced representation of the system and, consequently, the capability to learn the correlation between measures. If this dimension is too large, the autoencoder just reproduces its input; on the contrary, if the dimension is too small, the NN loses important information.

#### 5.4. Training Parameters

Finally, we investigated the gradient descent’s hyperparameters, focusing on batch size and epochs. After the previous phases, we fixed the threshold to

$h=5$ and the NN architecture to 22-18-15-18-22. We compared the results by using different learning hyperparameters. The first test focuses on the batch size, fixing the number of epochs to 10. Results are reported in

Table 4The second test refers to the number of epochs. We fixed the batch size to 32 and we evaluated the impact on accuracy by varying the epochs (

Table 5).

There is a relevant impact of hyperparameters on the accuracy concerning anomalies caused by Short Circuited Cells and Bad Data Injection. More times the training samples are passed to the autoencoder during training phase, more accurate is the model fitted by the NN. The identification of anomalies that produces lower errors benefits from a more suitable model built around training samples.

## 6. Discussion and Future Research Issues

The previous chapter proposed a performance evaluation of the proposed algorithm to detect different anomalies. One of the main problem is to adapt the algorithm to perform well under different conditions. Setting specific parameters for each anomaly brings to significantly better results; however, if the algorithm can perform well with only specific anomalies, than it betrays its aim. The previous section allows getting a compromise. Concerning future research, it would be our intention to elaborate on the error distribution so finding out alternatives for the computation of the threshold E. A major investigation over the NN architecture can bring better results. This paper analyzed separately the impact of the threshold, of the architecture and of the gradient descent hyperparameters. Actually, these three elements are bounded: for example, the choice of the best threshold cannot be considered a totally separated problem with respect to the other two elements. In order to find a real optimal solution, we must consider all these elements at the same time. We could proceed by a brute-force strategy by trying all the possible combinations for discrete steps. Nevertheless, this would result in a huge amount of data. A better solution would be to implement a multi-parametric optimization during the design of the NN.

Many papers in the literature use supervised ML algorithms in order to detect anomalies. Supervised algorithms usually perform well in classification problems, but these methods cannot be exploited without disposing of labeled datasets. An idea may be comparing our approach with anomaly detection algorithms based on One-Class Support Vector Machine. A cyber-physical anomaly detection may be a complementary tool to traditional Intrusion Detection Systems that could be based on network traffic analysis. Many industrial protocols send measures on plain text to a centralized control system; an element that scans the network traffic and that can also perform a deep packet inspection could act together with the proposed scheme to detect possible anomalies.