A Multitiered Solution for Anomaly Detection in Edge Computing for Smart Meters

In systems connected to smart grids, smart meters with fast and efficient responses are very helpful in detecting anomalies in realtime. However, sending data with a frequency of a minute or less is not normal with today’s technology because of the bottleneck of the communication network and storage media. Because mitigation cannot be done in realtime, we propose prediction techniques using Deep Neural Network (DNN), Support Vector Regression (SVR), and k-Nearest Neighbors (KNN). In addition to these techniques, the prediction timestep is chosen per day and wrapped in sliding windows, and clustering using Kmeans and intersection Kmeans and HDBSCAN is also evaluated. The predictive ability applied here is to predict whether anomalies in electricity usage will occur in the next few weeks. The aim is to give the user time to check their usage and from the utility side, whether it is necessary to prepare a sufficient supply. We also propose the latency reduction to counter higher latency as in the traditional centralized system by adding layer Edge Meter Data Management System (MDMS) and Cloud-MDMS as the inference and training model. Based on the experiments when running in the Raspberry Pi, the best solution is choosing DNN that has the shortest latency 1.25 ms, 159 kB persistent file size, and at 128 timesteps.


Introduction
Based on the predicted growth of the application of smart meters, it is estimated that, in 2024, there will be 1.2 billion smart meters in the world, with a cumulative expenditure of USD145.8 billion [1]. Utilities prefer Smart meters because of their ability to provide information more quickly and accurately optimize supply and demand. The smart meter in the smart grid system is also very suitable for integrating various energy sources. Traditionally, data from smart meters in the smart grid system are sent to cloud servers that have the ability to process large amounts of data and support big data analysis. However, this model has some disadvantages, such as difficulty in handling a large number of smart meters, and high latency, due to great distances, network loads, and additional layer to process the data. Therefore, data collected by smart meters are generally sent to Edge computing (EC) as an intermediate server between smart meters and cloud server. An EC device is a computer device that makes smart meters in the physical layer transparently connected to the computing world. Figure 1 shows an edge/fog computing general architecture that can be used in smart grid. EC is located near the applications, sensors, or data sources, but it may be far from the cloud. This is needed in order to reduce data exchange, shorten latency and response times, improve processing speed, and reduce pressure on the network as compared to connection through the cloud [2]. Data from smart meters

Related Works
This section will explain the handling of the four previous problems that are the focus of this work, namely, model reduction, techniques for expanding labels, expanding data points, and comparing anomaly detection in machine learning.

Model Reduction
Data from a smart meter are time series data. Ideally, for each ID meter a model is specially designed for that ID. However, this will make the number of models handled too large to handle the number of smart meters. From this time series data, the behavior of similar customers can be grouped based on electricity consumption. Thus, clustering methods are needed as long as the assumptions are valid. Many clustering models have been used, with two of the most common being Centroid models and Density models. Centroid models may be based on fuzzy c-means [16] or Kmeans [17], whereas density models may be a combination of principle component analysis (PCA) and density-based spatial clustering of applications with noise (DBSCAN) [18].
Kmeans clustering is often used because the time and space complexity are O(n k l) and O(k + n), respectively, where n is the amount of data, k the number of clusters, and l is the iteration that converts the algorithm. The weakness of Kmeans is that it has to determine the number of clusters from the beginning and the point of initialization can cause the solution to go into local minima [19]. On the other hand, DBSCAN focuses on the same density cluster [20] and it has an O(nlog n) performance if the R*-tree acceleration algorithm is applied. Moreover, the memory capacity without distance matrix is only O(n). The advantage of this algorithm is that it can separate noise, so that more similar cluster members can be obtained.
Anomaly detection processing with machine learning is very dependent on iterations. Processing too much data at once will be very burdensome for computing with high order complexity. Therefore, clustering is very important to limit the amount of the relevan data before being fed to machine learning. The focus of Kmeans and DBSCAN is different in terms of the partition area and density level. Therefore, we propose a combination of these two clustering to get more similar members, using a intersection technique which we call Kmeans*HDBSCAN.

Imbalanced Dataset
In supervised learning, the ideal condition for training is when the number of labels from each class is the same so the chance of each class in the learning process is the same. However, in the case of anomalies, the number of anomaly labels is always a minority when compared to normal or majority conditions. Two commonly used techniques to overcome this are oversampling minority and undersampling majority. Given the number of samples in imbalanced cases, oversampling is generally used, such as a synthetic minority over-sampling technique (SMOTE) algorithm [21] and its derivatives, such as adaptive sampling (ADASYN) [22]. However, Batista et al. [15] concluded that, for three minority/majority samples 1/99, 5/95, and 10/90 in five strategies, including SMOTE and ADASYN, did not yield good results.
The idea of SMOTE and other oversampling techniques is based on creating a cloud of samples between two real data [23]. If one real data is apparently on the wrong side, many high-bias samples tend to be generated. Thus, preprocessing must be handled before doing oversampling using this method. However, in the proposed method, we increase labels instead of feature data. The data contained in the smart meter are time series data. Anomaly labeling based on energy consumption is also based on time series. To predict anomalous events, pre-labeling is applied along the prediction range specified. For example, desired outcome may be the ability to predict seven days before. From the dataset it is known that there are anomalies on day 100. Thus the anomaly labels are expanded from day 94 to 100. In this way, the number of anomaly labels will increase according to predetermined prediction ranges.

Data Samples
A survey paper [24] states that a load profile that can predict weekly, daily, and hourly load consumption, which includes the Very Short-Term and Short-term Load Forecasting groups, is highly needed. To predict future events, previous observation points are required. For example, to predict a load with a time-of-day model generally takes from 24 h to 7 × 24 h, depending on forecasting expectations. This means that 24 to 168 data points are demanded per period. In daily cases, data points from the previous 24 days to 168 days are used.
However, the load profile from smart meter technology has not been implemented in all countries, so the data is still dependent on open datasets. The time span of a smart meter open dataset is often limited to one year or 365 days i.e., in [25]. This dataset consists of 200 smart meter IDs in one year. This is not enough to predict a whole year on a daily basis, since part of the dataset is used as observation points. To overcome this, we propose a clustering technique. All the smart meters incorporated in the cluster are assumed to originate from the same smart meter. In this way, the amount of data from the smart meter will increase significantly.

Comparison of Anomaly Detection in Smart Grid
Various techniques for detecting anomalies in smart grids have been studied and discussed, such as KNN [26], Support Vector Machine (SVM) [27], and LSTM [28]. These algorithms also discussed in the area of Internet of Things at the network edge [29]. For EC use, the latency time parameter becomes an important measure. Therefore, a comparison of these three algorithms will be observed in the following experiments. In terms of determining the model, training objectives, factor amount of data to the size of the model weight, training time, inference time shown in Table 1. For applications with EC, inference time is very important, because it determines latency time. For this reason, we need a comparison of the latency times of the three models obtained from the testing process performed on the EC device. Figure 2 shows an anomaly detection system architecture. System is divided into six levels including Smart meters, Neighborhood Area Network (NAN), Concentrator, Wide Area Network [8], Edge-Meter Data Management System (Edge-MDMS), and Cloud-MDMS. At the lowest level, the Smart meters in one region periodically send data to the Concentrator via NAN. Within the Concentrator, data is redirected to the corresponding EC that will process this data through WAN. The contribution of this paper is the anomaly detector of electrical energy consumption by enriching the Edge and Cloud levels, namely EC-MDMS and Cloud-MDMS. EC-MDMS has five modules that are used to store data locally and also send them to the Cloud-MDMS, load models sent by the Cloud-MDMS, retrieve the inference data from local storage, process inferences, and send alarm signals when anomalies occur. This alarm signal is also sent to the Cloud as an early warning of anomalies. Periodically, the data contained in the Cloud-MDMS level storage are retrained to obtain a model that is in accordance with the latest conditions. New models that are generated periodically will be sent back to the Edge-MDMS level. Figure 3 shows a sequence diagram of how modules interact with each other to form an ecosystem of anomaly detection system. Smart meter is a critical part and should be maintained carefully. It contains not only precision instrumentations, but also obligate sending the data and receiving the control on behalf of MDMS securely and privately. Recently, cyber-security has become prominent so that every smart meter must be equipped with high enough security facilities [35]. Apart from that, there are other matters that need to be addressed, such as tamper/theft detection, dynamic pricing, power outage detection, etc.

Anomaly Detection System
With so many functions that need to be handled but with limited computational capabilities for a range of applications up to a dozen years, the possible pre-processing should be as simple and secure as possible. In general, to process large data to get features and data from data, an upgradable and scalable machine is needed, which, in this case, is simpler if processed in the Edge than replacing all installed smart meters. Figure 4 displays the flowchart of the Anomaly Detection System. The pink, green, and blue colors represent the location of the functions in the smart meter device, edge, and cloud. Inference on three different models, Without-Clustering, Kmeans, and Kmeans*HDBSCAN, can be performed by setting the corresponding Mode value. If the clustering technique is used, the identification number of the smart meter determines the location of the cluster model.

Dataset
The dataset from [25] was obtained from 200 households in the Midwest region of the United States. This data set contains clean data with a time stamp every 30 min from 1 January 2012 to 31 December 2012. In this experiment, the time stamp is accumulated to a day to accommodate daily consumption data and then normalized. An anomaly label is statistically determined by calculating the mean plus three times standard deviations [36,37]. Figure 5 shows the histogram of the anomalous frequency. Labels with anomalies were obtained as many as 597 scattered among 150 smart meters, whereas the other 50 smart meters do not have an anomaly label. Among 150 smart meters, 24, 34, and one smart meter each has one, two, and ten anomalous events. The 597 anomaly labels are only around 1.1% and is included in the highly imbalanced dataset domain, since it is located near with 1%. By convention, the ratio of imbalance class lies between from 1:4 up to 1:100 [38]. Figure 6 shows consecutive label data that is formed based on statistical calculations, LA-7, and LA-14. These labels have only two values, 0 and 1, representing normal and abnormal conditions, respectively. It is clear that expanding the label with LA will reduce the imbalance ratio label of anomaly and normality. The ordinate and axis represent the day number, starting from 1 to 365 and smart meter number. The months between June to August, which are equivalent to index 180 to 240, are summer season in the northern hemisphere. The electricity consumption normally increases in this period of time, with most used for cooling. Figure 7 shows how to expand the anomaly labels. The anomaly labels are originally located in sparse areas. The labels will be extended to seven days in order to compensate for minority samples before the events happened (LA-7). The original anomaly labels reside at t = 22, 32, 36. For example, the anomaly label at t = 32 (light orange) is expanded from t = 26 to t = 32. After expansion the anomalous label, the third to fifth rows show how the anomaly labels are OR-ed to yield the New Label, as seen in the last row.
In Figure 8, sliding window data are generated by combining the sequences of data values, in this case each timesteps is 14 days, with a label, (light blue) from next timestamp. Rows SW-1 up to SW-4 show how four sliding windows data are generated. LA-X is varied to 7, 14, 21, 28, and 35, the number of anomalous labels will increase by 6.2%, 10.9%, 14.9%, 18.2%, and 21.3%, respectively. This percentage represents random sampling probability without prediction techniques.    Three types of datasets are prepared for Without-Clustering, Kmeans, and Kmeans*HDBSCAN. These datasets will be trained and tested within a combination of Convolution1D and LSTM [39] provided in main Deep Neural Networks (DNN) frameworks, KNN [15,29], and SVR [31]. The performance goal is indicated by the higher ROC-AUC, PR-AUC, and the processing time for each prediction step. The preparation of three types of datasets is arranged before serialization. After these three types of datasets are formed, serialization will be carried out, which will be explained in the Section 3.3.

Without Clustering
This time series dataset needs to be separated for training and testing, which are determined at 80% and 20%, respectively. The number of rows of 365 samples is too small for the training process, even though a sliding windows technique was used. To overcome this problem, it is assumed that all smart meters from the resident households and neighboring smart meters are essentially homogeneous. Each dataset from the smart meter is stacked with the previous smart meter data. With this assumption, the entire serialization of the dataset is 73,000 × 1 sample. These data are sufficient to be separated into training and testing datasets. Henceforth, it will be referred to as the Without Clustering type.

Kmeans Clustering
Assumptions of homogeneity are not necessarily true, but can still be corrected by grouping the dataset according to the closest cluster. One of the most widely used clustering methods is Kmeans. A total of 200 smart meters were clustered with Kmeans for two to 10 clusters and tested with Silhouette Coefficient (SC) [40]. SC gives a weight close to one if the resulting cluster has members that are close to the center of the cluster and negative one if the members of the cluster are closer to other clusters. From this test, only two clusters are recommended with SC values of 0.301. The range of silhouette values ranges from -1 to +1, where the value +1 indicates that the object matches its own cluster and does not match the neighboring group. The SC value of 0.301 indicates that there are many points that match the cluster, but there are also some that do not match. Algorithm 1 explains how to determine SC obtained after separating data from two to NMAX+2 clusters. After the cluster is determined, the data need to be splitted into appropriate clusters. This is done in Algorithm 2. This algorithm combines the similar smart meter IDs in one sub-dataset.

Kmeans*HDBSCAN Clustering
We combine the Kmeans and HDBSCAN. Kmeans is basically a partition that assumes data are collected in spherical form, equal size and dense, densest in the center of the cluster, and does not include noise or outliers. However, this real dataset has noise that should be removed before the learning process. Therefore, we apply HDBSCAN, which has the capability to determine noise and cluster data [41].
In Algorithm 3, the Kmeans*HDBSCAN process is carried out. In this process, all of the clusters from HDBSCAN are intersected with all clusters from Kmeans. Because it is mutually exclusive, it is not possible to have one ID occupying two sub clusters of Kmeans*HDBSCAN. Figure 9 illustrates Kmeans that has two clusters C0 and C1, while HDBSCAN also produces two clusters H0 and H1, and also cluster noise. In this case C0H0 has ID-1 and ID-3 smart meters. Whereas the noise cluster members are ID-7, ID-9, and ID-10.

Platform
Transmission delay, latency in data processing, and deadlines are the main factors in the design of systems with smart meters. Transmission delay that is too large requires an fast edge/fog server and cloud to reduce latency, so that targets from deadlines can be met. A paper discusses the effect of the average delay ratio between fog and cloud with the parameters of the number of server fog ratios four times the cloud, the fog processing power ratio of 10% of the cloud [42]. The variable is the delay fog to cloud ratio that is changed from 1% to 85%. From Figure 10, this delay ratio is Delay-2 to Delay-3.
Delay-x is a combination of various kinds of delay, such as transmission delay, propagation, queueing, and router/nodal processing. A 1% delay ratio means that the smart meter device is very close to the edge/fog server and the location of the cloud is relatively very far. From this experiment 100 s total latency was found that the cut-off point when the average delay ratio was 85% and at the data request 5000 packets. The average number of datasize requests greater than the intersection indicates that the latency of the edge/fog is closer to the cloud and very far from the smart meter device. In this experiment, the Latency-1 of the cloud was determined at 10 ms. Therefore, in this study measurement of Latency-1 and Latency-2 as parameters when deployment of this system to the actual application is needed and it will be discussed in Section 4.2.5. The latency measurement algorithm is done during the testing process. The average time difference before and after inference is defined as latency. There are two latencies that are sought, in this case Latency-1 and Latency-2, which are located on the PC server and edge. The Algorithm 4, shows how the measurement process is applied. The PC server and edge represented as Raspberry Pi both run in stand alone mode to reduce disturbing from another process.

Methods
The experimental steps are shown in the Figure 11 and explained, as follows. The first step is to load the original dataset. The original dataset is a collection clean data from 200 smart meter IDs, with 30 min time stamps from one year. This dataset is rearranged by accumulating it from 30 min to one day in order to conform with the day as the domain prediction. This dataset is then normalized to the maximum value of each smart meter. The size of this dataset is shaped into 365 days within 200 IDs (365, 200). The second process is to create labels based on the statistical definition of mean+3.standard_deviations. The size and shape of these labels are the same as the dataset. This step is continued with the serialization process, which is changing the shape from (365, 200) to (73,000, 1) by stacking each data and label of each smart meter. This first step has two others variations, namely serialization with clustering, which is clustering all of the data into two clusters and serialization with Kmeans*HDBSCAN. The difference with and without clustering is the selectivity of the smart meter data to be trained and tested becomes narrower. Clustering all if the data into two clusters split the 200 smart meters of data into two clusters. The clustering method applied is Kmeans with a value of k = 2. The resulting clusters have 123 and 77 smart meters.
The second step is to prepare the dataset, so that it can be processed by machine learning, which in this case is Conv1D-LSTM, KNN, and SVR. The format requested by LSTM is (samples, time, feature). The feature used here is only one, electricity usage. Time represents the number of time stamps or timesteps used to predict output. Samples states the number of data samples that are already in time series format. Time is composed of sliding windows features. In this experiment the feature or data width of the sliding windows used is 32, 64, and 128. The width of the sliding window label is determined at 7, 14, 21, 28, and 35 where represent one to five weeks' prediction. The purpose of this sliding window is to label the anomaly several timesteps before the predicted anomaly. The second goal is to increase the label anomaly, so that it reduces the imbalance rate. The algorithm used to generate labels and sliding windows is described in Algorithm 5. This algorithm has been used to read sequence data and convert it to standard LSTM data and an output vector [43]. The difference with this algorithm lies in the line number 9, where the output will be labeled 1 as long as there are anomalies in the range. // Creating dataset with sliding windows and new look-ahead labels 1 SlidingWindows(train_ori, label, sw_len, LA_len); 2 ptr_start_in ← 0 ; 3 for none in range(len(train_ori)): do 4 ptr_end_in ← ptr_start_in + sw_len; The third step is trying to find the good hyperparameters in DNN. The simplified DNN model to find the hyperparameters is presented in the Algorithm 6. The first trial is checking to find the epoch number for training. The experiment start to find a relative good performance in a relative Other parameters tested are the number unit/tensor of LSTM (32,64,256). This unit represented the dimensional output. The next parameter is Convolution stack (3,4,5), where, from this experiment, the good features based on the Maxpooling can be found. The best candidate model is chosen based on the best performance. Subsequently, this candidate model was tested again with stratified cross validation k = 5. Stratified cross validation k = 5 will divide the anomaly label proportionally 80% for training and 20% for testing. In experiments with cross validation k = 5, training was conducted five times. The results from the ROC AUC metric are averaged to represent this experiment. Furthermore, the best model is tested again for variations in the activation function at the last layer (Sigmoid, ReLU). The expected performance according to EC is an accurate prediction seen from the highest ROC value and the shortest prediction time.
The two other sub-steps are applying KNN and SVR using the same dataset to find the ROC-AUC of each algorithm. The KNN experiment was done by choosing the best k by manual selection after the ROC and PR were calculated and summed to yield only one value. Here the k was found with sequences 20, 50, 35, 28, 43, 39, 37, 41, and 100. The SVR's main parameters that will be treated as variables are C and . Parameter C is changed from 1 to 4 and from 0.1 to 0.4. After the training, all of the model weights are saved into files to select the best results as the model comparators to the DNN.
The fourth and fifth steps involve repeating steps three but serializing with Kmeans and serializing with Kmeans*HDBSCAN. The parameters chosen in the fourth step will be used again here.
The last step is loading the model into a Raspberry Pi to find the processing time. Three persistent models from DNN, KNN, and SVR are selected and transferred to Raspberry Pi. Each model was run and the processing time during inferencing observed. These results will show the performance of each algorithm in the Edge.

Experimental Environment
The experimental environments are shown in Table 2. This table shows the hardware and software that are used in this experiments. A standalone PC-server is connected to the Raspberry Pi in a dedicated LAN. Raspberry Pi is chosen as an Edge, mainly because of library support and compatibility with ML operations and the Tensorflow framework.

Results
The experimental results are explained in three subsections, namely Without Clustering, Kmeans, and Kmeans*HDBSCAN, as follows.

Without Clustering Experiment
At the beginning, the input data are going to be processed in the Convolution and Maxpooling layers. Both of the layers are utilized to find the maximum keypoints among the data points. Convolution will increase the number of parameters and Maxpooling will decrease the number of parameters. Maxpooling 1 × 3 will reduce the number of calculations to only 1/3 of the output parameters. This parameters can be calculated manually or by running a Keras function model.summary().
The output parameters then feed into LSTM. One of the LSTM hyperparameter is Unit. The LSTM Unit number determines the number of output dimension or Tensor, the size is linear inside LSTM layer. For the two LSTM layers, the output parameters from layer one are multiplied by the unit numbers in layer two. This causes the number of variable tensors to be processed to be high if the parameters on the previous layer are large.
From Figure 12, the horizontal axis represents the experiments by changing the unit numbers and activation functions. Each axis point has six experiments, three each for 1 and 2 LSTM layers, represented by bar and line graphs, respectively. For example, 32Tanh shows three experiments with single-layer LSTM with convolution up to layers 16, 32, 64 symbolized by the dashed-1L16, dotted-1L32, and light-dotted-1L64 bars, respectively. When 1L64 and 2L64 are compared, they look similar but the ordinate time scale of 2L is 1.6 times that of 1L. This experiment confirms that 2L will require more CPU power that may not be suitable for running in EC. Regarding a convolution layer, 1L64 is faster than 1L32 and 1L16. This experiment also shows that the longer layer will fill many points with zeros since the maximum values are more preferred. Thus making the calculation with zero is faster. Hence, the 1L64 is faster than 1L32 or 1L16. The next step is to try to find the best ROC-AUC and PR-AUC result.  Table 3 shows the ROC-AUC, PR-AUC, and the sum of ROC-AUC and PR-AUC results. The highest ROC-AUC experimental result is yielded by N64-CV64 with the Sigmoid activation function that obtains score 0.885. N64-CV64 is the abbreviation of LSTM with 64 Unit or tensor and five Convolution layers (4,8,16,32,64). The average training time per step of model N64-CV64 for Tanh, Sigmoid, and ReLU are 371 µs, 400 µs, and 442 µs, respectively.
The best PR can be obtained using N64-Cv64-ReLU which has AUC PR of 0.595. The best sum of ROC-AUC and PR-AUC is found using N32-Cv64-Sigmoid with 1.465, which is the best result in this table. If inference speed is important in the Edge, model N32-Cv64-Tanh with a score of 0.852 is the best choice. For this reason, N32-Cv64-Tanh is chosen for subsequent experiments. The number of imbalanced labels are expanded by adding labels before the original anomaly label based on the assumption that the system must be able to warn of the anomaly in previous timestamps. In this experiment, changing the label to an anomaly was done in 7, 14, 21, 28, and 35 days before the anomaly occurred. This change will increase the number of labels without changing the input data. Table 4 shows the test results of one layer LSTM with timesteps 32, 64, and 128. Normally, increasing timesteps would increase the performance, because the network is given more data to do estimation. Table 4 confirms the hypothesis for ROC-AUC and PR-AUC. The smaller time step is faster, because less memories are included in the calculation. However, in the next experiment, we want to do performance comparison with the other model. That is why in this experiment time step 128 is preferred. Intuitively, the longer LA will increase the anomaly detection probability. It implies increasing the PR-AUC performance since the increasing of True Positives, as shown in Table 4. The difference between D200-064 and D200-128 is not very significant, but it only requires half of the sample size, so that the amount of time used to predict is also shorter. However, if the memory and processing time are still sufficient to fulfill the specification, the D200-128 that performs slightly better should be chosen. The above experimental results were compared with SVR and KNN. SVR has two hyper-parameters C and . A limited grid search is chosen in order to find the parameters that infer the good results. Table 5 is obtained from the results of experiments with parameters C = 1 to 4 and = 0.1 to 0.4. The dataset is the same dataset as experiment DNN without clustering. Most SVR result performances show slightly better than with DNN. However, the SVR training time per sample is between 5.81 ms and 32.02 ms far longer than the DNN results. The smallest running time per sample is when C = 1 and = 0.2. This C result confirms that the smallest C will be computationally less expensive than the higher C [34].  In the KNN experiments, the k is not swept from 2 to 50, as in [44], because this takes a lot of time. Here, manual selection, as described in Section 3.3, similar to binary search, was chosen to find the optimal k. The result is shown as in Figure 13. The ROC-AUC and PR-AUC are the upper and lower curves, respectively. From this graph the optimal k is found at k = 37. The k in KNN is selected based on the value of k, which produces the best ROC-AUC and PR-AUC from the testing results when run in Without-Clustering mode.

Kmeans Experiment
Clustering at the beginning of the process can be used to group households that have similar profiles. Table 6 is obtained from the Kmeans experiments k = 2. Smart meters are divided into two clusters C0 and C1, with 77 and 123 smart meters IDs, respectively. There was a significant difference between C0 and C1. C0 shows better performance than C1, because it has more anomalous labels. The original C0 has 438 anomalous points, while C1 only has 159. Among all data in the cluster C0 and C1, the anomalous points are 1.6% and 0.4%, respectively.
In the case of DNN C0-128 as compared to D200-128 for all LA steps, the ROC-AUC and PR-AUC scores increase 10.2% and 48.4%, respectively. This increase increases true positive, decreases false positive, or false negative significantly. If number of smart meters are quite large, it should be possible for C1 members to be moved to the right cluster as long it has enough anomalous labels. To get another experiment sense, measurements of consumption patterns are proposed while using Dynamic Time Wrapping when compared to using Kmeans [14]. The benefit of using this clustering is mainly to reduce the clustering numbers because small shifts of electrical consumption should not be considered as a different cluster. In the next subsection, we show our proposed Kmeans*HDBSCAN experiment results that has the ability to increase the performance in the dominant cluster.
The results of clustering experiments trained with DNN, SVR, and KNN methods are shown in Tables 6-8. All three of these tables suffer from C1, so the results cannot be used in this clustering group. Of the three attempted methods, the average PR-AUC of C0-128 for DNN, SVR, and KNN is 0.837, 0.853, and 0.859. The results show a little difference with KNN is the best.

Kmeans*HDBSCAN Experiment
In the clustering experiment, the Kmeans dataset was separated into two clusters C0 = 77 and C1 = 123 members. Kmeans assumes that the data are not contaminated with noise, so we need an algorithm that is able to detect the presence of noise data and group them separately. Density based algorithms, such as DBSCAN and HDBSCAN, can be utilized. In this experiment, HDBSCAN was chosen because it offers better performance than its predecessor.
The results of HDBSCAN clustering are in the form of noise groups and two clusters, H0 and H1. The sub total members of the HDBSCAN results are H0 = 80, H1 = 57 , and Noise = 63 members. The next process is by intersecting a combination of C0, and C1 with H0, and H1 to form four new sets as a result of this intersection. Table 9 shows the intersection members of C0H0, C0H1, C1H0, and C1H1 of 74, 0, 57, and 63, respectively. As seen in Figure 14, a uniform pattern is shown, when all members of the C0H0 are plotted.
Kmeans*HDBSCAN C0H0, as shown in Table 10, when compared to Kmeans-C0 at timesteps 128 for all LA, is able to increase the average performance of DNN, SVR, and KNN by 2.9%, 0.7%, and 3.3%, respectively. Previously the KNN had demonstrated an average of 0.859, but Kmeans*HDBSCAN caused an increase to 0.888. At the same time, DNN and SVR increased from 0.837 to 0.861, and 0.853 to 0.860, respectively. Figure 15 displays a comparison of DNN, SVR, and KNN, each with Kmeans and Kmeans*HDBSCAN types. Kmeans*HDBSCAN C0H0 shows better than Kmeans C0 with 128 timesteps. Figure 16 represents that, in this anomaly detection, KNN shows better than the similar performances of DNN and SVR.The results of KNN C-128 with LA-24 are generated from the power trendline equation as seen in Equation (1). The PR-AUC value for LA = 24 is 0.918 or the imperfect condition is 0.082. From the [14], the best 24-hour prediction was obtained from the smallest DTW error of 0.182. These results indicate that the proposed method is potentially better.
The SMOTE algorithm is also tested for timesteps = 128 and LA-14. The difference between the experimental steps and other algorithms is that SMOTE expands the minority sample to the same amount as the majority sample, so that the total number of samples is twice the majority sample. With a balanced sample, it is expected that the test sample will be in the area that has been developed. The test results are as seen in Table 10 C-128-SM. SMOTE is not able to increase the ROC and PR of the SVR, DNN, and KNN performance. The performance of SMOTE in this experiment is slightly lower than that of Kmeans*HDBSCAN. Another disadvantage of SMOTE is that the number of training samples increases. For DNN and SVR, the training time will be longer, while the inference time will be longer for the SVR and KNN algorithms. The SVR and KNN persistence files also increase to become 4.9 MB and 35.6 MB from the previous 3.5 MB and 23.8 MB, respectively.    The distribution of the forecast error from Kmeans*HDBSCAN with DNN, timesteps = 128, LA-14, and 2577 samples test data is shown in Figure 17. More than 90% of the prediction errors reside in the first left and right bins near the 0.00. Every sample is compared with label-0 or label-1, thus the minimum and maximum are -1.00 and 1.00. This information confirms the high score in True Positives and True Negatives.

Performance Evaluation for All Clusters and Dominant Cluster
The performance of this system is tested and compared from the all of the clusters and based only on the dominant cluster. Testing the entire cluster is done by taking all of the values of true positive (TP), false positive (FP), true negative (TN), and false negative (FN) of each cluster model that is calculated from the experiment timesteps 128 and LA-14. In addition, the performance metrics of true positive rate (TPR), false positive rate (FPR), precision, recall, and F1 are calculated. In the same way for dominant clusters, but only one cluster is measured. Tables 11 and 12 represent all of the performances of the machine learning applied here.       The F1-score performance of all clusters when compared to XGBOOST [45] as a baseline in the Without-Clustering type shows always higher. The application of Kmeans*HDBSCAN for all types of machines in this experiment always results in an F1 score that is slightly lower than the Kmeans, but is always higher than the Without-Clustering type. Recall or TPR from DNN, SVR, and KNN from Kmeans*HDBSCAN showed 0.018, 0.033, and 0.029 results better than Kmeans. This is due to a decrease in ΣFN or an increase of ΣTP.
In the dominant cluster, F1-score of Kmeans*HDBSCAN always shows 0.007, 0.016, and 0.010 higher than Kmeans for DNN, SVR, and KNN type of machines, respectively. The dominant cluster is a cluster that has more anomalies than other clusters. From 1052-1053 actual anomalies in the testing dataset, 749-769 data are actual anomalies in the dominant cluster. Table 12, shows that F1 scores for Kmeans*HDBSCAN C0H0 are always better than Kmeans C0 for all machine-learning in these experiments. This reflects that the combination of Kmeans and HDBSCAN is able to produce better performance than Kmeans alone in cluster with higher anomalies probability.

Running on Edge Machine
As with techniques in machine learning, training normally requires an iteration process so that it takes longer than the inference process. A high number of parameters will affect the speed of inference. The model created by SVR is obtained after an iteration calculation process to look for w (weighted vector), which can separate the dataset using hyperplane. The number of parameters w is asymptotically as many as the amount of data X based on [33], which cites the work of [32] (i.e., w T .X). After the training, not all of the data points are important, but only those that become support vectors.
Comparison of the sizes of DNN, SVR, and kNN models states that the order from the smallest sizes is DNN, SVR, and KNN. KNN swallows all training data without processing beforehand, so that the training process is said to be the fastest, but it takes a little bit longer to load in the Raspberry Pi. DNN on the other hand, has a different technique, because it starts from a heuristic architectural design. Models are selected and determined from the beginning causing the size of the parameters to be predicted from the beginning and not depend on data size as in SVR or KNN. Table 13 and Figure 18 display the results of testing in tabular and graphical form. Testing is done on the server and edge to measure Latency-1 (Figure 18a) and Latency-2 (Figure 18b), respectively, and measure the size of the file persistent (Figure 18c) as the implementation of machine learning. Testing DNN in the PC-server is faster than SVR and KNN because DNN is supported by NVIDIA-GPU, while SVR and KNN that are built on the sklearn [34] library only optimized in the host processor. Sklearn does not yet provide a version that can access the GPU when run-time on PC and Servers. Latency-1 of DNN, SVR, and KNN in Kmeans*HDBSCAN if the test runs on PC-server shows 0.23 ms, 0.47 ms, and 2.90 ms. In the edge for Kmeans*HDBSCAN, DNN displays results with inference speed performance 3.8 and 19.6 times faster than SVR and KNN, respectively. In other words, the Latency-2 of DNN, SVR, and KNN in the Raspberry Pi are, respectively, 1.24 ms 4.76 ms, and 24.34 ms. DNN is also always faster for the case of clustering with Kmeans compared to SVR and KNN. But the inference speed of SVR and KNN in clustering mode is increasing very rapidly. This is consistent with the theory that both the SVR and KNN models depend on the amount of training data. While in DNN the inference time is relatively fixed, because the model does not depend on the amount of data trained. Clustering techniques reduce the amount of training data, so that the calculation of support vectors being sought and searching for data samples around the sample point reduces. DNN is still far faster than SVR and KNN despite this time reduction for SVR and KNN.   Figure 18c also show the model file size with the DNN model as the smallest, followed by SVR and KNN. The DNN model is stored in .h5 and .json files, while the SVR and KNN use .joblib. DNN in Without-Clustering mode is 69 and 618 times smaller than SVR and KNN, respectively, because it is a pre-designed model with several tests with parameters and options as variables. However, after clustering, the persistence file size is still much larger than the DNN. The small persistence file speeds up the process of changing the model sent by the cloud when the old model is not suitable and needs to be replaced. DNN is far better than SVR or KNN in the case of latency and model file size. For each inference process in Raspberry Pi is around 1.25 ms with a communication delay of about 10ms that allows the processing of 89 smart meter data/second. For a larger area of sensor network with a time step of 30 min and a transmission duty cycle of 1/100, effectively transmission can be done in 30 × 60 s × 1/100 or 18 s. During this, 18 s and with 89 inferences/second, totally this simple edge can process 1600 smart meters that can be handled by the edge at a time. If five retransmission is allowed then maximum 1600/5 or 320 smart meters can be served per 30 min. In our dataset, there are 200 smart meters identifiers; therefore, this Raspberry Pi is enough to satisfy this specification. Thus, even though the latency in Edge is five times higher than in the PC-server, this specification still can be fulfilled.
In Table 13, DNN and KNN show more linear results than SVR. SVR Without Clustering processing on a PC took 1.38 ms, while, on Raspberry PI, it was 33.11 ms. The possible answers is because the SVR library (Sklearn) optimizes the solution based on the underlying hardware architecture. The i7 architecture is more complex than Arm-based architecture, which can possibly make the processing much faster. In addition, the complexity of the SVR O(n f eatures × n 2 samples ) is largely determined by the number of samples. Accordingly, the latency obtained cannot be linear as a result of the number of samples and architectural models.

Conclusions
There are four conclusions that can be drawn from all of the experiments that have been carried out. First, we have proposed an Edge deployment model to overcome the excessive latency of the centralized model. Edge gets data sent from the smart meter with a client-server model via Ethernet as a simulation of data transmission. In the inference time/sample and persistant file size tests, DNN outperformed the SVR and KNN for all types of Without-Clustering, Kmeans, and Kmeans*HDBSCAN experiments. Second, the experimental results of the PR-AUC performance of the anomaly detection for the Without-Clustering, Kmeans, and Kmeans*HDBSCAN types showed that KNN was always slightly better and was followed by SVR and DNN. Third, the multitiered solution that has been applied in each experiment shows the tendency to be intuitively correct, namely, the timesteps of the extended sliding window and the widened look-ahead label will always improve the performance of PR-AUC. Finally, a comparison test on the dominant clusters of Kmeans*HDBSCAN and Kmeans shows that the dominant clusters of Kmeans*HDBSCAN are always slightly better to those of Kmeans alone. Acknowledgments: The authors obtained support from Computer Science and Information Engineering National Chung Cheng University, Taiwan.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: