# Analysis of Machine Learning Algorithms for Anomaly Detection on Edge Devices

^{*}

## Abstract

**:**

## 1. Introduction

- Detailed analysis of five ML algorithms (logistic regression, support vector machine, decision tree, random forest, and artificial neural network) for determination of anomaly detection performance on traffic traces between different IoT nodes communicating over DS2OS common middle-ware.
- Proposal of two general and intuitive approaches for keeping comparable classification results and reducing the size of an imbalanced training dataset by randomly under-sampling the majority class (‘NL’), and by under-sampling each class with clustering and selecting the most representative observation samples.
- Evaluation of ML algorithms training times on Raspberry Pi 4 comparing small randomly specified imbalanced datasets and new reduced balanced datasets, as well as examining the results of memory usage for suitable implementation on resource-constrained edge devices.

## 2. Related Work

#### 2.1. Edge Computing

#### 2.2. Machine Learning

#### 2.2.1. ML Algorithms

- Logistic regression (LR) is a linear model for classification [26]. In the scikit-learn implementation used, regularization is applied by default in Python as a function call LogisticRegression(class_weight = ’balanced’, max_iter = 10,000, n_jobs = −1). The solver for the optimization problem is lbfgs [27]. By default, it uses the cross-entropy loss in a multiclass case. The parameter class_weight was set to balanced mode, which uses values of output $y$ to automatically adjust weights inversely proportional to class frequencies in the input data. The parameter max sets the maximum number of iterations for the solvers to converge and was raised from the default value of 100 to 10,000 to prevent solvers from not converging. The parameter n_jobs sets the number of CPU cores that can be used in case of a multiclass problem with a one-vs.-rest (OvR) scheme and was set to −1 for all runs (−1 means using all available processors), although it had no effect in this case as cross-entropy loss was used for the multiclass problem.
- Support vector machine (SVM) is a supervised learning model used for classification and regression [28]. Scikit-learn’s C-Support Vector Classification implementation is based on libsvm defined as function call SVC(class_weight = ‘balanced’). By default, it uses a Radial-Basis-Function kernel and l2 regularization with the strength of 1.0 [29]. The multiclass support is handled according to a one-vs.-one scheme. The parameter class_weight was set to balanced mode, which uses values of y to automatically adjust weights inversely proportional to class frequencies in the input data.
- Decision tree (DT) is a non-parametric supervised learning method for classification [30]. In the scikit-learn implementation used, it is defined as DecisionTreeClassifier() [31], with default criterion for measuring the quality of a split using Gini impurity. This is a measure of how often a randomly chosen element from the set would be incorrectly labeled. No parameters were set outside of their default values.
- Random forest (RF) is one of the ensemble methods which combines the predictions of several base estimators to improve the robustness of the estimator [32]. Each tree in the ensemble is built from a sample drawn with a replacement from the training set. By default, in the function call RandomForestClassifier(n_estmators = 100, n_jobs = −1), there are 100 trees in the scikit-learn implementation of the algorithm, with Gini impurity as a default measure of split’s quality. The whole dataset is used to build each tree. The parameter n_jobs was set to −1 to use all available CPU cores for parallelizing fit and predicted methods over the trees.
- Artificial neural network (ANN) is a circuit of connected neurons that each deliver outputs based on their inputs and used predefined activation functions [33]. A Keras library with Tensorflow backend was used for the ANN training model with 11 input nodes on the input layer, 32 nodes on a hidden layer with relu (rectified linear) activation function, and 8 output nodes with softmax activation function to normalize the outputs. The selected optimization function was the Adam optimizer. The loss function was sparse categorical cross entropy and the number of epochs was set to ten.

#### 2.2.2. Evaluation Metrics

- Accuracy determines how many predictions the classifier got right from all the predictions (Equation (1)). It is defined as a sum of number of true positives (TP) and true negatives (TN) divided with the sum of number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN):$$\mathrm{Accuracy}=\frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{TN}+\mathrm{FP}+\mathrm{FN}}$$While the higher the number the better in case of an approximately equal number of samples in all classes, accuracy alone often leads to an error in the classification of the minor class in imbalanced datasets;
- Precision is the fraction of relevant instances among the retrieved instances (Equation (2)). It is defined as a number of true positive (TP) results divided by the number of true positive (TP) results and false positive (FP) results;$$\mathrm{Precision}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}$$
- Recall is the fraction of the total amount of relevant instances that were actually retrieved (Equation (3)). It is defined as a number of true positive (TP) results divided by true positive (TP) results and false negative (FN) results;$$\mathrm{Recall}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}$$
- F1 score is the harmonic mean of precision and recall (Equation (4)). The highest possible value of F1 is 1, indicating perfect precision and recall, and the lowest possible value is 0, if either the precision or the recall is zero;$$\mathrm{F}1\mathrm{score}=\frac{2\ast \left(\mathrm{Precision}\ast \mathrm{Recall}\right)}{\left(\mathrm{Precision}+\mathrm{Recall}\right)}$$
- Confusion Matrix is a specific table layout meant to visualize the performance of an algorithm, typically one from a group of supervised learning algorithms. In Python implementation, each row of the matrix represents the instances in an actual class while each column represents the instances in a predicted class (Figure 1). It is easy to see all falsely classified samples. The more samples found on the diagonal of the matrix, the better the model is.

## 3. Results

- Imbalanced training datasets (Di)—randomly selected samples from the training set;
- Balanced datasets (DRi)—all anomalous classes and randomly selected samples from class ‘NL’;
- Balanced datasets (DCi)—selected clusters of representative samples from all classes.

#### 3.1. Dataset

- Removal of corrupted data, unreadable field values;
- Change of NaN values from column ‘NodeType’ to Malicious;
- Replacement of all non-numeric values in column ‘value’ with numeric representations, all missing values in the same column filled with 0;
- Removal of ‘timestamp’ column from the dataset, as it is irrelevant;
- Use of label encoding on all columns except on column ‘values’.

#### 3.1.1. Imbalanced Subsets

#### 3.1.2. Random Selection of Class ‘NL’

#### 3.1.3. Subsets of Clusters Data

_{old}), the threshold that determines the minimum cluster size (t), and the number of representative observations we want to extract (n). First, the number of classes presented in the dataset is determined and used for iterating over each class (c). Second, observations for class c are extracted (X

_{c}) from the entire dataset and input into the DBSCAN clustering algorithm [36].

_{p}) and calculate their centroid with Equation (5):

_{p}($x$) with Equation (6):

_{new}), with the parameter n and the size of the cluster. If, for example, the current cluster presents 70% of all points from class c, we extract $0.7n$ of observations from it. So, large clusters provide proportionally more points than small clusters. We repeat these steps first for the remaining clusters and then for the remaining classes. If there are fewer points in the cluster than the number allowed, then we just add all points to the D

_{new}. This property is especially useful when we are dealing with unbalanced datasets, where this approach reduces the number of the majority class observations by selecting only the most representable ones, while keeping all observations of minor classes. The time complexity of this approach is $O\left(mnlogn\right)$, where $m$ presents the number of classes and $nlogn$ presents the DBSCAN clustering. This could be improved with the use of more efficient incremental density-based clustering approaches with time complexity $O\left(nm\right)$ like DBSCAN++ [37]. On the other hand, the direct clustering of all classes at once and then determining the most representative observations could also prove beneficial; however, handling of a multi-class cluster could be challenging.

Algorithm 1: Dataset reduction using clustering. |

Input:D_{old}, t, n |

Output:D_{new} |

Function DatasetReduction(D_{old}, t, n):D _{new} ← [ ] |

for c in findClasses(D_{old}) do |

X_{c} ← extractClassPoints(D_{old},c) |

db ← DBSCAN.fit(X_{c}) |

for l in db.clusters() do |

m ← size(l)/size(X_{c}) |

if m > t do |

X_{p} ← db.extractClusterPoints(p) |

q ← createCentroid(X_{p}) |

dist ← distances(q, X_{p})D _{new} ← addClosestNPoints(D_{new}, dist, X_{p}, n) |

end |

end |

end |

end |

#### 3.2. Evaluation of Imbalanced Training Datasets

#### 3.2.1. Classification Results

#### 3.2.2. Confusion Matrix (D20)

#### 3.3. Evaluation of Balanced Training Datasets with Reduced Class ‘NL’

#### 3.3.1. Classification Results

#### 3.3.2. Confusion Matrix (DR5)

#### 3.4. Evaluation of Balanced Datasets Determined with Clustering

#### 3.4.1. Classification Results

#### 3.4.2. Confusion Matrix (DC5)

#### 3.5. Comparison of ML Algorithms

#### 3.6. Edge Computing Results on Raspberry Pi 4

#### 3.6.1. Training Time

#### 3.6.2. Memory Usage

## 4. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Yousefpour, A.; Fung, C.; Nguyen, T.; Kadiyala, K.; Jalali, F.; Niakanlahiji, A.; Kong, J.; Jue, J.P. All one needs to know about fog computing and related edge computing paradigms: A complete survey. J. Syst. Archit.
**2019**, 98, 289–330. [Google Scholar] [CrossRef] - Merenda, M.; Porcaro, C.; Iero, D. Edge Machine Learning for AI-Enabled IoT Devices: A Review. Sensors
**2020**, 20, 2533. [Google Scholar] [CrossRef] [PubMed] - Premsankar, G.; Francesco, M.D.; Talb, T. Edge Computing for the Internet of Things. IEEE Internet Things J.
**2018**, 5, 1275–1284. [Google Scholar] [CrossRef][Green Version] - Chen, J.; Ran, X. Deep Learning With Edge Computing: A Review. Proc. IEEE
**2019**, 107, 1655–1674. [Google Scholar] [CrossRef] - Kozik, R.; Choras, M.; Ficco, M.; Palmieri, F. A scalable distributed machine learning approach for attack detection in edge computing environments. J. Parallel Distrib. Comput.
**2018**, 119, 18–26. [Google Scholar] [CrossRef] - Poornima, I.G.A.; Paramasivan, B. Anomaly detection in wireless sensor network using machine learning algorithm. Comput. Commun.
**2020**, 151, 331–337. [Google Scholar] [CrossRef] - Hasan, M.; Islam, M.M.; Zarif, M.I.I.; Hashem, M.M.A. Attack and anomaly detection in IoT sensors in IoT sites using machine learning approaches. Internet Things
**2019**, 7, 100059. [Google Scholar] [CrossRef] - Elsayed, M.S.; Le-Khac, N.A.; Dev, S.; Jurcut, A.D. Network Anomaly Detection Using LSTM Based Auto-encoder. In Proceedings of the 16th ACM Symposium on QoS and Security for Wireless and Mobile Networks, Alicante, Spain, 16–20 November 2020. [Google Scholar]
- Pang, G.; Shen, C.; Cao, L.; Hengel, A. Deep Learning for Anomaly Detection: A Review. ACM Comput. Surv.
**2021**, 54, 1–38. [Google Scholar] [CrossRef] - Churcher, A.; Ullah, R.; Ahmad, J.; Rehman, S.; Masood, F.; Gogate, M.; Alqahtani, F.; Nour, B.; Buchanan, W.J. An Experimental Analysis of Attack Classification Using Machine Learning in IoT Networks. Sensors
**2021**, 21, 446. [Google Scholar] [CrossRef] [PubMed] - Kim, J.M.; Cho, W.C.; Kim, D. Anomaly Detection of Environmental Sensor Data. In Proceedings of the 2020 International Conference on Information and Communication Technology Conference (ICTC), Jeju, Korea, 21–23 October 2020. [Google Scholar]
- Janjua, Z.H.; Vecchio, M.; Antonini, M.; Antonelli, F. IRESE: An intelligent rare-event detection system using unsupervised learning on the IoT edge. Eng. Appl. Artif. Intel.
**2019**, 84, 41–50. [Google Scholar] [CrossRef][Green Version] - Sajjad, M.; Nasir, M.; Muhammad, K.; Khan, S.; Jan, Z.; Sangaiah, A.K.; Elhoseny, M.; Baik, S.W. Raspberry Pi assisted face recognition framework for enhanced law-enforcement services in smart cities. Future Gener. Comput. Syst.
**2017**, 108, 995–1007. [Google Scholar] [CrossRef] - Anandhalli, M.; Baligar, V.P. A novel approach in real-time vehicle detection and tracking using Raspberry Pi. Alex. Eng. J.
**2017**, 57, 1597–1607. [Google Scholar] [CrossRef] - Xu, R.; Nikouei, S.Y.; Chen, Y.; Polunchenko, A.; Song, S.; Deng, C.; Faughan, T.R. Real-Time Human Objects Tracking for Smart Surveillance at the Edge. In Proceedings of the IEEE International Conference on Communications (ICC), Kansas City, MO, USA, 20–24 May 2018; pp. 20–24. [Google Scholar]
- Komninos, A.; Simou, I.; Gkorgkolis, N.; Garofalakis, J. Performance of Raspberry Pi microclusters for Edge Machine Learning in Tourism. In Proceedings of the Poster and Workshop Sessions of AmI-2019, the 2019 European Conference on Ambient Intelligence, Rome, Italy, 4 November 2019. [Google Scholar]
- Kamaraj, K.; Dezfouli, B.; Liu, Y. Edge mining on IoT Devices using Anomaly Detection. In Proceedings of the APSIPA Annual Summit and Conference 2019, Lanzhou, China, 18–21 November 2019. [Google Scholar]
- Verma, A.; Goyal, A.; Kumara, S.; Kurfess, T. Edge-cloud computing performance benchmarking for IoT based machinery vibration monitoring. Manuf. Lett.
**2021**, 27, 39–41. [Google Scholar] [CrossRef] - Marquez-Sanchez, S.; Campero-Jurado, I.; Robles-Camarillo, D.; Rodriguez, S.; Corchado-Rodriguez, J.M. BeSafe B2.0 Smart Multisensory Platform for Safety in Workplaces. Sensors
**2021**, 21, 3371. [Google Scholar] - Liu, C.; Su, X.; Li, C. Edge Computing for Data Anomaly Detection of Multi-Sensors in Underground Mining. Electronics
**2021**, 10, 302. [Google Scholar] [CrossRef] - Patel, K.K.; Patel, S.M. Internet of things-IOT: Definition, characteristics, architecture, enabling technologies, application & future challenges. Int. J. Eng. Comput. Sci.
**2016**, 6, 6122–6131. [Google Scholar] - Zantalis, F.; Koulouras, G.; Karabetsos, S.; Kandris, D. A Review of Machine Learning and IoT in Smart Transportation. Future Internet
**2019**, 11, 94. [Google Scholar] [CrossRef][Green Version] - Serkani, E.; Gharaee, H.; Mohammadzadeh, N. Anomaly Detection Using SVMs as Classifier and Decision Tree for Optimizing Feature Vectors. Int. J. Inf. Secur.
**2019**, 11, 159–171. [Google Scholar] - Ergen, T.; Kozat, S.S. A Novel Distributed anomaly detection Algorithm Based on Support Vector Machines. Digit. Signal Process.
**2020**, 99, 102657. [Google Scholar] [CrossRef] - Keras. Available online: https://keras.io/ (accessed on 20 November 2020).
- Linear Models (Logistic Regression). Available online: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression (accessed on 20 November 2020).
- Logistic Regression. Available online: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression (accessed on 20 November 2020).
- Support Vector Machines. Available online: https://scikit-learn.org/stable/modules/svm.html#svm (accessed on 20 November 2020).
- SVM-libsvm. Available online: https://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.pdf (accessed on 20 November 2020).
- Decision Trees. Available online: https://scikit-learn.org/stable/modules/tree.html (accessed on 20 November 2020).
- DT Function. Available online: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html (accessed on 20 November 2020).
- Forests of Randomized Trees. Available online: https://scikit-learn.org/stable/modules/ensemble.html#forest (accessed on 20 November 2020).
- Neural Network Models. Available online: https://scikit-learn.org/stable/modules/neural_networks_supervised.html (accessed on 20 November 2020).
- Metrics and Scoring: Quantifying the Quality of Predictions. Available online: https://scikit-learn.org/stable/modules/model_evaluation.html (accessed on 20 November 2020).
- DS2OS Traffic Traces. Available online: https://www.kaggle.com/francoisxa/ds2ostraffictraces (accessed on 20 November 2020).
- Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR, USA, 2–4 August 1996; pp. 226–231. [Google Scholar]
- Jang, J.; Jiang, H. DBSCAN++: Towards fast and scalable density clustering. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 3019–3029. [Google Scholar]
- Raspberry Pi 4. Available online: https://www.raspberrypi.org/products/raspberry-pi-4-model-b/specifications (accessed on 15 January 2021).
- Resource. Available online: https://docs.python.org/3/library/resource.html (accessed on 18 June 2021).

**Figure 2.**Determination of randomly selected training dataset (80%) and test dataset (20%) from the original dataset (DS2OS), and imbalanced datasets D1, D2, …, D80 from the training dataset (D100).

**Figure 4.**Diagram of DBSCAN clustering procedure with value three for the minimum neighboring points threshold. DBSCAN discriminates between core (blue), non-core (green), and noise (orange) points.

**Figure 5.**Determination of balanced datasets DCi with selection of representative samples using clustering method.

**Figure 6.**Evaluation results on test dataset for imbalanced training datasets Di using: (

**a**) LR (coincidence of accuracy and recall); (

**b**) SVM (coincidence of accuracy and recall).

**Figure 7.**Evaluation results for imbalanced training datasets Di using: (

**a**) DT (coincidence of accuracy, recall, and precision only for datasets larger than D1); (

**b**) RF (coincidence of accuracy, recall, and precision only for datasets larger than D1).

**Figure 8.**Evaluation results for imbalanced training datasets Di using ANN (coincidence of accuracy and recall).

**Figure 9.**Accuracy measure on imbalanced training datasets Di: (

**a**) training accuracy (coincidence of RF and DT); (

**b**) test accuracy (coincidence of RF and DT).

**Figure 10.**Confusion matrix on test dataset for training dataset D20: (

**a**) LR; (

**b**) SVM; (

**c**) DT; (

**d**) RF; (

**e**) ANN.

**Figure 11.**Accuracy measure for ML algorithms on balanced training datasets DRi: (

**a**) training accuracy (coincidence of RF, SVM, and DT); (

**b**) test accuracy (coincidence of RF and DT).

**Figure 12.**Confusion matrix on test dataset for training dataset DR5: (

**a**) LR; (

**b**) SVM; (

**c**) DT; (

**d**) RF; (

**e**) ANN.

**Figure 13.**Accuracy measure for ML algorithms on balanced training datasets DCi: (

**a**) training accuracy (coincidence of RF and DT); (

**b**) test accuracy.

**Figure 14.**Confusion matrix on test dataset for training dataset DC5: (

**a**) LR; (

**b**) SVM; (

**c**) DT; (

**d**) RF; (

**e**) ANN.

**Figure 16.**F1 score on test dataset for: (

**a**) Balanced datasets DRi with reduced class ‘NL’ (coincidence of RF and DT); (

**b**) balanced sets DCi with clustering.

**Figure 20.**F1 score on test dataset for training datasets Di, DRi, and DCi: (

**a**) ML algorithms for most of datasets, without LR_DRi, DT_DCi, and RF_DCi that are not suitable for small datasets (coincidence of LR_Di, DT_Di, DT_DRi, RF_Di, RF_DRi, and ANN_Di); (

**b**) SVM, DT, and RF for smaller datasets up to 20% of the training dataset (coincidences of: DT_DRi and RF_DRi; DT_Di and RF_Di).

**Figure 21.**Raspberry Pi 4 training time for: (

**a**) Di, DRi, DCi (coincidences of: RF_Di and RF_DRi; DT_Di, DT_DRi, and DT_DCi; LR_Di, LR_DRi, and LR_DCi); (

**b**) LR and SVM for Di, DRi, DCi (coincidence of LR_Di, LR_DRi, and LR_DCi).

**Figure 22.**Raspberry Pi 4 memory usage for datasets Di, DRi, DCi with coincidence of LR and DT algorithms RAM usage of 360 MB and with coincidence of RF algorithm with RAM usage for datasets larger than D1, DR1, and DC1 up to 390 MB.

**Table 1.**Random distribution of DS2OS dataset to training/test dataset and subsets of training dataset (D1, D2, D5, D10, D15, D20, D40, D60, D100).

Dataset | Anomalous Data | Normal Data | Total |
---|---|---|---|

Original dataset (DS2OS) | 10,017 | 278,264 | 357,941 |

Training dataset (80%) | 8088 | 278,264 | 286,352 |

Test dataset (20%) | 1929 | 69,660 | 71,589 |

D1 (1%) | 102 | 2761 | 2863 |

D2 (2%) | 179 | 5548 | 5727 |

D5 (5%) | 410 | 13,908 | 14,318 |

D10 (10%) | 842 | 27,793 | 28,635 |

D15 (15%) | 1220 | 41,732 | 42,952 |

D20 (20%) | 1612 | 55,658 | 57,270 |

D40 (40%) | 3224 | 111,316 | 114,540 |

D60 (60%) | 4831 | 166,980 | 171,811 |

D80 (80%) | 6456 | 222,625 | 229,081 |

D100 (Training dataset) | 8088 | 278,264 | 286,352 |

**Table 2.**Subsets of classes with anomalous data and percentage of samples from class ‘NL’ with normal data.

Dataset | Anomalous Data | Normal Data | Total |
---|---|---|---|

DR01 (0.1%) | 8088 | 278 | 8366 |

DR02 (0.2%) | 8088 | 557 | 8645 |

DR05 (0.5%) | 8088 | 1391 | 9479 |

DR1 (1%) | 8088 | 2783 | 10,871 |

DR2 (2%) | 8088 | 5565 | 13,653 |

DR5 (5%) | 8088 | 13,913 | 22,001 |

DR10 (10%) | 8088 | 27,826 | 35,914 |

DR15 (15%) | 8088 | 41,740 | 49,828 |

DR20 (20%) | 8088 | 55,653 | 63,741 |

Dataset | Anomalous Data | Normal Data | Total |
---|---|---|---|

DC01 | 1770 | 256 | 2026 |

DC02 | 3061 | 535 | 3596 |

DC05 | 4834 | 1395 | 6229 |

DC1 | 6266 | 2815 | 9081 |

DC2 | 8088 | 5656 | 13,744 |

DC5 | 8088 | 14,177 | 22,265 |

DC10 | 8088 | 28,379 | 36,467 |

DC15 | 8088 | 42,580 | 50,668 |

DC20 | 8088 | 56,784 | 64,872 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Huč, A.; Šalej, J.; Trebar, M. Analysis of Machine Learning Algorithms for Anomaly Detection on Edge Devices. *Sensors* **2021**, *21*, 4946.
https://doi.org/10.3390/s21144946

**AMA Style**

Huč A, Šalej J, Trebar M. Analysis of Machine Learning Algorithms for Anomaly Detection on Edge Devices. *Sensors*. 2021; 21(14):4946.
https://doi.org/10.3390/s21144946

**Chicago/Turabian Style**

Huč, Aleks, Jakob Šalej, and Mira Trebar. 2021. "Analysis of Machine Learning Algorithms for Anomaly Detection on Edge Devices" *Sensors* 21, no. 14: 4946.
https://doi.org/10.3390/s21144946