# MEOD: Memory-Efficient Outlier Detection on Streaming Data

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

- Distance-based outlier detection: For finding outliers, the distance between the data points is calculated, and outlier points are those whose distance is much bigger than the average distance [1,2]. As compared with the other two techniques, this technique is much simpler to use, and unlike the statistical-based approach, no prior assumptions are required.
- Density-based outlier detection: This approach compares the object density with respect to neighboring objects. Outlier points have a lower density than neighboring points. Local outlier factor (LOF) [3] and local correlation integral (LOCI) [4] are the popular density-based approaches provided by most of the machine learning libraries. Most of the methods, for example, DBSCAN [5], use a clustering technique for outlier detection. The outliers are treated as a by-product of the clustering technique. Initially, clustering is applied, and then the points far away from the centroid are identified as outliers.

- It is a local outlier detection technique over streaming data based on the local correlation integral (LOCI) technique.
- It works with limited memory resources using a data summarization mechanism.
- It uses a swarm intelligence technique to find the optimal value for radius r for LOCI calculations.
- To improve the efficiency of the algorithm, MEOD finds an outlier factor value for only candidate points rather than the whole dataset.

## 2. Preliminaries

#### 2.1. Local Correlation Integral

_{MDEF}) at radius r. Using these factors, the outliers can be identified and calculated as:

_{i}is part of a set of objects, P = {p

_{1},…, p

_{i},…, p

_{N}}, n(p

_{i}, αr) is the number of objects in the αr-neighborhood of p

_{i}, and n’(p

_{i}, r, α) is the average of all objects present in the αr-neighborhood of p

_{i}. It is calculated as:

_{i}, α, r) and can be calculated as:

_{σ}is the constant value set as 3.

#### 2.2. Particle Swarm Optimization

_{i}and velocity V

_{i}of each particle i is updated at each iteration using the following formula:

_{1}and θ

_{2}are acceleration coefficients; c

_{1}and c

_{2}are random numbers in the range [0,1]; X

_{i}

^{best}is the best position of particle i; and X

_{i}

^{gbest}is the global best position found among neighborhood particles of i.

#### 2.3. Memory-Efficient Approach

- Summarization: Due to memory constraints, it is not feasible to preserve all the data points in the stream. If the previous data points are deleted, then the new events cannot be distinguished from the past ones. This affects the accuracy of the evaluation as there is no history of data to be considered while checking the local outlier. In the summarization phase, the summary of previous data points is preserved. For every window slot, b/2 points are processed, and a summary is generated of these points. For summary generation, the clustering technique is used. A large number f of cluster counts is set, and the cluster centers are preserved as a summary of information with granularity deviation factor (MDEF) and normalized deviation factor (σ
_{MDEF}) at radius r values. The remaining points are deleted, and then the next slot is processed. - Merging: The cluster centers generated in previous sliding window i-1 and clusters generated in the current sliding window i are merged in this phase, and a single value of cluster centers is preserved. For clustering, the cluster centers from the sliding windows i-1 and i are merged using a weighted c-means clustering algorithm where each point has a weight that shows the point importance value in the clustering process. In this process, each point is a cluster center. Hence the count of cluster members is assigned as a weight to the cluster center. After the clustering process, the weights of the cluster centers are updated as:$${z}_{j}=\frac{{\displaystyle \sum _{{x}_{i}\in {X}_{j},{w}_{i}\in {W}_{j}}{w}_{i}{x}_{i}}}{{\displaystyle \sum _{{w}_{i}\in {W}_{j}}{w}_{i}}}$$
- Revised insertion: In the revised insertion phase, the processing is completed using the b/2 points in the current sliding window and the summarized points preserved in the memory. The outlier of a point is calculated using MDEF values. If the point is present in a radius of previously summarized points, then this point is not considered an outlier. Hence, there is no need to calculate the outlier factor of such points.

## 3. Proposed Outlier Detection Technique

- Revised insertion: For each sliding window, the outliers are detected using the LOCI technique, finding the outlier based on the density of neighboring points defined using the radius r. The radius value can be user-defined, but in order to remove such dependency, the r-value is automatically found using an optimization function applying the PSO-based approach to find the optimal value of the radius r.
- Summarization: The data representative points are extracted using a k-means clustering algorithm. The cluster centroids, treated as representative points for the rest of the cluster points, are preserved as summary information with a granularity deviation factor (MDEF) and a normalized deviation factor (σ
_{MDEF}) at radius r values. - Merging: A summary generated in the current sliding window is merged with the previous sliding window summary using the weighted c-means algorithm.

_{MDEF}values are calculated for candidate points that are not in a radius of previously preserved points (steps 16 to 19). The outlier points are detected in step 20 using Equation (5).

_{i}centroids. The calculation is completed in each sliding window. The current sliding window result is merged with the previous sliding window. For merging, the MDEF(V

_{i}, r, α) and σ

_{MDEF}(V

_{i}, r, α) values are calculated for centroid

_{i}in step 5 using the following formulas:

_{i}| represents the number of points in cluster C

_{i}.

_{MDEF}values of all cluster points. Centroids are preserved, and b/2 points are removed from memory in step 6. The weighted c-means are applied in step 7, and the centroid points are updated again with MDEF and σ

_{MDEF}values in step 8. The previous centroids are deleted, and the summary is updated in step 9, preserving in memory as history only the important reference points using the memory-efficiency approach.

## 4. Implementation Details

_{1}and c

_{2}required for the velocity calculation are set to 2.02.

_{i}is the attribute value, and f

_{i}

_{max}and f

_{i}

_{min}are the minimum and maximum values in attribute f

_{i}.

## 5. Results

#### 5.1. Accuracy Analysis

#### 5.2. Time and Memory Analysis

## 6. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## References

- Radovanovic, M.; Nanopoulos, A.; Ivanovic, M. Reverse nearest neighbors in unsupervised distance-based outlier detection. IEEE Trans. Knowl. Data Eng.
**2015**, 27, 1369–1382. [Google Scholar] [CrossRef] - Zhang, K.; Hutter, M.; Jin, H. A new local distance-based outlier detection approach for scattered real-world data. In Advances in Knowledge Discovery and Data Mining (PAKDD 2009). Lecture Notes in Computer Science; Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.B., Eds.; Springer: Berlin/Heidelberg, Germany, 2009; Volume 5476. [Google Scholar] [CrossRef]
- Breunig, M.; Kriegel, H.-P.; Ng, R.; Sander, J. LOF: Identifying density-based local outliers. Proc. ACM SIGMOD Int. Conf. Manag. Data (SIGMOD’00)
**2000**, 29, 93–104. [Google Scholar] [CrossRef] - Papadimitriou, S.; Gibbons, P.B.; Faloutsos, C. LOCI: Fast outlier detection using the local correlation integral. In Proceedings of the 19th International Conference on Data Engineering, Bangalore, India, 5–8 March 2003; pp. 315–326. [Google Scholar] [CrossRef]
- Ester, M.; Kriegel, H.-P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD’96); AAAI Press, Association for Computing Machinery: New York, NY, USA, 1996; Volume 96, pp. 226–231. [Google Scholar] [CrossRef]
- Chen, F.; Lu, C.-T.; Boedihardjo, A.P. GLS-SOD: A generalized local statistical approach for spatial outlier detection. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; Association for Computing Machinery: New York, NY, USA, 2010; pp. 1069–1078. [Google Scholar] [CrossRef]
- Liu, X.; Lu, C.-T.; Chen, F. Spatial outlier detection: Random walk based approaches. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems (GIS’10); Association for Computing Machinery: New York, NY, USA, 2010; pp. 370–379. [Google Scholar] [CrossRef]
- Kriegel, H.P.; Kr€oger, P.; Schubert, E.; Zimek, A. LoOP: Local outlier probabilities. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM’09); Association for Computing Machinery: New York, NY, USA, 2009; pp. 1649–1652. [Google Scholar] [CrossRef]
- Mukhopadhyay, A.; Maulik, U.; Bandyopadhyay, S.; Coello, C.A.C. A survey of multiobjective evolutionary algorithms for data mining: Part I. IEEE Trans. Evol. Comput.
**2014**, 18, 4–19. [Google Scholar] [CrossRef] - Misinem; Bakar, A.A.; Hamdan, A.R.; Nazri, M.Z.A. A rough set outlier detection based on particle swarm optimization. In Proceedings of the 10th International Conference on Intelligent Systems Design and Applications, Cairo, Egypt, 29 November–1 December 2010; pp. 1021–1025. [Google Scholar] [CrossRef]
- Alam, S.; Dobbie, G.; Koh, Y.S.; Riddle, P. Web bots detection using particle swarm optimization based clustering. In Proceedings of the 2014 IEEE Congress on Evolutionary Computation (CEC), Beijing, China, 6–11 July 2014; pp. 2955–2962. [Google Scholar] [CrossRef]
- Mohemmed, A.W.; Zhang, M.; Will, B. Particle swarm optimisation for outlier detection. In Proceedings of the 12th Annual Conference on Genetic and Evolutionary Computation (GECCO’10); Association for Computing Machinery: New York, NY, USA, 2010; pp. 83–84. [Google Scholar] [CrossRef] [Green Version]
- Hashmi, A.; Doja, M.; Ahmad, T. An optimized density-based algorithm for anomaly detection in high dimensional datasets. Scalable Comput. Pract. Exp.
**2018**, 19, 69–77. [Google Scholar] [CrossRef] - Knorr, E.M.; Ng, R.T.; Tucakov, V. Distance-based outliers: Algorithms and applications. VLDB J.
**2000**, 8, 237–253. [Google Scholar] [CrossRef] - Karale, A. Outlier detection methods and the challenges for their implementation with streaming data. J. Mob. Multimed.
**2020**, 16, 351–388. [Google Scholar] [CrossRef] - Pokrajac, D.; Lazarevic, A.; Latecki, L.J. Incremental local outlier detection for data streams. In Proceedings of the 2007 IEEE Symposium on Computational Intelligence and Data Mining, Honolulu, HI, USA, 1 March–5 April 2007; pp. 504–515. [Google Scholar] [CrossRef] [Green Version]
- Salehi, M.; Leckie, C.; Bezdek, J.C.; Vaithianathan, T.; Zhang, X. Fast Memory efficient local outlier detection in data streams. IEEE Trans. Knowl. Data Eng.
**2016**, 28, 3246–3260. [Google Scholar] [CrossRef] - Karale, A.; Lazarova, M.; Koleva, P.; Poulkov, V. A hybrid PSO-MiLOF approach for outlier detection in streaming data. In Proceedings of the 43rd International Conference on Telecommunications and Signal Processing (TSP), Milan, Italy, 7–9 July 2020; pp. 474–479. [Google Scholar] [CrossRef]
- UCI Machine Learning Repository. Available online: http://archive.ics.uci.edu/ml (accessed on 8 March 2021).
- Kaggle Wine Dataset. Available online: https://www.kaggle.com/rishidamarla/2d-3d-pca-t-sne-and-umap-on-wine-dataset (accessed on 8 March 2021).

**Figure 4.**Effect of parameter K on the outlier detection process: (

**a**) K = 5; (

**b**) K = 10; (

**c**) K = 15; (

**d**) K = 20; (

**e**) using LOCI.

**Figure 6.**Comparison of the execution time and the memory requirements of memory-efficient incremental local outlier factor (MiLOF) and MEOD for several UCI datasets: (

**a**) execution time; (

**b**) memory required.

**Figure 7.**Comparison of the execution time and the memory requirements of MiLOF and MEOD for outlier detection in the synthetic dataset with different sliding window sizes: (

**a**) execution time; (

**b**) memory requirement.

Sr. No. | Dataset | Data Points (n) | Dimensions (D) |
---|---|---|---|

1. | UCI Vowel (Vl) | 1040 | 10 |

2. | UCI Glass | 214 | 10 |

3. | UCI Pendigit (Pt) | 3600 | 16 |

4. | IBRL | 3000 | 2 |

5. | Kaggle Wine | 177 | 2 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Karale, A.; Lazarova, M.; Koleva, P.; Poulkov, V.
MEOD: Memory-Efficient Outlier Detection on Streaming Data. *Symmetry* **2021**, *13*, 458.
https://doi.org/10.3390/sym13030458

**AMA Style**

Karale A, Lazarova M, Koleva P, Poulkov V.
MEOD: Memory-Efficient Outlier Detection on Streaming Data. *Symmetry*. 2021; 13(3):458.
https://doi.org/10.3390/sym13030458

**Chicago/Turabian Style**

Karale, Ankita, Milena Lazarova, Pavlina Koleva, and Vladimir Poulkov.
2021. "MEOD: Memory-Efficient Outlier Detection on Streaming Data" *Symmetry* 13, no. 3: 458.
https://doi.org/10.3390/sym13030458