# A Mixed Clustering Approach for Real-Time Anomaly Detection

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

- Secondly, the similarity measure of a pair of clusters is expressed in an equivalent manner and a merge function is defined in terms of it.
- Finally, a two-phase method for anomaly detection is proposed.

## 2. Related Works

## 3. Problem Definitions

**Definition**

**1.**

_{1}, A

_{2}, ….A

_{d}. The domain (A

_{i}; i = 1,2,… d) = {a

_{i}

_{1}, a

_{i}

_{2}, … a

_{im}} comprises-finite, unordered possible values that can be taken by each attribute A

_{i}, such that for any a, b∈ dom(A

_{i}), either a = b or a ≠ b. Any data instance x

_{i}is a vector (x

_{i}

_{1}, x

_{i}

_{2}, … x

_{id})

^{/}, where x

_{ip}∈ dom(A

_{p}), p = 1,2, … d. The distance d(x

_{i}, C

_{j}) between data instance x

_{i}and cluster C

_{j}, i = 1, 2, … n and j = 1, 2, … k. as proposed in [6,14] is given by

_{i}, the weight factor associated with each attribute, describes the importance of the attribute which controls the contribution attribute–cluster distance to data instance–cluster distance. The attribute–cluster distance between x

_{ip}and C

_{j}is proposed as follows.

_{ip}, C

_{j}) ∈ [0, 1] means d(x

_{ip}, C

_{j}) = 1 only if all the data instance in C

_{j}for which A

_{p}= x

_{ip}and d(x

_{ip}, C

_{j}) = 0 only if no data instance in C

_{j}for which A

_{p}= x

_{ip}. With the help of Equations (1) and (2) becomes

_{i}, C

_{j})∈ [0, 1], i = 1, 2, … n and j = 1, 2, … k.

**Definition**

**2.**

_{A}of an attribute is quantified by the entropy metric as follows.

_{p}’s (p ∈ {1, 2, …, d}) importance can be evaluated by the formula.

_{p}∈

_{t}dom(A

_{p}), m

_{p}is the A

_{p}’s total number of possible values, and D is the whole dataset. From Equation (5) it is concluded that an attribute’s importance is directly proportional to the number of different values of the categorical attribute. Although, practically, an attribute with immensely diverse values contributes minimum to the cluster. Hence, Equation (5) can further be modified as

_{p}= 1/d, with p = 1, 2, …, d. Consequently, the instance or object–cluster distance in Equation (3) can be modified as

**Definition**

**3.**

_{i}= (x

_{i1}, x

_{i2}, … x

_{in}) be the numeric attribute of a data instance x

_{i}, then the distance d(x

_{i}, C

_{j}) between x

_{i}; i = 1, 2, … n and cluster C

_{j}; j = 1, 2, … k is defined as follows.

_{j}is the centroid of cluster C

_{j}, and d(x

_{i}, C

_{j}) ∈ [0, 1].

**Definition**

**4.**

_{i}= [x

^{c}

_{i}, x

^{n}

_{i}], the data instance with x

^{c}

_{i}= (x

^{c}

_{i}

_{1}, x

^{c}

_{i}

_{2}, … x

^{c}

_{idc}) categorical and x

^{n}

_{i}= (x

^{n}

_{i}

_{1}, x

^{n}

_{i}

_{2}, … x

^{n}

_{idn}) numerical attributes where (d

_{c}+ d

_{n}= d). Using Equations (1) and (9), the distance d

_{1}(x

_{i}, C

_{j}) between the data instance x

_{i}and cluster C

_{j}is defined [10,37] as follows:

^{n}

_{j}is the centroid of cluster C

_{j}.

_{j}) can be rewritten as follows:

_{i}, C

_{j}) ∈ [0, 1]. If x

_{i}∈ C

_{j}, d(x

_{i}, C

_{j}) = 0. In Equation (11), the numeric attributes are included as a whole in the Euclidean distance, hence, it can be treated as one of the indivisible components, and only one weight can be assigned to it. Thus, we will have d

_{c}+ 1 attribute weights in total, and their summation is equal to 1. Under such settings, the attribute weights can be taken as

**Definition**

**5.**

_{i}and C

_{j}; {i, j = 1, 2, …k and i ≠ j} be two clusters obtained by partitioning phase, and c

_{i}and c

_{j}be their centroids, then the similarity measure [14] S(C

_{i}, C

_{j}) between C

_{i}and C

_{j}is expressed as,

_{i}, C

_{j}) = (S

_{n}(C

_{i}, C

_{j}) + 1 − S

_{c}(C

_{i}, C

_{j}))/2

_{n}(C

_{i}, C

_{j}) = the similarity in numeric attributes

_{c}(C

_{i}, C

_{j}) = the similarity of C

_{i}and C

_{j}on categorical attributes [14]

_{n}(C

_{i}, C

_{j}) ∈ [0, 1] and S

_{c}(C

_{i}, C

_{j}) ∈ [0, 1], it follows that S(C

_{i}, C

_{j}) ∈ [0, 1]. For identical cluster pairs, S(C

_{i}, C

_{j}) = 0, and S(C

_{i}, C

_{j}) = 1, for completely dissimilar pairs.

**Definition**

**6.**

_{A}(x) ∈ [0, 1], x ∈ X where μ

_{A}(x) represents the grade of membership of x in A. (see e.g., [43]).

**Definition**

**7.**

**Definition**

**8.**

_{0}∈ R such that μ

_{A}(x

_{0}) = 1, and μ

_{A}(x) is piecewise continuous is called fuzzy number.

**Definition**

**9.**

_{A}(x

_{0}) = 1 for all x

_{0}∈ [a, b], and μ

_{A}(x) is piecewise continuous.

**Definition**

**10.**

_{A}(x) > 0}, whereas the core of A in X is the crisp set containing every element of X with membership grades 1 in A (see e.g., [43]). Obviously core [t

_{1}, t

_{2}] = [t

_{1}, t

_{2}], since a closed interval [t

_{1}, t

_{2}] is an equi-fuzzy interval with membership 1 (see e.g., [41]).

**Definition**

**11.**

_{1}(S) A

_{2}= (A

_{1}− A

_{2})

^{(1/2)}(+) (A

_{1}∩ A

_{2})

^{(1)}(+)(A

_{2}− A

_{1})

^{(1/2)}

_{1}− A

_{2})

^{(1/2))}and (A

_{2}− A

_{1})

^{(1/2)}are fuzzy sets having fixed membership (1/2), and (+) denotes union of disjoint sets. To elaborate it, let A

_{1}= [x

_{1}, y

_{1}] and A

_{2}= [x

_{2}, y

_{2}] are two real intervals such that A

_{1}∩A

_{2}≠ ϕ, we would obtain a superimposed portion. In the superimposition process of two intervals, the contribution of each interval on the superimposed interval is ½ so from Equation (17) we obtain

_{1}, y

_{1}](S)[x

_{2}, y

_{2}] = [x

_{(1)},x

_{(2)}]

^{(1/2)}(+) [x

_{(2)},y

_{(1)}]

^{(1)}(+) (y

_{(1)},y

_{(2)}]

^{(1/2)}

_{(1)}= min(x

_{1}, x

_{2}), x

_{(2)}= max(x

_{1}, x

_{2}), y

_{(1)}= min(y

_{1}, y

_{2}), and y

_{(2)}= max(y

_{1}, y

_{2}).

_{1}, y

_{1}], [x

_{2}, y

_{2}], and [x

_{3}, y

_{3}], with ${\bigcap}_{i=1}^{3}\left[{x}_{i},{y}_{i}\right]$≠ ϕ the resulting superimposed interval will look like

_{1}, y

_{1}](S)[x

_{2}, y

_{2}](S)[x

_{3},y

_{3}] = [x

_{(1)},x

_{(2)}]

^{(1/3)}(+)[x

_{(2)},x

_{(3)}]

^{(2/3)}(+) [x

_{(3)},y

_{(1)}]

^{(1)}(+) [y

_{(1)},y

_{(2)}]

^{(2/3)}(+)[y

_{(2)},y

_{(3)}]

^{(1/3)}

_{(i)}; i = 1, 2, 3} is found from {x

_{i}; i = 1, 2, 3} by arranging ascending order of magnitude and {y

_{(i)}; i = 1, 2, 3} is found from {y

_{i}; i = 1, 2, 3} in the similar fashion.

_{i}, y

_{i}], i = 1,2,…,n, be n real intervals such that $\underset{i=1}{\overset{n}{\cap}}\left[{x}_{i},{y}_{i}\right]$≠ ϕ. Generalizing (19) we obtain.

_{1}, y

_{1}](S) [x

_{2}, y

_{2}](S) … (S)[x

_{n}, y

_{n}] = [x

_{(1)}, x

_{(2)}]

^{(1/n)}(+) [x

_{(2)}, x

_{(3)}]

^{(2/n)}(+) … (+) [x

_{(r)}, x

_{(r+1)}]

^{(r/n)}(+)

… (+) [x

_{(n)},y

_{(1)}]

^{(1)}(+)[y

_{(1),}y

_{(2)}]

^{((n−1)/n)}(+)…(+)[y

_{(n-r)},y

_{(n-r+1)}]

^{(r/n)}(+)…(+)[y

_{(n-2)},y

_{(n-1)}]

^{(2/n)}(+)[y

_{(n-1)},y

_{(n)}]

^{(1/n)}

_{(i)}} is formed from the sequence {x

_{i}} in ascending order of magnitude for i = 1,2,… n and similarly {y

_{(i)}} is formed from {y

_{i}} in ascending order of magnitude [41]. Here, we observe that the membership functions are the combination of empirical probability distribution function and complementary probability distribution function and are given by

**Definition**

**12.**

_{(1)}, x

_{(2)}]

^{(1/m)}(+) [x

_{(2)}, x

_{(3)}]

^{(2/m)}(+) … (+)[x

_{(r)}, x

_{(r+1)}]

^{(r/m)}(+) … (+)[x

_{(m)}, y

_{(1)}]

^{(1)}(+)[y

_{(1),}y

_{(2)}]

^{((m−1)/m)}(+) … (+)[y

_{(m-r)},y

_{(m-r+1)}]

^{(r/m)}(+) … (+)[y

_{(m-2)},y

_{(m-1)}]

^{(2/m)}(+)[y

_{(m-1)},y

_{(m)}]

^{(1/m)}be the superimposition of m intervals and B = [x

_{(1)}

^{/}, x

_{(2)}

^{/}]

^{(1/n)}(+) [x

_{(2)}

^{/}, x

_{(3)}

^{/}]

^{(2/n)}(+) … (+) [x

_{(r)}

^{/}, x

_{(r+1)}

^{/}]

^{(r/n)}(+) … (+) [x

_{(n)}

^{/}, y

_{(1)/}]

^{(1)}(+)[y

_{(1)}

^{/}

_{,}y

_{(2)}

^{/}]

^{((n−1)/n)}(+) … (+)[y

_{(n-r)}

^{/},y

_{(n-r+1)}

^{/}]

^{(r/n)}(+) … (+) [y

_{(n-2)},y

_{(n-1)}

^{/}]

^{(2/n)}(+) [y

_{(n-1)}

^{/},y

_{(n)}

^{/}]

^{(1/n)}be superimposition of n intervals, then A(S)B is the superimposition of (m + n) intervals and is given by

_{((1))}, x

_{((2))}]

^{(1/(m+n))}(+) [x

_{((2))}, x

_{((3))}]

^{(2/(m+n))}(+) … (+) [x

_{((m))}, x

_{((m+1))}]

^{(m/(m+n))}(+) … (+)

[x

_{((m+n))}, y

_{((1))}]

^{(1)}(+)[y

_{((1)),}y

_{((2))}]

^{((m+n−1)/(m+n))}(+)…(+)[y

_{(((m+n-r))},y

_{((m+n-r+1))}]

^{(r/(m+n))}(+)…(+)[y

_{((m+n-2))},y

_{((m+n-1))}]

^{(2/(m+n))}(+)[y

_{((m+n-1))},y

_{((m+n))}]

^{(1/(m+n))}

_{((1))}, x

_{((2))}, … x

_{((m))}, x

_{((m+1))}… x

_{((m+n))}} is the sequence formed from x

_{(1)}, x

_{(2)}, … x

_{(m)}, x

_{(1)}

^{/}, x

_{(2)}

^{/}…x

_{(n)}

^{/}in ascending order of magnitude and {y

_{((1))}, y

_{((2))}, … y

_{((m))}, y

_{((m+1))}…y

_{((m+n))}} is the sequence formed from y

_{(1)}, y

_{(2)}, … y

_{(m)}, y

_{(1)}

^{/}, y

_{(2)}

^{/}…y

_{(n)}

^{/}in ascending order of magnitude. From (23), we obtain the membership function as

**Definition**

**13.**

_{i}, t

_{i}

^{/}] and [t

_{j}, t

_{j}

^{/}] be lifetimes of C

_{i}and C

_{j}; {i, j = 1,2,…n}, respectively, such that [t

_{i}, t

_{i}

^{/}]∩[t

_{j}, t

_{j}

^{/}] ≠ ϕ then the merge() function [14] is defined as C = merge(C

_{i,}C

_{j}) = C

_{i}∪C

_{j}, if and only if S(C

_{i,}C

_{j}) ≤ σ, a pre-defined threshold where C is the cluster obtained by merging C

_{i}and C

_{j}. It is to be mentioned here that C will be associated with the superimposed interval [t

_{i}, t

_{i}

^{/}] (S)[t

_{j}, t

_{j}

^{/}] as its lifetime. To merge the clusters with superimposed time intervals, we compute the intersection of the cores of the superimposed time intervals. If it is found to be non-empty, then the clusters are merged and the corresponding superimposed time intervals are again superimposed to obtain a new superimposed time interval.

## 4. Proposed Algorithm

_{j}; j = 1, 2, 3 … k, and puts it on the cluster with minimum distance value. It is to be mentioned here that the weights of the categorical attributes are taken to be the same. Consequently, the frequency of categorical values, the centroid of the numeric values, and the lifetime of the corresponding clusters are updated. For convenient updating the cluster-centroid and the categorical attribute values, two auxiliary matrices are maintained for each cluster, one for storing the frequencies and the other for storing the centroid vectors. Then, the weights of the categorical attributes are computed. The process of phase one continues till no assignment occurs. At the end of phase one, each of the k-clusters will have an associated lifetime. In phase two, the merge function [merge function is defined in Section 3] is applied to the output of phase one as follows. Each pair of clusters from k-clusters are merged to produce bigger cluster if their similarity value is within a specified threshold and their lifetimes overlaps. Then the overlapping lifetimes will be kept in a compact form as a superimposed time interval which in turn produces a fuzzy time interval. At any intermediate stage of merging, it is required to check the intersection of the cores of the lifetimes of the clusters to be merged. If they intersect then the clusters are merged along with the superimposition of two superimposed intervals. The superimposition of two superimposed intervals will produce a new superimposed interval with a reconstruction of its membership function [The Definition is given in Section 3]. Again, while merging in the first iteration, the lifetimes being closed intervals, their intersection will be computed and if they are found to be non-empty, they will be simply superimposed using the formula of interval superimposition [Definition is given in Section 3]. While in the subsequent iterations merging will be associated with the superimposition of the superimposed intervals produced by the previous iteration based on the non-empty intersection of their cores. For storing the boundaries of the time intervals to be superimposed two sorted arrays, one for storing the left boundaries and other for the right boundaries are used. The process would continue till no merging is possible or a particular level becomes empty. The Algorithm 1 is better described using pseudo-code and the flowchart (Figure 1) given below.

Algorithm 1: Mixed Clustering Algorithm |

Step 1: Given an online d-dimensional dataset with categorical and numeric attributes. |

Step 2: Select the number k, to decide the number of clusters. |

Step 3: Take first k, data instances and assign them as k-cluster centroid along with their time-stamp as start time of lifetime of the clusters. |

Step 4: Assign each incoming data instance to the closest centroid using equal weights for the categorical attributes. |

Step 5: Update the two auxiliary matrices maintained for storing the frequency of each categorical value occurring in the cluster, and the mean vector of the numerical parts of all the data instances belonging to the cluster. |

Step 6: Extend or update the lifetime of the clusters using the time-stamp of the current data instance to be inserted to the cluster. |

Step 7: Compute the weights of categorical attributes. |

Step 8: if (assignment does not occur) |

go to step 9. |

else |

re-assign each data instance to the new closest centroid of each cluster. |

go to step 5. |

Step 9: for each possible pair of clusters (C_{i}, C_{j}) with lifetimes as a superimposed intervals S[ti] and S[tj],
respectively, |

{if (core(S[t_{i}])∩core(S[t_{j}]) = empty) break; |

else if (sim(C_{i}, C_{j}) ≤ sigma) |

{merge (C_{i}, C_{j}); |

superimpose (S[ti], S[tj]); |

} |

continue |

}. |

Step 10: Output clusters. |

## 5. Time Complexity

_{n}= no of numeric attributes, and d

_{c}= no of categorical attributes so that d = d

_{n}+ d

_{c}+ 1 = the total number of attributes, where the extra attribute is the time attribute. The computational cost for calculating centroid is O(n + n.k.d

_{n}). The computational cost for step1, step2, and step 3 is O(n.k.d

_{n}). Step 4 takes O(k.n + n.k) time, as for each centroid, the distance of the data instance has to be computed, and minimum distance has to be chosen to assign the data instance in that cluster. The O(n.k) time is needed to compute the minimum distance for each k-clusters. Since each data instance is associated with a time stamp it needs another O(n.k) time. The cost of updating the two matrices along with lifetimes of clusters is O(3.k). This is the computational cost of step 5. Moreover, the computational cost of updating the weights of categorical attributes is O(a.n.k.d

_{c}), where a = average number of possible values that the categorical attributes can take. Thus, the total computational cost of step 4 and step 5 is O(3.k + 2 n.k + a.n.k.d

_{c}) = O(a.n.k.d

_{c}). If i is the number of iterations, then the total computational cost for phase-1 is O(i.(n.k.d

_{n}+ a.n.k.d

_{c})) = O(i.a.n.k.d

_{c}). For phase-2, the clusters obtained during phase-1 are to be merged based on similarity measure and the non-empty intersection of lifetimes or core(lifetimes). For merging two clusters with lifetimes requires O(n

_{1.}n

_{2}); where n

_{1}and n

_{2}are the sizes of the two clusters to be merged. Additionally, merging is associated with the superimposition of the cluster’s lifetimes. Let m

_{1}be the number of intervals superimposed in the lifetime of one cluster and m

_{2}be the number of intervals superimposed in the lifetime of another cluster and their cores have non-empty intersections. For the intersection of cores O(1) time is required. If m

_{1}superimposed intervals is to be superimposed on m

_{2}such that m

_{1}≤ n and m

_{2}≤ n, then the boundaries of m

_{1}are to be inserted on that of m

_{2}as two sorted arrays. So, this is basically merging four sorted arrays to produce two sorted arrays one for left end points and others for right end points. The searching in four sorted array requires O(logm

_{1}+ logm

_{2}+ logm

_{1}+ logm

_{2}) = O(4logn) = O(logn) time as m

_{1}= O(n) and m

_{2}= O(n). The insertion in sorted arrays will require O(m

_{1}+ m

_{2}+ m

_{1}+ m

_{2}) = O(4.n) = O(n). If t is the number of iterations in phase-2, the total cost will be O(t(logn + n)) = O(k.logn + k.n), as t ≤ k. The total cost of all the steps is O(i.a.n.k.dc + k.logn + k.n). Since k is constant d

_{c}≤ n, i ≤ k ≤ n, a ≤ n, i is also constant. The overall complexity of the algorithm is O(n.n.n) = O(n

^{3}), which shows the efficacy of the algorithm.

## 6. Experimental Analysis and Results

## 7. Conclusions, Limitations and Lines for Future Works

#### 7.1. Conclusions

- It supplies a set of clusters, where each cluster will have an associated fuzzy time interval describing its period or lifetime.
- One of the most challenging issues in k-means algorithm is the selection of k. The proposed algorithm has overcome this problem by producing the same number of stable clusters, as the number of clusters reduces due to merging of clusters.
- Obviously, the number of output clusters is less than or equal to that at the beginning.
- The data instance or group of data instances, which does not belong to any of the cluster or belongs to sparse clusters or falling in the lifetime with very low membership values will be treated as anomalies.
- An experimental study with a synthetic dataset and a real world dataset is conducted and comparative analysis is carried out against a couple of clustering-based anomaly detection algorithms namely k-means [5], PCM (Partitioning clustering with merging) [14], ACA (agglomerative clustering algorithm) [39], IF (IsolationForest) [39], and OnCAD [31]. It has been found that our algorithm outperforms others in terms of accuracy, specificity, sensitivity, number of anomalies found, number of clusters generated, execution time and the stability of the output clusters.
- The algorithm is found to be run in cubic time.

#### 7.2. Limitations and Lines for Future Works

- An efficient algorithm can be designed to find real-time anomalies in high-dimensional, heterogeneous data with continuous attributes.
- An efficient algorithm can be designed to find anomaly from temporal interval dataset.
- An approach other than partitioning and hierarchical viz. density based approach can be employed for the real-time anomaly detection.

## Author Contributions

## Funding

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## References

- Pamula, R.; Deka, J.K.; Nandi, S. An Outlier Detection Method based on Clustering. In Proceedings of the 2011 Second International Conference on Emerging Applications of Information Technology, Kolkata, India, 19–20 February 2011; pp. 253–256. [Google Scholar]
- Agrawal, S.; Agrawal, J. Survey on Anomaly Detection on Data Mining Techniques. Procedia Comput. Sci.
**2015**, 60, 708–713. [Google Scholar] [CrossRef] [Green Version] - Zaki, M.J.; Wong, L. Data Mining Techniques; WSPC-2003; Lecture Notes Series; Computer Science: Singapore, 2003; Available online: http://www.cs.rpi.edu/~zaki/PaperDir/PGKD04.pdf (accessed on 12 March 2022).
- Soni, D. Understanding the Different Types of Mmachine Learning. Towards Data Science, 2019. Available online: https://towardsdatascience.com/understanding-the-different-types-of-machine-learning-models-9c47350bb68a (accessed on 15 March 2022).
- Hartigan, J.A. Hartigan Clustering Algorithms; John Wiley & Sons: Hoboken, NJ, USA, 1975. [Google Scholar]
- Cheng, Y.-M.; Jia, H. A Unified Metric for Categorical and Numeric Attributes in Data Clustering; Hong Kong University Technical Report; Springer: Berlin/Heidelberg, Germany, 2011; Available online: https://www.comp.hkbu.edu.hk/tech-report (accessed on 1 April 2018).
- Mazarbhuiya, F.A.; Abulaish, M. Clustering Periodic Patterns using Fuzzy Statistical Parameters. Int. J. Innov. Comput. Inf. Control.
**2012**, 8, 2113–2124. [Google Scholar] - Gil-Garcia, R.; Badia-Contelles, J.M.; Pons-Porrata, A. Dynamic Hierarchical Compact Clustering Algorithm. In Progress in Pattern Recognition, Image Analysis and Applications; Sanfeliu, A., Cortés, M.L., Eds.; CIARP 2005, LNCS 3775; Springer: Berlin/Heidelberg, Germany; pp. 302–310.
- Hammouda, K.M.; Kamel, M.S. Efficient phrase-based document indexing for web document clustering. IEEE Trans. Knowl. Data Eng.
**2004**, 16, 1279–1296. [Google Scholar] [CrossRef] - Mahdy, A.M.S. A numerical method for solving the nonlinear equations of Emden-Fowler models. J. Ocean. Eng. Sci. 2022; in press. [Google Scholar] [CrossRef]
- Mahdy, A.M.S. Stability, existence, and uniqueness for solving fractional glioblastoma multiforme using a Caputo–Fabrizio derivative. Math. Methods Appl. Sci. 2023; Early View. [Google Scholar] [CrossRef]
- Mazarbhuiya, F.A.; AlZahrani, M.Y.; Georgieva, L. Anomaly Detection Using Agglomerative Hierarchical Clustering Algorithm; ICISA 2018; Lecture Notes on Electrical Engineering (LNEE); Springer: Hong Kong, 2019; Volume 514, pp. 475–484. [Google Scholar]
- Linquan, X.; Wang, W.; Liping, C.; Guangxue, Y. An Anomaly Detection Method Based on Fuzzy C-means Clustering Algorithm. In Proceedings of the Second International Symposium on Networking and Network Security, Jinggangshan, China, 2–4 April 2010; pp. 089–092. [Google Scholar]
- Mazarbhuiya, F.A.; AlZahrani, M.Y.; Mahanta, A.K. Detecting Anomaly Using Partitioning Clustering with Merging. ICIC Express Lett.
**2020**, 14, 951–960. [Google Scholar] - Retting, L.; Khayati, M.; Cudre-Mauroux, P.; Piorkowski, M. Online anomaly detection over Big Data streams. In Proceedings of the 2015 IEEE International Conference on Big Data, Santa Clara, CA, USA, 29 October–1 November 2015. [Google Scholar]
- Alguliyev, R.; Aliguliyev, R.; Sukhostat, L. Anomaly Detection in Big Data based on Clustering. Stat. Optim. Inf. Comput.
**2017**, 5, 325–340. [Google Scholar] [CrossRef] - Hahsler, M.; Piekenbrock, M.; Doran, D. dbscan: Fast Density-based clustering with R. J. Stat. Softw.
**2019**, 91, 1–30. [Google Scholar] [CrossRef] [Green Version] - Song, H.; Jiang, Z.; Men, A.; Yang, B. A Hybrid Semi-Supervised Anomaly Detection Model for High Dimensional data. Comput. Intell. Neurosci.
**2017**, 2017, 8501683 . [Google Scholar] [CrossRef] [Green Version] - Alghawli, A.S. Complex methods detect anomalies in real time based on time series analysis. Alex. Eng. J.
**2022**, 61, 549–561. [Google Scholar] [CrossRef] - Yang, Y.; Zhang, K.; Wu, C.; Niu, X.; Yang, Y. Building an Effective Intrusion Detection System Using the Modified Density Peak Clustering Algorithm and Deep Belief Networks. Appl. Sci.
**2019**, 9, 238. [Google Scholar] [CrossRef] [Green Version] - Kim, B.; Alawami, M.A.; Kim, E.; Oh, S.; Park, J.; Kim, H. A Comparative Study of Time Series Anomaly Detection, Models for Industrial Control Systems. Sensors
**2023**, 23, 1310. [Google Scholar] [CrossRef] - Mazarbhuiya, F.A. Detecting Anomaly using Neighborhood Rough Set based Classification Approach. ICIC Express Lett.
**2023**, 17, 73–80. [Google Scholar] [CrossRef] - Younas, M.Z. Anomaly Detection using Data Mining Techniques: A Review. Int. J. Res. Appl. Sci. Eng. Technol.
**2020**, 8, 568–574. [Google Scholar] [CrossRef] - Thudumu, S.; Branch, P.; Jin, J.; Singh, J. A comprehensive survey of anomaly detection techniques for high dimensional big data. J. Big Data
**2020**, 7, 42. [Google Scholar] [CrossRef] - Habeeb, R.A.A.; Nasauddin, F.; Gani, A.; Hashem, I.A.T.; Ahmed, E.; Imran, M. Real-time big data processing for anomaly detection: A Survey. Int. J. Inf. Manag.
**2019**, 45, 289–307. [Google Scholar] [CrossRef] [Green Version] - Wang, B.; Hua, Q.; Zhang, H.; Tan, X.; Nan, Y.; Chen, R.; Shu, X. Research on anomaly detection and real-time reliability evaluation with the log of cloud platform. Alex. Eng. J.
**2022**, 61, 7183–7193. [Google Scholar] [CrossRef] - Halstead, B.; Koh, Y.S.; Riddle, P.; Pechenizkiy, M.; Bifet, A. Combining Diverse Meta-Features to Accurately Identify Recurring Concept Drit in Data Streams. ACM Trans. Knowl. Discov. Data
**2023**. [Google Scholar] [CrossRef] - Li, X.; Han, J. Mining approximate top-k subspace anomalies in multi-dimensional time-series data. In Proceedings of the 33rd International Conference on Very Large Data Bases, Vienna, Austria, 23–27 September 2007; pp. 447–458. [Google Scholar]
- Gupta, M.; Gao, J.; Aggrawal, C.C.; Jain, J. Outlier detection for temporal data: A survey. IEEE Trans. Knowl. Data Eng.
**2014**, 25, 2250–2267. [Google Scholar] [CrossRef] - Zhao, Z.; Birke, R.; Han, R.; Robu, B.; Bouchenak, S.; Ben Mokhtar, S.; Chen, L.Y. RAD: On-line Anomaly Detection for Highly Unreliable Data. arXiv
**2019**, arXiv:1911.04383. [Google Scholar] - Chenaghlou, M.; Moshtaghi, M.; Lekhie, C.; Salahi, M. Online Clustering for Evolving Data Streams with Online Anomaly Detection. Advances in Knowledge Discovery and Data Mining. In Proceedings of the 22nd Pacific-Asia Conference, PAKDD 2018, Melbourne, VIC, Australia, 3–6 June 2018; pp. 508–521. [Google Scholar]
- Firoozjaei, M.D.; Mahmoudyar, N.; Baseri, Y.; Ghorbani, A.A. An evaluation framework for industrial control system cyber incidents. Int. J. Crit. Infrastruct. Prot.
**2022**, 36, 100487. [Google Scholar] [CrossRef] - Chen, Q.; Zhou, M.; Cai, Z.; Su, S. Compliance Checking Based Detection of Insider Threat in Industrial Control System of Power Utilities. In Proceedings of the 2022 7th Asia Conference on Power and Electrical Engineering (ACPEE), Hangzhou, China, 15–17 April 2022; pp. 1142–1147. [Google Scholar]
- Zhao, Z.; Mehrotra, K.G.; Mohan, C.K. Online Anomaly Detection Using Random Forest. In Recent Trends and Future Technology in Applied Intelligence; Mouhoub, M., Sadaoui, S., Ait Mohamed, O., Ali, M., Eds.; IEA/AIE 2018; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2018. [Google Scholar]
- Izakian, H.; Pedrycz, W. Anomaly detection in time series data using fuzzy c-means clustering. In Proceedings of the 2013 Joint IFSA World Congress and NAFIPS Annual Meeting, Edmonton, AB, Canada, 24–28 June 2013. [Google Scholar]
- Decker, L.; Leite, D.; Giommi, L.; Bonakorsi, D. Real-time anomaly detection in data centers for log-based predictive maintenance using fuzzy-rule based approach. arXiv
**2020**, arXiv:2004.13527v1. [Google Scholar] - Masdari, M.; Khezri, H. Towards fuzzy anomaly detection-based security: A comprehensive review. Fuzzy Optim. Decis. Mak.
**2020**, 20, 1–49. [Google Scholar] [CrossRef] - de Campos Souza, P.V.; Guimarães, A.J.; Rezende, T.S.; Silva Araujo, V.J.; Araujo, V.S. Detection of Anomalies in Large-Scale Cyberattacks Using Fuzzy Neural Networks. AI
**2020**, 1, 92–116. Available online: https://www.mdpi.com/2673-2688/1/1/5 (accessed on 1 April 2022). [CrossRef] [Green Version] - Habeeb, R.A.A.; Nasauddin, F.; Gani, A.; Hashem, I.A.T.; Amanullah, A.M.E.; Imran, M. Clustering-based real-time anomaly detection—A breakthrough in big data technologies. Trans. Emerg. Telecommun. Technol.
**2022**, 33, e3647. [Google Scholar] - Mahanta, A.K.; Mazarbhuiya, F.A.; Baruah, H.K. Finding Calendar-based Periodic Patterns. In Pattern Recognition Letters; Elsevier Publication: Amsterdam, The Netherlands, 2008; Volume 29, pp. 1274–1284. [Google Scholar]
- Mazarbhuiya, F.A.; Mahanta, A.K.; Baruah, H.K. The Solution of fuzzy equation A+X=B using the method of superimposition. Appl. Math.
**2011**, 2, 1039–1045. [Google Scholar] [CrossRef] [Green Version] - Loeve, M. Probability Theory; Springer: New York, NY, USA, 1977. [Google Scholar]
- Klir, J.; Yuan, B. Fuzzy Sets and Logic Theory and Application; Prentice Hill Pvt. Ltd.: Englewood Cliffs, NJ, USA, 2002. [Google Scholar]
- KDD Cup’99 Data. Available online: https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html (accessed on 15 January 2020).
- Kitsune Network Attack Dataset Dataset. Available online: https://github.com/ymirsky/Kitsune-py (accessed on 12 December 2021).

Dataset | Dataset Characteristics | Attribute Characteristics | No. of Instances | No. of Attributes |
---|---|---|---|---|

KDDCUP’99 Network Anomaly | Multivariate | Numeric, categorical, temporal | 4,898,431 | 41 |

Kitsune Network Attack | Multivariate, sequential, time-series | Real, temporal | 27,170,754 | 115 |

k-Means | PCM [14] | ACA | IF Model | OnCAD | Proposed Method | |
---|---|---|---|---|---|---|

Accuracy | 95% | 86% | 82% | 84% | 97% | 98% |

Sensitivity | 0% | 35% | 68% | 72% | 93% | 95% |

Specificity | 100% | 98% | 95% | 97% | 98% | 100% |

k-Means | PCM [14] | ACA | IF Model | OnCAD | Proposed Method | |
---|---|---|---|---|---|---|

Accuracy | 86% | 76% | 72% | 74% | 84% | 90% |

Sensitivity | 0% | 30% | 57% | 60% | 83% | 85% |

Specificity | 100% | 87% | 89% | 96% | 97% | 100% |

Dataset Size | No. of Clusters Obtained (for k = 12 Initially) | |||||
---|---|---|---|---|---|---|

k-Means | PCM [14] | ACA | IF Model | OnCAD | Proposed Method | |

100,000 | 12 | 7 | 4 | 12 | 12 | 3 |

200,000 | 12 | 7 | 5 | 12 | 12 | 5 |

300,000 | 12 | 7 | 8 | 12 | 12 | 7 |

400,000 | 12 | 7 | 10 | 12 | 12 | 8 |

500,000 | 12 | 7 | 7 | 12 | 12 | 8 |

Dataset Size | No. of Clusters Obtained (for k = 15 Initially) | |||||
---|---|---|---|---|---|---|

k-Means | PCM [14] | ACA | IF Model | OnCAD | Proposed Method | |

100,000 | 15 | 10 | 6 | 15 | 15 | 6 |

200,000 | 15 | 10 | 8 | 15 | 15 | 9 |

300,000 | 15 | 10 | 9 | 15 | 15 | 11 |

400,000 | 15 | 10 | 13 | 15 | 15 | 12 |

500,000 | 15 | 10 | 11 | 15 | 15 | 12 |

Dataset Size | Anomalies | |||||
---|---|---|---|---|---|---|

k-Means | PCM [14] | ACA | IF Model | OnCAD | Proposed Method | |

100,000 | 4750 | 4300 | 4100 | 4200 | 4850 | 4900 |

200,000 | 9500 | 8600 | 4200 | 8400 | 9700 | 9800 |

300,000 | 14,250 | 12,900 | 12,300 | 12,600 | 14,550 | 14,700 |

400,000 | 19,000 | 17,200 | 16,400 | 16,800 | 19,400 | 19,600 |

500,000 | 23,750 | 21,500 | 20,500 | 21,000 | 24,250 | 24,500 |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Mazarbhuiya, F.A.; Shenify, M.
A Mixed Clustering Approach for Real-Time Anomaly Detection. *Appl. Sci.* **2023**, *13*, 4151.
https://doi.org/10.3390/app13074151

**AMA Style**

Mazarbhuiya FA, Shenify M.
A Mixed Clustering Approach for Real-Time Anomaly Detection. *Applied Sciences*. 2023; 13(7):4151.
https://doi.org/10.3390/app13074151

**Chicago/Turabian Style**

Mazarbhuiya, Fokrul Alom, and Mohamed Shenify.
2023. "A Mixed Clustering Approach for Real-Time Anomaly Detection" *Applied Sciences* 13, no. 7: 4151.
https://doi.org/10.3390/app13074151