Multi-Level Clustering-Based Outlier’s Detection (MCOD) Using Self-Organizing Maps
Abstract
:1. Introduction
2. Related Work and Background
2.1. Distance-Based Outlier Detection
2.2. Distribution-Based Outlier Detection
2.3. Density-Based Outlier Detection
2.4. Deviation-Based Outlier Detection
2.5. Angle-Based Outlier Detection
2.6. Deep Learning-Based Outlier Detection
2.7. Clustering-Based Outlier Detection
3. Multi-Level Clustering-Based Outlier Detection
3.1. SOM Clustering
3.2. The MCOD (Ai-SOM) Outlier’s Detection Algorithm
Algorithm 1 Multi-Level Clustering-Based Outlier’s Detection (MCOD) (Ai-SOM) |
Input: Dataset X of n records and d dimension, Algorithm Ai, Number of clusters k, Learning rate η, Alpha α |
Output: Top % outliers |
Begin |
Step1://Apply Ai on X to obtain k cluster |
Clusterj ← Ai algorithm (X, k) where j = 1, 2, 3, …, k |
Step2://Initialize a vector W with the size k |
for j ← 1, 2, 3, …, k do |
w j = mean (Clusterj) |
end for |
Step 3://Reshape W to a 2D matrix that matches the shape of SOM |
for x ← 1, 2, 3, …, sqrt(k) |
for y ← 1, 2, 3, …, sqrt(k) |
z = (x-1)∙sqrt(k)+y |
Wx,y = w z |
end for |
end for |
Step 4: //Apply SOM and update W by using η to obtain updated k cluster |
UpdatedClusterj ← SOM (Clusterj, W, k, η) where j = 1, 2, 3, …, k |
Step 5: Find the ORF factor for each object x in the updated k cluster given α |
Step 6: Select top % data points with high value of ORF as outliers |
End |
4. Experiment Analysis
4.1. Datasets
4.1.1. Artificial Datasets
4.1.2. Biomedical Datasets
4.1.3. Credit Card Datasets
4.2. Adopted Outliers Detection Algorithms
4.3. Evaluation Criteria
4.4. Individual Clustering Results
4.5. Experiment 1: Artificial Datasets
4.6. Experiment 2: Medical Datasets with True Outliers
4.7. Experiment 3: Real Credit Card Datasets
5. Conclusions and Future Directions
Author Contributions
Funding
Conflicts of Interest
References
- Kashef, R.; Gencarelli, M.; Ibrahim, A. Classification of Outlier’s Detection Methods Based on Quantitative or Semantic Learning. In Combating Security Challenges in the Age of Big Data. Advanced Sciences and Technologies for Security Applications; Fadlullah, Z., Khan Pathan, A.S., Eds.; Springer: Cham, Switzerland, 2020. [Google Scholar]
- Malini, N.; Pushpa, M. Analysis on credit card fraud identification techniques based on KNN and outlier detection. In Proceedings of the 2017 Third International Conference on Advances in Electrical, Electronics, Information, Communication and Bio-Informatics (AEEICB), Chennai, India, 27–28 February 2017; pp. 255–258. [Google Scholar]
- Rajeswari, N.; Nachammai, S.; Jemima, P.E.; Rajeswari, A.M. Unexpected Health Issues Prediction in Medical Data Using Apriori Rare Based Outlier Detection Method. In Proceedings of the 2019 International Conference on Vision towards Emerging Trends in Communication and Networking (ViTECoN), Vellore, India, 30–31 March 2019; pp. 1–6. [Google Scholar]
- Kumar, M.; Mathur, R. Unsupervised outlier detection technique for intrusion detection in cloud computing. In Proceedings of the International Conference for Convergence for Technology-2014, Pune, India, 6–8 April 2014; pp. 1–4. [Google Scholar]
- Zheng, L.; Hu, W.; Min, Y. Raw Wind Data Preprocessing: A Data-Mining Approach. IEEE Trans. Sustain. Energy 2014, 6, 11–19. [Google Scholar] [CrossRef]
- Khezrimotlagh, D.; Cook, W.D.; Zhu, J. A nonparametric framework to detect outliers in estimating production frontiers. Eur. J. Oper. Res. 2020, 286, 375–388. [Google Scholar] [CrossRef]
- Schnepper, T.; Klamroth, K.; Stiglmayr, M.; Puerto, J. Exact algorithms for handling outliers in center location problems on networks using k-max functions. Eur. J. Oper. Res. 2018, 273, 441–451. [Google Scholar] [CrossRef]
- Erkuş, E.C.; Purutçuoğlu, V. Outlier detection and quasi-periodicity optimization algorithm: Frequency domain based outlier detection (FOD). Eur. J. Oper. Res. 2020. [Google Scholar] [CrossRef]
- Kohonen, T. Self-organized formation of topologically correct feature maps. Biol. Cybern. 1982, 43, 59–69. [Google Scholar] [CrossRef]
- Aggawal, C. Proximity-Based Outlier Detection. In Outlier Analysis; Springer: Berlin/Heidelberg, Germany, 2016; pp. 101–133. [Google Scholar]
- Knox, E.M.; Raymond, T.N. Algorithms for mining distance-based outliers in large datasets. In Proceedings of the International Conference on Very Large Data Bases, San Franciso, CA, USA, 24–27 August 1998. [Google Scholar]
- Dang, T.T.; Ngan, H.Y.; Liu, W. Distance-based k-nearest neighbors outlier detection method in large-scale traffic data. In Proceedings of the 2015 IEEE International Conference on Digital Signal Processing (DSP), Singapore, 21–24 July 2015; pp. 507–510. [Google Scholar]
- Domingues, R.; Filippone, M.; Michiardi, P.; Zouaoui, J. A comparative evaluation of outlier detection algorithms: Experiments and analyses. Pattern Recognit. 2018, 74, 406–421. [Google Scholar] [CrossRef]
- Davies, L.; Gather, U. The Identification of Multiple Outliers. J. Am. Stat. Assoc. 1993, 88, 782. [Google Scholar] [CrossRef]
- Han, J.; Kamber, M.; Pei, J. Outlier Detection. In Data Mining: Concepts and Techniques; Elsevier Science: Burlington, NJ, USA, 2012; pp. 543–584. [Google Scholar]
- Swersky, L.; Marques, H.O.; Sander, J.; Campello, R.J.G.B.; Zimek, A. On the Evaluation of Outlier Detection and One-Class Classification Methods. In Proceedings of the 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), Montreal, QC, Canada, 17–19 October 2016; pp. 1–10. [Google Scholar]
- Schubert, E.; Zimek, A.; Kriegel, H.-P. Local outlier detection reconsidered: A generalized view on locality with applications to spatial, video, and network outlier detection. Data Min. Knowl. Discov. 2012, 28, 190–237. [Google Scholar] [CrossRef]
- Kantardzic, M. Data-Mining Concepts; Wiley: Hoboken, NJ, USA, 2011; pp. 1–25. [Google Scholar]
- Kriegel, H.-P.; Hubert, M.S.; Zimek, A. Angle-based outlier detection in high-dimensional data. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining-KDD 08, Las Vegas, NV, USA, 24–27 August 2008; p. 444. Available online: https://www.dbs.ifi.lmu.de/~zimek/publications/KDD2008/KDD08-ABOD.pdf (accessed on 23 September 2020).
- Ye, H.; Kitagawa, H.; Xiao, J. Continuous Angle-based Outlier Detection on High-dimensional Data Streams. In Proceedings of the 19th International Database Engineering & Applications Symposium—IDEAS ’15, Yokohama, Japan, 13–15 July 2015; pp. 162–167. [Google Scholar] [CrossRef]
- Pillai, T.R.; Hashem, I.A.T.; Brohi, S.N.; Kaur, S.; Marjani, M. Credit Card Fraud Detection Using Deep Learning Technique. In Proceedings of the 2018 Fourth International Conference on Advances in Computing, Communication & Automation (ICACCA), Bombay, India, 29–31 March 2018; pp. 1–6. [Google Scholar]
- Roy, A.; Sun, J.; Mahoney, R.; Alonzi, L.; Adams, S.; Beling, P. Deep learning detecting fraud in credit card transactions. In Proceedings of the 2018 Systems and Information Engineering Design Symposium (SIEDS), Charlottesville, VA, USA, 27 April 2018; pp. 129–134. Available online: https://ieeexplore.ieee.org/document/8374722 (accessed on 23 September 2020). [CrossRef]
- Raghavan, P.; El Gayar, N. Fraud Detection using Machine Learning and Deep Learning. In Proceedings of the 2019 International Conference on Computational Intelligence and Knowledge Economy (ICCIKE), Dubai, United Arab Emirates, 11–12 December 2019; pp. 334–339. [Google Scholar]
- He, Z.; Xu, X.; Deng, S. Discovering cluster-based local outliers. Pattern Recognit. Lett. 2003, 24, 1641–1650. [Google Scholar] [CrossRef]
- Kashef, R.; Kamel, M.S. Towards Better Detection of Outliers, International Conference on BioInformatics and BioEngineering. Biotechno 2008, 1, 149–154. [Google Scholar] [CrossRef]
- Yogita, L.; Toshniwal, D. A Framework for Outlier Detection in Evolving Data Streams by Weighting Attributes in Clustering. Procedia Technol. 2012, 6, 214–222. [Google Scholar] [CrossRef]
- Wang, H.; Bah, M.J.; Hammad, M. Progress in Outlier Detection Techniques: A Survey. IEEE Access 2019, 7, 107964–108000. [Google Scholar] [CrossRef]
- Guha, S.; Rastogi, R.; Shim, K. Rock: A robust clustering algorithm for categorical attributes. Inf. Syst. 2000, 25, 345–366. [Google Scholar] [CrossRef]
- Ebbels, T.M. Non-linear Methods for the Analysis of Metabolic Profiles. In The Handbook of Metabonomics and Metabolomics; Elsevier BV: Amsterdam, The Netherlands, 2007; pp. 201–226. [Google Scholar]
- Wehrens, R. Data Mapping: Linear Methods versus Nonlinear Techniques. Compr. Chemom. 2009, 2, 619–633. [Google Scholar]
- Hartigan, J.A.; Wong, M.A. Algorithm AS 136: A K-Means Clustering Algorithm. J. R. Stat. Soc. Ser. C 1979, 28, 100. [Google Scholar] [CrossRef]
- Savaresi, S.M.; Boley, D. On the performance of bisecting K-means and PDDP. In Proceedings of the 2001 SIAM International Conference on Data Mining, Chicago, IL, USA, 5–7 April 2001; pp. 1–14. [Google Scholar]
- Barnett, V.; Lewis, T. Outliers in Statistic Data; John Wiley’s: New York, NY, USA, 1994. [Google Scholar]
- Dunn, J.C. A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters. J. Cybern. 1973, 3, 32–57. [Google Scholar] [CrossRef]
- Hawkins, D.M.; Bradu, D.; Kass, G.V. Location of Several Outliers in Multiple-Regression Data Using Elemental Sets. Technometrics 1984, 26, 197. [Google Scholar] [CrossRef]
- Rousseeuw, P.J.; Leroy, A.M. Robust Regression and Outlier Detection; Wiley: Hoboken, NJ, USA, 1987. [Google Scholar]
- Aggarwal, C.C.; Sathe, S. Theoretical Foundations and Algorithms for Outlier Ensembles? ACM SIGKDD Explor. Newsl. 2015, 17, 24–47. [Google Scholar] [CrossRef]
- West, M.; Blanchette, C.; Dressman, H.; Huang, E.; Ishida, S.; Spang, R.; Zuzan, H.; Olson, J.J.A.; Marks, J.R.; Nevins, J.R. Predicting the clinical status of human breast cancer by using gene expression profiles. Proc. Natl. Acad. Sci. USA 2001, 98, 11462–11467. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Personal and Business Banking Services-RBC Royal Bank. Available online: http://www.rbcroyalbank.com/ (accessed on 12 January 2020).
- Machine Learning Group. Credit Card Fraud Detection, Kaggle, 23 March 2018. Available online: https://www.kaggle.com/mlg-ulb/creditcardfraud/data (accessed on 12 January 2020).
- Kashef, R. Ensemble-Based Anomaly Detection Using CooperativeLearning. In Proceedings of the KDD 2017: Workshop on Anomaly Detection in Finance, PMLR 71, Halifax, NS, Canada, 14 August 2018; pp. 43–55. Available online: http://proceedings.mlr.press/v71/kashef18a/kashef18a.pdf (accessed on 23 September 2020).
- Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef] [Green Version]
- Williams, G.; Baxter, R.; He, H.; Hawkins, S.; Gu, L. A comparative study of RNN for outlier detection in data mining. In Proceedings of the 2002 IEEE International Conference on Data Mining, Maebashi City, Japan, 9–12 December 2002. [Google Scholar]
Dataset | n | K | d |
---|---|---|---|
HBK | 75 | 4 | 4 |
Wood | 20 | 4 | 6 |
Cardio | 2126 | 100 | 25 |
BC | 699 | 100 | 9 |
RBC | 13,731 | 100 | 15 |
European Credit | 284,807 | 100 | 29 |
Algorithm | Parameters |
---|---|
KM | k = 4 or 100, MaxITER = 300 |
BKM | k = 4 or 100, Maximum number of trials to select cluster to bisect = 20 |
PAM | k = 4 or 100, MaxITER = 20 |
FCM | k = 4 or 100, MaxITER = 100 |
SOM | Map size = 2 × 2 or 10 × 10, Sigma = 0.5, Learning rate = 0.5 Neighbourhood function = Gaussian |
LOF | Number of neighbours = 10 |
CBLOF | Alpha = 0.9, Beta = 5 Add weight to outlier score calculation = True |
Dataset | KM | BKM | FCM | PAM | SOM |
---|---|---|---|---|---|
HBK | 0.8757 | −0.2003 | 0.3973 | 0.3251 | 0.9210 |
Wood | 0.4364 | −0.0836 | 0.4360 | 0.2097 | 0.4364 |
Cardio | 0.2498 | −0.3784 | 0.1544 | 0.1347 | 0.2344 |
BC | 0.2227 | −0.6473 | 0.0297 | 0.1953 | 0.3430 |
RBC | 0.8928 | 0.3414 | 0.6982 | 0.7399 | 0.8719 |
European Credit | 0.1841 | −0.1803 | 0.0796 | 0.1425 | 0.1344 |
Dataset | LOF | FindCBLOF (KM) | FindCBLOF (BKM) | FindCBLOF (PAM) | FindCBLOF (FCM) | FindCBLOF (SOM) |
---|---|---|---|---|---|---|
HBK | 3 | 5 | 11 | |||
Wood | 4 | 1 | 3 |
Dataset | MCOD (KM-SOM) | MCOD (BKM-SOM) | MCOD (PAM-SOM) | MCOD (FCM-SOM) |
---|---|---|---|---|
HBK | 14 | 12 | 13 | 13 |
Wood | 4 | 3 | 3 | 4 |
TopRatio | LOF | FindCBLOF (KM) | FindCBLOF (BKM) | FindCBLOF (FCM) | FindCBLOF (PAM) | FindCBLOF (SOM) |
---|---|---|---|---|---|---|
10% | 31 | 71 | 23 | 68 | 72 | 44 |
15% | 50 | 98 | 35 | 87 | 90 | 60 |
20% | 75 | 124 | 53 | 114 | 112 | 83 |
25% | 85 | 158 | 76 | 132 | 123 | 108 |
30% | 92 | 171 | 99 | 162 | 155 | 134 |
TopRatio | LOF | FindCBLOF (KM) | FindCBLOF (BKM) | FindCBLOF (FCM) | FindCBLOF (PAM) | FindCBLOF (SOM) |
---|---|---|---|---|---|---|
10% | 11 | 45 | 19 | 28 | 32 | 56 |
15% | 24 | 52 | 23 | 39 | 63 | 74 |
20% | 45 | 69 | 36 | 50 | 89 | 99 |
25% | 59 | 81 | 44 | 68 | 97 | 129 |
30% | 68 | 93 | 50 | 76 | 131 | 159 |
TopRatio | MCOD (KM-SOM) | MCOD (BKM-SOM) | MCOD (PAM-SOM) | MCOD (FM-SOM) |
---|---|---|---|---|
10% | 82 | 85 | 82 | 79 |
15% | 123 | 123 | 108 | 110 |
20% | 155 | 144 | 137 | 136 |
25% | 185 | 169 | 162 | 164 |
30% | 214 | 197 | 182 | 184 |
TopRatio | CBLOF (KM-SOM) | MCOD (BKM-SOM) | MCOD (PAM-SOM) | MCOD (FCM-SOM) |
---|---|---|---|---|
10% | 63 | 33 | 47 | 42 |
15% | 85 | 55 | 74 | 68 |
20% | 116 | 83 | 97 | 97 |
25% | 144 | 105 | 126 | 118 |
30% | 173 | 128 | 154 | 145 |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, M.; Kashef, R.; Ibrahim, A. Multi-Level Clustering-Based Outlier’s Detection (MCOD) Using Self-Organizing Maps. Big Data Cogn. Comput. 2020, 4, 24. https://doi.org/10.3390/bdcc4040024
Li M, Kashef R, Ibrahim A. Multi-Level Clustering-Based Outlier’s Detection (MCOD) Using Self-Organizing Maps. Big Data and Cognitive Computing. 2020; 4(4):24. https://doi.org/10.3390/bdcc4040024
Chicago/Turabian StyleLi, Menglu, Rasha Kashef, and Ahmed Ibrahim. 2020. "Multi-Level Clustering-Based Outlier’s Detection (MCOD) Using Self-Organizing Maps" Big Data and Cognitive Computing 4, no. 4: 24. https://doi.org/10.3390/bdcc4040024