ConAnomaly: Content-Based Anomaly Detection for System Logs
Abstract
:1. Introduction
- We use the part of speech of the vocabulary as the standard for preliminary filtering of the log content, which reduces unnecessary waste of computing resources. To the best of our knowledge, our work is the first to utilize this to weight features.
- This study provides new insights to handle unseen log templates and reduce the dependence on the log parser on the market.
- We proposed ConAnomaly, which considers the semantic information in the log message into the log sequential anomaly detection, which improves the detection performance to a certain extent.
2. Related Work
2.1. Log Parsing
2.2. Feature Extraction
2.3. Anomaly Detection
2.4. Limitation of Previous Models
- The existing log-based anomaly detection system is very effective, which mostly depends on the existing log parser tools. If the tool is not available for the current log data set, the model may not perform well. Moreover, they cannot handle unknown log events or templates. In DeepLog, it utilizes Spell, an unsupervised streaming parser that parses incoming log entries in an online fashion based on the idea of the longest common subsequence (LCS), to preprocess log files. Its input for classification is a window w of the h most recent log keys. That is, w = m, …, m, m, where each m is the log key from the log entry e. However, if an undefined log instance is printed in a real-time environment, there is a risk that the model will crash or make incorrect predictions.
- Logs as unstructured data have two characteristics: one is that there is a temporal relationship between logs, which is a manifestation of the workflow; The second is that the log itself has semantics. But most of the tools available take advantage of only the first feature of logs in the anomaly detection part. For example, in LogCluster, the clustering method is leveraged to cluster log sequences that are similar in sequences.
3. Design of ConAnomaly
3.1. Overview
3.2. Log2vec
3.3. Log Anomaly Detection
3.3.1. Lstm
3.3.2. FC
3.3.3. Softmax
4. Experiment
4.1. Experiment Setting
4.1.1. Datasets
- BGLThere are 4,747,963 logs in the BGL dataset, which are collected from a BlueGene/L supercomputer system at Lawrence Livermore National Labs. Each BGL log was manually labeled as either normal or anomalous, and 348,469 logs were anomalous.
- HDFSThe HDFS dataset consists of 11,175,629 logs collected from more than 200 Amazon EC2 nodes that run Hadoop-based jobs. Program execution in the HDFS system usually involves a block of logs. Based on this theory, 575,061 blocks of logs are obtained, among which 16,838 blocks were labeled as anomalous by experts. Unlike BGL data, HDFS logs have identifiers recorded for each job execution.
Datasets | #Time Span | #of Logs | #of Anomalies |
---|---|---|---|
HDFS | 38.7 h | 11,175,629 | 16,838 (blocks) |
BGL | 7 months | 4,747,963 | 348,469 (logs) |
4.1.2. Baselines
- LogCluster: This article proposes an approach that clusters the logs to ease log-based problem identification. Besides, it utilizes a knowledge base to check if the log sequences occurred before.
- DeepLog: It proposes DeepLog, a deep neural network model utilizing Long Short-Term Memory (LSTM), to model a system log as a natural language sequence.
- LogAnomaly: LogAnomaly is a framework to model a log stream as a natural language sequence. It can detect both sequential and quantitive log anomalies simultaneously, which has not been done by any previous work.
- LogRobust: LogRobust extracts semantic information of log events and represents them as semantic vectors. It utilizes an attention-based Bi-LSTM model to detect anomalous log sequences.
- HitAnomaly: This work proposes a log-based anomaly detection model utilizing a hierarchical transformer structure to model both log template sequences and parameter values.
4.1.3. Implementation
4.1.4. Evaluation Metrics
4.2. Evaluation on BGL Dataset
4.3. Evaluation on HDFS Dataset
4.3.1. Experiment Result
4.3.2. Analysis of ConAnomaly
4.3.3. Experiment Based on the Unseen Logs
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Acknowledgments
Conflicts of Interest
Abbreviations
log2vec | log sequence encoder |
LSTM | Long Short Term Memory Network |
PCA | Principal component analysis |
word2vec | Word-to-vector |
BGL | BlueGene/L |
HDFS | HDFS distributed file system |
LCS | the longest common subsequence |
dLCE | Distributional lexical-contrast embedding model |
template2vec | template-to-vector |
NLP | Natural language processing |
dLCE | Distributional lexical-contrast embedding model |
SMOTE | Synthetic Minority Oversampling Technique |
References
- Huang, S.; Fung, C.; Wang, K.; Pei, P.; Luan, Z.; Qian, D. Using recurrent neural networks toward black-box system anomaly prediction. In Proceedings of the 2016 IEEE/ACM 24th International Symposium on Quality of Service, Beijing, China, 18 October 2016; pp. 1–10. [Google Scholar]
- Lin, Q.; Zhang, H.; Lou, J.-G.; Zhang, Y.; Chen, X. Log clustering based problem identification for online service systems. In Proceedings of the 38th International Conference on Software Engineering Companion, Austin, TX, USA, 14–22 May 2016; pp. 102–111. [Google Scholar]
- Luo, M.; Wang, K.; Cai, Z.; Liu, A.; Li, Y.; Cheang, C.F. Using imbalanced triangle synthetic data for machine learning anomaly detection. Comput. Mater. Contin. 2019, 58, 15–26. [Google Scholar] [CrossRef] [Green Version]
- Zhang, J.; Xie, Z.; Sun, J.; Wang, J. A cascaded R-CNN with multiscale attention and imbalanced samples for traffic sign detection. IEEE Access 2020, 8, 29742–29754. [Google Scholar] [CrossRef]
- Chen, Y.; Wang, J.; Xia, R.; Zhang, Q.; Cao, Z.; Yang, K. The visual object tracking algorithm research based on adaptive combination kernel. J. Ambient Intell. Humaniz. Comput. 2019, 10, 4855–4867. [Google Scholar] [CrossRef]
- Zhang, J.; Wang, W.; Lu, C.; Wang, J.; Sangaiah, A.K. Lightweight deep network for traffic sign classification. Ann. Telecommun. 2019, 75, 369–379. [Google Scholar] [CrossRef]
- Xie, K.; Li, X.; Wang, X.; Xie, G.; Wen, J.; Cao, J.; Zhang, D. Fast tensor factorization for accurate internet anomaly detection. IEEE/ACM Trans. Netw. (TON) 2017, 25, 3794–3807. [Google Scholar] [CrossRef]
- Zhu, H.; Meng, F.; Rho, S.; Li, M.; Wang, J.; Liu, S.; Jiang, F. Long Short Term Memory Networks Based Anomaly Detection for KPIs. Comput. Mater. Contin. 2019, 61, 829–847. [Google Scholar] [CrossRef]
- Oliva, A.F.; Perez, F.M.; Berna-Martinez, J.V.; Ortega, M.A. Non-deterministic outlier detection method based on the variable precision rough set model. Comput. Syst. Sci. Eng. 2019, 34, 131–144. [Google Scholar] [CrossRef]
- Zhu, C.; Zhao, W.; Li, Q.; Li, P.; Da, Q. Network Embedding-Based Anomalous Density Searching for Multi-Group Collaborative Fraudsters Detection in Social Media. Comput. Mater. Contin. 2019, 60, 317–333. [Google Scholar] [CrossRef] [Green Version]
- Xu, W.; Huang, L.; Fox, A.; Patterson, D.; Jordan, M.I. Detecting large-scale system problems by mining console logs. In Proceedings of the 27th International Conference on Machine Learning, Haifa, Israel, 21–24 June 2010; pp. 117–132. [Google Scholar]
- Lou, J.-G.; Fu, Q.; Yang, S.; Xu, Y.; Li, J. Mining invariants from console logs for system problem detection. In Proceedings of the 2010 USENIX Annual Technical Conference, Boston, MA, USA, 23–25 June 2010; pp. 1–14. [Google Scholar]
- Du, M.; Li, F.; Zheng, G.; Srikumar, V. Deeplog: Anomaly detection and diagnosis from system logs through deep learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, Dallas, TX, USA, 30 October–3 November 2017; pp. 1285–1298. [Google Scholar]
- Duan, X.; Ying, S.; Yuan, W.; Cheng, H.; Yin, X. QLLog: A log anomaly detection method based on Q-learning algorithm. Inf. Process. Manag. 2021, 5, 102540. [Google Scholar] [CrossRef]
- Meng, W.; Liu, Y.; Zhu, Y.; Zhang, S.; Pei, D.; Liu, Y.; Chen, Y.; Zhang, R.; Tao, S.; Sun, P.; et al. Loganomaly: Unsupervised detection of sequential and quantitative anomalies in unstructured logs. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, Macao, China, 10–16 August 2019; Volume 7, pp. 4739–4745. [Google Scholar]
- Zhang, X.; Li, Z.; Chen, J. Robust log-based anomaly detection on unstable log data. In Proceedings of the 2019 27th ACM Joint Meeting, Tallinn, Estonia, 26–30 August 2019; pp. 807–817. [Google Scholar]
- Kenji, Y.; Yuko, M. Dynamic syslog mining for network failure monitoring. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, IL, USA, 21–24 August 2005; pp. 499–508. [Google Scholar]
- Adam, O.; Archana, G.; Wei, X. Advances and Challenges in Log Analysis. Commun. ACM 2012, 2, 55–61. [Google Scholar]
- Niwattanakul, S.; Singthongchai, J.; Naenudorn, E.; Wanapu, S. Using of Jaccard Coefficient for Keywords Similarity. In Proceedings of the 2013 IAENG International Conference on Internet Computing and Web Services, Hong Kong, China, 13–15 March 2013; pp. 380–384. [Google Scholar]
- He, P.; Zhu, J.; He, S.; Li, J.; Lyu, M.R. An evaluation study on log parsing and its use in log mining. In Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, Toulouse, France, 28 June–1 July 2016; pp. 654–661. [Google Scholar]
- He, P.; Zhu, J.; Zheng, Z.; Lyu, M.R. Drain: An Online Log Parsing Approach with Fixed Depth Tree. In Proceedings of the IEEE International Conference on Web Services, Honolulu, HI, USA, 25–30 June 2017; pp. 33–40. [Google Scholar]
- Huang, S.H.; Liu, Y.; Fung, C.; He, R.; Zhao, Y.N. HitAnomaly: Hierarchical Transformers for Anomaly Detection in System Log. IEEE Trans. Netw. Serv. Manag. 2020, 10, 1. [Google Scholar] [CrossRef]
- Meng, W.; Liu, Y.; Zhang, S.; Pei, D.; Dong, H.; Song, L.; Luo, X. Device-agnostic log anomaly classification with partial labels. In Proceedings of the 2018 IEEE/ACM 26th International Symposium on Quality of Service (IWQoS), Banff, AB, Canada, 4–6 June 2018; pp. 1–6. [Google Scholar]
- Indra, R. Classification of User Comment Using Word2vec and Deep Learning. Int. J. Emerg. Technol. Adv. Eng. 2021, 5, 1–8. [Google Scholar]
- Bertero, C.; Roy, M.; Sauvanaud, C.; Trédan, G. Experience report: Log mining using natural language processing and application to anomaly detection. In Proceedings of the 2017 IEEE 28th International Symposium on Software Reliability Engineering (ISSRE), Toulouse, France, 23–26 October 2017; pp. 351–360. [Google Scholar]
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
- Martinez, A. Part-of-speech tagging. WIREs Comp. Stat. 2012, 4, 107–113. [Google Scholar] [CrossRef]
- Gupta, A.; Nayyar, A.; Arora, S.; Jain, R. Detection and Classification of Toxic Comments by Using LSTM and Bi-LSTM Approach. In Advanced Informatics for Computing Research; Springer: Singapore, 2021; pp. 100–112. [Google Scholar]
- Oliner, A.; Stearley, J. What supercomputers say: A study of five system logs. In Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, Edinburgh, UK, 25–28 June 2007; pp. 575–584. [Google Scholar]
- Du, M.; Li, F. Spell: Online Streaming Parsing of Large Unstructured System Logs. IEEE Trans. Knowl. Data Eng. 2018, 10, 1. [Google Scholar] [CrossRef]
- Paterson, M.; Dančík, V. Longest common subsequences. In Mathematical Foundations of Computer Science; Springer: Berlin, Heidelberg, 2006; pp. 127–142. [Google Scholar]
- Hua, X.Q.; Peng, L.Y. MIG median detectors with manifold filter. Signal Process. 2021, 11, 108176. [Google Scholar] [CrossRef]
- Wu, H.C.; Luk, R.W.P.; Wong, K.F.; Kwok, K.L. Interpreting TF-IDF term weights as making relevance decisions. Acm Trans. Inf. Syst. 2008, 26, 1–37. [Google Scholar] [CrossRef]
- Soucy, P.; Mineau, G.W. Beyond TFIDF Weighting for Text Categorization in the Vector Space Model. In Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence, Edinburgh, UK, 30 July–5 August 2005; pp. 1130–1135. [Google Scholar]
- He, S.; Zhu, J.; He, P.; Lyu, M.R. Experience report: System log analysis for anomaly detection. In Proceedings of the 2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE), Ottawa, ON, Canada, 23–27 October 2016; pp. 207–218. [Google Scholar]
- Wang, H.; Zhou, C.; Wu, J.; Dang, W.Z.; Zhu, X.Q.; Wang, J.L. Deep Structure Learning for Fraud Detection. In Proceedings of the IEEE International Conference on Data Mining, Singapore, 17–20 November 2018; pp. 567–576. [Google Scholar]
- Jia, T.; Chen, P.F.; Yang, L.; Li, Y.; Meng, F.J.; Xu, J. An Approach for Anomaly Diagnosis Based on Hybrid Graph Model with Logs for Distributed Services. In Proceedings of the IEEE International Conference on Web Services, Honolulu, HI, USA, 25–30 June 2017; pp. 25–32. [Google Scholar]
- Yu, W.C.; Cheng, W.; Aggarwal, C.; Zhang, K.; Chen, H.F.; Wang, W. NetWalk: A Flexible Deep Embedding Approach for Anomaly Detection in Dynamic Networks. In Proceedings of the 24th ACM SIGKDD International Conference, London, UK, 19–23 August 2018; pp. 2672–2681. [Google Scholar]
- Xia, B.; Bai, Y.X.; Yin, J.J.; Li, Y.; Xu, J. LogGAN: A Log-level Generative Adversarial Network for Anomaly Detection using Permutation Event Modeling. Inf. Syst. Front. 2021, 23, 285–298. [Google Scholar] [CrossRef]
- Oprea, A.; Li, Z.; Yen, T.F.; Alrwais, S. Detection of Early-Stage Enterprise Infection by Mining Large-Scale Log Data. In Proceedings of the 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, Rio de Janeiro, Brazil, 22–25 June 2015; pp. 45–56. [Google Scholar]
- Vinayakumar, R.; Soman, K.; Poornachandran, P. Long short-term memory based operation log anomaly detection. In Proceedings of the 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Udupi, India, 13–16 September 2017; pp. 236–242. [Google Scholar]
- Tuor, A.R.; Baerwolf, R.; Knowles, N.; Hutchinson, B.; Nichols, N.; Jasper, R. Recurrent neural network language models for open vocabulary event-level cyber anomaly detection. In Proceedings of the Workshops at the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–3 February 2018. [Google Scholar]
- Lu, H.; Shi, K.; Zhu, Y. Sensing Urban Transportation Events from Multi-Channel Social Signals with the Word2vec Fusion Model. Sensors 2018, 11, 4093. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Tian, L.Y.; Shao, Z.H.; Wu, J.P. Application of Full Connection Network in Submarine Formation Recognition. In Proceedings of the IEEE 9th Joint International Information Technology and Artificial Intelligence Conference, Chongqing, China, 11–13 December 2020; pp. 322–326. [Google Scholar]
- Zhang, X.; Zhao, R.; Qiao, Y.; Li, H.S. RBF-Softmax: Learning Deep Representative Prototypes with Radial Basis Function Softmax. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 296–311. [Google Scholar]
- Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J. PyTorch: An Imperative Style, High-Performance Deep Learning Library. Available online: https://papers.neurips.cc/paper/2019/file/bdbca288fee7f92f2bfa9f7012727740-Paper.pdf (accessed on 11 September 2021).
The Name of the Part of Speech Tag | Meaning |
---|---|
‘CC’ | coordinating conjunction |
‘TO’ | ‘to’ |
‘IN’ | preposition/subordinating conjunction |
‘MD’ | modal (could, will) |
Sequences Id | Log Sequences Based on the Fixed Window |
---|---|
s1 | 22 12 22 12 22 12 22 22 12 22 12 22 12 22 12 22 12 12 12 22 |
s2 | 168 168 168 168 168 168 168 168 168 168 168 168 168 168 168 168 168 168 168 168 |
s3 | 168 168 168 168 168 168 168 168 168 168 168 168 168 168 168 168 168 168 168 168 |
s4 | 168 168 168 168 168 168 168 168 168 168 168 168 168 168 168 168 168 168 168 168 |
s5 | 189 189 189 189 189 189 189 189 189 189 189 189 189 189 189 189 189 189 189 189 |
s6 | 201 149 201 149 201 149 201 149 201 149 201 149 201 149 201 149 201 149 201 149 |
s7 | 168 168 168 168 168 168 168 168 168 168 168 168 168 168 168 168 168 168 168 168 |
s8 | 168 168 168 168 168 168 168 168 168 168 168 168 168 168 168 168 168 168 168 168 |
s9 | 305 305 305 305 305 305 305 305 305 305 305 305 305 305 305 305 305 305 305 305 |
The Number in Table 3 | The Log Template It Represents |
---|---|
12 | (.*) microseconds spent in the rbs signal handler during (.*) calls. (.*) microseconds was the maximum time for a single instance of a correctable ddr. |
22 | (.*) total interrupts. (.*) critical input interrupts. (.*) microseconds total spent on critical input interrupts |
149 | external input interrupt (.*) (.*) (.*) tree receiver (.*) in resynch mode |
168 | gister: machine state register: machine state register: machine state register: machine state register: machine state register: |
189 | interrupt threshold...0 |
201 | Lustre mount FAILED:(.*):point /p/gb1 |
305 | program interrupt: unimplemented operation..0 |
Percentage of Data | Divided by Timestamp | Divided in Random |
---|---|---|
0.1 | 4232 | 4097 |
0.2 | 8601 | 6275 |
0.3 | 12,344 | 7873 |
0.4 | 13,500 | 9451 |
0.5 | 13,835 | 10,660 |
0.6 | 14,503 | 11,855 |
0.7 | 14,959 | 12,938 |
0.8 | 15,212 | 13,959 |
0.9 | 15,547 | 14,846 |
1 | 15,802 | 15,802 |
1% | 10% | 20% | 50% | |
---|---|---|---|---|
# in training | 991 | 4201 | 6296 | 10,778 |
# in testing | 15,719 | 14,878 | 13,896 | 10,549 |
# unseen in training | 14,811 | 11,601 | 9506 | 5024 |
F1-score | 0.95 | 0.97 | 0.98 | 0.98 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Lv, D.; Luktarhan, N.; Chen, Y. ConAnomaly: Content-Based Anomaly Detection for System Logs. Sensors 2021, 21, 6125. https://doi.org/10.3390/s21186125
Lv D, Luktarhan N, Chen Y. ConAnomaly: Content-Based Anomaly Detection for System Logs. Sensors. 2021; 21(18):6125. https://doi.org/10.3390/s21186125
Chicago/Turabian StyleLv, Dan, Nurbol Luktarhan, and Yiyong Chen. 2021. "ConAnomaly: Content-Based Anomaly Detection for System Logs" Sensors 21, no. 18: 6125. https://doi.org/10.3390/s21186125
APA StyleLv, D., Luktarhan, N., & Chen, Y. (2021). ConAnomaly: Content-Based Anomaly Detection for System Logs. Sensors, 21(18), 6125. https://doi.org/10.3390/s21186125