# Anomaly Detection in Log Files Using Selected Natural Language Processing Methods

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Related Work

## 3. Materials and Methods

#### 3.1. Parsing Log Data

**Listing**

**1.**

**CE sym 16, at 0x0456cd40, mask 0x04**

**Listing**

**2.**

#### 3.2. Vectorization of Text Data

#### 3.3. Classification Algorithms for Anomaly Detection

#### 3.4. Overview of the Anomaly Detection Process

## 4. Experiments

#### 4.1. Dataset

**Listing**

**3.**

#### 4.2. Experimental Environment

- Modified Drain3 (Version 0.9.9) [37] log-parser module;
- Re [41]—regular expression library,
- NumPy (Version 1.21.5) [42]—module for scientific computing;
- Pandas (Version 1.4.1) [43]—Python Data Analysis Library;
- fastText module (Version 0.9.2) [44], providing an implementation of the feature extraction model;
- Scikit-learn (Version 1.0.2) [45], providing an implementation of the classification algorithms.

- GNB—with default configuration;
- DT—with Gini impurity as a measure of the quality of a split;
- RF—with 100 trees in the forest and the Gini impurity as a measure of the quality of a split;
- AB—with 100 estimators, SAMME.R real boosting algorithm, and learning rate of 1.0;
- XB—with logistic regression for binary classification, negative log-likelihood evaluation metrics for validation data, and configured weight scaling as the sum of negative instances divided by the sum of positive instances;
- MLP—with 100 neurons in the hidden layer, the Adam optimizer, and rectified linear unit (ReLU) as the activation function.

#### 4.3. Evaluation Metrics

- True positives (TP)—number of correctly predicted malicious log entries;
- False positives (FP)—number of normal log entries classified as anomalies (false alarms);
- True negatives (TN)—number of correctly predicted normal log entries;
- False negatives (FN)—number of malicious log entries classified as normal log entries.

- Precision, which indicates what percentage of all positive predictions were actually malicious log entries:$$Precision=\frac{TP}{TP+FP}$$
- Recall, which is a fraction of true positives among all malicious log entries:$$Recall=\frac{TP}{TP+FN}$$
- F1-score, which considers both precision and recall and is their harmonic mean:$$F1=2\xb7\frac{Precision\xb7Recall}{Precision+Recall}$$

## 5. Results

#### 5.1. Feature Extraction Results

#### 5.2. Impact of Sequence Length

#### 5.3. Anomaly Detection Results for Optimal Sequence Length

## 6. Discussion

## 7. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Xu, W.; Huang, L.; Fox, A.; Patterson, D.; Jordan, M.I. Detecting Large-Scale System Problems by Mining Console Logs. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles; Association for Computing Machinery: New York, NY, USA, 2009; pp. 117–132. [Google Scholar] [CrossRef] [Green Version]
- Oliner, A.; Ganapathi, A.; Xu, W. Advances and Challenges in Log Analysis. Commun. ACM
**2012**, 55, 55–61. [Google Scholar] [CrossRef] - Svacina, J.; Raffety, J.; Woodahl, C.; Stone, B.; Cerny, T.; Bures, M.; Shin, D.; Frajtak, K.; Tisnovsky, P. On Vulnerability and Security Log Analysis: A Systematic Literature Review on Recent Trends. In Proceedings of the International Conference on Research in Adaptive and Convergent Systems; Association for Computing Machinery: New York, NY, USA, 2020; pp. 175–180. [Google Scholar] [CrossRef]
- He, S.; He, P.; Chen, Z.; Yang, T.; Su, Y.; Lyu, M.R. A Survey on Automated Log Analysis for Reliability Engineering. ACM Comput. Surv.
**2021**, 54, 1–37. [Google Scholar] [CrossRef] - Müller, A.; Miinz, G.; Carle, G. Collecting router information for error diagnosis and troubleshooting in home networks. In Proceedings of the 2011 IEEE 36th Conference on Local Computer Networks, Bonn, Germany, 4–7 October 2011; pp. 764–769. [Google Scholar] [CrossRef] [Green Version]
- Brandao, A.; Georgieva, P. Log Files Analysis For Network Intrusion Detection. In Proceedings of the 2020 IEEE 10th International Conference on Intelligent Systems (IS), Varna, Bulgaria, 28–30 August 2020; pp. 328–333. [Google Scholar] [CrossRef]
- He, S.; Zhu, J.; He, P.; Lyu, M.R. Experience Report: System Log Analysis for Anomaly Detection. In Proceedings of the 2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE), Ottawa, ON, Canada, 23–27 October 2016; pp. 207–218. [Google Scholar] [CrossRef]
- Savitha, K.S.; Ms, V. Mining of Web Server Logs in a Distributed Cluster Using Big Data Technologies. Int. J. Adv. Comput. Sci. Appl.
**2014**, 5, 137–142. [Google Scholar] [CrossRef] [Green Version] - Wang, J.; Tang, Y.; He, S.; Zhao, C.; Sharma, P.; Alfarraj, O.; Tolba, A. LogEvent2vec: LogEvent-to-Vector Based Anomaly Detection for Large-Scale Logs in Internet of Things. Sensors
**2020**, 20, 2451. [Google Scholar] [CrossRef] [PubMed] - Chandola, V.; Banerjee, A.; Kumar, V. Anomaly Detection: A Survey. ACM Comput. Surv.
**2009**, 41, 1–58. [Google Scholar] [CrossRef] - Grace, L.K.J.; Maheswari, V.; Nagamalai, D. Web Log Data Analysis and Mining. In Advanced Computing; Meghanathan, N., Kaushik, B.K., Nagamalai, D., Eds.; Springer: Berlin/Heidelberg, Germany, 2011; pp. 459–469. [Google Scholar]
- Breier, J.; Branišová, J. Anomaly Detection from Log Files Using Data Mining Techniques. In Information Science and Applications; Kim, K.J., Ed.; Springer: Berlin/Heidelberg, Germany, 2015; pp. 449–457. [Google Scholar]
- Zhang, S.; Zhang, Y.; Chen, Y.; Dong, H.; Qu, X.; Song, L.; Liu, Y.; Meng, W.; Luo, Z.; Bu, J.; et al. PreFix: Switch Failure Prediction in Datacenter Networks. ACM Sigmetrics Perform. Eval. Rev.
**2018**, 2, 1–29. [Google Scholar] [CrossRef] - Khatuya, S.; Ganguly, N.; Basak, J.; Bharde, M.; Mitra, B. ADELE: Anomaly Detection from Event Log Empiricism. In Proceedings of the IEEE Conference on Computer Communications (INFOCOM 2018), Honolulu, HI, USA, 16–19 April 2018; pp. 2114–2122. [Google Scholar] [CrossRef]
- Bertero, C.; Roy, M.; Sauvanaud, C.; Tredan, G. Experience Report: Log Mining Using Natural Language Processing and Application to Anomaly Detection. In Proceedings of the 2017 IEEE 28th International Symposium on Software Reliability Engineering (ISSRE), Toulouse, France, 23–26 October 2017. [Google Scholar] [CrossRef] [Green Version]
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. arXiv
**2013**, arXiv:1301.3781. [Google Scholar] - Pennington, J.; Socher, R.; Manning, C. GloVe: Global Vectors for Word Representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar] [CrossRef]
- Bojanowski, P.; Grave, E.; Joulin, A.; Mikolov, T. Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist.
**2017**, 5, 135–146. [Google Scholar] [CrossRef] [Green Version] - Meng, W.; Liu, Y.; Huang, Y.; Zhang, S.; Zaiter, F.; Chen, B.; Pei, D. A Semantic-aware Representation Framework for Online Log Analysis. In Proceedings of the 2020 29th International Conference on Computer Communications and Networks (ICCCN), Honolulu, HI, USA, 3–6 August 2020; pp. 1–7. [Google Scholar] [CrossRef]
- Li, K.L.; Huang, H.K.; Tian, S.F.; Xu, W. Improving one-class SVM for anomaly detection. In Proceedings of the International Conference on Machine Learning and Cybernetics (ICLMC), Xi’an, China, 5 November 2003; Volume 5, pp. 3077–3081. [Google Scholar] [CrossRef]
- Zhang, W.; Chen, L. Web Log Anomaly Detection Based on Isolated Forest Algorithm. In Proceedings of the IEEE 14th International Conference on Intelligent Systems and Knowledge Engineering (ISKE), Dalian, China, 14–16 November 2019; pp. 755–759. [Google Scholar] [CrossRef]
- Henriques, J.; Caldeira, F.; Cruz, T.; Simões, P. Combining K-Means and XGBoost Models for Anomaly Detection Using Log Datasets. Electronics
**2020**, 9, 1164. [Google Scholar] [CrossRef] - Ying, S.; Wang, B.; Wang, L.; Li, Q.; Zhao, Y.; Shang, J.; Huang, H.; Cheng, G.; Yang, Z.; Geng, J. An Improved KNN-Based Efficient Log Anomaly Detection Method with Automatically Labeled Samples. ACM Trans. Knowl. Discov. Data
**2021**, 15, 1–22. [Google Scholar] [CrossRef] - Du, M.; Li, F.; Zheng, G.; Srikumar, V. DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, Dallas, TX, USA, 30 October–3 November 2017. [Google Scholar] [CrossRef]
- Chen, Z.; Liu, J.; Gu, W.; Su, Y.; Lyu, M.R. Experience Report: Deep Learning-based System Log Analysis for Anomaly Detection. arXiv
**2021**, arXiv:2107.05908. [Google Scholar] - Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.U.; Polosukhin, I. Attention is All you Need. In Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: New York, NY, USA, 2017; Volume 30. [Google Scholar]
- Huang, Z.; Xu, W.; Yu, K. Bidirectional LSTM-CRF Models for Sequence Tagging. arXiv
**2015**, arXiv:1508.01991. [Google Scholar] - Chen, Y.; Luktarhan, N.; Lv, D. LogLS: Research on System Log Anomaly Detection Method Based on Dual LSTM. Symmetry
**2022**, 14, 454. [Google Scholar] [CrossRef] - Guo, H.; Yuan, S.; Wu, X. LogBERT: Log Anomaly Detection via BERT. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; pp. 1–8. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv
**2018**, arXiv:1810.04805. [Google Scholar] - Le, V.H.; Zhang, H. Log-based Anomaly Detection without Log Parsing. In Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering (ASE), Melbourne, Australia, 15–19 November 2021; pp. 492–504. [Google Scholar] [CrossRef]
- Duan, X.; Ying, S.; Yuan, W.; Cheng, H.; Yin, X. QLLog: A log anomaly detection method based on Q-learning algorithm. Inf. Process. Manag.
**2021**, 58, 102540. [Google Scholar] [CrossRef] - Chen, R.; Zhang, S.; Li, D.; Zhang, Y.; Guo, F.; Meng, W.; Pei, D.; Zhang, Y.; Chen, X.; Liu, Y. LogTransfer: Cross-System Log Anomaly Detection for Software Systems with Transfer Learning. In Proceedings of the IEEE 31st International Symposium on Software Reliability Engineering (ISSRE), Coimbra, Portugal, 12–15 October 2020; pp. 37–47. [Google Scholar] [CrossRef]
- Yadav, R.B.; Kumar, P.S.; Dhavale, S.V. A Survey on Log Anomaly Detection using Deep Learning. In Proceedings of the 8th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), Noida, India, 4–5 June 2020; pp. 1215–1220. [Google Scholar] [CrossRef]
- He, P.; Zhu, J.; Zheng, Z.; Lyu, M.R. Drain: An Online Log Parsing Approach with Fixed Depth Tree. In Proceedings of the IEEE International Conference on Web Services (ICWS), Honolulu, HI, USA, 25–30 June 2017. [Google Scholar] [CrossRef]
- Usenix. The HPC4 Data. Available online: https://www.usenix.org/cfdr-data#hpc4 (accessed on 20 February 2022).
- IBM. Drain3. Available online: https://github.com/IBM/Drain3 (accessed on 10 January 2022).
- Kim, E. Optimize Computational Efficiency of Skip-Gram with Negative Sampling. Available online: https://aegis4048.github.io/optimize_computational_efficiency_of_skip-gram_with_negative_sampling (accessed on 13 February 2022).
- Rong, X. word2vec Parameter Learning Explained. arXiv
**2016**, arXiv:1411.2738. [Google Scholar] - Oliner, A.; Stearley, J. What Supercomputers Say: A Study of Five System Logs. In Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’07), Edinburgh, UK, 25–28 June 2007; pp. 575–584. [Google Scholar] [CrossRef]
- Python Software Foundation. Re. Available online: https://docs.python.org/3/library/re.html (accessed on 15 January 2022).
- Open-Source Python Library. Numpy. Available online: https://numpy.org/about/ (accessed on 15 January 2022).
- McKinney, W. Pandas. Available online: https://pandas.pydata.org/ (accessed on 15 January 2022).
- Facebook. fastText. Available online: https://fasttext.cc/ (accessed on 19 January 2022).
- Cournapeau, D. Scikit-Learn. Available online: https://scikit-learn.org/ (accessed on 20 January 2022).
- Rathore, M. Comparison of FastText and Word2Vec. Available online: https://markroxor.github.io/gensim/static/notebooks/Word2Vec_FastText_Comparison.html (accessed on 8 January 2022).
- He, S.; Zhu, J.; He, P.; Lyu, M.R. Loghub: A Large Collection of System Log Datasets towards Automated Log Analytics. arXiv
**2020**, arXiv:2008.06448. [Google Scholar]

**Figure 2.**Cbow model architecture, based on [39].

Sequence Length | No. of Sequences | |
---|---|---|

Training Set (80%) | Testing Set (20%) | |

3 | 1,266,123 | 316,531 |

5 | 759,674 | 189,918 |

10 | 379,837 | 94,959 |

15 | 253,224 | 63,306 |

20 | 189,918 | 47,480 |

30 | 126,612 | 31,653 |

50 | 75,967 | 18,992 |

100 | 37,983 | 9496 |

Classifier | PR AUC | ROC AUC | Precision | Recall | F1-Score |
---|---|---|---|---|---|

GNB | 0.7366 | 0.9682 | 0.4226 | 0.9472 | 0.5805 |

DT | 0.9833 | 0.9910 | 0.9637 | 0.9879 | 0.9757 |

XB | 0.9937 | 0.9937 | 0.9549 | 0.9974 | 0.9757 |

RF | 0.9969 | 0.9933 | 0.9986 | 0.9890 | 0.9937 |

AB | 0.9926 | 0.9939 | 0.9587 | 0.9859 | 0.9721 |

MLP | 0.9981 | 0.9945 | 0.9942 | 0.9942 | 0.9942 |

**Table 3.**Evaluation metrics’ comparison with LogEvent2Vec (W denotes the window length for which the best results were obtained).

F1-Score | AUC | |||
---|---|---|---|---|

Classifier | LogEvent2Vec | Our Approach | LogEvent2Vec | Our Approach |

GNB | 0.778 (W = 5000) | 0.612 (W = 10) | 0.929 (W = 5000) | 0.973 (W = 10) |

RF | 0.886 (W = 5000) | 0.994 (W = 5) | 0.959 (W = 5000) | 0.993 (W = 5) |

MLP | 0.829 (W = 5000) | 0.994 (W = 5) | 0.911 (W = 5000) | 0.994 (W = 5) |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Ryciak, P.; Wasielewska, K.; Janicki, A.
Anomaly Detection in Log Files Using Selected Natural Language Processing Methods. *Appl. Sci.* **2022**, *12*, 5089.
https://doi.org/10.3390/app12105089

**AMA Style**

Ryciak P, Wasielewska K, Janicki A.
Anomaly Detection in Log Files Using Selected Natural Language Processing Methods. *Applied Sciences*. 2022; 12(10):5089.
https://doi.org/10.3390/app12105089

**Chicago/Turabian Style**

Ryciak, Piotr, Katarzyna Wasielewska, and Artur Janicki.
2022. "Anomaly Detection in Log Files Using Selected Natural Language Processing Methods" *Applied Sciences* 12, no. 10: 5089.
https://doi.org/10.3390/app12105089