Anomaly Detection in Log Files Using Selected Natural Language Processing Methods

Ryciak, Piotr; Wasielewska, Katarzyna; Janicki, Artur

doi:10.3390/app12105089

Open AccessArticle

Anomaly Detection in Log Files Using Selected Natural Language Processing Methods

by

Piotr Ryciak

^*

,

Katarzyna Wasielewska

and

Artur Janicki

Faculty of Electronics and Information Technology, Warsaw University of Technology, Nowowiejska 15/19, 00-665 Warsaw, Poland

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(10), 5089; https://doi.org/10.3390/app12105089

Submission received: 21 April 2022 / Revised: 16 May 2022 / Accepted: 17 May 2022 / Published: 18 May 2022

(This article belongs to the Special Issue Soft Computing Application to Engineering Design)

Download

Browse Figures

Versions Notes

Abstract

In this article, we address the problem of detecting anomalies in system log files. Computer systems generate huge numbers of events, which are noted in event log files. While most of them report normal actions, an unusual entry may inform about a failure or malware infection. A human operator may easily miss such an entry; therefore, anomaly detection methods are used for this purpose. In our work, we used an approach known from the natural language processing (NLP) domain, which operates on so-called embeddings, that is vector representations of words or phrases. We describe an improved version of the LogEvent2Vec algorithm, proposed in 2020. In contrast to the original version, we propose a significant shortening of the analysis window, which both increased the accuracy of anomaly detection and made further analysis of suspicious sequences much easier. We experimented with various binary classifiers, such as decision trees or multilayer perceptrons (MLPs), and the Blue Gene/L dataset. We showed that selecting an optimal classifier (in this case, MLP) and a short log sequence gave very good results. The improved version of the algorithm yielded the best F1-score of 0.997, compared to 0.886 in the original version of the algorithm.

Keywords: log analysis; natural language processing; anomaly detection; malware; word embeddings; fastText

Share and Cite

MDPI and ACS Style

Ryciak, P.; Wasielewska, K.; Janicki, A. Anomaly Detection in Log Files Using Selected Natural Language Processing Methods. Appl. Sci. 2022, 12, 5089. https://doi.org/10.3390/app12105089

AMA Style

Ryciak P, Wasielewska K, Janicki A. Anomaly Detection in Log Files Using Selected Natural Language Processing Methods. Applied Sciences. 2022; 12(10):5089. https://doi.org/10.3390/app12105089

Chicago/Turabian Style

Ryciak, Piotr, Katarzyna Wasielewska, and Artur Janicki. 2022. "Anomaly Detection in Log Files Using Selected Natural Language Processing Methods" Applied Sciences 12, no. 10: 5089. https://doi.org/10.3390/app12105089

APA Style

Ryciak, P., Wasielewska, K., & Janicki, A. (2022). Anomaly Detection in Log Files Using Selected Natural Language Processing Methods. Applied Sciences, 12(10), 5089. https://doi.org/10.3390/app12105089

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Anomaly Detection in Log Files Using Selected Natural Language Processing Methods

Abstract

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI