LogEvent2vec: LogEvent-to-Vector Based Anomaly Detection for Large-Scale Logs in Internet of Things

Log anomaly detection is an efficient method to manage modern large-scale Internet of Things (IoT) systems. More and more works start to apply natural language processing (NLP) methods, and in particular word2vec, in the log feature extraction. Word2vec can extract the relevance between words and vectorize the words. However, the computing cost of training word2vec is high. Anomalies in logs are dependent on not only an individual log message but also on the log message sequence. Therefore, the vector of words from word2vec can not be used directly, which needs to be transformed into the vector of log events and further transformed into the vector of log sequences. To reduce computational cost and avoid multiple transformations, in this paper, we propose an offline feature extraction model, named LogEvent2vec, which takes the log event as input of word2vec to extract the relevance between log events and vectorize log events directly. LogEvent2vec can work with any coordinate transformation methods and anomaly detection models. After getting the log event vector, we transform log event vector to log sequence vector by bary or tf-idf and three kinds of supervised models (Random Forests, Naive Bayes, and Neural Networks) are trained to detect the anomalies. We have conducted extensive experiments on a real public log dataset from BlueGene/L (BGL). The experimental results demonstrate that LogEvent2vec can significantly reduce computational time by 30 times and improve accuracy, comparing with word2vec. LogEvent2vec with bary and Random Forest can achieve the best F1-score and LogEvent2vec with tf-idf and Naive Bayes needs the least computational time.


Introduction
Internet of Things (IoT) [1,2] has provided the possibility of easily deploying tiny, cheap, available, and durable devices, which are able to collect various data in real time, with continuous supply [3][4][5][6][7]. IoT devices are vulnerable and usually deployed in harsh and extreme natural environments, thus solutions that can improve monitoring services and the security of IoT devices are needed [8][9][10]. Most smart objects can accumulate log data obtained through sensors during operation. The logs record the states and events of the devices and systems, thus providing a valuable source of information which can be exploited both for research and industrial purposes. The reason is that a large amount of log data stored in such devices can be analyzed to observe user behavior patterns or detect errors in the system. Based on log analysis, better IoT solutions can be developed or updated and presented to the user [11]. Therefore, logs are one of the most valuable data sources for device management, root cause analysis, and IoT solutions updating. Log analysis plays an important role in IoT system management to ensure the reliability of IoT services [12]. Log anomaly detection is a part of log analysis that analyzes the log messages to detect the anomalous state caused by sensor hardware failure, energy exhaustion, or the environment [13].
Logs are semi-structured textual data. An important task is that of anomaly detection in log [14], which is different from the classification and detection in computer vision [15][16][17][18], digital time serial [19][20][21][22][23], and graph data [24]. In fact, the traditional ways of dealing with anomalies in logs are very inefficient. Operators manually check the system log with regular expression matching or keyword searching (for example, "failure", "kill") to detect anomaly, which is based on their domain knowledge. However, this kind of anomaly detection is not applicable to large-scale systems.
Many existing works propose schemes to process the logs automatically. Log messages are free-form texts and semi-structured data which should turn into structured data for further analysis. Log parsing [25][26][27] extracts the structured or constant part from log messages. The constant part is named by the log template or log event. For example, a log message is "CE sym 2, at 0x0b85eee0, mask 0x05". The log event of the log message is "CE sym < * >, at < * >, mask < * >".
Although log events are structured, they are still text data. Most machine learning models for anomaly detection are not able to handle text data directly. Therefore, to extract features of the log event or derive a digital representation of it is a core step. According to the feature extraction results, several machine learning models are used for anomaly detection, such as Regression, Random Forest, Clustering, Principal Component Analysis (PCA), and Independent Component Analysis (ICA) [28]. At first, many statistical features of log event [29,30] are extracted, such as sequence, frequency, surge, seasonality, event ratio, mean inter-arrival time, mean inter-arrival distance, severity spread, and time-interval spread.
More and more works start to apply natural language processing (NLP) methods for the log event vectorization, such as bag-of-words [31], term frequency-inverse document frequency (tf-idf) [32,33] and word2vec [34,35]. Most of the above works are based on the word. Anomalies in logs mostly depend on the log message sequence. Meng et al. [32] form the log event vector by the frequency and weights of words. The log event vector is transformed into the log sequence vector as the input of the anomaly detection model. The transformation from word vector to log event vector or log sequence vector is called coordinate transformation. The frequency and weight of words ignore the relevance between words. Bertero et al. [34] detect the anomaly based on the word vector from word2vec [36], which is an efficient method to extract the relevance between words. The word vector is converted to the log event vector, and then the log event vector is converted to the log sequence vector before anomaly detection. However, the computing cost of training word2vec is high and it needs to transform the word vector twice.
As the systems become increasingly complex, there is a large amount of log data. The number of words in each log message is in the range from 10 to 102. Processing words directly is not suitable for large-scale log anomaly detection. Therefore, He et al. [31] propose to count the occurrence number of log events to obtain log sequence vectors directly. The coordinate transformation is unnecessary. In addition, the number of log events is far less than the number of words. The length of the vector is based on the number of words or log events. The dimension of the vector is shortened, which further reduces the computational cost. However, the frequency of log events ignores the relevance of log events. Therefore, to extract the relevance between log events, reduce the computational cost, and avoid multiple transformations, we investigate the log anomaly detection problem by word2vec with log events as input. The main contributions can be summarized as follows: • We propose an offline low-cost feature extraction model, named LogEvent2vec, which first takes log events as input of the word2vec model to vectorize the log event vector directly. The relevance between log events can be extracted by word2vec. Only one coordinate transformation is necessary to get the log sequence vector from the log event vector, which decreases the number of coordinate transformations. Training log events is more efficient because the number of log events is less than that of words, which reduces the computational cost. • LogEvent2vec can work with any coordinate transformation methods and anomaly detection models. After getting the log event vector, the log event vector is transformed into the log sequence vector by bary or tf-idf. Three kinds of supervised models (Random Forests, Naive Bayes, and Neural Networks) are trained to detect the anomaly. • We have conducted extensive experiments on a real public log dataset from BlueGene/L (BGL).
The experimental results demonstrate that our proposed LogEvent2vec can significantly reduce computational time by 30 times and improve the accuracy of anomaly detection, comparing with word2vec. • Among different coordinate transformation methods and anomaly detection models, LogEvent2vec with bary and Random Forest can achieve the best F1-score and LogEvent2vec with tf-idf and Naive Bayes needs the least computational time. Tf-idf is weaker than bary in aspect of accuracy, but it can significantly reduce the computational time.
The rest of the paper is organized as follows. We introduce the related work in Section 2, and present the general framework of log anomaly detection and the formulation of our work in Section 3. We further provide an overview of our scheme, the log parsing, feature extraction, and anomaly detection model in Section 4. Finally, we evaluate the performance of the proposed algorithms through extensive experiments in Section 5 and conclude the work in Section 6.

Related Work
According to the framework of log anomaly detection in Section 3, log anomaly detection consists of several important steps. We review the related works for each step.

Log Parsing
Log parsing extracts the log template or log event from the raw log. A log template is a log event that records events occurring in the execution of a system. FT-tree [25] identifies the longest combination of frequently occurring words as a log template. He et al. [26] design and implement a parallel log parser (namely POP) on top of Spark, a large-scale data processing platform. The raw log is divided into constant and variable, and the same log events are combined into the same clustering group by hierarchical clustering. He et al. also propose an online log parsing method, namely Drain [27], which uses a fixed depth parse tree to accelerate parsing. He et al. [37] provide the tools and benchmarks for automated log parsing.

Feature Extraction
Extracting the feature of logs is the basis of anomaly detection. Zhang et al. [29] propose Prefix to extract four features (sequence, frequency, surge, seasonality) from the log sequence and form a feature matrix. Khatuya et al. [30] select features from system logs, including event count, event ratio, mean inter-arrival time, mean inter-arrival distance, severity spread, and time-interval spread, and transform the log events into score matrix. Liu et al. [38] extract 10 features and compress to two features.
In addition, the NLP methods start to attract the researcher's interest to vectorize the log event, such as bag-of-words [39], TF-IDF [40], and word2vec.
He et al. [31] count the occurrence number of each log event to form the event count vector for each log sequence, whose basic idea draws from bag-of-words. Meng et al. [32] propose LogClass which combines a word representation method, named tf-idf, with the Positive-unlabeled (PU) learning model to construct device-agnostic vocabulary with partial labels. Lin et al. [33] propose an approach named LogCluster which turns each log sequence into a vector by Inverse Document Frequency (IDF) and Contrast-based Event Weighting.
Bertero et al. [34] consider logs as regular text and first apply a word embedding technique based on Google's word2vec algorithm, in which logfiles' words are mapped to a high dimensional metric space. Then, the coordinate of the word is transformed into the log event vector, and the coordinate of the log event vector is transformed into the log sequence vector. Meng et al. [35] propose LogAnomaly, a framework to model a log stream as a natural language sequence. They propose a novel, simple feature extraction method, template2vec, to extract the semantic information hidden in log templates by a distributional lexical-contrast embedding model (dLCE) [41]. The word vector is transformed to the log event vector, which is fed into the long short-term memory (LSTM) detection model.
According to the type of anomaly detection, the word vector from word2vec needs to form the log event vector or the log sequence vector. For example, the log event vector is enough for LSTM [35], while the log sequence vector is needed for Random Forest or Naive Bayes [34]. Table 1 concludes the NLP methods on log feature extraction. To avoid multiple transformations, the objects of NLP methods become log events from words. Therefore, this paper handles the log events directly.
Ridge regression is used to estimate the abnormal score from the features [30], and the total weight vector obtained by ridge regression is used for express the relative importance of different features. Random Forest is used to anomaly detection based on the feature matrix in Prefix [29]. LogClass [32] classifies anomalies based on device logs by Random Forest.
LogCluster [33] clusters the logs to ease log-based problem identification, which utilizes a knowledge base to check if the log sequences occurred before. Liu et al. [38] make use of a mixed attribute clustering method k-prototype, which transforms data from 10 features to a new data set to reduce feature dimensions. Then, k-Nearest Neighbor (k-NN) classifier is used to identify the real abnormalities in the new data set, which greatly reduces the calculation scale and time. Loglens [42] is a real-time log analysis system, which clusters log events by similarity measure.
A comparison among six state-of-the-art log-based anomaly detection methods is presented in [31], including three supervised methods (Logistic Regression, Decision Tree, and Support Vector Machine (SVM)) and three unsupervised methods (LogCluster, PCA, Invariant Mining), and an open-source toolkit allowing ease of reuse.
In addition, deep learning methods [43] are applied in log anomaly detection [35]. Deeplog [44] uses LSTM to model a certain type of log key sequence of logs, automatically learns the normal mode from the normal log data, and then judges system exceptions. Refs [45,46] analyze the application of various LSTM models in anomaly detection, such as bidirectional LSTM, stacked LSTM, etc.
In this paper, we show that our feature extraction algorithm can work well with various anomaly detection methods.

General Framework and System Model
In this section, we introduce the general framework of log anomaly detection and the formulation of our work. The general framework of log anomaly detection consists of three steps: log parsing, feature extraction, and anomaly detection, as shown in Figure 1. Table 2 summarizes the notations and definitions used in this paper.  Table 2. List of notations.

L
The log data N The number of lines in log data E The set of log events M The number of log events LSE The set of log sequences W The window size which decides the length of a log sequence T The vector space l i The ith log message p(.) The mapping function of log parsing p(l i ) The log event of log message l i lse i The ith log sequence that is The vector of log event e f (lse i ) The prediction of log sequence lse i that is f (v(p(l iW+1 )), v(p(l iW+2 )), . . . , v(p(l iW+W ))) y i The label of log sequence lse i

Log Parsing
Logs are semi-structured. A log message can be divided into two parts: a constant part and a variable part (some specific parameters). A log event is the template (constant part) of a log message. To turn semi-structured raw logs into structured data, log parsing extracts a set of templates to record events that occur during the execution of a system. In this paper, we do not distinguish between the log template and the log event.
The log data from a system are denoted by L. The log data contain N lines of log messages. The ith log message is denoted by l i ∈ L, 1 ≤ i ≤ N. Every log message is generated by an application of the system to report an event. Every log message consists of a list of words, similar to a sentence.
The log parsing [27] is used to remove all specific parameters from log messages and extract all the log events. The set of log events is denoted by E, in which the number of log events is M. In this way, each log message is mapped into a log event. Log parsing can be represented by the mapping function p. The log event of the log message l i can be described as p(l i ) ∈ E: Then, log data are divided into various chunks. A chunk is a log sequence. We assume that the fixed window is used and the window size decides the length of log sequences, denoted by W. There are N/W log sequences, where the set of log sequences is denoted by LSE. The ith sequence consists of W log messages from l iW+1 , l iW+2 , to l iW+W . Each log message in a log sequence can be mapped into a log event [47]. As a result, the log sequence can be treated as a list of log events. The log sequence lse i is denoted by (2)

Feature Extraction
Although log events are structured, they still consist of text. Therefore, the log event should be numerically encoded for further anomaly detection. Text of log events can be encoded by NLP models. The list of logs are divided into various chunks, which are log sequences. A feature vector is generated to represent a log sequence.
Word2vec [36] is used to extract features of log events. Generally speaking, word2vec maps words of a text corpus into a Euclidean space. In the Euclidean space, relevant words are close, while irrelevant words are far away.
In our case, we use word2vec to map log events of log sequence into a Euclidean space. The input of word2vec is a list of log events instead of a list of words. Thus, every log event gets a coordinate, denoted by v(e), e ∈ E in a vector space T. After mapping each log event, a log sequence can be represented by a function of its all log events' coordinates. It means that each log sequence is also mapped into the vector space. The mapping of log event and log sequence can be represented as two functions: v : According to the definition of log sequence in Equation (2), the log events of log sequence lse i are p(l iW+1 ), p(l iW+2 ), . . . , p(l iW+W ). The coordinate of the log event related to the log message l j can be denoted by v(p(l j )).
Therefore, the coordinates of these log events are v(p(l iW+1 )), v(p(l iW+2 )), . . . , v(p(l iW+W )). By the above-described procedure, the coordinate of the log sequence depends on all its log events' coordinates. The log sequence lse i can be assigned to a coordinate by

Anomaly Detection
All feature vectors of log sequence are the samples, which are trained for machine learning or deep learning models to detect anomaly. Then, the trained model predicts whether a new log sequence is anomalous or not.
A binary classifier c is trained on f (lse i |lse i ∈ LSE) ∈ T. This kind of classifier c can be treated as an ideal separation function: c : T → [0, 1]. The classifier determines whether a log sequence lse i is anomalous (label y i = 1 denotes an anomalous log sequence and y i = 0 denotes a normal log sequence) or not. When the anomalous event occurs, the log message at that time is labeled anomalous. If an anomalous log message belongs to a log sequence, this log sequence is labeled as an anomaly. Otherwise, the log sequence is normal when all log messages in it are normal. In the case log sequence contains log events which do not occur, those are simply ignored.

Methodology
The overview of LogEvent-to-vector based log anomaly detection is shown in Figure 2. The first block shows nine raw logs in the BGL dataset. The second block is the log parsing step which extracts five log events from the raw logs by the Drain. Each log is mapped into a log event. The third block is the feature extraction step. Logs are divided into log sequences by a fixed window. Each log event vector is obtained by logEvent2vec which takes the log event as the processing object. The log sequence vector is calculated by all log event vectors in the log sequence according to bary or tf-idf. The fourth block is the anomaly detection. The anomalies are marked by the red line. Three kinds of supervised models (Random Forests, Naive Bayes, and Neural Networks) are trained to detect the anomaly. The detailed process of each step is described below.   After parsing by Drain [27], the log event is shown in the last row of Table 3. The semi-structured raw log message is converted into structured information. The variable part in the log message is replaced by a wildcard, and the constant part remains unchanged. Each log event has a unique log event and event template. The event template of the third log message is "CE sym < * >, at < * >, mask < * >" with log event E3 as shown in the second block of Figure 2. Similarly, we get five log events E1-E5 in the second block from the nine raw log messages. Each raw log message is mapped into a log event. For example, the first log message is mapped into log event E1, and the second log message is mapped into log event E2. "CE sym < * >, at < * >, mask < * >"

Feature Extraction
LogEvent2vec takes the log event as input of the word2vec model, and then transforms the log event vector to the log sequence vector. Because the number of log events is far less than the number of words, LogEvent2vec reduces the training cost. In addition, only one coordinate transformation is necessary to get the log sequence vector from the log event vector.

LogEvent2vec: Log Event Training Via Word2vec
Word2vec maps words to vectors, which is divided into two models, namely continuous skip-gram model (skip-gram) and continuous bag-of-words model (cbow) [48]. The training input of cbow model is the context word vector of a target word, and the output is the word vector of the target word. The idea between skip-gram and cbow is opposite, that is, the input is the word vector of a target word, and the output is the context word vector of the target word. Cbow model is used in this paper.
Cbow model consists of three layers [49]: input layer, hidden layer, and output layer, as shown in Figure 3. For example, the corpus is "I drink coffee every day". We can get the embedding of "coffee" from the rest four words "I", "drink", "every", and "day" which are taken as input. Similarly, we can get the embedding of all words.
LogEvent2vec takes the log event as the input of word2vec to get the embedding of each log event in vector space T. The space dimension is dim(T). If the target is log event p(l iW+j ), the rest log events p(l iW+1 ), p(l iW+2 ), . . . , p(l iW+j−1 ), p(l iW+j+1 ), . . . , p(l iW+W ) in the log sequence lse i are taken as input, as shown in Figure 3. For example, we assume that the fixed window size is 3, as shown in the third block of Figure 2. The nine log messages are divided into three sequences (lse 1 , lse 2 , lse 3 ) which are [E1, E2, E3], [E3, E4, E4], and [E5, E3, E1], respectively. LogEvent2vec takes log events E1 and E3 in the first sequence as the input and log event E2 as the output. Similarly, log events E3 and E4 in the second sequence are taken as the input of word2vec while the target is log event E4. Log events E5 and E1 in the third sequence are taken as the input of word2vec while the target is log event E3.
In detail, the one-hot vector with |E| dimension is used to represent the log event. There are W − 1 one-hot vectors in the input layer. The output layer is the one-hot vector of the target log event. The hidden layer's dimension is dim(T). After training the model, we can get the embedding of a log event by multiplying its one-hot vector and the weight matrix W M ∈ R |E|×dimT . Assuming that the dimension is set to 5, the embedding vectors of log events E1 − E5 are

From Log Event Vector to Log Sequence Vector
All log event vectors in space T are produced by LogEvent2vec. To get the log sequence vector in space T, we transform log event vector to log sequence vector by bary or tf-idf: • Bary defines the vector of log sequence as the average of all its log events in Equation (4): • Tf-idf defines the vector of log sequence as the weighted average of all its log events. The weight depends on the frequency of log events. A rare log event has a higher weight than a frequent log event.
According to bary, the vector of first log sequence lse 1 is the average position of E1, E2, E3, which is ([1, 2, 1, 0 After transformation, we can obtain all log sequences' vector, which is a matrix with N/W × dim(T).

Anomaly Detection
Anomaly detection can be treated as a binary classification problem. Many classifiers are available. In this paper, we use three supervised algorithms to detect anomaly: Random Forests, Naive Bayes, and Neural Networks in this part. The log sequence matrix is the input of the anomaly detection model.

Datasets
To evaluate the performance of our proposed algorithms, we use the BGL dataset from the BlueGene/L supercomputer system at Lawrence Livermore National Labs (LLNL) [50]. Table 4 shows the basic information of the BGL dataset. There are 4,747,963 log messages and 348,460 anomalous log messages in the BGL dataset.

Experimental Setup
All experiments are run on Baidu AI Studio (Beijing, China), which provides a server with an Intel(R) Xeon(R) Gold 6148 CPU (Beijing, China) with 8 core, NVIDIA Tesla V100 with 16 GB VideoMem GPU (Beijing, China), and 32 GB RAM.
After Drain [27] log parsing, we obtain 376 log events. By default, according to [34], the window size of fixed windows is set to 5000 and the dimension of vector dim(T) is set to 20. It means that the length of each log sequence is 5000. After dividing, there are 943 log sequences. We randomly choose the 90% log sequence as the training data, and the remaining 10% as the testing data. All results are averages of five times results.
We compare our feature extraction scheme with the method in [34] with different coordinate transformation and anomaly detection models. The two kinds of feature extraction schemes have two kinds of coordinate transformations: bary and tf-idf. There are three kinds of supervised methods: Random Forests, Naive Bayes, and Neural Networks. The two kinds of feature extraction schemes are described as follows: • Word [34]: It takes words as input of the word2vec model after removing the non-alphanumeric characters. After getting the words vector, it performs coordinate transformation twice to get the log file vector. • LogEvent: our approach takes log events as input of the word2vec model after log parsing. After getting the log event vector, it performs coordinate transformation once to get the log sequence vector.
As shown in Table 5, we have 12 kinds of schemes. We use three combined characters to represent the schemes. For example, "W-b-NB" means the method in [34] with two bary coordinate transformations and Naive Bayes anomaly detection model. "LE-t-NN" means our approach with a tf-idf coordinate transformation and Neural Networks anomaly detection model. The implementations of tf-idf, Random Forests, Naive Bayes, and Neural Networks are from the scikit-learn (http://scikitlearn.org/) standard library.

Steps Models
Word2vec input unit Word/Log event Coordinate transformation Bary/Tf-idf Anomaly detection model Random Forests/Naive Bayes/Neural Networks F1-score, Area Under Curve (AUC), and computational time are used to evaluate the accuracy of anomaly detection methods. F1-score is an index used to measure accuracy of binary classification model in statistics. It takes into account both accuracy and recall of classification model. F1-score can be regarded as the harmonic average of precision and recall in Equation (5). It has its best value at 1 and worst at 0. AUC is a kind of evaluation index to measure the quality of the binary classification model, which indicates the probability that the positive example prediction is in front of the negative example. Computational time includes the time of feature extraction, the time of training anomaly detection model, the time of issuing all predictions in the test set, and the total time. The computational time of feature extraction consists of training word2vec and coordinate transformations. The total time is from word2vec training to the anomaly detection model without preprocessing because of the different preprocessing in our scheme and [34]:

Impact of Anomaly Detection Model
To investigate the effect of different anomaly detection models, we analyze the F1-score, AUC, and computational time of Random Forests, Naive Bayes, and Neural Networks, while other parameters are set to the default values and the coordinate transformation method is bary.
The results are as shown in Figure 4. In the aspect of the detection model, we can see that the anomaly detection performance of W-b-RF (F1-score = 0.83, AUC = 0.96) and W-b-NN (F1-score = 0.80, AUC = 0.94) are far higher than that of W-b-NB (F1-score = 0.72, AUC = 0.89). In the case of log event as input, the detection performance of LE-b-RF (F1-score = 0.88, AUC = 0.97) and LE-b-NN(F1-score = 0.89, AUC = 0.95) is better than that of LE-b-BN (F1-score = 0.78, AUC = 0.94). The detection performance of Random Forest and Neural Network as classifier is better than Naive Bayes. The reason is that Random Forest is a set of decision trees, in which each decision tree processes samples and predicts output labels. The decision trees in the set are independent, and each decision tree can predict the final result. Neural Networks are fully connected. They are grouped by layers and process the data in each layer and deliver it to the next layer. The last layer of neurons is responsible for the prediction. Therefore, those two detection models consider the relevance between features. The premise of Naive Bayes algorithm is that the features are independent. There is a certain relevance between the log events in the log sequence. Therefore, Random Forest and Neural Network are better than Naive Bayes.
In the aspect of the input, the results of anomaly detection using log events as input are better than using words as input. The AUC score of LE-b-RF is 0.97. The reason is that the representation of the log sequence vector is more accurate in LogEvent2vec. Inputting words to word2vec [34] needs to transform word vectors into the vector of log sequence by two coordinate transformations, so there will be some bias in the representation of the log sequence vector and affect the final anomaly detection results. Our schemes only need to perform one coordinate transformation to get the log sequence vector. Therefore, the log sequence vector is more accurate in representation. Logevent2vec reduces the number of coordinate transformations and obviously improves its F1-score. The results confirm the rationality of LogEvent2vec. Figures 5 and 6 show the computation time for three classifiers to detect anomaly. Figure 5a shows the time of feature extraction from training word2vec to coordinate transformation, where training word2vec consumes the majority of the time. It can be seen that the time required for feature extraction in Random Forest is the highest, and the time required for feature extraction in Naive Bayes is the lowest. The number of words in the BGL dataset is 1,405,168, while the number of log events is only 376. The training time of log events in word2vec is far less than that of words. The final experimental results show that LogEvent2vec (32.97 s < 955.33 s) takes less time to train word2vec.   Figures 5b and 6a show the time needed to train the classifier with 848 log sequences and issue 95 log sequences for prediction. We can see that LogEvent2vec and word2vec consume the same time in training the classifier and issuing the test set for prediction. Figure 6b shows the total time from training word2vec to finally completing anomaly detection. The total time is mainly determined by the time of feature extraction: the less time word2vec training takes, the less time it takes. Finally, the experiment shows that LogEvent2vec shortens time by 30 times than word2vec.

Impact of Coordinate Transformation
We analyze the performance of bary and tf-idf with other parameters set to the default values. The results are shown in    Figure 5a. In addition, the computational time of feature extraction in W-b-RF is 959.31 s, while that in W-t-RF is only 347.67 s. It can be seen that tf-idf can reduce the time consumption in feature extraction. Figures 5b and 6a show the time consumed by training the anomaly detection model and issuing all predictions in the test set. It can be seen that different coordinate transformations have less impact on the time consumed by training the anomaly detection model and issuing predictions. Figure 6b shows the total consumption time of anomaly detection with tf-idf coordinate transformation, which is still mainly determined by the computational time of feature extraction. Therefore, the performance of tf-idf is weaker than that of bary, but tf-idf can significantly reduce the computational time.

Impact of the Dimension
To investigate the effect of the dimension of feature space, the number of dimensions is set from 5 to 500 while other parameters are set to the default values.
In Tables 6 and 7, LE-b-NN has the best performance in all classification when dimensions are from 5 to 50. The performance of LE-b-RF is the best when dimensions are from 100 to 500.
Although the AUC score of W-b-RF (F1-score = 0.81) is the highest at 200 dimensions, its F1-score of correct classification is lower than that of LE-b-RF (F1-score = 0.84). The difference of AUC score between LE-b-RF and W-b-RF is 0.008, while the difference of F1-score is 0.03. Therefore, the performance of LE-b-RF is better than that of W-b-RF in 200 dimensions. Generally speaking, we use log events as input of the word2vec model, and the effect of anomaly detection is better than using words as input of the word2vec model. The final experimental results also confirm that the effect of LogEvent2vec is better than that of word2vc.
Tables 8-11 depict the time of feature extraction, the time of training anomaly detection model, the time of issuing all predictions in the test set, and the total time with bary coordinate transformation, respectively. No matter which kind of anomaly detection models and feature extraction, the time of feature extraction and training the anomaly detection model increases as the dimension increases. Therefore, the total time is also increasing with the increase of dimensions. However, the dimension has less impact on the time of issuing all predictions in the test set.    Table 9. Computational time of the anomaly detection with different dimensions.

Conclusions
We propose LogEvent2vec, an offline feature extraction approach that takes the log event as the input of word2vec to extract the relevance between log events, and reduce the time of training and coordinate transformation. LogEvent2vec can work with any coordinate transformation methods and anomaly detection models. The experimental results demonstrate that our approach is effective and outperforms the state-of-the-art work. Compared with Neural Network and Naive Bayes model, the performance of Random Forest as classifier working with LogEvent2vec is better. Different coordinate transformation methods (bary and tf-idf) have less influence on the accuracy of anomaly detection, but tf-idf can significantly reduce the computational time. LogEvent2vec working with LSTM is our future work.