Deep Learning-Based Log Parsing for Monitoring Industrial ICT Systems

: For rapidly developing smart manufacturing, Industrial ICT Systems (IICTSs) have become critical to safe and reliable production, and effective monitoring of complex IICTSs in practice is necessary but challenging. Since such monitoring data are organized generally as semi-structural logs, log parsing, the fundamental premise of advanced log analysis, has to be comprehensively addressed. Because of unrealistic assumptions, high maintenance costs, and the incapability of distinguishing homologous logs, existing log parsing methods cannot simultaneously fulﬁll the requirements of complex IICTSs simultaneously. Focusing on these issues, we present LogParser, a deep learning-based framework for both online and ofﬂine parsing of IICTS logs. For performance evaluation, we conduct extensive experiments based on monitoring log sets from 18 different real-world systems. The results demonstrate that LogParser achieves at least a 14.5% higher parsing accuracy than the state-of-the-art methods.


Introduction
The recent development of information and communication technologies (ICT) has promoted the progression of the industrial revolution from digital to intelligence (i.e., Industry 4.0 [1]).Industrial ICT systems (IICTSs) have been demonstrated to be a critical component of smart manufacturing, as their failure causes catastrophic consequences, such as manufacturing accidents and financial disasters [2].Drawing increasing interest from both industry and academia, effective monitoring holds great significance in preventing IICTSs from variable and intense physical and cyber threats in practice [3].
Generally, the IICTS monitor aims to detect anomalies and predict threats by collecting and analyzing operational information about system components [3].Such monitoring data are usually formed as logs, in which data mining methods are commonly applied to knowledge extraction, i.e., log mining.Conventional log mining approaches have three basic steps, i.e., log parsing, matrix generation, and data mining [4].As the essential premise of effective mining, log parsing identifies system-specified events according to the corresponding logs.An example of log parsing is shown in Figure 1; the corresponding even labels, such as super user action, informative report, event report, etc., are generated for the next data mining tasks.
Existing log parsing methods usually depend on conventional data mining techniques, such as clustering and iterative partitioning [4], and a few methods leverage program source code, which is hardly available in practice [5].However, parsing logs from complex IICTSs have the following requirements, which cannot be satisfied simultaneously by the current solutions: • Generalization: The parser should be applicable to all logs with different textual properties.Conventional log parsing methods are based generally on the assumption that words with high-presence frequencies in logs imply corresponding types of events, which, however, may not hold for IICTS logs.For example, according to our evaluation results, such an assumption does not hold on the publicly available Web Log dataset [6].A parser with no log textual property requirement is necessary for IICTSs.

•
Low human labor cost: The parser should cost less human labor while processing multisourced logs.Conventional log parsing methods require a set of parameters tuned for logs from each log source (i.e., a system component/device), which is quite expensive considering the massively different log sources in practical IICTSs, especially when the system needs to be continuously maintained.For example, logs collected from the ThingWorx platform [7] contain 87 components' logs.It is cost-effective to use a parser specifically designed for such a multi-sourced log scenario, which just requires affordable human labor.

•
Transferability: The parser should be able to satisfy various task-specified parsing requirements on logs (from the same source or different sources) with similar structures (i.e., homologous logs).Conventional log parsing methods are basically driven by log textual properties, and no information regarding successive mining tasks is considered.
In fact, for log mining, the parsing results of similar logs in different mining tasks are obviously different, especially in IICTSs with multiple services.It is of great significance to construct a parser that is capable of adjusting to different tasks and producing task-specified parsing results on homologous logs.Treating all logs from an IICTS as an aggregation of multiple runtime event sequences, we present LogParser, a deep learning (DL) based log parsing framework for both online and offline log analysis.Our contributions can be briefly summarized as follows: 1.
We construct both offline and online DL-based parsers with no specific textual property requirement for IICTS logs.Our approach can handle different task-specified log parsing requirements on both heterogeneous and homologous logs.

2.
LogParser unifies log parsing and matrix generation for consistent event labeling across log mining tasks without task-specific preprocessing.

3.
LogParser (code and data are available at github.com(https://github.com/jas0n-bot/LogParser) accessed on 4 August 2020) achieves higher parsing accuracies than the state-of-the-art methods on 18 real-world log sets.Specifically, our online parser has a negligible accuracy loss compared to our offline parser and outperforms the existing methods on heterogeneous/homologous multi-source log sets.
The rest of the article is organized as follows.The related works are presented in Section 2. Section 3 provides the preliminaries of our work.Section 4 describes our methodology.Our DL-based log parsing approach is explicitly presented in Section 5. Section 6 demonstrates our evaluation methodology and results.We conduct a discussion on parameter settings in Section 7. Section 8 concludes the article.

Related Work
Log parsing is a fundamental issue in the operation of ICT systems in practice.Previous studies on log parsing have applied various clustering algorithms based on different criteria such as word frequency or feature words.However, these methods often suffer from low accuracy, scalability, or robustness.Heuristic algorithms have also been proposed to address some of these issues, but they require manual tuning and domain knowledge.Recently, machine learning techniques have been explored for log parsing, which can potentially overcome some of the drawbacks of traditional methods.
One of the earliest efforts is the Simple Logfile Clustering Tool (SLCT) [8], which sets the tone for successive clustering-based methods.It uses the frequency of fixed words (tokens) in the log text as the basis for clustering.LogSig [9] uses word pairs as the feature of one log record and then aggregates them into groups.The log template of each group is generated based on the common pairs.LKE [10] takes the strategy that two clusters of log records are simply joint if their distance is below a particular threshold, which suffers from an obvious accuracy loss on complex log sets.
Other than the conventional clustering-based methods, many log parsers use heuristic algorithms to analyze logs.IPLoM [11] heuristically parses logs by record length, which, however, requires careful preprocessing since inappropriate preprocessing may cause incorrect splitting.Drain [12] enhances IPLoM by maintaining a trie (prefix tree) for faster searching, which, however, does not address the problem caused by the length-based parsing in IPLoM.POP [13] aims to solve this issue by merging groups in the last step, which surely improves the parsing recall and the overall parsing accuracy.
The natural language processing (NLP)-based log parser [14] uses latent Dirichlet allocation (LDA) to perform the classification after the tokenization, semantic processing, vectorization, and model compression steps.LogDTL [15] constructs a deep transfer neural network model for log template generation, which applies the transfer learning technique for data training augmentation purposes.NuLog [16] employs masked-language modeling to vectorize input tokens with random masking.A two-layer neural network encoder processes the vectorized input, and a Softmax linear layer produces a result matrix that maps to event templates.
However, unlike LogParser, none of the methods above manages to address the issue of parsing logs from multiple sources, especially from homologous data sources with a different parsing task.

Preliminary
For ease of understanding, we briefly introduce some basic concepts in terms of DNN design and text data processing used in the rest of the article.

DNN Layers
Word2vec [17] acts as the word embedding layer in a DNN.It is a two-layer neural network that learns the contextual relationship of a certain word in a corpus of text.Specifically, we use the continuous bag-of-word (CBOW) model, which uses the surrounding context words to predict the current word.
Dropout [18] is a simple and effective filtering layer that reduces redundant information, which can improve a DNN's performance in terms of generalization.A typical usage is to apply a dropout layer before a dense input layer.
The long short-term memory (LSTM) [19] network is a kind of recurrent neural network (RNN) that can remember long-term information over several steps back.It has been successfully applied to fields such as language translation, speech recognition, and chat robots.

Batch Learning and Online Learning
In the field of machine learning, batch learning, the actual learning method used in offline learning, takes in the entire training set and generates a model after a certain number of training iterations.In each iteration, all samples in the training set are used.The model is evaluated according to its performance on a separate testing set.Online learning, however, does not require access to the entire training set.It continuously consumes samples from the input stream.The model is iteratively optimized based on each incoming sample.The model is evaluated according to its performance on all consumed samples.

Methodology
Real-world log parsing tasks require both batch and stream processing modes, where the former parses historical logs, and the latter parses continuously generated logs.Considering this, we design a deep learning-based framework comprising both offline and online parsers for batch and stream processing, respectively.To achieve the three aforementioned requirements of IICTS log parsing, we have the following specific designs:

•
Log parsing consistency: For generalization, the event labels teach our parsers the parsing standards directly without task-specific data preprocessing.Additionally, the online parser needs to follow the same parsing standards as the offline parser.Our framework, as shown in Figure 2, ensures this by periodically importing the offline parser models into the online parser.

•
Quality of log record representations: To lower the human labor cost of our method (mainly for labeling log records), we significantly reduce the number of labeled records required for model training by enhancing the quality of record representations.In particular, we use word2vec [17] to vectorize the log records based on a neural network trained on the corpus of all log text.• Supervised learning by event labels: Our method leverages event labels directly to train our neural network model, which enables it to adjust itself for different parsing criteria.In this way, our method achieves the transferability that is needed for various log parsing tasks.Since log parsing influences all subsequent log analyses and mining tasks, log parsing accuracy should be considered the key metric.In particular, we have the following designs to improve the accuracy of offline and online log parsing: Accuracy of offline parsing: To attain high offline parsing accuracy, we design an LSTM-based offline parser that exploits the natural contextual relevance of log messages, which aids in recognizing the essential word representations.Our method leverages vectorized log records, which are word representations in a unified semantic space.
Adaptability of online parsing: For stream processing, to achieve high parsing accuracy, online parsing needs to cope with the challenge of concept drift, i.e., the dynamic changes in patterns in the data stream over time.To address this challenge, we propose an online parser that incorporates BiLSTM [20] and attention mechanisms [21] to enhance its adaptability to new data patterns.

The Deep Learning-Based Log Parsing Method
In this section, we explicitly present our deep learning-based log parsing architecture, LogParser.

Overview
The general architecture of LogParser is shown in Figure 2, whose workflow can be described as follows.Firstly, all logs from the monitored IICTS are aggregated as a stream pip, where each incoming log is replicated and stored in a historical log database.Then, the offline parser, which is described in detail in Section 5.3, completes the parsing task by outputting a matrix representing all processed logs and corresponding event labels.Such results are used by successive mining tasks, such as bottleneck analysis [22], daily check analysis [23], etc. Cooperating with the offline parser, the online parser, which is described in detail in Section 5.4, periodically loads the latest offline model, and directly consumes logs from the stream pip.It refines the model through online learning, which adaptively learns different log events.Similar to the offline parser, the online parser generates a vector representing the processed log and a label of the corresponding event type.Such results are used by online analyses, such as anomaly detection [24], resource scheduling [25], etc.

Word Representation
In LogParser, we use word2vec [17] to vectorize input logs.Specifically, given a log record composed of a sequence of T words x = {x 1 , x 2 , . . ., x T }, each word x t (1 ≤ t ≤ T) is converted as a real-valued n w2v -dimensional vector e t according to the embedding matrix W w2v ∈ R. Here, R represents the full vocabulary, and W w2v is a size n w2v word embedding learned by a word2vec neural network.We use the full log dataset to train our word2vec model, where the CBOW model is selected due to its faster training speed.

Offline Parser
As shown in Figure 3, as a specifically designed recurrent neural network (RNN), our offline parser has three major components:

•
A word embedding layer that transforms x to e; • An LSTM layer that extracts features from log records; • A decision layer that identifies the log event type.
Additionally, to enhance the generalization capability, we integrate two dropout layers into the offline parser, i.e., one between the word embedding layer and the LSTM layer and another between the LSTM layer and the decision layer.
Fundamentally, the core layer of our offline parser can be formalized as follows: where x is the input log record, i.e., the word sequence; h is the hidden vector sequence; o is the output sequence of the LSTM layer; the W terms denote the weight matrices; H is the hidden layer function; and the b terms denote the bias vectors.

Dropout
Embedding layer Specifically, our hidden layer function H is designed as a standard LSTM, which is built upon LSTM cells with four major components, i.e., input gate i, forget gate f , output gate o, and memory cell c.Explicit cell implementation is as follows: where σ is the sigmoid function.
Then, the LSTM layer's output o is transformed into o by the dropout layer.Finally, the decision layer uses the Softmax function to calculate the probability distribution over all log event types y as well as the predicted label ŷ: One thing that should be noted is that, different from other kinds of data for mining, a single log record contains limited words.Therefore, we believe that our offline parser can be effectively trained upon a small training set, which will be verified further and discussed in Section 6.

Online Parser
Since our online parser focuses on directly consuming streaming log records (i.e., one at a time), it has to learn as quickly as possible.To address this issue, we use attention-based BiLSTM to construct our online parser.
Specifically, as shown in Figure 4, our online parser has four major components, i.e., a word embedding layer, a BiLSTM layer, an attention layer, and a decision layer.Two dropout layers are placed before the BiLSTM and decision layers, respectively.The explicit component descriptions are as follows.

Dropout
Embedding layer To start, the input log record x is converted as e through the word embedding and dropout layers.Then, let − → h and ← − h denote the forward and backward hidden layers, respectively.The BiLSTM layer in our online parser can be formally described as follows: where denotes the element-wise sum operation.Then, an attention model (or mechanism) is added between the BiLSTM and decision layers.Specifically, our attention model consumes h and adjusts the weights of the BiLSTM output as follows: Finally, the decision layer takes the re-calculated and filtered output o from the attention model to determine the log event type ŷ (similar to our offline parser).
One thing that should be noted is that, since the training of BiLSTM is quite timeconsuming, we train our online parser only based on falsely predicted records to reduce the training time.

Evaluation and Results
In this section, we evaluate the performance of LogParser.We first provide an explanation of our experimental methodology and then present the corresponding evaluation results of our offline and online parsers, respectively.

Experimental Methodology
Selected Datasets: For a comprehensive performance evaluation, we selected 18 different log sets collected from different practical systems, as shown in Table 1, where 'Log Size' indicates the number of log records, and 'Events' indicates the number of log event types.Specifically, HDFS [5] is generated in a private cloud environment using a benchmark workload.BGL [26] is a supercomputer log set collected at Lawrence Livermore National Labs (LLNL) in Livermore, CA.Spark is collected from our three-node cluster server that hosts our lab's calculation tasks.Apache is an open-source Apache server access log set.UofS [27] is a trace set containing all HTTP requests to the WWW server at the University of Saskatchewan within seven months.Jul95 [27] is a trace set containing all HTTP requests to the WWW server at NASA Kennedy Space Center in Florida, within two months.Nginx is an open-source Nginx server access log set.Openstack [28] is generated in CloudLab.Security Log [29] contains connection logs collected and used by Security Data Analysis Labs.Thunderbird [26] is a supercomputer log set collected from Sandia National Labs (SNL) in Albuquerque, NM.Big Brother [30] is a diagnostic log set from IEEE VAST Challenge 2013.Web Log [6] is a publicly available web log set.ThingWorx [7] contains system runtime logs collected from the ThingWorx platform, an Industrial Internet of Things (IIoT) platform, and 4SICS-151020, 4SICS-151021, and 4SCIS-151022 [31] are network traffic data captured from industrial network equipment by the industrial cyber security conference 4SICS.EMSD includes 30 days of Energy Management System logs (https://sites.google.com/a/uah.edu/tommy-morris-uah/ics-data-setsaccessed on 4 August 2020).IoT sentinel [32] contains the traffic emitted during the setup of 31 smart home IoT devices.

Comparisons:
We compared the performance of LogParser with that of multiple log parsing methods, i.e., IPLoM [11], Drain [12], POP [13], and NuLog [16].All comparatives are implemented in Python.For POP, we used the single-node version.All parameters were set first as recommended and then tuned according to the actual performance.
Evaluation Metric: We used the F1-score, or the harmonic mean of the precision and recall, to evaluate the log parsing performance.Specifically, for a certain type of event, precision is defined as the number of correctly predicted records divided by the number of all records identified as such a type of event, and recall is defined as the number of correctly predicted records divided by the number of all records actually labeled as such a type of event (i.e., the ground truth).

Performance of the Offline Parser
In this set of experiments, we trained our offline parser and all comparatives for three rounds on all 18 selected datasets and conducted log parsing for event identification based on the best performing model from each log parsing method, respectively.Unless specified, the training set drawn from each dataset contained 200 randomly selected log records labeled with each type of event.If the number of records of a specific type of event was less than 200, all records were used.

General Performance
Table 2 shows the log parsing accuracy based on IPLoM, Drain, POP, NuLog, and our offline parser on all selected datasets.It is obvious that our method significantly outperforms all comparatives on all datasets.Furthermore, we notice that the conventional methods induce a quite low performance in certain scenarios, e.g., IPLoM on Jul95.We believe that the reason for this is two-fold: First, such methods inherently require domain knowledge for accuracy improvement while only basic data preprocessing was conducted in our evaluations.Second, the classification is not conducted based on the structural features of the log text.According to Table 2, on Nginx, when the classification criterion is 'request type' (i.e., Nginx-4 events), all methods manage to achieve accurate classification.However, when the criterion is changed to 'protocol version' (i.e., Nginx-2 events), unlike our offline parser, none of the comparatives demonstrates an acceptable performance.NuLog, which adopted machine learning techniques, achieved higher log parsing accuracy than other conventional methods.However, it still does not reach an acceptable performance (less than 0.70 precision) on some datasets, i.e., Nginx-2, Nginx-20, Thunderbird, Web log, and ThingWorx.We believe that this is due to the lack of taskspecific data preprocessing.The self-supervised learning approach could not automatically adapt to the mining task requirement.Such results indicate that our offline parser manages to support user-defined classification of log events that cannot be achieved by existing methods.Then, we further analyzed the results on ThingWorx, which implied that there are two major reasons for the low accuracy based on comparatives: First, the comparatives can barely recognize certain types of events, e.g., the type five event 'anonymous error message' cannot be recognized by IPLoM (i.e., with 0.0 precision).Then, all comparatives induce low recalls except for those from POP and NuLog, which are still significantly lower than that from our method.Such results indicate that, unlike our offline parser, the existing methods cannot effectively differentiate events in certain scenarios.

Impact of Log Set Size
We also investigated the impact of log set size on the performance of both our offline parser and all comparatives.Specifically, for each dataset, we randomly selected subsets with different sizes from the entire log set (i.e., 200 records labeled with each type of event) for performance evaluation.For illustration, the results on HDFS and Spark are listed in Table 3.As we can see in Figure 5, when the log set size increases, our offline parser induces a negligible accuracy loss (or even enhances the accuracy, i.e., 4% on Spark), and Nulog performs the same and even achieves a better accuracy (i.e., 1.00 accuracy) on Spark(4703).We believe that overfitting caused our method to have a 5% lower accuracy than NuLog on Spark (4703).Conversely, all conventional comparatives suffer from obvious accuracy loss (e.g., POP, the best performing comparative, still induces 5% and 8% accuracy losses on HDFS and Spark, respectively).The reason, we believe, is that certain unexpected words appear relatively more frequently in large log sets, which hinders comparatives from identifying the correct log event type.In this set of experiments, we compared the performance of LogParser to those of all comparatives on logs from multiple sources.To achieve this, we randomly selected a subset of log records with sizes from 100 k to 300 k from different log sets in Table 1 and constructed a heterogeneous multi-source log set (see Table 4) and a homologous multi-source log set (see Table 5), respectively.Specifically, heterogeneous represents logs generated by different components, and homologous represents logs generated by similar components.The experimental results are demonstrated in Figure 6 and Figure 7, respectively.According to Figure 6, we can see that, as the number of heterogeneous log sources increases, our offline parser has a negligible accuracy loss (i.e., no more than 5%), and NuLog has a slight accuracy loss (i.e., no more than 10%), while the accuracy of all comparatives drops significantly (i.e., from 29% to 32%).The reason, we believe, is that integrating heterogeneous log records from new sources into the original log set changes the original pattern of word distribution drastically.NuLog is able to learn the new pattern according to its self-supervise mechanism.However, it was still less effective than our method, which learned from labels.All conventional comparatives, however, rely on a single set of parameters to identify the event type of heterogeneous logs from multiple sources, which is difficult to achieve and downgrades the performance significantly.
According to Figure 7, the results on the homologous multi-source log set are similar to those on the heterogeneous set, and our offline parser (with an accuracy loss of no more than 8%) outperforms all comparatives (with accuracy losses from 31% to 99%) more obviously.The reason, we believe, is that all comparatives cannot differentiate log records with similar textual structures but different event identification criteria, despite the fact that we intentionally added the source index to enhance the dissimilarity between log records during the evaluation.

Performance of the Online Parser
As described in Section 5.4, our online parser periodically loads the latest offline parser and conducts online parsing of logs from incoming streams directly.

General Performance
According to our design, before the online parser is updated, each incoming log record may imply a new type of event, i.e., the training set of the loaded offline parser contains no log record implying the same event type of such an incoming log record.We use r to represent the ratio of log records implying new types of events to the number of all records from the stream within a certain period of time.Theoretically, a higher r should induce a more obvious impact on the online parsing accuracy, and an ideal online log parser should be able to prevent the accuracy from downgrading when r increases.Therefore, in this set of experiments, we evaluated the performance of our online parser on log streams with different settings of r.
As streams, the log sets used by the evaluation of our offline parser (see Table 1) were fed into our online parser, our offline parser, Drain, and NuLog, respectively, for comparisons.Specifically, we fed complete log sets to our offline parser, Drain, and to NuLog as our baseline and comparative, respectively.Furthermore, log sets with r ∈ {0.2, 0.4, 0.6, 0.8, 1.0} were fed to our online parser, respectively, to study the impact of r (for example, r = 0.2 represents that log records implying 20% of all event types are not selected by the training set of the offline parser).Figure 8 shows the online parsing results on the HDFS and Spark log sets (due to the page limitation, we only present such results for illustration), which demonstrates the F1-scores of all methods when different numbers of log records (up to 10 7 ) were fed.According to Figure 8, it is obvious that our online parser outperforms Drain in all scenarios (i.e., from 2% to 15% in terms of accuracy), except for when r > 0.6 and 4703 records were fed on the Spark log set.NuLog, on the other hand, performs evenly with our online parser since it already trained the whole dataset in advance.In cases of such small dataset sizes with limited trained log event types, learning cost leads to a reduction in the final accuracy.
For our online parser, as we expected, a higher r induces a more obvious accuracy loss.However, the parsing accuracy increases as the number of log records fed increases.When more than 5 million records are fed, a 1.0 F1-score is achieved, even when r = 1.0 (i.e., all incoming logs are with new event types).Meanwhile, for all settings of r, the accuracy difference between our online and offline parsers keeps reducing, which becomes negligible (i.e., up to 4%) when 50 k records were fed.
For certain cases, e.g., 4703 records, r = 0.2, more than 5 million records, and r ∈ {0.2, 0.4, 0.6, 0.8, 1.0}, our online parser achieves higher (up to 2%) accuracy compared to our offline parser.We believe that this comes from the difference between the offline evaluation procedure and the online evaluation procedure.The offline parser needs to find a global template to identify all log events, while the online parser only needs to focus on the recent log records.

Performance of Multi-Source Log Parsing
In this set of experiments, we investigated the performance of our method on parsing multi-source log records.Specifically, we used the same log sets shown in Tables 4 and 5 to construct our heterogeneous and homologous multi-source log streams, respectively.
Online Parsing of Heterogeneous Logs: We first evaluated our online parser's capability to adapt to five different structural patterns of the heterogeneous log steam as follows: 1.
Full length, fixed order: all log records from each source were fed sequentially; 2.
Short length, fixed order: 400 log records from each source were fed sequentially in an iterative manner; 3.
Medium length, fixed order: 5000 log records from each source were fed sequentially in an iterative manner; 4.
Medium length, random order: 5000 log records from a random source were fed in an iterative manner; 5.
Fully random: each log fed was from a random source.
Figure 9 demonstrates the results of our online parser on streams containing heterogeneous log records from different numbers of sources.It is obvious that our online parser performs well on streams with all structural patterns except for the 'short length, fixed order' one (i.e., the parsing accuracy is significantly downgraded on streams with six and seven sources when r ≥ 0.8 ).Comparing to results under other settings, we believe that the input stream structure has a significant impact on the performance of our online parser, and it is quite challenging to achieve accurate parsing of streams with continuously shifting log patterns (i.e., sets of log records with a relatively limited size from different sources).In this case, the learning rate of the online parser might require explicit fine-tuning to guarantee the parsing accuracy.However, since our online parser performs well (both in terms of accuracy and stability) in the 'fully random' scenario, potential structural modifications can be conducted on log streams in practice to guarantee the effectiveness of our solution.(1) full length from each source; (2) fixed length (400) from every single source with fixed order; (3) fixed length (5000) from every single source with fixed order; (4) fixed length (5000) from every single source with random order; (5) fully random.
Online Parsing of Homologous Logs: Similar to that of our offline parser, we also evaluated the performance of our online parser on homologous multi-source log streams.The results are shown in Figure 10, which demonstrates that our online parser obviously outperforms Drain and NuLog (up to 75%/32% in terms of accuracy).Moreover, our online parser performs evenly with our offline parser.

Discussion
In this section, we conduct an empirical study on the parameter settings of LogParser for a comprehensive discussion.

Parameter Tuning of Word2vec
During our experiments, we observed that the parameter tuning of Word2vec was extremely important to the performance of LogParser.In general, n w2v determines the size of the vector e t , i.e., the translation of a word x t and words appearing no more than min_count times should be ignored.According to our experimental results, a high n w2v leads to model overfitting and unstable performance, and a low n w2v causes excessive information loss during the translation from x t to e t , which severely downgrades the final accuracy.
Realizing the properties above, for LogParser, we set n w2v = 30, which is relatively low since log records normally have a smaller vocabulary compared to general texts without affecting the translation effectiveness.The setting of min_count, however, is non-trivial.It acts as a filter eliminating information that is non-essential for later training.For LogParser, we used three.

Dropout Rate
The dropout rate is commonly recommended to be set as 0.5, i.e., the dropout layer randomly ignores half of the inputs.However, according to our experimental results, a 0.3 dropout rate manages to slightly enhance the performance.We believe the reason is that a dropout layer is applied before the LSTM/BiLSTM layer, and dropping information too early could downgrade the final accuracy to a certain extent.

Conclusions
In this article, we present LogParser, a deep learning-based log parsing framework for both online and offline IICTS monitoring.Specifically, LogParser requires no specific textual property of IICTS logs and is designed to simultaneously parse multi-source logs with different task-specified criteria.The results of extensive experiments based on 18 real-world log sets demonstrate that LogParser obviously outperforms state-of-the-art offline and online log parsers in terms of parsing accuracy (i.e., 59.3%/45.1%/40.3%/14.5% higher on average).Furthermore, LogParser manages to achieve indisputable advantages over the existing methods in heterogeneous/homologous multi-source log parsing scenarios (i.e., with a 46%/92% higher parsing accuracy at most).

Figure 4 .
Figure 4. Deep learning-based log parsing framework: online architecture.

Figure 5 .
Figure 5. Offline parsing accuracy on datasets of different sizes.

Figure 8 .
Figure 8. Online parsing accuracy on datasets of different sizes.

Figure 9 .
Figure 9. Online parsing accuracy on heterogeneous multi-source datasets legend from left to right: (1) full length from each source; (2) fixed length (400) from every single source with fixed order;(3) fixed length (5000) from every single source with fixed order; (4) fixed length (5000) from every single source with random order; (5) fully random.

Figure 10 .
Figure 10.Online parsing accuracy on homologous multi-source datasets.

Table 1 .
Summary of evaluation datasets.

Table 2 .
Offline parsing result among all evaluation datasets.

Table 3 .
Offline parsing result over different log size subset from HDFS and Spark datasets.