DLogParser: An Efficient Dynamic Log Parser with Multiple Grouping Criteria

Yuan, Jinhui; Wang, Chao; Zhou, Hongwei; Zhang, Yucheng; Wang, Yongwei

doi:10.3390/app16020811

Open AccessArticle

DLogParser: An Efficient Dynamic Log Parser with Multiple Grouping Criteria

by

Jinhui Yuan

¹

,

Chao Wang

²

,

Hongwei Zhou

^2,*

,

Yucheng Zhang

²

and

Yongwei Wang

²

¹

School of Computer Science, Zhongyuan University of Technology, Zhengzhou 450007, China

²

Zhengzhou Information Science and Technology Institute, Zhengzhou 450001, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(2), 811; https://doi.org/10.3390/app16020811

Submission received: 19 November 2025 / Revised: 2 January 2026 / Accepted: 9 January 2026 / Published: 13 January 2026

Download

Browse Figures

Versions Notes

Abstract

Most existing log parsers are static. When parsing logs with a static parser, the accuracy tends to fluctuate significantly. To overcome this issue, this paper proposes a dynamic log parser named DLogParser. The core idea of DLogParser is to select different parsing policies based on log features. DLogParser first parses a small batch of sample log messages, then analyzes log characteristics from the parsing results, and determines an appropriate parsing policy for the current logs. Then it parses all remaining logs according to the determined policy. To support dynamic parsing policies, DLogParser incorporates 5 grouping criteria for log features, including length, punctuation, first token, last token, and key token, and establishes 7 rules for parsing policy generation. We evaluated DLogParser on public datasets from LogHub. The experimental results demonstrate that compared to 11 existing log parsers, DLogParser achieves an accuracy of 90.3% with an acceptable performance loss.

Keywords:

log parser; log template; parsing tree

1. Introduction

Log parsing plays a critical role in log analysis [1]. During system runtime, software systems typically output system status, environmental information, and other operational details to log files, which serve as essential data sources for diagnosing software failures and identifying their root causes. However, due to the semi-structured nature of raw log data, direct analysis of such data introduces significant challenges. Consequently, preprocessing of raw logs is required to extract and tag critical information while filtering redundant content, thereby transforming these raw logs into structured data for downstream analytical tasks such as anomaly detection [2,3], bug localization [4], and root-cause diagnosis [5,6].

Direct approaches to log parsing rely on analyzing the source code responsible for log generation [7,8]. Such methods are referred to as code-driven log parsers. However, since software source code is not always accessible, code-driven solutions lack universal applicability. Thus, most existing log parsers adopt data-driven approaches. These parsers take raw logs as input and extract structured information such as log events and log parameters from the unstructured or semi-structured log data.

Data-driven log parsers are broadly categorized into four types: frequent pattern mining-based parsers [9,10,11], clustering-based parsers [12,13,14], heuristics-based parsers [15,16,17], and large language model-based parsers [18,19,20]. These parsers generally rely on specific textual features, analyzing pattern consistency and contextual relationships within raw logs to infer structured information. For instance, clustering-based parsers operate under the hypothesis that raw logs derived from the same log event exhibit consistent textual similarity and parameter positional distributions. This hypothesis enables log message clustering through computational similarity metrics such as edit distance or TF-IDF (Term Frequency-Inverse Document Frequency) similarity.

Each parser exhibits inherent limitations, rendering it suitable only for specific log types. For instance, frequent pattern mining-based parsers rely on high-frequency words to identify log events. However, in practice, critical events may manifest as low-frequency anomalies. Drain [17] initially groups log messages by length and subsequently subdivides them by the first word. However, the first word is not always discriminative, leading to cases where multiple log templates share identical first words. Consequently, the last word serves as another candidate grouping criterion [21], while nDrain [22] attempts to extract representative keywords for feature construction. Nevertheless, these improved static parsers still cannot accommodate all types of logs.

These log parsers often exhibit significant differences in accuracy when processing different logs [23,24]. From our perspective, this problem stems from the heterogeneity of logs. Due to the absence of unified logging standards, different software systems adopt divergent file formats and content representation schemes, resulting in substantial variations in log structures and semantics. We argue that log parsers employing fixed pipelines are inherently limited in their ability to adapt to the dynamic variations in log structures and semantics. A straightforward approach is to select an appropriate log parser based on the structural and semantic characteristics of target logs. However, this is not always feasible. Not all users possess the necessary technical expertise in logs and log parsers, and there is no guarantee that users will be able to obtain such professional information.

Thus, in this paper, we present a dynamic log parser called DLogParser. DLogParser employs a three-step parsing methodology comprising sampling parsing, policy generation, and full parsing. In the sampling parsing step, DLogParser leverages Drain to parse a small log sample, and analyzes the parsing results to identify log features. Based on these features, DLogParser determines the optimal parsing policy in the policy generation step. In the full parsing step, DLogParser parses the remaining logs using the policy determined in the previous step. Importantly, DLogParser dynamically adjusts its parsing policy in response to different logs. This is the core distinction between our approach and prior static log parsers that adopt a fixed parsing pipeline.

To support dynamic policy generation, DLogParser employs five grouping criteria, including length, punctuation, first token, last token, and key token. Based on log features, DLogParser selects one or more of these grouping criteria to quickly group log messages. To evaluate the efficacy of different grouping criteria, we introduce the concept of discriminability. A higher discriminability value indicates that the corresponding criterion can group log messages more accurately. Based on discriminability, we formulate seven rules to govern the order of grouping criteria usage. The introduction of additional grouping criteria could further optimize DLogParser, such as Part-of-Speech Tagging [25]. However, this paper only focuses on the aforementioned five grouping criteria.

To validate the effectiveness of our proposed method, we implemented DLogParser and conducted experiments on publicly accessible datasets from LogHub [26]. DLogParser was evaluated against 11 state-of-the-art log parsers as baseline parsers, including AEL, IPLoM, LogCluster, and other representative parsers. Experimental results indicate that DLogParser achieves the highest parsing accuracy of 90.3% across all evaluated non-LLM-based log parsers baselines. While LLM-based log parsers are partially superior to DLogParser, DLogParser does not incur the costs associated with LLM access. To evaluate the time overhead incurred by processes such as sampling and parsing, we conducted a time overhead comparison experiment. The experiment demonstrates that DLogParser’s time overhead is comparable to that of mainstream baselines and thus remains within an acceptable range for practical deployment.

The remainder of this paper is structured as follows. Section 2 presents observations from our analysis of static parsers. It demonstrates that fixed parsing processes cannot adapt to heterogeneous logs, thereby justifying the need for a dynamic parser. Section 3 outlines the methodology of DLogParser, detailing the design and implementation details of the proposed parser. Section 4 presents the experimental evaluation of DLogParser using log datasets from LogHub. Section 5 discusses related work, while Section 6 provides further discussion. Finally, Section 7 concludes this paper.

2. Observation

Log parsing converts unstructured or semi-structured log messages into structured information (e.g., log templates), providing standardized data for downstream analysis such as anomaly detection. As shown in Figure 1, raw log messages contain elements such as time and level which are of limited utility for subsequent processing, as well as unstructured data like log content. A log parser analyzes the characteristics of log messages, groups similar ones together, and treats invariant components as constants and variant components as variables to generate log templates. As illustrated by msg1 and msg2 in Figure 1, the analysis reveals that their log content structures are similar, with both messages containing constants such as “Kind” and “Service”. By replacing variables with *, template1 is obtained. Similarly, template2 can be derived from msg3 and msg4.

To improve log parsing accuracy, log parsers should group log messages into designated groups as precisely as possible based on their inherent features. Most existing parsers are static, designed to handle only logs with predefined feature profiles. Consequently, when incoming log messages exhibit different features, they may be misclassified into irrelevant groups. This misgrouping further impairs subsequent similarity assessments and ultimately degrades overall parsing accuracy. A more adaptive solution would involve tailoring parsing policies to the specific features of log messages, thereby enabling more accurate grouping of messages into target groups. The core research question addressed in this work is how to dynamically adjust parsing policies to accommodate varying log characteristics.

We discuss how to determine this parsing policy from log templates. Both our experiments and prior work demonstrate that Drain is versatile and achieves high parsing accuracy. Drain first groups log messages by their length, then uses the first word token for further grouping. After grouping these messages into several groups, it derives log templates via inter-log similarity assessment. We analyze the generalizability of Drain from the log template perspective, using the log template set generated from LogHub’s public log datasets.

Table 1 presents statistics on template grouping conflicts across seven log datasets (where “conflict” refers to overlaps in template categories generated by different grouping criteria). These log datasets cover both high and low parsing accuracy categories, as illustrated by Drain achieving 100% accuracy on HDFS logs versus only 68.6% accuracy on Linux logs. The column “Length” denotes the number of distinct length categories among log templates. When this number equals the total number of log templates, it confirms that length-based grouping alone suffices for correct grouping. The column “First token” indicates the number of template groups sharing the same first word. The column “Total templates” denotes the total quantity of unique log templates for each dataset, and the column “Accuracy” denotes Drain’s parsing accuracy on the corresponding log dataset.

Analysis of Table 1 reveals that the choice of grouping criteria has a critical impact on parsing accuracy. For Linux logs, Drain achieves notably low accuracy. As shown in the table, approximately 48.3% of Linux log templates share the same first word, indicating the limited effectiveness of first-token-based grouping. Similarly, the 441 Linux log templates map to only 18 distinct lengths, further demonstrating the insufficiency of length-based grouping alone. Since Linux log messages cannot be accurately grouped via either first-token or length criteria, high parsing accuracy cannot be guaranteed. In contrast, Apache logs contain 30 unique templates with 18 distinct lengths, and only 12 templates across the dataset share identical first words. Under these favorable grouping conditions, Drain achieves 100% parsing accuracy on Apache logs.

Based on these observations, we draw two conclusions regarding how to enhance log parsing accuracy. (1) Selecting appropriate grouping criteria helps achieve accurate grouping of log messages. Drain achieves parsing accuracy below 80% for Linux, OpenSSH, and Mac logs- a shortcoming partly due to first-word conflicts affecting over 50% of their log templates. Adopting alternative grouping criteria, such as last tokens as grouping features [21], can potentially improve parsing accuracy. (2) Cooperation among multiple grouping criteria is essential. We find that no single grouping criterion can achieve perfect grouping accuracy, necessitating the combination of multiple grouping criteria to achieve ideal results.

3. Methodology

The workflow of DLogParser includes three core steps: sampling parsing, policy generation, and full parsing, as shown in Figure 2. In the sampling parsing step, DLogParser initially uses two grouping criteria (length and first word) and parses a small subset of log messages to generate partial log templates. Sampling parsing aims to identify and characterize the inherent features of the target log dataset, such as the distribution of log template lengths. In the policy generation step, DLogParser analyzes the log templates obtained from sampling parsing to determine the optimal grouping criteria for the current log set and the order of application of these criteria. In other words, DLogParser’s parsing policy is dynamically generated based on the inherent features of the target logs. This inherent adaptability is the core rationale for classifying DLogParser as a dynamic parser. In the full parsing step, DLogParser parses all remaining logs using the generated policy. Notably, DLogParser also incorporates a log preprocessing module to filter out non-informative components from raw logs (e.g., IP addresses and timestamp suffixes). This module, similar to preprocessing components in existing parsers, is not elaborated on further in this paper. We elaborate on the three core steps separately in subsequent subsections.

3.1. Step 1 Sampling Parsing

The purpose of sampling parsing is to identify and characterize the features of the target log dataset to determine the subsequent parsing policy. Since the optimal policy for a given log dataset cannot be predetermined during the sampling parsing phase, we employ a traditional static parsing-related grouping approach. DLogParser first groups log messages by message length, then performs further grouping based on the first token, and ultimately completes initial log template derivation via similarity determination. Alternatively, other grouping criteria (e.g., length and last word) may be adopted for sampling parsing in scenarios where first-word grouping leads to severe category conflicts. However, to the best of our knowledge, the combination of length and first token represents the optimal criteria, as it achieves superior results in initial template derivation for the majority of real-world log datasets while maintaining computational efficiency and the accuracy of template discrimination.

Due to the inherent incompleteness of sampling parsing, its output results are not fully accurate. Sampling parsing only processes a small subset of log messages, and the obtained partial log templates are neither exhaustive (i.e., they do not cover all templates in the full dataset) nor error-free. Consequently, the log feature characterization derived from sampling parsing only provides an approximate profile of the target dataset. We argue that this approximate characterization is nevertheless sufficient for DLogParser to determine the subsequent parsing policy. This process inherently involves a trade-off between characterization accuracy and computational cost. Increasing the sample size can effectively mitigate sample-induced parsing bias, while an overly large sample size will incur excessive overall time overhead.

In this paper, we limit the sample size to either 2000 log messages or 5% of the total log messages. The reasons are mainly as follows. First, the number of log templates generally ranges from tens to hundreds. To ensure the accuracy of sampling parsing, the number of log samples should be at least 2000. Second, we analyzed the template distribution in five log datasets (each with 100,000 log messages), as shown in Table 2. For example, parsing 5% of HDFS logs identifies 85.7% of the log templates. However, exceptions exist. For Mac logs, for instance, the number of discovered templates increases nearly linearly with the parsed sample size. Insufficient sampling introduces significant errors in discriminability calculation, which adversely affects parsing policy formulation and, consequently, parsing accuracy. Although increasing the sampling ratio is an intuitive solution, it would require parsing a large portion of logs using the initial static method. Therefore, the chosen sample size represents a practical compromise.

3.2. Step 2 Policy Generation

The purpose of policy decision is to provide an adaptive parsing policy for log parsing. This involves two core dimensions: combinations of grouping criteria and the order of criteria application. The former requires selecting suitable sets of grouping criteria from candidate sets, while the latter determines the application order of criteria within each selected combination. Since a single grouping criterion typically cannot achieve comprehensive log parsing, collaboration among multiple grouping criteria is essential. However, an excessive number of grouping criteria rarely improves parsing accuracy and substantially degrades the computational performance of parsing. Consequently, we restrict the number of grouping criteria per combination to no more than 3. Determining the application order of grouping criteria involves balancing the computational overhead against the accuracy contribution of each criterion.

The candidate grouping criteria include five types: length, punctuation, first token, last token, and key token. These criteria are classified into two categories: structural criteria and content criteria. Length is determined by the total token count of log messages, while punctuation is defined by the set of punctuation symbols a log message contains. Both belong to structural criteria, as they group log messages based on their structural characteristics. In contrast, the first token corresponds to the initial word of a log message, the last token to its final word, and the key token refers to the extraction of representative key terms from log messages. These three criteria fall under content criteria, as they facilitate log grouping through the analysis of message content. Although additional grouping criteria could be integrated, this paper focuses on the aforementioned five types.

Compared with directly applicable grouping criteria such as length, punctuation, first token, and last token, the key token grouping criterion incurs additional computational overhead. To implement this criterion, this paper employs the TF-IDF algorithm to identify the tokens that best represent the core features of log templates from all derived log templates. We treat all derived log templates as a corpus, consider each log template as a text sample, compute the TF-IDF score of each token within each sample, and finally select the token with the highest TF-IDF score as the key token of that log template. If DLogParser employs key tokens as the grouping criterion, it is necessary to maintain a key token list to support subsequent grouping. Since the computation involves only a limited set of log templates, the additional computational overhead of the key token criterion is minimal (approximately 0.02 s) and does not increase with the size of the log dataset.

To quantify a grouping criterion’s applicability in log parsing, we propose discriminability as a metric for measuring this property, as formally defined in Definition 1.

Definition 1

(Discriminability). The ability of a grouping criterion to correctly group log messages.

Based on the above formulation, the discriminability of a grouping criterion x is denoted as

dis (x)

. Mathematically, it is defined as follows:

dis (x) = \frac{n}{m}

(1)

where m represents the number of unique log templates contained in the log dataset, and n denotes the number of correctly assigned groups resulting from partitioning log messages via grouping criterion x.

For practical log parsing scenarios,

dis (x)

ranges from 0 to 1, with a higher value indicating a greater capacity of the criterion to accurately classify log messages into their respective correct groups. When

dis (x)

equals 1, this indicates that log messages can be perfectly categorized into their respective correct groups based solely on the current grouping criterion. For example, when log messages are grouped into 4 groups by length, each group contains one or more log templates, resulting in a total of 8 log templates. Thus, the discriminability is 0.5.

Intuitively, discriminability should be evaluated using log messages, as this ensures its accuracy. Take log message length as an example: when log messages contain variable-length dynamic fields, template-based analysis becomes increasingly unreliable, as log messages derived from identical log templates exhibit length variability. Similarly, key tokens can become misaligned when log messages include key tokens absent from the templates, resulting in template-based methods failing to capture critical semantic features. These limitations underscore notable risks when relying solely on log templates to evaluate the discriminability of key token-based grouping criteria.

To balance performance and accuracy, we adhere to our approach of evaluating discriminability based on log templates. Empirical results show that while log messages may reach up to millions of log entries, the number of corresponding log templates is typically only in the hundreds. Calculating discriminability directly from raw log messages would incur prohibitive computational overhead in comparison with template-based methods. In addition, despite the continuous generation of new log messages, the log template count remains relatively stable. As a result, the computational overhead associated with discriminability metrics is largely insensitive to the volume of log messages. In this way, we can adopt a straightforward method to calculate discriminability. For example, assuming 8 log templates with 4 distinct lengths in total, the discriminability is 0.5.

To verify whether a single grouping criterion or a combination of grouping criteria is applicable to log parsing, we introduce the discriminability threshold as defined in Definition 2. For ease of explanation, we use DisThr to denote this threshold. Intuitively, a single grouping criterion with the highest discriminability should be prioritized for log parsing. However, in practice, achieving ideal parsing accuracy with just one grouping criterion is difficult. Yet, using an excessive number of grouping criteria does not necessarily enhance log parsing accuracy. This is because misclassification is inherently inevitable during grouping. An overabundance of grouping criteria amplify these errors. This, in turn, leads log messages to be assigned to incorrect groups and compromises the overall accuracy of log parsing. Properly setting DisThr ensures the number of grouping criteria used for log parsing is reasonably selected.

Definition 2

(Discriminability threshold). The threshold that a single grouping criterion or a combination of grouping criteria must satisfy for use in log parsing.

In our opinion, a reasonable range of values for DisThr is 0.5–0.7, with the reasons as follows. First, most grouping criteria do not exhibit high discriminability across different log datasets. We sampled and analyzed the discriminability of five grouping criteria, and the results indicate that for most log datasets, their discriminability values are all below 0.6. For instance, setting DisThr to 0.7 can avoid relying on a single grouping criterion for log parsing. Of course, if a specific grouping criterion achieves sufficiently high discriminability, it is feasible to employ only this criterion. In fact, researchers have been consistently seeking such high-discriminability grouping criteria. Second, DisThr should not be excessively high. For example, setting DisThr above 0.9 may lead to failure to achieve satisfactory parsing accuracy even when multiple grouping criteria are employed. As noted earlier, a greater number of grouping criteria are not necessarily better for log parsing.

The sequence for using grouping criteria is primarily governed by the following considerations. First is the performance overhead associated with grouping log messages. Low-overhead criteria are recommended for the initial grouping stage to rapidly reduce the log dataset size, thereby lowering the overall performance overhead of log parsing. Second is discriminability—high-discriminability criteria should be employed first to mitigate the cascading effects of misgrouping. However, there is no strict standard for this prioritization. For instance, the length of log messages is often used as the initial criterion despite its generally low discriminability. Thus, empirical evidence plays a crucial role in determining the optimal sequence.

Based on the preceding discussion, seven rules are formulated to determine the log parsing policy. The first four serve to select the combination of grouping criteria, while the last three determine the sequence for applying these criteria.

Rule 1: In a single parsing, at most three grouping criteria are used.

Using an excessive number of grouping criteria does not yield substantial improvements. Since the accuracy of such grouping cannot be fully guaranteed, excessive reliance on it also introduces a high number of misgroupings. Furthermore, it incurs performance overhead. Therefore, the combination of grouping criteria is limited to a maximum of three, as established in Rule 1.

Rule 2: Prioritize criteria with high discriminability when selecting the combination of grouping criteria.

To ensure log grouping efficiency, prioritize grouping criteria with high discriminability. This ensures that log messages are grouped as accurately as possible, thereby mitigating the further propagation of errors caused by misgroupings. However, this rule is not the only one for determining the application order of grouping criteria; it must be combined with other rules to finalize such an order.

Rule 3: The use of a single grouping criterion is permitted if its discriminability exceeds DisThr.

Ideally, only one grouping criterion is needed to accurately group logs. In such cases, the grouping can be accomplished with minimal performance overhead. However, if its discriminability is too low, there may be a large number of misgroupings, which further affect subsequent parsing. Therefore, if only one grouping criterion is used, its discriminability must exceed DisThr. Based on our experience, it is difficult for a single grouping criterion to achieve the desired parsing accuracy.

Rule 4: If a combination of grouping criteria contains two or more criteria, it shall include at least one structural criterion and one content criterion.

When a single criterion fails to meet the DisThr, we augment the combination with additional criteria to enhance its discriminability. Thus, to achieve more accurate log parsing, log features should be extracted from multiple dimensions. Structural criteria and content criteria capture distinct log information related to structural characteristics and content attributes respectively. Thus, such a combination of two or more criteria must incorporate both types of criteria.

Rule 5: If the discriminability of the combination of two grouping criteria is still below DisThr, adding another grouping criterion to it shall be allowed.

There are two methods to calculate the discriminability of the combination of two grouping criteria. The first method recalculates the discriminability of such a combination based on the obtained log templates, following Equation (1). However, this method incurs additional computational overhead. The second method employs an approximate calculation method. Given that the discriminabilities of criterion 1 and criterion 2 are p and q respectively, the joint discriminability is defined as s = 1 − (1 − p)(1 − q). However, in some cases, the grouping criteria are not fully independent. For instance, a key token may be the first or last token of a log message, and log messages with a high number of punctuation marks tend to be longer. If two grouping criteria are correlated, the approximate result will be higher than the actual discriminability. Fortunately, the correlation between grouping criteria usually exists among criteria of the same type, and there is almost no correlation between structural and content grouping criteria. Considering these factors, in this paper, this approximation approach is adopted to balance precision and computational efficiency.

Rule 6: If a combination of grouping criteria contains both structural and content criteria, structural criteria shall be applied first during log parsing.

The prioritization of structural criteria stems from two key factors. First, structural criteria typically incur lower computational overhead than content criteria. More significantly, they mitigate misgrouping errors caused by irregular log outputs. For instance, when programmers inconsistently use verb tenses, relying solely on lexical content may misclassify log messages that should belong to the same template. Structural criteria can inherently avoid such ambiguities.

Rule 7: When multiple grouping criteria of the same type (e.g., all structural or content criteria) coexist in a combination of grouping criteria, the criteria with the highest discriminability shall take precedence in log parsing.

When a combination of grouping criteria contains three criteria, scenarios may arise where two criteria belong to the same category. In such scenarios, structural criteria shall be prioritized first. If the combination contains two structural criteria, the criterion with the higher discriminability shall be prioritized. Similarly, when two content criteria are present in the combination, the content criteria with lower discriminability shall be prioritized last among the three criteria in log parsing.

Based on the above discussion, we propose a log parsing policy generation algorithm as described in Algorithm 1. The prerequisite for this algorithm is that the discriminability of all candidate criteria has been calculated. First, if the grouping criterion with the maximum discriminability among all candidates exceeds DisThr, we select this criterion for log parsing (i.e., Rule 3). If not, we sort structural and content grouping criteria separately by discriminability, and select the criterion with the highest discriminability from each category (i.e., Rules 2, 4, and 6). If the discriminability of the combination of these two criteria still does not exceed DisThr, the criterion with the highest discriminability from all remaining candidate criteria is added to the existing combination (i.e., Rules 5, 6, and 7). Considering the balance between performance and accuracy, the policy is limited to a maximum of three grouping criteria (i.e., Rule 1).

Algorithm 1 Log Policy Generation Algorithm. Source: author’s contribution.

Require: candidate criteria C
Ensure: policy p

1:: $t_{x}$ =SearchMaxDis(C); //Select the criterion with the maximum discriminability.
2:: if GetDis( $t_{x}$ )> $D i s T h r$ then
3:: p={ $t_{x}$ };
4:: else
5:: st[]=SortStrTokenByDis(C); //Sort the structural criteria by discriminability
6:: ct[]=SortConTokenByDis(C); //Sort the content criteria by discriminability.
7:: if GetDis( $s t_{1}$ , $c t_{1}$ )> $D i s T h r$ then
8:: p={ $s t_{1}$ , $c t_{1}$ };
9:: else
10:: C=C-{ $s t_{1}$ , $c t_{1}$ }; //Remove the selected grouping criteria.
11:: $t_{y}$ =SearchMaxDis(C);
12:: if IsStrToken( $t_{y}$ ) then
13:: p={ $s t_{1}$ , $t_{y}$ , $c t_{1}$ };
14:: else
15:: p={ $s t_{1}$ , $c t_{1}$ , $t_{y}$ };
16:: end if
17:: end if
18:: end if

All log parsing policies generated by Algorithm 1 in this paper contain two or three grouping criteria, e.g., {length, first token}, {length, key token}, {punctuation, first token}, etc. None of the five grouping criteria employed in this paper can independently accomplish log parsing while ensuring satisfactory parsing accuracy. Using two or three grouping criteria in log parsing can improve the parsing accuracy. However, our algorithm does not always yield the optimal solution. We found that the policy {length, punctuation} also achieves satisfactory parsing accuracy. Making full use of prior expert experience to optimize the parsing policy is also a viable alternative approach. During the policy selection process, we select log templates as the data source to support policy generation. Regardless of the grouping criteria adopted by the policy, the discriminability of all grouping criteria needs to be ensured, including operations such as TF-IDF-based keyword enhancement. Although these operations incur computational overhead, the time cost is very limited because the data scale of log templates is small (generally no more than 1000).

3.3. Step 3 Full Parsing

Full parsing consists of two steps, namely log message grouping and similarity determination. In the log message grouping step, DLogParser divides log messages into distinct groups in accordance with the dynamically generated grouping policy. Unlike existing works, we adopt a dynamic parsing tree to achieve log grouping. In the similarity determination step, log templates are generated based on the grouped log messages. As for this step, we employ the method of Drain [17], an existing static parser, and we focus on introducing the grouping process based on the dynamic parsing tree.

The dynamic parsing tree is closely associated with parsing policy. Except for the root node and leaf nodes, the intermediate layers of the dynamic parsing tree vary depending on the parsing policy. If the parsing policy includes three grouping criteria, the parsing tree has three intermediate layers. In each intermediate layer, every node holds a value for grouping log messages. The leaf nodes of the dynamic parsing tree are the log groups. As shown in Figure 3, if the adopted parsing policy is {punctuation, key token}, the dynamic parsing tree has two intermediate layers, referred to as the punctuation-layer and key token-layer respectively. Nodes in the punctuation-layer record the punctuation mark sequence, while nodes in the key token-layer record the key token.

Log message grouping is accomplished using the dynamic parsing tree. We still use Figure 3 as an example to illustrate this process. If the log message to be parsed is denoted as msgx and the parsing policy is {punctuation, key token}, this means grouping is first performed by the punctuation marks of msgx, followed by grouping based on the key token. To group by punctuation, we first extract the complete punctuation mark sequence of msgx and match it with the nodes in the punctuation-layer of the dynamic parsing tree. If the punctuation mark sequence of msgx is “:!”, it exactly matches the 2nd node in the punctuation-layer. This scenario is referred to as a “hit”. Otherwise, it is called a “miss”. Similarly, for grouping by the key token, if msgx contains the key token “token2”, this indicates another hit. Since the dynamic parsing tree employs only two grouping criteria, the grouping process is completed, and msgx is assigned to Group y as shown in the figure. In other words, log message grouping starts from the root of the dynamic parsing tree, sequentially performs a variable number of matching operations, and ultimately terminates at a leaf node.

Misses during the grouping process typically trigger updates to the dynamic parsing tree. In the initial state, the dynamic parsing tree only contains a root node. When the first log message is grouped, a miss is inevitable since no intermediate nodes exist. At this point, a new node is generated for the dynamic parsing tree. For example, if the current grouping criterion being checked is length, a node is created with the length value of the current log message as its node value (let us assume this value is y). In the grouping of subsequent log messages, if there is a log message with length y, this node will be hit. As log message grouping is repeated, the dynamic parsing tree continuously generates new nodes. However, as the number of grouped log messages increases, the number of misses decreases. This is because the dynamic parsing tree has already identified most log messages by this point—only unfamiliar log messages will trigger another update to the dynamic parsing tree.

Similarity determination is also required to complete the parsing process. Ideally, log messages within each group should belong to the same log template. In this case, by regarding the identical parts of these log messages as constants and the non-identical parts as variables (with the latter replaced by *), we can derive the log template and thus complete the parsing. Unfortunately, it is difficult to fully ensure that all log messages within a group are identical. To address this, we adopt the similarity determination method employed by Drain [17] to finalize the parsing. The similarity simSeq between two log messages

{seq}_{1}

and

{seq}_{2}

is calculated as shown in the following equation.

simSeq = \frac{\sum_{i = 1}^{n} equ ({seq}_{1} (i), {seq}_{2} (i))}{n}

(2)

Here, seq(i) denotes the i-th token of the log message seq, n is the length of the log message, and equ is given as follows.

equ (t_{1}, t_{2}) = \{\begin{matrix} 1 & if t_{1} = t_{2} \\ 0 & otherwise \end{matrix}

(3)

Here,

t_{1}

and

t_{2}

are two tokens. Using the above method, we can obtain the similarity between two log messages. As long as the similarity is greater than a preset threshold, the two log messages are considered similar.

4. Evaluation

We evaluate DLogParser from three aspects: accuracy, robustness, and efficiency, and the test dataset is provided by LogHub [26].

4.1. Accuracy

Table 3 presents the accuracy statistics of various log parsers across different log types. To ensure a consistent number of log messages across all log types, we selected the first 50,000 log messages from each of 14 log types in LogHub as the test dataset. The baseline parsers include AEL [15], IPLoM [16], LogCluster [11], LogMine [12], LogSig [14], LFA [27], Spell [28], Drain [17], nDrain+ [22], LILAC [19], and AdaParser [29]. Among these baselines, LILAC and AdaParser are LLM-based parsers that leverage large language models (LLMs) for log template generation. Among non-LLM-based log parsers, DLogParser achieves the highest accuracy when parsing five log types (including Hadoop), while Drain attains optimal accuracy for another six log types (e.g., HDFS). Furthermore, DLogParser achieves an average parsing accuracy of 0.903, which is the highest among these non-LLM-based counterparts. The core driver behind this accuracy enhancement lies in DLogParser’s adoption of a dynamic parsing policy, which allows for the selection of appropriate grouping criteria tailored to log characteristics. While DLogParser does not outperform other log parsers in parsing accuracy across all log types, its dynamic parsing mechanism enables the selection of the most suitable grouping policy to the greatest possible extent even when handling complex logs, thereby achieving a notable enhancement in the overall average parsing accuracy.

As shown in the table, AdaParser [29] achieves the highest parsing accuracy of 96.9%, outperforming DLogParser. However, this improvement comes at a cost. While LILAC and AdaParser leverage LLMs for log template generation, they also incur both communication overhead and economic costs associated with LLM access. In the experiments, when parsing Thunderbird logs, LILAC and AdaParser accessed LLMs 582 and 561 times, respectively. Currently, a key research focus for LLM-based parsers is reducing LLM access frequency. Additionally, LLMs’ hallucination problem can lead to inaccurate log templates. To correct the erroneous log templates generated by LLMs, AdaParser incorporates a template corrector, which is also a critical reason why AdaParser’s parsing accuracy is higher than that of LILAC. However, constructing a template corrector requires manual intervention. Although DLogParser’s average parsing accuracy is lower than that of AdaParser, it does not incur the costs associated with LLM access nor require the manual development of a template corrector. These constitute the key advantages of DLogParser.

Table 3 shows that DLogParser’s accuracy in parsing Proxifier, OpenStack, and Mac remains unsatisfactory, with none exceeding 0.8. There are two main reasons as follows. First, sampling parsing fails to accurately capture log features. Taking Mac logs as an example, 1909 log templates can be generated from the test dataset, while those obtained through sampling parsing account for only 8.43%. This results in sampling parsing only being able to formulate parsing policies based on a small subset of log templates, which may not be suitable for parsing subsequent log messages. Ultimately, this leads to reduced log parsing accuracy. Second, due to the diversity and complexity of log features, even with five grouping criteria, DLogParser cannot fully adapt to all scenarios. When parsing the three types of logs (Proxifier, OpenStack, and Mac), the parsing policy we adopted is {length, first token, key token}. Theoretically, key token-based grouping should provide more accurate log grouping. However, the experimental results do not support this. Since the log templates generated via sampling parsing are limited, the key token sequences derived from these templates lack comprehensiveness, rendering them ineffective for subsequent log parsing. To verify our conjecture, we constructed key token sequences using all OpenStack log templates as the corpus and performed key token-based grouping. This approach improved the parsing accuracy to 87.8%.

4.2. Robustness

We evaluate the robustness of each log parser when parsing diverse logs. Figure 4 presents the distribution of parsing accuracy rates for ten log parsers across fourteen distinct logs. Figure 4 depicts a box plot where each box visualizes five statistical measures of accuracy: the minimum, the first quartile, the median, the third quartile, and the maximum. As shown in Figure 4, DLogParser achieves the optimal median accuracy of 0.954 and exhibits the smallest accuracy variability. These results indicate that DLogParser has excellent robustness and can adapt to parsing most heterogeneous logs.

We attribute DLogParser’s superior robustness to the dynamic nature of its parsing policies. Unlike static parsers such as Drain, which adopt only a single parsing policy for all log parsing tasks, DLogParser employs six distinct policies for fourteen log types. For instance, DLogParser employs {length, first token} to parse Zookeeper logs, adopts {length, first token, key token} for Hadoop, and utilizes {length, key token, first token} for OpenSSH. However, DLogParser’s parsing policy exhibits suboptimal performance in certain scenarios: during HDFS parsing, DLogParser selects key tokens yet demonstrates lower parsing accuracy than Drain. Overall, DLogParser’s dynamic nature enables adaptive policy selection aligned with log characteristics, thus yielding enhanced robustness. Drain, nDrain+, AEL, and IPLoM exhibit good robustness, as shown in the figure, with their median values all exceeding 0.8. Drain, nDrain+, and AEL are categorized as Heuristic-based log parsers, while IPLoM belongs to Clustering-based log parsers. However, all of them achieve the best parsing accuracy on certain types of logs. This highlights the fact that it is difficult to perfectly parse all logs using a single parsing method. DLogParser is designed to adopt a dynamic parsing policy to accommodate a wider range of log parsing.

4.3. Efficiency

Since DLogParser introduces additional processes including sampling parsing and policy generation, which incur new performance overhead, we conducted performance tests on DLogParser. The experimental platform was equipped with an Intel(R) Core(TM) i5-14600KF CPU, an NVIDIA GeForce RTX 5060 Ti GPU, and 32GB of RAM. For comparative purposes, we selected Spark and HDFS. To ensure accuracy, the experimental results represent the average of three independent test runs.

Figure 5 presents the experimental results regarding efficiency. We selected four different quantities of log messages (i.e., 50,000, 100,000, 150,000, and 200,000) for testing. The figure shows the performance overhead of Drain, AEL, IPLoM, LogCluster, and DLogParser, measured in seconds. As can be seen from the figure, DLogParser takes approximately 10.63 s to process 100,000 HDFS logs. It can also be observed from the figure that the performance overhead of DLogParser varies, which is due to the varying overhead introduced by dynamic policy-based parsing. When the parsing policy only requires one grouping criterion, the performance overhead decreases significantly. When DLogParser determines that multiple grouping criteria need to work together to improve parsing accuracy, it has to sacrifice some performance overhead. The time consumed by policy generation accounts for less than 0.1% of the total time overhead. The time overhead of LILAC and AdaParser is closely correlated with LLM access frequency. For instance, parsing 50,000 log messages takes LILAC 813 s and AdaParser 1654 s, which are significantly higher than those of non-LLM-based log parsers.

5. Related Work

The work related to this paper primarily focuses on data-driven parsers, which are classified into four categories: clustering-based parsers, frequent pattern mining-based parsers, Heuristic-based log parsers, and large language model (LLM)-based parsers.

Frequent pattern mining-based parsers distinguish constant and variable components in logs through word occurrence frequency. Given that variables frequently change while constants remain fixed, high-frequency words are identified as constants. These methods typically require full log traversal to calculate word frequencies, mark high-frequency words as constants, and determine variables by combining positional information with frequency thresholds. Subsequently, log messages are clustered according to constants, variables, and their positional relationships, ultimately generating log templates. SLCT [9], FP-tree [10], and LogCluster [11] are all frequent pattern mining-based parsers. The fundamental limitation of such parsers lies in the fact that constants may not necessarily exhibit high frequency. When a system event itself is uncommon, for instance sporadic system errors, the corresponding log messages occur with extremely low frequency. Consequently, such parsers fail to process such log messages.

Clustering-based log parsers operate under the fundamental premise that log messages originating from identical log printing code inherently exhibit similarity. By grouping multiple log messages according to principles of similarity, log templates can be derived. Different parsers emerge due to varying similarity criteria and assessment methodologies [12,14]. For instance, LKE [13] employs textual edit distance to measure log message similarity. LogMine [12] utilizes hierarchical clustering techniques for similarity assessment. A critical challenge confronting these methods is that intrinsically similar log messages may exhibit divergence under specific circumstances. For example, when evaluating similarity using position-based tokens, the presence of variable-length variables can induce errors in similarity determination.

Heuristic-based parsers establish corresponding heuristic rules based on the inherent characteristics of logs to accomplish log parsing. Representative heuristic parsers include AEL [15], IPLoM [16], Drain [17], and LogPunk [30]. Drain postulates that log messages of the same category should exhibit identical length and identical initial tokens; consequently, it employs a fixed-depth parsing tree for log parsing, where clustering operations are performed after classification by length and initial token. LogPunk contends that punctuation marks within the same log message category demonstrate similarity, combining length and punctuation tokens to achieve log grouping. The methodology proposed in this paper also belongs to this category. However, a distinction exists in that the aforementioned methods constitute static parsers whereas our approach represents a dynamic parser.

LLM-based log parsers represent an emerging category of log parsing techniques developed alongside advancements in large language models. Traditional log parsers have consistently failed to achieve entirely satisfactory parsing accuracy rates, leading researchers to incorporate LLMs for accuracy enhancement [18,31,32]. Leveraging the natural language comprehension capabilities of LLMs enables substantial improvements in log parsing accuracy through semantic analysis [32,33]. However, the deployment costs of LLMs and communication overhead associated with accessing them render fully LLM-dependent log parsing impractical. LILAC [19], HELP [20], LibreLog [34], and LogBatcher [35] implement log parsing solutions while minimizing reliance on LLMs. To address this, such parsers perform necessary grouping or clustering on log messages prior to invoking large language models (LLMs). In other words, LLM-based log parsers are a fusion of early-stage log parsers and large language models.

6. Discussion

In comparison with static log parsers, the primary advantage of DLogParser lies in its dynamic nature, which enables it to select appropriate parsing policies based on log characteristics. The dynamicity of DLogParser is primarily manifested in the selection of grouping criteria, as it does not aim to design entirely new grouping criteria but instead fully leverages multiple existing ones. For instance, Drain [17] adopts length and the first token as grouping criteria, while LogPunk [30] utilizes length and punctuation marks as grouping criteria. However, these static parsers all employ fixed parsing policies, making it difficult for them to adapt to diverse log types. In contrast, DLogParser dynamically adjusts the grouping criteria underpinning its parsing policies based on log characteristics. Our experimental results confirm this, as DLogParser employs six distinct parsing policies to successfully parse fourteen types of logs.

As a dynamic log parser, DLogParser possesses excellent average parsing accuracy and robustness. Different logs exhibit distinct characteristics. For instance, if the first tokens of log templates for a certain log type have significant differences, adopting the first token as the grouping criterion can achieve favorable parsing results. In contrast, if the first tokens of log templates for the same log type are largely identical but contain diverse keywords, using key tokens as the grouping criteria can yield high-quality parsing results. By selecting appropriate grouping criteria based on log characteristics, DLogParser can effectively adapt to variations in log characteristics. Overall, DLogParser achieves superior average parsing accuracy and median accuracy compared to static parsers such as Drain [17] and nDrain [22]. Our experimental results further validate this finding.

In our view, DLogParser is more suitable for online log parsing. A key challenge in optimizing DLogParser lies in how to understand log features to select appropriate grouping criteria. Although DLogParser incorporates a sampling parsing approach, this method struggles to accurately capture the complete features of logs. As observed in our experiments, sampling parsing particularly fails to accurately capture the features of Mac logs. Consequently, the parsing policies derived from sampling results cannot be fully adapted to the log parsing scenarios. When DLogParser is applied to online log parsing, it can systematically analyze log features after accumulating and processing a sufficient number of logs, thereby selecting parsing policies more tailored to the actual log characteristics. Our experimental results demonstrate that when DLogParser is used for online parsing of OpenStack logs, the parsing accuracy can exceed 87%, compared to its current accuracy of only 73.4%.

Currently, large language models have been gradually applied in the field of log parsing [18,32]. However, this does not mean that core methods adopted by traditional log parsers, such as log grouping and clustering, will be abandoned. LLM-based log parsers need to bear the communication overhead and economic costs incurred by accessing LLMs. Therefore, although such parsers rely on LLMs to complete core parsing tasks, they still strive to minimize the frequency of LLM calls [19,20,34]. Their core approach can be summarized as follows: first cluster and group similar log messages, then provide samples and related logs within the group to the LLM, requesting it to generate corresponding log templates. In our view, DLogParser can provide more optimal grouping results for LLM-based log parsers, and DLogParser can complement LLM-based parsers. We consider this as our future work.

7. Conclusions

This paper proposes a dynamic log parser DLogParser. DLogParser acquires partial log templates through sampling parsing, analyzes the characteristics of the log templates, selects suitable grouping criteria, and thus dynamically adjusts parsing policies. We implemented a prototype system of DLogParser, employing logs provided by LogHub as the test dataset, to validate the effectiveness of DLogParser. Experimental results demonstrate that, compared to eight non-LLM-based log parsers, DLogParser achieves the highest average parsing accuracy with an acceptable overhead.

Author Contributions

Conceptualization, H.Z. and J.Y.; methodology, H.Z. and C.W.; software, C.W. and J.Y.; validation, Y.Z.; formal analysis, J.Y.; investigation, H.Z.; resources, Y.W.; writing—original draft preparation, H.Z.; writing—review and editing, J.Y. and Y.W.; visualization, C.W.; supervision, Y.Z.; project administration, H.Z.; funding acquisition, H.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China (61902427), Henan Provincial Key Scientific and Technological Research Project (252102210104).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in this study are publicly available. The download address of the dataset is: https://github.com/logpai/loghub, accessed on 8 January 2024.

Conflicts of Interest

The authors declare no conflicts of interest.

References

He, S.; He, P.; Chen, Z.; Yang, T.; Su, Y.; Lyu, M. A Survey on Automated Log Analysis for Reliability Engineering. ACM Comput. Surv. 2021, 54, 1–37. [Google Scholar] [CrossRef]
Le, V.-H.; Zhang, H.Y. Log-based Anomaly Detection with Deep Learning: How Far Are We? In Proceedings of the IEEE/ACM International Conference on Software Engineering, Pittsburgh, PA, USA, 25–27 May 2022; pp. 1356–1367. [Google Scholar]
Yang, L.; Chen, J.J.; Wang, Z.; Wang, W.J.; Jiang, J.J.; Dong, X.Y. Semi-Supervised Log-Based Anomaly Detection via Probabilistic Label Estimation. In Proceedings of the IEEE/ACM International Conference on Software Engineering, Madrid, Spain, 22–30 May 2021; pp. 1448–1460. [Google Scholar]
Chen, A.R.; Chen, T.-H.; Wang, S.W. Pathidea: Improving Information Retrieval-Based Bug Localization by Re-Constructing Execution Paths Using Logs. IEEE Trans. Softw. Eng. 2022, 48, 2905–2919. [Google Scholar] [CrossRef]
Chuah, E.; Kuo, S.-H.; Hiew, P.; Tjhi, W.-C.; Lee, G.; Hammond, J. Diagnosing the root-causes of failures from cluster log files. In Proceedings of the International Conference on High Performance Computing, Goa, India, 19–22 December 2010; pp. 1–10. [Google Scholar]
Notaro, P.; Haeri, S.; Cardoso, J.; Gerndt, M. LogRule: Efficient Structured Log Mining for Root Cause Analysis. IEEE Trans. Netw. Serv. Manag. 2023, 20, 4231–4243. [Google Scholar] [CrossRef]
Bushong, V.; Sanders, R.; Curtis, J.; Du, M.; Cerny, T.; Frajtak, K.; Bures, M.; Tisnovsky, P.; Shin, D.W. On Matching Log Analysis to Source Code: A Systematic Mapping Study. In Proceedings of the International Conference on Research in Adaptive and Convergent Systems, Gwangju, Republic of Korea, 13–16 October 2020; pp. 181–187. [Google Scholar]
Shang, W.Y. Bridging the divide between software developers and operators using logs. In Proceedings of the International Conference on Software Engineering, Zurich, Switzerland, 2–9 June 2012; pp. 1583–1586. [Google Scholar]
Vaarandi, R. A data clustering algorithm for mining patterns from event logs. In Proceedings of the IEEE Workshop on IP Operations and Management, Kansas City, MO, USA, 1–3 October 2003; pp. 119–126. [Google Scholar]
Han, J.; Pei, J.; Yin, Y.; Mao, R. Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach. Data Min. Knowl. Discov. 2004, 8, 53–87. [Google Scholar] [CrossRef]
Vaarandi, R.; Pihelgas, M. LogCluster—A data clustering and pattern mining algorithm for event logs. In Proceedings of the International Conference on Network and Service Management, Barcelona, Spain, 9–13 November 2015; pp. 1–7. [Google Scholar]
Hamooni, H.; Debnath, B.; Xu, J.W.; Zhang, H.; Jiang, G.F.; Mueen, A. LogMine: Fast Pattern Recognition for Log Analytics. In Proceedings of the ACM International on Conference on Information and Knowledge Management, Indianapolis, IN, USA, 24–28 October 2016; pp. 1573–1582. [Google Scholar]
Fu, Q.; Lou, J.-G.; Wang, Y.; Li, J. Execution Anomaly Detection in Distributed Systems through Unstructured Log Analysis. In Proceedings of the IEEE International Conference on Data Mining, Miami, FL, USA, 6–9 December 2009; pp. 149–158. [Google Scholar]
Tang, L.; Li, T.; Perng, C.-S. LogSig: Generating system events from raw textual logs. In Proceedings of the ACM international conference on Information and knowledge management, Glasgow, UK, 24–28 October 2011; pp. 785–794. [Google Scholar]
Jiang, Z.M.; Hassan, A.E.; Flora, P.; Hamann, G. Abstracting Execution Logs to Execution Events for Enterprise Applications. In Proceedings of the International Conference on Quality Software, Oxford, UK, 12–13 August 2008; pp. 181–186. [Google Scholar]
Makanju, A.A.O.; Zincir-Heywood, A.N.; Milios, E.E. Clustering event logs using iterative partitioning. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, 28 June–1 July 2009; pp. 1255–1264. [Google Scholar]
He, P.; Zhu, J.; Zheng, Z.; Lyu, M.R. Drain: An Online Log Parsing Approach with Fixed Depth Tree. In Proceedings of the IEEE International Conference on Web Services, Honolulu, HI, USA, 25–30 June 2017; pp. 33–40. [Google Scholar]
Le, V.-H.; Zhang, H. Log Parsing with Prompt-based Few-shot Learning. In Proceedings of the IEEE/ACM International Conference on Software Engineering, Australia, 14–20 May 2023; pp. 2438–2449. [Google Scholar]
Jiang, Z.; Liu, J.; Chen, Z.; Li, Y.; Huang, J.; Huo, Y.; He, P.; Gu, J.; Lyu, M. LILAC: Log Parsing using LLMs with Adaptive Parsing Cache. Proc. ACM Softw. Eng. 2024, 1, 137–160. [Google Scholar] [CrossRef]
Xu, A.; Gau, A. HELP: Hierarchical Embeddings-based Log Parsing. arXiv 2024. [Google Scholar] [CrossRef]
He, P.J.; Zhu, J.M.; Xu, P.C.; Zheng, Z.B.; Lyu, M.R. A Directed Acyclic Graph Approach to Online Log Parsing. arXiv 2018. [Google Scholar] [CrossRef]
Yuan, J.H.; Zhou, H.W.; Wang, C.; Guan, B. nDrain: A Robust Log Template Mining Algorithm. In Proceedings of the International Conference on Computer and Communications, Chengdu, China, 13–16 December 2024; pp. 332–336. [Google Scholar]
Zhu, J.M.; He, S.L.; Liu, J.Y.; He, P.J.; Xie, Q.; Zheng, Z.B. Tools and Benchmarks for Automated Log Parsing. In Proceedings of the IEEE/ACM International Conference on Software Engineering: Software Engineering in Practice, Montreal, QC, Canada, 25–31 May 2019; pp. 121–130. [Google Scholar]
He, P.J.; Zhu, J.M.; He, S.L.; Li, J.; Lyu, M.R. An Evaluation Study on Log Parsing and Its Use in Log Mining. In Proceedings of the IEEE/IFIP International Conference on Dependable Systems and Networks, Toulouse, France, 28 June–1 July 2016; pp. 654–661. [Google Scholar]
Jiang, J.; Fu, Y.; Xu, J. PosParser: A Heuristic Online Log Parsing Method Based on Part-of-Speech Tagging. IEEE Trans. Big Data 2025, 11, 1334–1345. [Google Scholar] [CrossRef]
Zhu, J.M.; He, S.L.; He, P.J.; Liu, J.Y.; Lyu, M.R. Loghub: A Large Collection of System Log Datasets for AI-driven Log Analytics. In Proceedings of the IEEE International Symposium on Software Reliability Engineering, Florence, Italy, 9–12 October 2023; pp. 355–366. [Google Scholar]
Nagappan, M.; Vouk, M.A. Abstracting log lines to log event types for mining software system logs. In Proceedings of the IEEE Working Conference on Mining Software Repositories, Cape Town, South Africa, 2–3 May 2010; pp. 114–117. [Google Scholar]
Du, M.; Li, F.F. Spell: Streaming Parsing of System Event Logs. In Proceedings of the IEEE International Conference on Data Mining, Barcelona, Spain, 12–15 December 2016; pp. 859–864. [Google Scholar]
Wu, Y.; Yu, S.; Li, Y. Log Parsing using LLMs with Self-Generated In-Context Learning and Self-Correction. arXiv 2025, arXiv:2406.03376v2. [Google Scholar]
Zhang, S.J.; Gang, W. Efficient Online Log Parsing with Log Punctuations Signature. Appl. Sci. 2021, 11, 11974. [Google Scholar] [CrossRef]
Jiang, Z.H.; Liu, J.Y.; Huang, J.J.; Li, Y.C.; Huo, Y.T.; Gu, J.Z.; Chen, Z.B.; Zhu, J.M.; Lyu, M.R. A Large-Scale Evaluation for Log Parsing Techniques: How Far Are We? In Proceedings of the ACM SIGSOFT International Symposium on Software Testing and Analysis, New York, NY, USA, 16–20 September 2024; pp. 223–234. [Google Scholar]
Ma, Z.; Chen, A.R.; Kim, D.J.; Chen, T.-H.P.; Wang, S. LLMParser: An Exploratory Study on Using Large Language Models for Log Parsing. In Proceedings of the IEEE/ACM International Conference on Software Engineering, Lisbon, Portugal, 14–20 April 2024; pp. 1209–1221. [Google Scholar]
Xu, J.J.L.; Yang, R.C.; Huo, Y.T.; Zhang, C.Y.; He, P.J. DivLog: Log Parsing with Prompt Enhanced In-Context Learning. In Proceedings of the IEEE/ACM International Conference on Software Engineering, Lisbon, Portugal, 14–20 April 2024; pp. 2457–2468. [Google Scholar]
Ma, Z.Y.; Kim, D.J.; Chen, T.-H.P. LibreLog: Accurate and Efficient Unsupervised Log Parsing Using Open-Source Large Language Models. In Proceedings of the IEEE/ACM International Conference on Software Engineering, Ottawa, ON, Canada, 26 April–6 May 2025; pp. 924–936. [Google Scholar]
Xiao, Y.; Le, V.-H.; Zhang, H.Y. Demonstration-Free: Towards More Practical Log Parsing with Large Language Models. In Proceedings of the IEEE/ACM International Conference on Automated Software Engineering, Sacramento, CA, USA, 27 October–1 November 2024; pp. 153–165. [Google Scholar]

Figure 1. An example of log parsing. Source: author’s contribution.

Figure 2. Workflow of DLogParser. Source: author’s contribution.

Figure 3. An example of dynamic parsing tree. Source: author’s contribution.

Figure 4. Accuracy distribution of log parsers across different logs. Source: author’s contribution.

Figure 5. Parsing time of log parsers on different logs. Source: author’s contribution.

Table 1. Statistics of log templates across 7 log types. Source: author’s contribution.

Log	Length	First Token	Total Templates	Accuracy
HDFS	10	10	14	100%
Hadoop	26	96	261	92.3%
Zookeeper	30	17	77	99.5%
Apache	18	12	30	100%
Linux	18	199	441	68.6%
OpenSSH	12	17	26	73.5%
Mac	4	1221	1909	76.5%

Table 2. Distribution of log templates across log messages. Source: author’s contribution.

Log	5%	10%	20%	30%	50%
HDFS	85.7%	85.7%	92.8%	92.8%	92.8%
Hadoop	52.4%	69.7%	75.4%	79.6%	92.7%
Spark	71.3%	72.7%	75.7%	83.1%	97.8%
Mac	8.4%	15.6%	28.8%	41.8%	55.9%
OpenSSH	80.7%	84.6%	84.6%	84.6%	84.6%

Table 3. Parsing Accuracy of Different Log Parsers. Source: author’s contribution.

Dataset	AEL	IPLoM	LogCluster	LogMine	LogSig	LFA	Spell	Drain	nDrain+	LILAC	AdaParser	DLogParser
HDFS	0.999	0.991	0.463	0.748	0.508	0.780	0.991	1.000	0.999	1.000	1.000	0.999
Hadoop	0.842	0.919	0.512	0.848	0.285	0.673	0.455	0.923	0.927	0.875	0.990	0.955
Spark	0.548	0.056	0.019	0.009	0.105	0.049	0.541	0.921	0.921	0.992	0.996	0.996
ZooKeeper	0.991	0.995	0.726	0.679	0.783	0.844	0.990	0.995	0.989	1.000	1.000	0.995
BGL	0.997	0.997	0.983	0.880	0.232	0.991	0.974	0.999	0.995	0.998	0.999	0.998
HPC	0.404	0.391	0.060	0.047	0.382	0.160	0.211	0.958	0.957	1.000	1.000	0.957
Thunderbird	0.860	0.739	0.544	0.846	0.756	0.682	0.583	0.922	0.906	0.910	0.953	0.917
Linux	0.916	0.808	0.595	0.736	0.107	0.224	0.622	0.686	0.805	0.652	0.801	0.826
HealthApp	0.731	0.974	0.738	0.545	0.092	0.753	0.657	0.861	0.851	1.000	0.990	0.860
Apache	1.000	0.992	0.536	1.000	0.731	0.802	1.000	1.000	1.000	0.996	0.999	1.000
Proxifier	0.973	0.800	0.661	0.503	0.494	0.351	0.521	0.692	0.773	0.521	0.946	0.692
OpenSSH	0.439	0.392	0.345	0.341	0.441	0.328	0.444	0.735	0.711	0.732	0.999	0.952
OpenStack	0.746	0.341	0.698	0.745	0.839	0.200	0.765	0.734	0.812	0.491	1.000	0.734
Mac	0.794	0.627	0.462	0.858	0.518	0.566	0.758	0.765	0.805	0.777	0.891	0.760
Average	0.802	0.715	0.524	0.628	0.448	0.528	0.679	0.870	0.889	0.853	0.969	0.903

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yuan, J.; Wang, C.; Zhou, H.; Zhang, Y.; Wang, Y. DLogParser: An Efficient Dynamic Log Parser with Multiple Grouping Criteria. Appl. Sci. 2026, 16, 811. https://doi.org/10.3390/app16020811

AMA Style

Yuan J, Wang C, Zhou H, Zhang Y, Wang Y. DLogParser: An Efficient Dynamic Log Parser with Multiple Grouping Criteria. Applied Sciences. 2026; 16(2):811. https://doi.org/10.3390/app16020811

Chicago/Turabian Style

Yuan, Jinhui, Chao Wang, Hongwei Zhou, Yucheng Zhang, and Yongwei Wang. 2026. "DLogParser: An Efficient Dynamic Log Parser with Multiple Grouping Criteria" Applied Sciences 16, no. 2: 811. https://doi.org/10.3390/app16020811

APA Style

Yuan, J., Wang, C., Zhou, H., Zhang, Y., & Wang, Y. (2026). DLogParser: An Efficient Dynamic Log Parser with Multiple Grouping Criteria. Applied Sciences, 16(2), 811. https://doi.org/10.3390/app16020811

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DLogParser: An Efficient Dynamic Log Parser with Multiple Grouping Criteria

Abstract

1. Introduction

2. Observation

3. Methodology

3.1. Step 1 Sampling Parsing

3.2. Step 2 Policy Generation

3.3. Step 3 Full Parsing

4. Evaluation

4.1. Accuracy

4.2. Robustness

4.3. Efficiency

5. Related Work

6. Discussion

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI