Next Article in Journal
PrevOccupAI-HAR: A Public Domain Dataset for Smartphone Sensor-Based Human Activity Recognition in Office Environments
Previous Article in Journal
Triple-Sampling kT/C Noise Cancellation for SAR ADCs
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

BERT-LogAnom: Enhancing Log Anomaly Detection with Gated Residual BiLSTM and Dynamic Thresholding

1
School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing 100083, China
2
China Electronics Engineering Design Institute Co., Ltd., Beijing 100840, China
3
NSFOCUS Technologies Group Co., Ltd., Beijing 100089, China
*
Authors to whom correspondence should be addressed.
These authors contributed equally to this work.
Electronics 2026, 15(4), 806; https://doi.org/10.3390/electronics15040806
Submission received: 7 January 2026 / Revised: 2 February 2026 / Accepted: 4 February 2026 / Published: 13 February 2026
(This article belongs to the Section Computer Science & Engineering)

Abstract

As modern software systems continue to grow in scale and structural complexity, log anomaly detection has become an essential component of system monitoring and fault diagnosis. However, existing approaches often struggle to adequately capture sequential dependencies in log data and to remain robust under distributional changes. To mitigate these issues, this paper presents BERT-LogAnom, an unsupervised framework for log anomaly detection that combines contextual representation learning, sequential modeling, and adaptive decision mechanisms. Specifically, a BERT-based encoder is employed to learn global contextual semantics from log sequences, while a gated residual bidirectional Long Short-Term Memory (GR-BiLSTM) network is introduced to model bidirectional temporal dependencies without disrupting the learned contextual information. To characterize normal system behavior from unlabeled logs, two self-supervised objectives—masked log key prediction and volume hypersphere minimization—are jointly optimized during training. Furthermore, a Dynamic Thresholding Prediction Module (DTPM) is incorporated to adjust anomaly decision boundaries in response to short-term statistical fluctuations and longer-term distribution drift. Experiments conducted on three public benchmark datasets (HDFS, BGL, and Thunderbird) show that BERT-LogAnom achieves consistently superior performance compared with representative baseline methods across precision, recall, and F1-score. Additional ablation studies further confirm the contribution of each major component in the proposed framework.

1. Introduction

Ensuring high availability and reliability is essential for large-scale software-intensive systems, such as distributed clusters, cloud platforms, and high-performance computing environments [1,2]. As system scale and complexity increase, anomalies become inevitable during long-term operation and may lead to performance degradation, data integrity issues, or cascading failures [3,4]. Even seemingly minor anomalies may propagate across components, leading to performance degradation, data integrity issues, or large-scale service disruptions. Consequently, effective log anomaly detection has become a critical capability for maintaining system stability, reliability, and operational efficiency in modern computing infrastructures [5].
System logs constitute one of the most widely available and informative sources of system telemetry, recording detailed runtime states, execution traces, and operational events [6]. Engineers routinely rely on log data to monitor system behavior, identify abnormal conditions, and diagnose root causes of failures. However, the massive volume, high dimensionality, and complex temporal dependencies of log data render manual inspection impractical and error-prone, especially in large-scale environments [7]. These challenges have driven extensive research on automated log-based anomaly detection. Nevertheless, the task remains difficult: anomalies often manifest as subtle deviations in event ordering or execution stages rather than isolated abnormal events, and real-world systems are inherently non-stationary due to workload fluctuations, configuration changes, and software evolution. As a result, effective log anomaly detection methods must not only capture rich contextual semantics and sequential dependencies, but also remain robust under continuously evolving data distributions.
In recent years, deep learning has been widely explored for end-to-end log representation learning. RNN-based [8] approaches, such as DeepLog [9] and its successor LogAnomaly [10], model log sequences using LSTM [11] to predict the next event or combine frequency features for anomaly detection. These models demonstrate strong performance in capturing short-range sequential dependencies. Nevertheless, the recursive nature of RNNs limits their ability to capture long-range dependencies, making them less effective in handling complex call chains or multi-stage state transitions. Transformer-based architectures, by contrast, leverage parallel attention mechanisms to capture global context and have shown superior results in log anomaly detection. Representative works include LogBERT [12], LogRobust [13], LogGPT [14], and LogFormer [15], which achieve state-of-the-art performance on multiple benchmark datasets. Despite these advances, Transformer-based approaches still face two main limitations: (i) the attention mechanism lacks explicit encoding of sequential constraints, reducing their ability to fully exploit staged or progressive dependencies inherent in logs; and (ii) anomaly detection decisions are typically made using static thresholds or top-k strategies, which are highly sensitive to distribution shifts caused by system upgrades or workload variations, leading to false positives or missed detections.
Existing BERT-based [16] methods, such as LogBERT [12], leverage Transformer’s powerful semantic modeling capacity by incorporating self-supervised tasks such as Masked Log Key Prediction (MLKP) and Volume of Hypersphere Minimization (VHM). While effective in capturing local semantic context, they remain limited in modeling explicit long-range sequential dependencies. Furthermore, their reliance on static thresholds makes them vulnerable to performance degradation in dynamic environments, where log distributions often drift over time.
To address these challenges, we propose BERT-LogAnom, an unsupervised log anomaly detection framework that couples contextual representation learning with explicit sequential modeling and adaptive decision making. Specifically, we introduce a Gated Residual BiLSTM (GR-BiLSTM) module to enhance long-range sequential dependency [17] modeling on top of contextual representations, and a Dynamic Threshold Prediction Module (DTPM) to adapt anomaly decision boundaries under distribution shifts. Together, these designs improve robustness and practical deployability in dynamic system environments.
The main contributions are summarized as follows:
(1)
We propose GR-BiLSTM to complement contextual representations with explicit sequential dynamics, improving detection of order-/phase-related anomalies.
(2)
We propose DTPM to adapt decision thresholds to evolving score distributions, mitigating the sensitivity of static thresholding.
(3)
We conduct comprehensive experiments on multiple public log datasets and show that BERT-LogAnom consistently improves precision, recall, and F1-score over strong baselines.
While recent studies have explored combining BERT-based representations with sequential models, most existing approaches primarily adopt generic architectural stacking without explicitly tailoring the integration to the characteristics of log anomaly detection. In contrast, our work is motivated by two log-specific challenges: anomalies that manifest as deviations in execution order or staged behaviors, and the instability of fixed decision thresholds under evolving system conditions. Accordingly, we design a Gated Residual BiLSTM (GR-BiLSTM) module to complement contextual representations with order-sensitive sequential dynamics while preserving semantic information, and further couple it with a self-supervised dynamic thresholding mechanism. This task-driven integration enables the proposed framework to jointly address sequential anomaly patterns and non-stationary decision boundaries, distinguishing it from existing BERT-based log anomaly detection methods.
In summary, this study combines the contextual modeling strength of Transformer architectures with enhanced sequential modeling and dynamic decision-making strategies, providing a more effective and practical solution for log anomaly detection in real-world large-scale systems.

2. Related Work

This section reviews representative studies on log-based anomaly detection, including RNN/LSTM-based sequence modeling, Transformer/BERT-based contextual representation learning, and adaptive thresholding strategies under distribution shifts.

2.1. Log Anomaly Detection Overview

Log data record runtime events and system states and serve as a critical source for automated monitoring and troubleshooting [18,19,20].Early approaches typically relied on handcrafted rules, keyword matching, or simple statistical tests, which often suffer from limited scalability and poor adaptability to evolving log patterns. Recent research has shifted toward learning-based methods, especially deep models, which enable automatic representation learning and sequential pattern modeling for more robust anomaly detection.

2.2. RNN/LSTM-Based Methods

Recurrent neural networks(RNNs) [8] and their variants, including Long Short-Term Memory(LSTM) networks [11], have been extensively explored in the context of log anomaly detection because of their natural suitability for modeling sequential data. Du et al. (2017) introduced DeepLog [9], which uses an LSTM to learn normal log sequences and detect anomalies by identifying deviations between predicted and observed events. Building on this idea, Meng et al. (2019) proposed LogAnomaly [10], which integrates frequency features into sequence prediction, capturing richer local context and improving sensitivity to abnormal events. Wang et al. (2020) introduced LogEvent2vec [21], which strengthens semantic relationships among log events through embedding-based representations, resulting in more discriminative anomaly detection. Despite these advances, RNN/LSTM-based methods are generally more effective for relatively short sequences. Their recursive architectures are prone to gradient vanishing or explosion, which constrains their ability to model long-range dependencies. As a result, such methods often struggle to detect complex anomalies that involve long interaction chains or multi-stage system state transitions.

2.3. Transformer-Based Methods

The Transformer architecture [22], driven by its self-attention mechanism, has gradually been adopted in log anomaly detection as a means to address the difficulty of RNN-based models in capturing long-term dependencies. Guo et al. (2021) proposed LogBERT [12], which represents the first attempt to apply Transformers to log anomaly detection by introducing Masked Log Key Prediction (MLKP) and Volume of Hypersphere Minimization (VHM) as self-supervised objectives for learning global contextual semantics. Subsequently, Zhang et al. (2019) presented LogRobust [13], which enhances the attention mechanism and integrates more robust self-supervised tasks to improve performance under noisy log data. Qi et al. (2023) developed LogGPT [14], where generative pre-training is leveraged to enhance model generalization and enable the detection of previously unseen anomalies. More recently, Guo et al. (2024) proposed LogFormer [15], which incorporates event types and attributes to capture fine-grained anomaly features more effectively. Despite these advances, Transformer-based methods still lack explicit modeling of sequential constraints and state transitions, limiting their ability to detect anomalies characterized by strict order or phase-dependent behaviors.

2.4. BERT-Based Methods

Following the success of BERT [16,23] in natural language processing, BERT-based methods have gained attention in log anomaly detection [7]. By leveraging bidirectional self-attention [22], BERT captures contextual semantics effectively and overcomes the limitations of RNN/LSTM in long-range dependency modeling. LogBERT applies masked language modeling (MLM) on log sequences, learning normal patterns through self-supervision and detecting anomalies by identifying deviations in prediction. LAnoBERT [24] builds upon LogBERT by eliminating the need for log parsers. It applies lightweight preprocessing via regular expressions and relies on MLM loss and prediction probabilities during inference, reducing information loss caused by traditional parsing. LogFiT [25] introduces a fine-tuning strategy for pre-trained BERT, adapting it specifically to log data distributions and enhancing anomaly detection performance. LogLLM [26] combines BERT with large-scale generative models (e.g., LLaMA), aligning embeddings between the two to enhance semantic understanding and cross-domain generalization. These approaches significantly improve detection accuracy and robustness, especially in capturing global context. Nevertheless, existing BERT-based methods still struggle to model explicit sequential anomalies, motivating the need to combine BERT with traditional sequential modeling modules such as BiLSTM.
Despite their effectiveness in modeling global contextual semantics, existing BERT-based methods remain limited in capturing anomalies arising from deviations in sequential execution patterns, such as event-order violations or staged behaviors. Similar limitations have been observed in other domains, where semantic representations alone are insufficient for reliable anomaly detection. For instance, IMMENSE [27] demonstrates that integrating semantic features with relational and contextual information improves the detection of malicious users in social networks, while Lograph [28] shows that modeling semantic associations and structural dependencies via log-entity graphs enhances anomaly detection in complex log data. These studies indicate that semantic modeling benefits from complementary contextual or structural representations, motivating the integration of contextual representation learning with explicit sequential modeling in log anomaly detection.

2.5. Dynamic Thresholding Methods

Most log anomaly detection methods rely on static thresholds, which are sensitive to data distribution shifts caused by workload variations or software updates. To improve adaptability, early studies explored statistical and smoothing-based threshold adjustment. For instance, Ahmad and Purdy proposed a real-time anomaly detection framework for streaming data using statistical modeling and moving averages [29]. Exponentially Weighted Moving Average (EWMA) techniques have also been widely adopted to capture short-term fluctuations efficiently [30]. However, these approaches primarily address local variations and have limited capability in handling long-term distribution drift.
Another representative line of work is based on extreme value theory (EVT), such as the Peaks-Over-Threshold framework and its streaming variant SPOT, which adapts thresholds by modeling the tail distribution of anomaly scores [31]. While EVT-based methods provide principled threshold selection, they often rely on distributional assumptions and can be sensitive to parameter settings when applied to complex and evolving log data. Overall, existing dynamic thresholding methods highlight the importance of adaptive decision boundaries but remain constrained in robustness and flexibility under sustained distribution shifts.

2.6. Summary

In summary, despite strong progress in contextual representation learning for log anomaly detection, challenges remain in (i) explicitly capturing sequential/order-constrained anomalies and (ii) making robust decisions under distribution shifts, motivating our framework design.

3. Preliminaries

3.1. System Logs and Sequence Construction

System logs are indispensable artifacts that record the runtime behavior and internal states of large-scale software systems. A system log consists of a chronological list of log messages, each generated by the logging framework to capture critical runtime events. As shown in Figure 1, a raw log message typically contains two parts: the header and the content. The header includes structural metadata such as timestamp, verbosity level (e.g., INFO/WARN/ERROR), and the originating component [32]. The content can be further divided into a constant part, which reflects the log template (keywords), and a variable part, which carries dynamic runtime information such as identifiers, parameters, or memory addresses. In this work, we focus primarily on the content of log messages, as it provides the semantic patterns essential for anomaly detection.
For effective anomaly detection, raw log messages are usually grouped into log sequences, which reflect execution flows or operational snapshots of the system. Two common sequence construction methods are widely used [33]. The first is session window partitioning, which groups log messages according to session identifiers (e.g., request ID, block ID). This method yields coherent log sequences for each session, as illustrated in Figure 2, where HDFS [34] logs are grouped by their block_id. The second method is fixed/sliding window partitioning, which groups log messages into subsequences of fixed size, either by the number of log entries or a specified time span. As shown in Figure 3, BGL [35] logs are partitioned into windows of size 2 with a step size of 2, capturing snapshots of the system’s runtime behavior.
The objective of log-based anomaly detection is to identify anomalous log sequences that deviate from normal execution patterns, thereby enabling timely recognition of potential issues in system operation. These basic definitions and sequence construction strategies form the foundation of subsequent preprocessing and modeling steps.

3.2. Fundamentals of Log Parsing

System logs are a fundamental data source for the operation and maintenance of modern information systems and distributed platforms. Raw log records typically contain multiple fields, such as timestamps, log levels, service components, and unstructured textual messages. Due to the diversity of log formats and the verbose nature of log messages, directly modeling raw logs often introduces noise and unnecessary complexity. Therefore, log parsing is commonly adopted as a preprocessing step in automated log anomaly detection.
Log parsing aims to transform semi-structured logs into structured representations by extracting event templates and corresponding parameters, thereby reducing feature dimensionality and facilitating downstream modeling. In this work, we adopt Drain [36], a widely used and efficient log parsing method, to generate event templates. Drain clusters semantically similar log messages and assigns each template a unique identifier without relying on hand-crafted rules or domain-specific knowledge, making it suitable for diverse log sources.
After parsing, each log message is mapped to an ordered sequence of event template identifiers, effectively converting raw textual logs into structured symbolic sequences. This representation provides compact and noise-reduced inputs for subsequent sequence modeling, BERT-based representation learning, and anomaly detection.

3.3. Fundamentals of Sequential Modeling

Log events inherently display sequential and temporal attributes. Under normal system operation, log event sequences typically follow consistent business logic or state machine patterns, while anomalies or failures typically cause sudden deviations from these sequences. Therefore, accurately modeling the temporal dependencies between log events is critical for effective anomaly detection.
Early sequence modeling approaches, such as n-gram models, relied on fixed-size windows to estimate the joint probability of preceding events. However, their expressive power is limited and they cannot capture remote dependencies. Recurrent neural networks (RNNs) and their variants, including long short-term memory (LSTM) and gated recurrent units (GRU), have been widely used for sequence modeling tasks due to the fact that their recurrent states can efficiently store and propagate historical information. In particular, LSTM mitigates the gradient vanishing problem through the gating mechanism, thus enabling the capture of longer-term dependencies. Models such as DeepLog and LogAnomaly based on LSTM have achieved better results in logging anomaly detection, but their global modeling capabilities on ultra-long sequences are limited by the computational and information bottlenecks inherent in the loop structure. In addition, practical challenges such as anomaly sparsity and complex contextual correlations in log sequences remain open problems in sequence modeling and inspire the design of more advanced models.

3.4. Fundamentals of BERT

BERT (Bidirectional Encoder Representations from Transformers) represents a major breakthrough in the field of natural language processing. Based on the Transformer architecture, BERT, by superimposing multiple layers of self-attention mechanisms, can support efficient parallel computing while capturing the global context relationships in the input sequence. Compared with traditional models based on recurrent neural networks (RNNS), BERT has significant advantages in modeling long-distance dependencies.
The core innovation of BERT lies in its pre-training objectives and bidirectional modeling capabilities. Through tasks such as masking Language Model (MLM) [37] and Next Sentence Prediction (NSP), BERT can learn rich semantic representations and context dependencies. In log sequence modeling, BERT maps log event sequences to high-dimensional vector representations. Its multi-head self-attention mechanism can dynamically adjust the weights between events, thereby effectively capturing abnormal events and their context.
In log anomaly detection, BERT can serve as the fundamental encoder for sequence modeling, extracting context features from log template sequences. Its flexible structure can well adapt to the complex semantics of logs, and at the same time provide high-quality feature representations for subsequent modules (such as sequence enhancement and anomaly-sensitive gating mechanisms). The latest research by LogBERT and LogFormer has successfully applied BERT to log anomaly detection, further verifying its global modeling capabilities and strong generalization performance in this field.

4. Methodology

4.1. Problem Definition

Log anomaly detection aims to automatically identify log sequences that may reflect system failures, security threats, or abnormal behaviors from the massive streams of continuously generated system logs. Let the collection of logs produced up to time t be denoted as L = { l 1 , l 2 , , l T } , where each log entry l i typically contains a timestamp, log level, service module, and event content. After structured parsing, the logs are mapped into a template sequence S = { s 1 , s 2 , , s T } , where s i represents the template identifier of the corresponding log entry.
In this work, we focus on a sliding-window–based setting for log anomaly detection. Given a template window X = { s t - n + 1 , s t - n + 2 , , s t } of length n , the objective is to determine whether the current window contains anomalous events. The detection model is expected to learn the patterns of normal log sequences and output either an anomaly score or a binary label y 0 , 1 for each window X , where y = 1 indicates anomaly and y = 0 indicates normal.
This task involves several inherent challenges. First, log events often follow complex patterns and contextual structures, whereas anomalous behaviors are typically diverse and occur infrequently, which limits the effectiveness of static, rule-based detection methods. Second, log data are prone to concept drift, as system upgrades, configuration adjustments, or changes in workload can gradually modify what should be considered normal behavior. Under such conditions, anomaly detection models are required not only to capture rich sequential dependencies in log data, but also to demonstrate sufficient generalization ability and adaptability to evolving system environments.

4.2. Log Data Preprocessing and Representation

In practical operational and security analysis settings, raw logs are typically produced in a semi-structured textual form, with substantial variation in structure, field definitions, and event descriptions across different systems. Modeling features directly from such heterogeneous data tends to introduce considerable noise and results in high-dimensional representations, which in turn increase the risk of overfitting and weaken model robustness. Consequently, log preprocessing and the construction of structured representations are necessary steps to ensure the reliability and generalization capability of subsequent modeling stages.
Log parsing. In this study, we employ the Drain parser to transform raw logs into structured event templates. Drain is an efficient and automated log template extraction algorithm that incrementally constructs a hierarchical tree and clusters messages into templates without requiring manual rules. Each raw log entry is mapped to a unique template identifier (ID), thereby converting complex and unstructured text into a low-dimensional, structured sequence of events. Formally, the parsed log stream can be expressed as
S = { s 1 , s 2 , , s T } ,   s i V
where V denotes the set of all templates. Prior to parsing, canonicalization is applied to replace volatile fields such as IP addresses, process identifiers, or numeric literals with placeholders, which reduces template fragmentation while retaining semantic information.
Windowing. To adapt log sequences for deep learning models, the template stream is segmented into fixed-length subsequences using a sliding window. At time step t , a window of length n is defined as
X t = { s t - n + 1 , s t - n + 2 , , s t }
If the sequence is shorter than n , it is padded with a special token [ PAD ] ; if it exceeds the limit, it is truncated to maintain consistency across inputs. Windows are constructed within the same service session to prevent unrelated log sources from being mixed.
Representation. Each template ID is embedded into a dense vector
e ( s i ) R d
capturing semantic relationships among templates. To preserve the temporal characteristics of logs, a positional encoding p k is added to distinguish order dependencies. Optionally, when timestamps are available, we incorporate a simple time-gap encoding g ( Δ t k ) to reflect inter-event intervals, which can be informative for anomaly detection. The final input representation at position k is given by
z k = e ( s t - n + k ) + p k + g ( Δ t k )
where Δ t k = 0 if timing information is unavailable or if the event is the first in the window.
The resulting preprocessed and embedded event sequences are then fed into the contextual encoding module (e.g., BERT), providing clean and structured input that facilitates the capture of both global and local dependencies in log sequences.

4.3. Model Architecture

Figure 4 illustrates the overall architecture of the proposed method. The framework consists of several key modules, which transform raw system logs into structured representations through pipeline processing, extract context and sequence features, and finally achieve robust anomaly detection.
First, raw text logs are processed through a Drain log parser, which automatically extracts and clusters log event templates to form a standardized set of template identifiers. This step effectively converts heterogeneous unstructured logs into structured event template sequences, thereby reducing noise and improving subsequent modeling efficiency.
Then, the template ID sequence is mapped into a dense vector by an embedding layer, which is augmented by a position encoding to maintain the event sequence order. These representations form the input tokens for the Transformer encoder, a BERT-based context representation module. By using the multi-layer Transformer module with multi-head self-attention mechanism, the BERT encoder can not only capture the global semantic information of log events, but also grasp the long-distance context correlation across events.
To further enhance the model’s ability to capture temporal dependence, we superimpose a gated residual bidirectional LSTM (GR-BiLSTM) sequence enhancement module on the output of the Transformer encoder. This module explicitly models forward and reverse dependencies in log sequences. The residual connection ensures that the global context representation provided by BERT is preserved, while GR-BiLSTM enhances the sensitivity of the model to sequential patterns. In addition, the introduced log anomaly sensitive gating mechanism can dynamically adjust the model’s attention to key log events and improve its ability to emphasize abnormal patterns in the sequence.
In addition, an optional attention pooling layer is used to aggregate sequence features into compact representations, so that the model can focus on the log events that are most indicative of anomalies in the detection process.
The two-layer self-supervised Dynamic Threshold Module (DTPM) was introduced into the abnormal judgment step. The fast layer captures short-term fluctuations based on exponentially weighted moving average (EWMA), and the slow layer uses a lightweight gated neural network to handle long-term calibration. The collaborative operation of the two layers enables the model to dynamically adjust the anomaly detection threshold to resist the data distribution shift caused by system updates or workload changes.
In summary, the proposed architecture integrates log parsing, embedding and positional encoding, BERT-based context modeling, GR-BiLSTM sequence enhancement, attention pooling, and a two-layer self-supervised dynamic threshold mechanism. This unified design enables the model to combine deep semantic understanding, explicit sequence perception and robust anomaly detection capabilities, which shows excellent performance in complex log flow analysis.

4.4. Contextual Feature Encoding

In log anomaly detection, contextual dependencies across events are critical. After Drain parsing, each template ID is mapped to an embedding and combined with positional encodings to form the input tokens. These tokens are fed into a BERT-based encoder, where multi-head self-attention models global interactions among events and produces contextual representations for subsequent modules. The encoded sequence is then passed to GR-BiLSTM to further incorporate explicit sequential constraints.
Specifically, the model first processes the log template sequences, which have been parsed by the Drain log parser, into embedded representations. Each log event template is transformed into a dense vector representation via an embedding layer, denoted as e i R d , where d represents the embedding space dimension. To preserve the sequential order of log events, we add positional encodings p i R d to each event template. These positional encodings allow the model to capture the relative positions of events within the sequence. The final representation fed into the BERT model is:
x i = e i + p i
where i = 1 , 2 , , n represents the index of each event in the sequence, and n is the sequence length.
The input data is then passed to the BERT encoder, where the multi-head self-attention mechanism and feed-forward network (FFN), stacked in multiple layers of Transformer blocks, enable the model to perform global contextual modeling of the log sequence. The self-attention mechanism allows the model to compute the relationships between positions within the sequence, ensuring that each log event’s representation depends not only on local information but also captures long-range dependencies. The core calculation of the attention mechanism is given by:
Attention ( Q , K , V ) = softmax ( Q K T d k ) V
where Q is the query vector, K is the key vector, V is the value vector, and d k is the dimension of the key vector. Multiple attention heads are computed in parallel to produce diverse contextual representations, which are then concatenated and linearly transformed to obtain the final global context representation for each event, denoted as h i R d , where h i represents the feature vector of the i - th log event in the context.
Through multiple layers of self-attention, BERT can thoroughly explore the complex contextual relationships within the log sequence. In each Transformer layer, besides the self-attention module, a feed-forward network (FFN) is used to apply nonlinear transformations to the features. The output features of each layer are processed with residual connections and layer normalization, ensuring effective information flow and mitigating issues such as gradient vanishing.
In this manner, the BERT encoder not only preserves the local semantic information of the log events but also captures long-range temporal and semantic dependencies across the entire sequence, greatly enhancing the expressive power of log sequence modeling. The feature sequence H = { h 1 , h 2 , , h n } , after BERT encoding, serves as input to the subsequent sequence enhancement module, providing rich contextual information and high-dimensional features for further anomaly detection tasks.

4.5. Sequential Enhancement Module

Although the BERT model excels at capturing the global contextual semantics of log sequences, the inherent characteristics of the Transformer architecture limit its ability to explicitly model sequential dependencies and temporal structures. Log data often exhibit strong sequential dependencies, especially in real-world anomaly scenarios where abnormal events are frequently associated with specific patterns of event sequences. To address this limitation and enhance the model’s ability to capture explicit sequential relationships, we propose a novel Gated Residual Bidirectional LSTM (GR-BiLSTM) sequential enhancement module. This module complements the contextual representations learned by the BERT encoder and improves the accuracy of anomaly detection.
The core design of the GR-BiLSTM module integrates bidirectional LSTM networks, residual connections, and a gating mechanism. It effectively captures the sequential dependencies among log events while dynamically adjusting the importance of different events. Concretely, the contextual feature sequence generated by the BERT encoder is denoted as H = { h 1 , h 2 , , h n } , where each vector h i R d encodes rich semantic information of an event but lacks explicit ordering constraints. The GR-BiLSTM first applies bidirectional LSTMs to model sequential dependencies from both forward and backward directions, producing forward and backward hidden states:
h i = L S T M f ( z i , h i 1 ) , h i = L S T M b ( z i , h i + 1 ) ,
where L S T M f ( ) and L S T M b ( ) denote the forward and backward LSTM units, respectively, z i is the input representation at position i , and h i and h i represent the forward and backward hidden states.
The final BiLSTM output is the concatenation of these two representations:
h i ^ = [ h i ; h i ]
To preserve the global contextual semantics extracted by BERT while avoiding information loss during sequential modeling, a residual connection is employed. Specifically, the BiLSTM output is combined with the original BERT output to form the enhanced sequential feature representation:
z i = h i + h i ^
This residual mechanism ensures that the global semantics captured by BERT are retained, while the explicit sequential dependencies learned by BiLSTM are incorporated, thereby enhancing the robustness of sequence modeling.
To further improve the model’s sensitivity to abnormal events, we introduce a gating mechanism within the GR-BiLSTM module. This mechanism dynamically adjusts the importance of each log event in the feature space, assigning greater weights to anomaly-sensitive events. Formally, a gating weight g i is computed as:
g i = σ ( W g z i + b g )
where W g and b g denote the weight matrix and bias vector of the gating layer, respectively, and σ represents the sigmoid activation function. The final gated feature representation is obtained as:
z i = g i · z i
Through this gating mechanism, the model automatically emphasizes events that are more likely related to anomalies, thereby improving its detection sensitivity and effectiveness.
In summary, the GR-BiLSTM sequential enhancement module integrates bidirectional LSTMs, residual connections, and a gating mechanism to effectively overcome the limitations of Transformer-based models in explicit sequence modeling. By strengthening the representation of sequential dependencies and adaptively highlighting critical anomalous events, this module provides richer and more discriminative features for downstream loss function optimization and anomaly decision-making.

4.6. Loss Function Design

In this study, we introduce two key self-supervised loss functions—Masked Log Key Prediction (MLKP) and Volume Hypersphere Minimization (VHM)—to optimize the model’s learning capability on unlabeled log data. By combining these two objectives, the model is able to capture contextual dependencies and structural patterns of log sequences during training, thereby improving anomaly detection performance and robustness.
The MLKP loss is designed to enable the model to learn contextual relationships within log sequences through a self-supervised prediction task inspired by the Masked Language Model (MLM) used in BERT. Specifically, within a log template sequence S = { s 1 , s 2 , , s T } , a subset of template positions is randomly selected for masking. These selected tokens s i are replaced by a special symbol [MASK], and the model is required to predict the original token based on the surrounding context S mask . The optimization objective minimizes the discrepancy between the predicted and the true templates using cross-entropy loss:
L MLKP = - i M log P ( s i S mask )
where M denotes the set of masked positions, and P ( s i S mask ) is the predicted probability of recovering the original template s i . By training on this objective, the model learns to capture global contextual dependencies across the log sequence, strengthening its capability to detect anomalies that disrupt normal event patterns.
The VHM loss is designed to optimize the distribution of normal log representations in feature space, ensuring they remain compact and well-clustered. The goal is to minimize the feature volume occupied by normal logs, thereby enhancing the separation between normal and anomalous patterns. Suppose the model generates feature representations of normal logs as H = { h 1 , h 2 , , h N } , where each h i R d is the feature vector of the i -th log. Let the centroid of these features be
c = 1 N i = 1 N h i
The VHM loss is then defined as:
L VHM = 1 N i = 1 N h i c 2
where h i - c 2 denotes the squared Euclidean distance between a log feature h i and the cluster centroid c . By minimizing this loss, the model enforces normal log features to be concentrated in a small hyperspherical region, increasing the discriminability of anomalies that lie outside this region.
During training, MLKP and VHM are jointly optimized to balance contextual learning and distribution compactness. The overall training objective is expressed as a weighted sum of the two loss functions:
L total = λ MLKP L MLKP + λ VHM L VHM
where λ MLKP and λ VHM are hyperparameters that control the relative contributions of the two objectives. Through this joint optimization, the model simultaneously learns contextual semantics and structural distribution patterns of log sequences, enabling effective discrimination between normal and anomalous logs without requiring labeled anomaly data.
In summary, the MLKP task enhances the model’s understanding of log sequence context, while the VHM task improves feature compactness for normal logs in latent space. Together, they provide a strong self-supervised learning framework that supports robust anomaly detection under real-world conditions.

4.7. Dynamic Thresholding Prediction Module

A central issue in log anomaly detection lies in deciding whether a given log sequence should be regarded as normal or anomalous. Many existing methods make this decision by applying fixed thresholds to anomaly scores or reconstruction errors. In practice, however, such static thresholds are highly sensitive to shifts in log data distributions, which commonly arise from system upgrades, workload variations, or changes in the runtime environment. Under these conditions, fixed decision boundaries often lead to performance degradation, manifested as increased false positives or false negatives. To address this problem, this work introduces a Dynamic Threshold Prediction Module (DTPM), which adjusts anomaly decision boundaries in an adaptive manner according to the statistical characteristics of the incoming log data.
The core idea of DTPM is to combine short-term statistical adaptation with long-term drift calibration, thereby providing a robust and self-supervised anomaly decision mechanism. Specifically, the anomaly score of a log sequence X t is denoted as s t . A naive static decision rule would classify the sequence as anomalous if s t > τ , where τ is a fixed threshold. Instead, our module dynamically estimates the threshold based on data statistics and adaptive modeling.
First, a fast adaptation layer based on Exponentially Weighted Moving Average (EWMA) is employed to track short-term variations in the anomaly score distribution. The adaptive threshold estimate at time t is defined as:
τ t fast = α s t - 1 + ( 1 - α ) τ t - 1 fast
where α ( 0 , 1 ) is a smoothing parameter controlling the influence of recent scores. This component ensures that the threshold can quickly respond to sudden shifts caused by temporary load fluctuations or bursty workloads.
Second, a slow calibration layer is introduced to handle long-term distribution drift. We design a lightweight gated neural network that takes historical anomaly score statistics as input and outputs a calibration factor δ t . The final threshold adjustment is then given by:
τ t slow = τ t fast + δ t
where δ t adaptively corrects for systematic shifts in the score distribution, ensuring that the model maintains stable performance even in evolving environments. The gating mechanism inside the neural network regulates the influence of different statistical features, allowing the system to emphasize anomaly-sensitive signals.
The final decision rule of the dynamic thresholding module is:
y t = { 1 , s t > τ t slow 0 , otherwise
where y t = 1 denotes that the sequence is classified as anomalous, and y t = 0 denotes a normal sequence.
By combining short-term EWMA-based adaptation with long-term gated calibration, the DTPM effectively mitigates the limitations of static thresholds. It enables the model to self-adapt to concept drift in log distributions without requiring labeled anomaly data, thereby improving both robustness and generalization. Furthermore, the lightweight design ensures that threshold updates can be computed efficiently, making the module suitable for real-time log stream processing in large-scale industrial systems.

4.8. Anomaly Decision and Output

After feature extraction and representation learning by the contextual encoder, sequential enhancement, and dynamic thresholding modules, the model produces an anomaly score for each log sequence. The final step is to decide whether a given sequence should be classified as normal or anomalous.
In the proposed framework, two complementary self-supervised tasks—Masked Log Key Prediction (MLKP) and Volume Hypersphere Minimization (VHM)—are used to generate reliable anomaly scores. The MLKP task enables the model to learn semantic and contextual dependencies within log sequences. If the model fails to correctly predict masked log keys, this reflects a deviation from learned normal patterns. Consequently, the prediction loss of MLKP is used as an anomaly score, with larger loss values indicating a higher likelihood of abnormality.
At the same time, the VHM task encourages the representations of normal log sequences to remain compact in the latent space. By reducing intra-class variance, normal samples are constrained to lie within a hyperspherical region centered at a centroid. Sequences whose representations fall outside this region yield higher reconstruction errors, which are also treated as anomaly scores. Together, these two tasks form a robust self-supervised framework, where MLKP captures semantic deviations and VHM focuses on structural deviations.
Formally, given a log sequence X t , the overall anomaly score s t is defined as a weighted combination of the MLKP and VHM components:
s t = β · L MLKP ( X t ) + ( 1 - β ) · L VHM ( X t )
where 0 β 1 is a balancing coefficient. This combined score integrates semantic prediction difficulty and spatial deviation, offering a comprehensive measurement of anomaly likelihood.
The anomaly score s t is then passed to the Dynamic Thresholding Prediction Module (DTPM), which adaptively determines the threshold τ t according to both short-term fluctuations and long-term drift in log distributions. The final classification decision is made as:
y t = { 1 , s t > τ t 0 , otherwise
where y t = 1 indicates that the log sequence X t is anomalous, and y t = 0 indicates normal behavior.
Finally, anomaly detection results can be reported at the sequence level or further aggregated to the system level, depending on specific application requirements. In real-time monitoring scenarios, detected anomalies are immediately flagged and can be integrated with alerting or visualization systems to support rapid diagnosis. In offline analysis settings, anomaly labels can be leveraged for subsequent root cause analysis, trend identification, or system optimization.
By integrating contextual encoding, sequential enhancement, dual self-supervised objectives, and adaptive thresholding, the proposed framework provides a reliable and interpretable anomaly decision process. This design enables robust performance under dynamic log distributions while preserving the efficiency needed for deployment in large-scale real-world systems.

5. Experiments

5.1. Experimental Setup

This section introduces the experimental setup used to evaluate the effectiveness of the proposed method. Three publicly available datasets commonly used in log anomaly detection research are selected, namely HDFS, BGL, and Thunderbird. These datasets span different application domains and exhibit diverse log formats, providing a comprehensive benchmark for evaluating model performance in a range of real-world scenarios. The following subsections describe the datasets, preprocessing steps, experimental environment, parameter configurations, and baseline methods in detail.
Datasets. We evaluate the proposed method on three widely used benchmark datasets for log anomaly detection: HDFS, BGL, and Thunderbird. The overall statistics of the datasets are summarized in Table 1.
Hadoop Distributed File System (HDFS) [34]. The HDFS dataset is generated by executing Hadoop-based MapReduce jobs on Amazon EC2 clusters and is manually labeled using handcrafted rules to identify anomalous events. It contains 11,172,157 raw log messages, among which 284,818 are labeled as anomalous. For HDFS, log messages are grouped into log sequences based on the session (block) identifiers associated with each log entry. This session-based grouping reflects the execution semantics of distributed jobs. The resulting log sequences have an average length of 19 log events. All normal log sequences in the training split are used for model training.
BlueGene/L Supercomputer System (BGL) [35]. The BGL dataset is collected from a BlueGene/L supercomputer system at Lawrence Livermore National Laboratory (LLNL). Log messages are annotated using alert category tags, where alert messages are treated as anomalous events. The dataset consists of 4,747,963 log messages, including 348,460 anomalous entries. For BGL, log sequences are constructed using a time-based sliding window of 5 min, which captures temporal correlations among system events. The average length of the resulting log sequences is approximately 562 log events. All normal sequences in the training split are used for unsupervised learning.
Thunderbird [35]. Thunderbird is a large-scale log dataset collected from a supercomputer system. To ensure computational feasibility while preserving the temporal characteristics of the data, we use a subset consisting of the first 20,000,000 log messages from the original dataset, among which 758,562 messages are labeled as anomalous.Log sequences are generated using a time-based sliding window of 1 min, resulting in sequences with an average length of approximately 326 log events. Similar to the other datasets, all normal log sequences in the training split are used to learn normal system behavior.
Data Split Strategy. For each dataset, log sequences are divided into training, validation, and testing sets following the experimental protocol commonly adopted in unsupervised log anomaly detection.
The training set consists exclusively of normal log sequences, which are used to learn the patterns of normal system behavior. In line with established experimental settings, the number of normal log sequences used for training is on the order of several thousand for each dataset. A small subset of the training data is further reserved as a validation set for hyperparameter tuning, such as threshold selection.
The testing set contains both normal and anomalous log sequences and is used solely for performance evaluation. Anomaly labels are not used during training or validation.
Sequence construction. For HDFS, raw log messages are grouped into log sequences according to the session (block) identifiers associated with each log entry. For BGL and Thunderbird, log sequences are constructed using time-based sliding windows of 5 min and 1 min, respectively, following standard practice in log anomaly detection.
The resulting sequences are directly used as the model inputs. For sequences shorter than the maximum input length supported by the model, padding is applied. Sequences exceeding the maximum length are truncated to preserve the most recent log events.
Experimental platform. All experiments were conducted on a server equipped with an NVIDIA RTX 3090 GPU (24 GB memory), an Intel Xeon Silver 4210 CPU, and 128 GB of RAM, running Ubuntu 20.04. The implementation was based on PyTorch 1.12 and Python 3.8. To ensure reproducibility and stability, all software dependencies were strictly version-controlled. Log parsing was carried out using the Drain tool, which efficiently converts unstructured log text into structured event template identifiers for model training and evaluation.
Model parameter settings.The BERT encoder followed the standard configuration with 12 Transformer layers, a hidden size of 768, and 12 attention heads. For the Gated Residual BiLSTM (GR-BiLSTM) module, we employed two BiLSTM layers with 128 hidden units each, balancing performance and efficiency. In the Dynamic Thresholding Prediction Module (DTPM), the fast adaptation layer was implemented using the Exponentially Weighted Moving Average (EWMA) method with a smoothing factor of 0.2, while the slow calibration layer was realized as a gated neural network with 64 hidden units. The model was optimized using the Adam optimizer with a learning rate of 1 × 10 - 4 and a batch size of 128. In the total loss function, the weights for the Masked Log Key Prediction (MLKP) and Volume Hypersphere Minimization (VHM) tasks were set to 1.0 and 0.2, respectively, ensuring a balanced contribution.
Baseline methods.We compared our model against several representative baseline methods:
PCA: Principal Component Analysis reduces dimensionality and extracts dominant features for anomaly detection.
DeepLog: An LSTM-based method that models sequential dependencies among log events and detects anomalies via prediction error.
LogAnomaly: An unsupervised method that detects sequential and quantitative anomalies from log event sequences.
LogBERT: A BERT-based model that captures contextual semantics of logs through self-supervised learning.
LAnoBERT: An extension of LogBERT that integrates LSTM to enhance sequential dependency modeling.
LogGPT: A generative pre-trained Transformer (GPT)-based model that leverages generative tasks for semantic representation and anomaly detection.
LogFiT: A fine-tuning based approach that adapts pre-trained language models such as BERT for log anomaly detection.

5.2. Evaluation Metrics

To comprehensively assess the performance of the proposed method on log anomaly detection tasks, three commonly used evaluation metrics are adopted: Precision, Recall, and F1-score. These metrics are well suited to anomaly detection scenarios, where the data distribution is often highly imbalanced.
Precision reflects the proportion of correctly identified anomalies among all instances predicted as anomalous. A higher precision indicates a lower false alarm rate, meaning that a larger fraction of the detected anomalies correspond to true anomalous events. It is defined as:
Precision = TP TP + FP
where TP (True Positives) denotes the number of correctly predicted anomalous log sequences, and FP (False Positives) denotes the number of normal log sequences incorrectly predicted as anomalous.
Recall evaluates the model’s ability to identify true anomalies from all actual anomalies. A higher recall indicates that the model can detect a greater portion of anomalous logs, though it may come at the cost of higher false alarms. It is defined as:
Recall = TP TP + FN
where FN (False Negatives) denotes the number of anomalous log sequences that were mistakenly classified as normal.
F1-score is the harmonic mean of precision and recall, providing a balanced evaluation when there is a trade-off between the two metrics. It is particularly useful in imbalanced settings, as it ensures that neither precision nor recall is neglected. The F1-score is defined as:
F 1 = 2 × Precision × Recall Precision + Recall
By jointly considering these three metrics, we obtain a comprehensive evaluation of the model’s performance in log anomaly detection. This allows us to assess not only the accuracy of anomaly identification but also the model’s robustness in detecting rare yet critical anomalies under imbalanced data conditions.

5.3. Experimental Results and Analysis

This section reports the experimental results of the proposed BERT-LogAnom model on three benchmark datasets and compares its performance with several state-of-the-art baseline methods. Precision, Recall, and F1-score are used as evaluation metrics to systematically evaluate the effectiveness of each approach in log anomaly detection. Overall, the results indicate that BERT-LogAnom consistently outperforms existing methods, especially in its ability to capture complex anomaly patterns in log data.
As shown in Table 2, BERT-LogAnom consistently outperforms baseline methods across all three datasets. On HDFS, BERT-LogAnom achieves an F1-score of 89.98%, surpassing LogBERT (85.56%) and LAnoBERT (85.80%). This improvement demonstrates the effectiveness of the GR-BiLSTM and dynamic thresholding modules in capturing both contextual and sequential dependencies. On BGL, our model obtains an F1-score of 89.84%, which is higher than LogBERT (88.39%) and LogFiT (86.89%), confirming its capability to handle complex and large-scale system logs. On the most challenging Thunderbird dataset, BERT-LogAnom achieves an F1-score of 94.39%, outperforming LogBERT (92.68%), LAnoBERT (92.89%), and LogFiT (92.81%). These results highlight the robustness of our model in heterogeneous and user-driven log environments.
In terms of precision, BERT-LogAnom reaches 96.17% on Thunderbird, the highest among all methods, significantly reducing false positives. At the same time, the recall of 92.68% indicates that the model successfully identifies the majority of anomalies. By contrast, while LogAnomaly achieves competitive recall on Thunderbird (98.74%), its precision drops to 85.69%, leading to a less favorable F1-score. This further confirms the balanced performance of our model across precision and recall.
Among the baseline models, LogBERT and LAnoBERT perform well in capturing contextual semantics, but they struggle with sequential modeling and adaptive thresholding, leading to slightly lower scores compared with our approach. LogGPT and LogFiT also deliver competitive results by leveraging generative and fine-tuning strategies, respectively. Nevertheless, BERT-LogAnom achieves superior performance by combining BERT contextual representation, GR-BiLSTM sequential enhancement, and DTPM adaptive thresholding, leading to consistent improvements across datasets.

5.4. Statistical Robustness and Stability Analysis

Deep learning–based log anomaly detection models are inherently influenced by stochastic factors such as random parameter initialization and mini-batch optimization. To ensure that the reported results reflect the intrinsic performance characteristics of the proposed method rather than incidental randomness, we further analyze the robustness and stability of the experimental outcomes.
For each dataset, the proposed model is trained and evaluated five times with different random seeds, while keeping the dataset splits, model architecture, and all hyperparameters unchanged. For each run, Precision, Recall, and F1-score are computed on the test set. We report the mean and standard deviation of these evaluation metrics across the five runs to characterize performance stability.
As shown in Table 3, the proposed method demonstrates stable performance on all three datasets, with relatively small standard deviations across evaluation metrics. This indicates that the observed performance improvements are consistent across different random initializations and confirms the robustness of the proposed approach.
To ensure reproducibility, dataset splits and hyperparameters are fixed across experiments; randomness is introduced only through parameter initialization and optimization, which is explicitly analyzed in the stability evaluation (five independent runs).
In summary, the experimental results validate that BERT-LogAnom provides a more robust, precise, and adaptive solution to log anomaly detection compared with existing approaches, particularly in large-scale and heterogeneous environments.

5.5. Temporal Concept Drift Evaluation

To evaluate the effectiveness of the proposed Dynamic Thresholding Prediction Module (DTPM) under temporal distribution shifts, we conduct an additional time-aware evaluation using a chronological data split. For each dataset, log sequences are ordered by timestamp, where normal sequences from an earlier time period are used for training, and sequences from a later period containing both normal and anomalous samples are used for testing. No information from the testing period is used during training or threshold estimation.
We compare the proposed method with and without DTPM to isolate the effect of dynamic thresholding under temporal concept drift. As shown in Table 4, the fixed-threshold variant suffers performance degradation due to distribution mismatch, while the model equipped with DTPM consistently achieves higher F1-scores across all datasets. This demonstrates that DTPM effectively adapts the anomaly decision threshold to evolving score distributions and improves robustness in non-stationary environments.
Following the chronological evaluation protocol, the model is trained exclusively on normal log sequences from an earlier time period, while evaluation is performed on sequences from a subsequent time period that contain both normal and anomalous samples. For the fixed-threshold setting, a single global threshold is estimated from the training data and applied uniformly during testing. In contrast, the proposed DTPM dynamically predicts anomaly thresholds for each sequence based on contextual information, enabling adaptive decision making under temporal distribution shifts.

5.6. Ablation Studies and Module Effectiveness

To rigorously assess the contribution of the two core components in BERT-LogAnom—the Gated Residual Bidirectional LSTM (GR-BiLSTM) sequential enhancement and the Dynamic Thresholding Prediction Module (DTPM)—we conduct ablation studies by selectively removing each module (and both) while keeping all other settings identical to the full model. Performance is evaluated on HDFS, BGL, and Thunderbird using Precision, Recall, and F1-score.
Effect of removing GR-BiLSTM. Table 5 reports the results when the GR-BiLSTM module is removed. Across all three datasets, performance drops are observed, with the largest decline on HDFS (F1 from 89.98% to 86.61%). This confirms that explicitly modeling bidirectional sequential dependencies, coupled with residual fusion and gating, is crucial for capturing order-sensitive anomalies beyond what contextual encoding alone can provide.
Effect of removing DTPM. Table 6 shows the impact of removing the dynamic thresholding module. The performance consistently decreases, especially on HDFS and BGL (F1 drops of ~3 and ~2.8 points, respectively). The reduction is smaller on Thunderbird (F1 from 94.39% to 93.44%), suggesting that while contextual+sequential modeling already provides strong separation, adaptive thresholding further stabilizes decisions under distributional shifts and load variability.
Effect of removing both modules. Table 7 summarizes the results when both GR-BiLSTM and DTPM are removed. The compounded degradation is evident, with F1 decreasing to 86.16% on HDFS and 85.62% on BGL. This demonstrates the complementarity of the two components: GR-BiLSTM strengthens explicit sequence modeling and anomaly-sensitive emphasis, while DTPM enhances robustness to data drift via adaptive decision boundaries.
The ablation results establish that both modules materially contribute to the final performance, and their joint effect is greater than either alone. In particular, the GR-BiLSTM contributes larger gains on datasets with stronger order dependencies (e.g., HDFS),while DTPM yields notable benefits where distribution drift and non-stationarity are more pronounced (e.g., BGL). On Thunderbird, where semantic context is already highly informative, both modules still provide measurable improvements, with DTPM further reducing decision instability.

5.7. Parameter Sensitivity Analysis

To evaluate the sensitivity of BERT-LogAnom to key hyperparameters, we conducted experiments by varying important parameters and observing their impact on model performance. The parameters examined include the number of hidden units and LSTM layers in the Gated Residual Bidirectional LSTM (GR-BiLSTM) module, the smoothing factor in the Dynamic Thresholding Prediction Module (DTPM), and the weight ratio between the Masked Log Key Prediction (MLKP) and Volume Hypersphere Minimization (VHM) tasks.
Figure 5a shows the impact of varying the hidden units in the GR-BiLSTM module. As the number of hidden units increases from 32 to 128, both Precision and Recall improve significantly, indicating that richer sequential representations benefit anomaly recognition. Beyond 128 units, however, performance slightly decreases, suggesting that excessive capacity may introduce redundant information and overfitting. Figure 5b presents the effect of the smoothing factor (α) in the Dynamic Threshold Prediction Module (DTPM). The optimal performance occurs around α = 0.2, where the model achieves the best balance between rapid adaptation to short-term fluctuations and stability under long-term drift. Too small a value results in delayed response, whereas too large a value causes instability. Figure 5c analyzes the effect of the MLKP loss weight (λₘₗₖₚ) on detection accuracy. Increasing λₘₗₖₚ up to 1.2 enhances the model’s discriminative ability by strengthening contextual learning, but further increments lead to marginal declines as the VHM loss becomes underweighted. Overall, these results confirm that BERT-LogAnom maintains stable performance across a wide range of hyperparameters, demonstrating good robustness and generalization.

5.8. Running Efficiency Analysis

This section evaluates the running efficiency of BERT-LogAnom on the HDFS dataset. We compare its training time and inference speed with several baseline models: PCA, DeepLog, LogAnomaly, LogBERT, LAnoBERT, LogGPT, and LogFiT.
Table 8 provides a comparison of training time and inference speed for each model on the HDFS dataset. The results show that BERT-LogAnom achieves competitive training time despite its complex architecture, with 11,000 s per epoch. While more sophisticated models like LogGPT and LogFiT require longer training times (up to 18,000 s and 16,000 s, respectively), BERT-LogAnom maintains a balance between performance and efficiency.
Although BERT-LogAnom is built upon the BERT architecture, which is typically associated with high computational cost, it still achieves efficient processing performance. The training time of the model is comparable to that of other deep learning–based approaches, such as LogBERT and LAnoBERT, which have similar model complexity but require slightly longer training time. In addition, an inference time of 0.9 s per log indicates that BERT-LogAnom is capable of supporting real-time log stream processing, making it suitable for large-scale anomaly detection applications.

6. Conclusions

In this paper, we proposed BERT-LogAnom, an unsupervised log anomaly detection framework that integrates BERT-based contextual representation learning with explicit sequential modeling and adaptive thresholding. The framework is designed to address key challenges in log anomaly detection, including order-sensitive anomaly patterns and performance degradation under distribution shifts.
Experimental results on multiple benchmark datasets demonstrate that the proposed approach consistently improves detection performance over strong baselines. In particular, the introduction of the GR-BiLSTM module enhances the modeling of sequential execution patterns, while the statistical robustness analysis confirms that the performance gains are stable across multiple runs. Furthermore, the temporal concept drift evaluation shows that the Dynamic Thresholding Prediction Module (DTPM) provides better adaptability than fixed-threshold strategies under non-stationary conditions.
Overall, the results indicate that BERT-LogAnom achieves a favorable balance between accuracy, robustness, and practical deployability. Future work will explore extending the framework to broader system environments and more complex dependency structures in large-scale log data.

Author Contributions

Conceptualization, W.W. and X.L.; methodology, S.A. and W.W.; software, S.A.; validation, Z.S., J.C. and R.Q.; formal analysis, X.L. and S.A.; investigation, S.A. and Y.D.; resources, W.W.; data curation, Z.S. and J.C.; writing—original draft preparation, S.A. and X.L.; writing—review and editing, W.W. and R.Q.; visualization, X.L.; supervision, W.W.; project administration, W.W.; funding acquisition, W.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Special Fund for High-Quality Development of the Ministry of Industry and Information Technology (Grant No. TC220A04A-182); the National Natural Science Foundation of China (Grant No. 62271045); the Key Program for International Science and Technology Cooperation Projects of China (Grant No. 2022YFE0112300); the Fundamental Research Funds for the Central Universities (Grants No. FRF-IDRY-23-027 and FRF-KST-25-008); the Young Teaching Backbone Talent Program of the University of Science and Technology Beijing (Grant No. JXGG202509); the CCF-NSFOCUS “Kunpeng” Research Fund (Grant No. CCF-NSFOCUS 202417). The author is also supported by the 2024 Xiaomi Young Scholar Program.

Data Availability Statement

The data analyzed in this study were obtained from the loghub program, which is publicly available at https://github.com/logpai/loghub (accessed on 6 January 2026).

Acknowledgments

The authors would like to thank Jiaming Yu (RDFZ Xishan School, Beijing 100193, China;) for his assistance in preliminary experiments and for the helpful discussions that contributed to the development of this work.

Conflicts of Interest

Author Jingmei Chen and Zhan Shu was employed by the company NSFOCUS Technologies Group Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Kazemzadeh, R.S.; Jacobsen, H.A. Reliable and highly available distributed publish/subscribe service. In Proceedings of the 2009 28th IEEE International Symposium on Reliable Distributed Systems, Niagara Falls, NY, USA, 27–30 September 2009. [Google Scholar] [CrossRef]
  2. Bauer, E.; Adams, R. Reliability and Availability of Cloud Computing; John Wiley & Sons: Hoboken, NJ, USA, 2012. [Google Scholar] [CrossRef]
  3. Le, V.H.; Zhang, H. Log-based anomaly detection without log parsing. In Proceedings of the IEEE/ACM International Conference on Automated Software Engineering (ASE), Melbourne, Australia, 15–19 November 2021. [Google Scholar] [CrossRef]
  4. Guan, W.; Cao, J.; Zhao, H.; Gu, Y.; Qian, S. Survey and benchmark of anomaly detection in business processes. IEEE Trans. Knowl. Data Eng. 2024, 37, 493–512. [Google Scholar] [CrossRef]
  5. Zhang, S.; Ji, Y.; Luan, J.; Nie, X.; Cheng, Z.; Ma, M.; Sun, Y.; Pei, D. End-to-end AutoML for unsupervised log anomaly detection. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE), Sacramento, CA, USA, 27 October–1 November 2024. [Google Scholar] [CrossRef]
  6. Le, V.H.; Zhang, H. Log-based anomaly detection with deep learning: How far are we? In Proceedings of the 44th International Conference on Software Engineering (ICSE), Pittsburgh, PA, USA, 18–26 May 2022. [Google Scholar] [CrossRef]
  7. Landauer, M.; Onder, S.; Skopik, F.; Wurzenberger, M. Deep learning for anomaly detection in log data: A survey. Mach. Learn. Appl. 2023, 12, 100470. [Google Scholar] [CrossRef]
  8. Elman, J.L. Finding structure in time. Cogn. Sci. 1990, 14, 179–211. [Google Scholar] [CrossRef]
  9. Du, M.; Li, F.; Zheng, G.; Srikumar, V. DeepLog: Anomaly detection and diagnosis from system logs through deep learning. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security (CCS), Dallas, TX, USA, 30 October–3 November 2017. [Google Scholar] [CrossRef]
  10. Meng, W.; Liu, Y.; Zhu, Y.; Zhang, S.; Pei, D.; Liu, Y.; Chen, Y.; Zhang, R.; Tao, Y. LogAnomaly: Unsupervised detection of sequential and quantitative anomalies in unstructured logs. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), Macao, China, 10–16 August 2019. [Google Scholar] [CrossRef]
  11. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
  12. Guo, H.; Yuan, S.; Wu, X. LogBERT: Log anomaly detection via BERT. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021. [Google Scholar] [CrossRef]
  13. Zhang, X.; Xu, Y.; Lin, Q.; Chen, B.; Zhang, H.; Lou, J.; Zhang, D. Robust log-based anomaly detection on unstable log data. In Proceedings of the ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), Tallinn, Estonia, 26–30 August 2019. [Google Scholar] [CrossRef]
  14. Qi, J.; Huang, S.; Luan, Z.; Yang, S.; Fung, C.J.; Yang, H.; Qian, D.; Shang, J.; Xiao, Z.; Wu, Z. LogGPT: Exploring ChatGPT for log-based anomaly detection. In Proceedings of the IEEE International Conference on High Performance Computing & Communications, Data Science & Systems, Smart City & Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys), Melbourne, Australia, 17–21 December 2023. [Google Scholar] [CrossRef]
  15. Guo, H.; Yang, J.; Liu, J.; Zhang, D.; Lou, J.; Zhang, D. LogFormer: A pre-train and tuning pipeline for log anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, Canada, 20–27 February 2024. [Google Scholar] [CrossRef]
  16. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018. arXiv:1810.04805. [Google Scholar]
  17. Siami-Namini, S.; Tavakoli, N.; Namin, A.S. The performance of LSTM and BiLSTM in forecasting time series. In Proceedings of the IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, 9–12 December 2019. [Google Scholar] [CrossRef]
  18. Houston, A.I.; Sumida, B.H. Learning rules, matching and frequency dependence. J. Theor. Biol. 1987, 126, 289–308. [Google Scholar] [CrossRef]
  19. Liu, F.; Yu, C.; Meng, W.; Chowdhury, A. Effective keyword search in relational databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Chicago, IL, USA, 27–29 June 2006. [Google Scholar] [CrossRef]
  20. Hristidis, V.; Papakonstantinou, Y. Discover: Keyword search in relational databases. In Proceedings of the International Conference on Very Large Databases (VLDB), Hong Kong, China, 20–23 August 2002; Available online: https://www.sciencedirect.com/science/chapter/edited-volume/abs/pii/B9781558608696500652 (accessed on 13 December 2025).
  21. Wang, J.; Tang, Y.; He, S.; Zhao, J.; Pei, D.; Yang, H. LogEvent2vec: LogEvent-to-vector based anomaly detection for large-scale logs in internet of things. Sensors 2020, 20, 2451. [Google Scholar] [CrossRef] [PubMed]
  22. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
  23. Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A robustly optimized BERT pretraining approach. arXiv 2019. arXiv:1907.11692. [Google Scholar]
  24. Lee, Y.; Kim, J.; Kang, P. LanoBERT: System log anomaly detection based on BERT masked language model. Appl. Soft Comput. 2023, 146, 110689. [Google Scholar] [CrossRef]
  25. Almodovar, C.; Sabrina, F.; Karimi, S.; Azad, S. LogFiT: Log anomaly detection using fine-tuned language models. IEEE Trans. Netw. Serv. Manag. 2024, 21, 1715–1723. [Google Scholar] [CrossRef]
  26. Guan, W.; Cao, J.; Qian, S.; Gu, Y.; Zhao, H. LogLLM: Log-based anomaly detection using large language models. arXiv 2024. arXiv:2411.08561. [Google Scholar]
  27. Benedetti, F.; Pellicani, A.; Pio, G.; Ceci, M.; Malerba, D. IMMENSE: Inductive multi-perspective user classification in social networks. Online Soc. Netw. Media 2025, 50, 100335. [Google Scholar] [CrossRef]
  28. Chu, G.; Wang, J.; Qi, Q.; He, S.; Zhang, D.; Lou, J. Anomaly detection on interleaved log data with semantic association mining on log-entity graph. IEEE Trans. Softw. Eng. 2025, 51, 581–594. [Google Scholar] [CrossRef]
  29. Ahmad, S.; Purdy, S. Real-time anomaly detection for streaming analytics. arXiv 2016. arXiv:1607.02480. [Google Scholar]
  30. Lucas, J.M.; Saccucci, M.S. Exponentially weighted moving average control schemes: Properties and enhancements. Technometrics 1990, 32, 1–12. [Google Scholar] [CrossRef]
  31. Siffer, A.; Fouque, P.A.; Termier, A.; Largouët, C. Anomaly detection in streams with extreme value theory. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), Halifax, NS, Canada, 13–17 August 2017. [Google Scholar] [CrossRef]
  32. Zhu, J.; He, S.; Liu, J.; He, P.; Xie, Q.; Zheng, Z.; Lyu, M.R. Tools and benchmarks for automated log parsing. In Proceedings of the IEEE/ACM International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), Montreal, QC, Canada, 25–31 May 2019. [Google Scholar] [CrossRef]
  33. He, P.; Zhu, J.; He, S.; Li, J.; Lyu, M.R. An evaluation study on log parsing and its use in log mining. In Proceedings of the IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Toulouse, France, 28 June–1 July 2016. [Google Scholar] [CrossRef]
  34. Xu, W.; Huang, L.; Fox, A.; Patterson, D.; Jordan, M.I. Online system problem detection by mining patterns of console logs. In Proceedings of the IEEE International Conference on Data Mining (ICDM), Miami, FL, USA, 6–9 December 2009. [Google Scholar] [CrossRef]
  35. Oliner, A.; Stearley, J. What supercomputers say: A study of five system logs. In Proceedings of the IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Edinburgh, UK, 25–28 June 2007. [Google Scholar] [CrossRef]
  36. He, P.; Zhu, J.; Zheng, Z.; Lyu, M.R. Drain: An online log parsing approach with fixed depth tree. In Proceedings of the IEEE International Conference on Web Services (ICWS), Honolulu, HI, USA, 25–30 June 2017. [Google Scholar] [CrossRef]
  37. Salazar, J.; Liang, D.; Nguyen, T.Q.; Kirchhoff, K. Masked language model scoring. arXiv 2019. arXiv:1910.14659. [Google Scholar]
Figure 1. An example of a system log.
Figure 1. An example of a system log.
Electronics 15 00806 g001
Figure 2. Session window.
Figure 2. Session window.
Electronics 15 00806 g002
Figure 3. Fixed window.
Figure 3. Fixed window.
Electronics 15 00806 g003
Figure 4. Overall Model Architecture.
Figure 4. Overall Model Architecture.
Electronics 15 00806 g004
Figure 5. Performance variation of BERT-LogAnom under different hyperparameter settings: (a) Impact of the number of hidden units in the GR-BiLSTM module on detection performance. (b) Effect of the smoothing factor (α) in the dynamic thresholding module on detection performance. (c) Influence of the MILKP weight (λ_MILKP) on detection performance.
Figure 5. Performance variation of BERT-LogAnom under different hyperparameter settings: (a) Impact of the number of hidden units in the GR-BiLSTM module on detection performance. (b) Effect of the smoothing factor (α) in the dynamic thresholding module on detection performance. (c) Influence of the MILKP weight (λ_MILKP) on detection performance.
Electronics 15 00806 g005
Table 1. Statistics of evaluation datasets.
Table 1. Statistics of evaluation datasets.
DatasetRaw Log MessagesAnomalous MessagesAvg. Sequence LengthTraining DataTesting Data
HDFS11,172,157284,81819Normal sequences onlyNormal + Anomalous
BGL4,747,963348,460562Normal sequences onlyNormal + Anomalous
Thunderbird20,000,000758,562326Normal sequences onlyNormal + Anomalous
Table 2. Experimental results on HDFS, BGL and Thunderbird datasets.
Table 2. Experimental results on HDFS, BGL and Thunderbird datasets.
Datasets MethodsHDFSBGLThunderbird
PrecisionRecallF1-ScorePrecisionRecallF1-ScorePrecisionRecallF1-Score
PCA7.51100.0013.9710.4297.4918.8346.2898.9963.07
DeepLog85.8972.4678.6184.7379.2681.9086.6384.7285.66
LogAnomaly91.5745.7361.0074.8778.2976.5485.6998.7491.75
LogBERT89.1682.2485.5688.5189.2788.3994.1491.2792.68
LAnoBERT89.2482.6185.8089.1488.7388.7893.1792.6292.89
LogGPT89.6485.7287.6487.9586.1987.0692.7691.2891.84
LogFIT90.2787.4288.8289.7184.2586.8993.8991.7592.81
BERT-LogAnom91.7588.2689.9891.4288.3289.8496.1792.6894.39
Table 3. Performance stability across multiple independent runs (mean ± std).
Table 3. Performance stability across multiple independent runs (mean ± std).
DatasetPrecision (Mean ± Std)Recall (Mean ± Std)F1-Score (Mean ± Std)
HDFS92.03 ± 0.3188.12 ± 0.2190.03 ± 0.26
BGL91.46 ± 0.5288.27 ± 0.6389.83 ± 0.58
Thunderbird96.14 ± 0.2492.66 ± 0.1894.47 ± 0.11
Note. Results are averaged over five independent runs with different random seeds. All dataset splits and hyperparameters are kept identical across runs.
Table 4. Performance under temporal concept drift (chronological split).
Table 4. Performance under temporal concept drift (chronological split).
DatasetFixed Threshold (F1-Score)DTPM (F1-Score)
HDFS88.4190.04
BGL87.9589.86
Thunderbird91.4894.36
Table 5. Effect of removing GR-BiLSTM.
Table 5. Effect of removing GR-BiLSTM.
DatasetModel VariantPrecisionRecallF1-Score
HDFSBERT-LogAnom(full)91.7588.2689.98
w/o GR-BiLSTM89.1484.2186.61
BGLBERT-LogAnom(full)91.4288.3289.84
w/o GR-BiLSTM88.9585.4387.15
ThunderbirdBERT-LogAnom(full)96.1792.6894.39
w/o GR-BiLSTM92.6191.5592.08
Table 6. Effect of removing DTPM.
Table 6. Effect of removing DTPM.
DatasetModel VariantPrecisionRecallF1-Score
HDFSBERT-LogAnom(full)91.7588.2689.98
w/o DTPM88.4985.1986.81
BGLBERT-LogAnom(full)91.4288.3289.84
w/o DTPM87.9586.2387.08
ThunderbirdBERT-LogAnom(full)96.1792.6894.39
w/o DTPM95.7891.2293.44
Table 7. Effect of removing GR-BiLSTM and DTPM together.
Table 7. Effect of removing GR-BiLSTM and DTPM together.
DatasetModel VariantPrecisionRecallF1-Score
HDFSBERT-LogAnom(full)91.7588.2689.98
w/o GR-BiLSTM & DTPM88.3184.1186.16
BGLBERT-LogAnom(full)91.4288.3289.84
w/o GR-BiLSTM & DTPM87.0584.2385.62
ThunderbirdBERT-LogAnom(full)96.1792.6894.39
w/o GR-BiLSTM & DTPM93.1890.4491.79
Table 8. Running Efficiency Comparison on HDFS Dataset (in seconds).
Table 8. Running Efficiency Comparison on HDFS Dataset (in seconds).
ModelTraining Time (s/Epoch)Inference Time (s/log)
PCA30.02
DeepLog15000.45
LogAnomaly9000.40
LogBERT10,8001.20
LAnoBERT10,5001.00
LogGPT18,0001.40
LogFIT16,0001.10
BERT-LogAnom11,0000.90
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lu, X.; An, S.; Chen, J.; Shu, Z.; Wang, W.; Qi, R.; Diao, Y. BERT-LogAnom: Enhancing Log Anomaly Detection with Gated Residual BiLSTM and Dynamic Thresholding. Electronics 2026, 15, 806. https://doi.org/10.3390/electronics15040806

AMA Style

Lu X, An S, Chen J, Shu Z, Wang W, Qi R, Diao Y. BERT-LogAnom: Enhancing Log Anomaly Detection with Gated Residual BiLSTM and Dynamic Thresholding. Electronics. 2026; 15(4):806. https://doi.org/10.3390/electronics15040806

Chicago/Turabian Style

Lu, Xi, Shufan An, Jingmei Chen, Zhan Shu, Weiping Wang, Runyi Qi, and Yapeng Diao. 2026. "BERT-LogAnom: Enhancing Log Anomaly Detection with Gated Residual BiLSTM and Dynamic Thresholding" Electronics 15, no. 4: 806. https://doi.org/10.3390/electronics15040806

APA Style

Lu, X., An, S., Chen, J., Shu, Z., Wang, W., Qi, R., & Diao, Y. (2026). BERT-LogAnom: Enhancing Log Anomaly Detection with Gated Residual BiLSTM and Dynamic Thresholding. Electronics, 15(4), 806. https://doi.org/10.3390/electronics15040806

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop