1. Introduction
Ensuring high availability and reliability is essential for large-scale software-intensive systems, such as distributed clusters, cloud platforms, and high-performance computing environments [
1,
2]. As system scale and complexity increase, anomalies become inevitable during long-term operation and may lead to performance degradation, data integrity issues, or cascading failures [
3,
4]. Even seemingly minor anomalies may propagate across components, leading to performance degradation, data integrity issues, or large-scale service disruptions. Consequently, effective log anomaly detection has become a critical capability for maintaining system stability, reliability, and operational efficiency in modern computing infrastructures [
5].
System logs constitute one of the most widely available and informative sources of system telemetry, recording detailed runtime states, execution traces, and operational events [
6]. Engineers routinely rely on log data to monitor system behavior, identify abnormal conditions, and diagnose root causes of failures. However, the massive volume, high dimensionality, and complex temporal dependencies of log data render manual inspection impractical and error-prone, especially in large-scale environments [
7]. These challenges have driven extensive research on automated log-based anomaly detection. Nevertheless, the task remains difficult: anomalies often manifest as subtle deviations in event ordering or execution stages rather than isolated abnormal events, and real-world systems are inherently non-stationary due to workload fluctuations, configuration changes, and software evolution. As a result, effective log anomaly detection methods must not only capture rich contextual semantics and sequential dependencies, but also remain robust under continuously evolving data distributions.
In recent years, deep learning has been widely explored for end-to-end log representation learning. RNN-based [
8] approaches, such as DeepLog [
9] and its successor LogAnomaly [
10], model log sequences using LSTM [
11] to predict the next event or combine frequency features for anomaly detection. These models demonstrate strong performance in capturing short-range sequential dependencies. Nevertheless, the recursive nature of RNNs limits their ability to capture long-range dependencies, making them less effective in handling complex call chains or multi-stage state transitions. Transformer-based architectures, by contrast, leverage parallel attention mechanisms to capture global context and have shown superior results in log anomaly detection. Representative works include LogBERT [
12], LogRobust [
13], LogGPT [
14], and LogFormer [
15], which achieve state-of-the-art performance on multiple benchmark datasets. Despite these advances, Transformer-based approaches still face two main limitations: (i) the attention mechanism lacks explicit encoding of sequential constraints, reducing their ability to fully exploit staged or progressive dependencies inherent in logs; and (ii) anomaly detection decisions are typically made using static thresholds or top-k strategies, which are highly sensitive to distribution shifts caused by system upgrades or workload variations, leading to false positives or missed detections.
Existing BERT-based [
16] methods, such as LogBERT [
12], leverage Transformer’s powerful semantic modeling capacity by incorporating self-supervised tasks such as Masked Log Key Prediction (MLKP) and Volume of Hypersphere Minimization (VHM). While effective in capturing local semantic context, they remain limited in modeling explicit long-range sequential dependencies. Furthermore, their reliance on static thresholds makes them vulnerable to performance degradation in dynamic environments, where log distributions often drift over time.
To address these challenges, we propose BERT-LogAnom, an unsupervised log anomaly detection framework that couples contextual representation learning with explicit sequential modeling and adaptive decision making. Specifically, we introduce a Gated Residual BiLSTM (GR-BiLSTM) module to enhance long-range sequential dependency [
17] modeling on top of contextual representations, and a Dynamic Threshold Prediction Module (DTPM) to adapt anomaly decision boundaries under distribution shifts. Together, these designs improve robustness and practical deployability in dynamic system environments.
The main contributions are summarized as follows:
- (1)
We propose GR-BiLSTM to complement contextual representations with explicit sequential dynamics, improving detection of order-/phase-related anomalies.
- (2)
We propose DTPM to adapt decision thresholds to evolving score distributions, mitigating the sensitivity of static thresholding.
- (3)
We conduct comprehensive experiments on multiple public log datasets and show that BERT-LogAnom consistently improves precision, recall, and F1-score over strong baselines.
While recent studies have explored combining BERT-based representations with sequential models, most existing approaches primarily adopt generic architectural stacking without explicitly tailoring the integration to the characteristics of log anomaly detection. In contrast, our work is motivated by two log-specific challenges: anomalies that manifest as deviations in execution order or staged behaviors, and the instability of fixed decision thresholds under evolving system conditions. Accordingly, we design a Gated Residual BiLSTM (GR-BiLSTM) module to complement contextual representations with order-sensitive sequential dynamics while preserving semantic information, and further couple it with a self-supervised dynamic thresholding mechanism. This task-driven integration enables the proposed framework to jointly address sequential anomaly patterns and non-stationary decision boundaries, distinguishing it from existing BERT-based log anomaly detection methods.
In summary, this study combines the contextual modeling strength of Transformer architectures with enhanced sequential modeling and dynamic decision-making strategies, providing a more effective and practical solution for log anomaly detection in real-world large-scale systems.
3. Preliminaries
3.1. System Logs and Sequence Construction
System logs are indispensable artifacts that record the runtime behavior and internal states of large-scale software systems. A system log consists of a chronological list of log messages, each generated by the logging framework to capture critical runtime events. As shown in
Figure 1, a raw log message typically contains two parts: the header and the content. The header includes structural metadata such as timestamp, verbosity level (e.g., INFO/WARN/ERROR), and the originating component [
32]. The content can be further divided into a constant part, which reflects the log template (keywords), and a variable part, which carries dynamic runtime information such as identifiers, parameters, or memory addresses. In this work, we focus primarily on the content of log messages, as it provides the semantic patterns essential for anomaly detection.
For effective anomaly detection, raw log messages are usually grouped into log sequences, which reflect execution flows or operational snapshots of the system. Two common sequence construction methods are widely used [
33]. The first is session window partitioning, which groups log messages according to session identifiers (e.g., request ID, block ID). This method yields coherent log sequences for each session, as illustrated in
Figure 2, where HDFS [
34] logs are grouped by their block_id. The second method is fixed/sliding window partitioning, which groups log messages into subsequences of fixed size, either by the number of log entries or a specified time span. As shown in
Figure 3, BGL [
35] logs are partitioned into windows of size 2 with a step size of 2, capturing snapshots of the system’s runtime behavior.
The objective of log-based anomaly detection is to identify anomalous log sequences that deviate from normal execution patterns, thereby enabling timely recognition of potential issues in system operation. These basic definitions and sequence construction strategies form the foundation of subsequent preprocessing and modeling steps.
3.2. Fundamentals of Log Parsing
System logs are a fundamental data source for the operation and maintenance of modern information systems and distributed platforms. Raw log records typically contain multiple fields, such as timestamps, log levels, service components, and unstructured textual messages. Due to the diversity of log formats and the verbose nature of log messages, directly modeling raw logs often introduces noise and unnecessary complexity. Therefore, log parsing is commonly adopted as a preprocessing step in automated log anomaly detection.
Log parsing aims to transform semi-structured logs into structured representations by extracting event templates and corresponding parameters, thereby reducing feature dimensionality and facilitating downstream modeling. In this work, we adopt Drain [
36], a widely used and efficient log parsing method, to generate event templates. Drain clusters semantically similar log messages and assigns each template a unique identifier without relying on hand-crafted rules or domain-specific knowledge, making it suitable for diverse log sources.
After parsing, each log message is mapped to an ordered sequence of event template identifiers, effectively converting raw textual logs into structured symbolic sequences. This representation provides compact and noise-reduced inputs for subsequent sequence modeling, BERT-based representation learning, and anomaly detection.
3.3. Fundamentals of Sequential Modeling
Log events inherently display sequential and temporal attributes. Under normal system operation, log event sequences typically follow consistent business logic or state machine patterns, while anomalies or failures typically cause sudden deviations from these sequences. Therefore, accurately modeling the temporal dependencies between log events is critical for effective anomaly detection.
Early sequence modeling approaches, such as n-gram models, relied on fixed-size windows to estimate the joint probability of preceding events. However, their expressive power is limited and they cannot capture remote dependencies. Recurrent neural networks (RNNs) and their variants, including long short-term memory (LSTM) and gated recurrent units (GRU), have been widely used for sequence modeling tasks due to the fact that their recurrent states can efficiently store and propagate historical information. In particular, LSTM mitigates the gradient vanishing problem through the gating mechanism, thus enabling the capture of longer-term dependencies. Models such as DeepLog and LogAnomaly based on LSTM have achieved better results in logging anomaly detection, but their global modeling capabilities on ultra-long sequences are limited by the computational and information bottlenecks inherent in the loop structure. In addition, practical challenges such as anomaly sparsity and complex contextual correlations in log sequences remain open problems in sequence modeling and inspire the design of more advanced models.
3.4. Fundamentals of BERT
BERT (Bidirectional Encoder Representations from Transformers) represents a major breakthrough in the field of natural language processing. Based on the Transformer architecture, BERT, by superimposing multiple layers of self-attention mechanisms, can support efficient parallel computing while capturing the global context relationships in the input sequence. Compared with traditional models based on recurrent neural networks (RNNS), BERT has significant advantages in modeling long-distance dependencies.
The core innovation of BERT lies in its pre-training objectives and bidirectional modeling capabilities. Through tasks such as masking Language Model (MLM) [
37] and Next Sentence Prediction (NSP), BERT can learn rich semantic representations and context dependencies. In log sequence modeling, BERT maps log event sequences to high-dimensional vector representations. Its multi-head self-attention mechanism can dynamically adjust the weights between events, thereby effectively capturing abnormal events and their context.
In log anomaly detection, BERT can serve as the fundamental encoder for sequence modeling, extracting context features from log template sequences. Its flexible structure can well adapt to the complex semantics of logs, and at the same time provide high-quality feature representations for subsequent modules (such as sequence enhancement and anomaly-sensitive gating mechanisms). The latest research by LogBERT and LogFormer has successfully applied BERT to log anomaly detection, further verifying its global modeling capabilities and strong generalization performance in this field.
4. Methodology
4.1. Problem Definition
Log anomaly detection aims to automatically identify log sequences that may reflect system failures, security threats, or abnormal behaviors from the massive streams of continuously generated system logs. Let the collection of logs produced up to time be denoted as where each log entry typically contains a timestamp, log level, service module, and event content. After structured parsing, the logs are mapped into a template sequence , where represents the template identifier of the corresponding log entry.
In this work, we focus on a sliding-window–based setting for log anomaly detection. Given a template window of length , the objective is to determine whether the current window contains anomalous events. The detection model is expected to learn the patterns of normal log sequences and output either an anomaly score or a binary label for each window , where indicates anomaly and indicates normal.
This task involves several inherent challenges. First, log events often follow complex patterns and contextual structures, whereas anomalous behaviors are typically diverse and occur infrequently, which limits the effectiveness of static, rule-based detection methods. Second, log data are prone to concept drift, as system upgrades, configuration adjustments, or changes in workload can gradually modify what should be considered normal behavior. Under such conditions, anomaly detection models are required not only to capture rich sequential dependencies in log data, but also to demonstrate sufficient generalization ability and adaptability to evolving system environments.
4.2. Log Data Preprocessing and Representation
In practical operational and security analysis settings, raw logs are typically produced in a semi-structured textual form, with substantial variation in structure, field definitions, and event descriptions across different systems. Modeling features directly from such heterogeneous data tends to introduce considerable noise and results in high-dimensional representations, which in turn increase the risk of overfitting and weaken model robustness. Consequently, log preprocessing and the construction of structured representations are necessary steps to ensure the reliability and generalization capability of subsequent modeling stages.
Log parsing. In this study, we employ the Drain parser to transform raw logs into structured event templates. Drain is an efficient and automated log template extraction algorithm that incrementally constructs a hierarchical tree and clusters messages into templates without requiring manual rules. Each raw log entry is mapped to a unique template identifier (ID), thereby converting complex and unstructured text into a low-dimensional, structured sequence of events. Formally, the parsed log stream can be expressed as
where
denotes the set of all templates. Prior to parsing, canonicalization is applied to replace volatile fields such as IP addresses, process identifiers, or numeric literals with placeholders, which reduces template fragmentation while retaining semantic information.
Windowing. To adapt log sequences for deep learning models, the template stream is segmented into fixed-length subsequences using a sliding window. At time step
, a window of length
is defined as
If the sequence is shorter than , it is padded with a special token ; if it exceeds the limit, it is truncated to maintain consistency across inputs. Windows are constructed within the same service session to prevent unrelated log sources from being mixed.
Representation. Each template ID is embedded into a dense vector
capturing semantic relationships among templates. To preserve the temporal characteristics of logs, a positional encoding
is added to distinguish order dependencies. Optionally, when timestamps are available, we incorporate a simple time-gap encoding
to reflect inter-event intervals, which can be informative for anomaly detection. The final input representation at position
is given by
where
if timing information is unavailable or if the event is the first in the window.
The resulting preprocessed and embedded event sequences are then fed into the contextual encoding module (e.g., BERT), providing clean and structured input that facilitates the capture of both global and local dependencies in log sequences.
4.3. Model Architecture
Figure 4 illustrates the overall architecture of the proposed method. The framework consists of several key modules, which transform raw system logs into structured representations through pipeline processing, extract context and sequence features, and finally achieve robust anomaly detection.
First, raw text logs are processed through a Drain log parser, which automatically extracts and clusters log event templates to form a standardized set of template identifiers. This step effectively converts heterogeneous unstructured logs into structured event template sequences, thereby reducing noise and improving subsequent modeling efficiency.
Then, the template ID sequence is mapped into a dense vector by an embedding layer, which is augmented by a position encoding to maintain the event sequence order. These representations form the input tokens for the Transformer encoder, a BERT-based context representation module. By using the multi-layer Transformer module with multi-head self-attention mechanism, the BERT encoder can not only capture the global semantic information of log events, but also grasp the long-distance context correlation across events.
To further enhance the model’s ability to capture temporal dependence, we superimpose a gated residual bidirectional LSTM (GR-BiLSTM) sequence enhancement module on the output of the Transformer encoder. This module explicitly models forward and reverse dependencies in log sequences. The residual connection ensures that the global context representation provided by BERT is preserved, while GR-BiLSTM enhances the sensitivity of the model to sequential patterns. In addition, the introduced log anomaly sensitive gating mechanism can dynamically adjust the model’s attention to key log events and improve its ability to emphasize abnormal patterns in the sequence.
In addition, an optional attention pooling layer is used to aggregate sequence features into compact representations, so that the model can focus on the log events that are most indicative of anomalies in the detection process.
The two-layer self-supervised Dynamic Threshold Module (DTPM) was introduced into the abnormal judgment step. The fast layer captures short-term fluctuations based on exponentially weighted moving average (EWMA), and the slow layer uses a lightweight gated neural network to handle long-term calibration. The collaborative operation of the two layers enables the model to dynamically adjust the anomaly detection threshold to resist the data distribution shift caused by system updates or workload changes.
In summary, the proposed architecture integrates log parsing, embedding and positional encoding, BERT-based context modeling, GR-BiLSTM sequence enhancement, attention pooling, and a two-layer self-supervised dynamic threshold mechanism. This unified design enables the model to combine deep semantic understanding, explicit sequence perception and robust anomaly detection capabilities, which shows excellent performance in complex log flow analysis.
4.4. Contextual Feature Encoding
In log anomaly detection, contextual dependencies across events are critical. After Drain parsing, each template ID is mapped to an embedding and combined with positional encodings to form the input tokens. These tokens are fed into a BERT-based encoder, where multi-head self-attention models global interactions among events and produces contextual representations for subsequent modules. The encoded sequence is then passed to GR-BiLSTM to further incorporate explicit sequential constraints.
Specifically, the model first processes the log template sequences, which have been parsed by the Drain log parser, into embedded representations. Each log event template is transformed into a dense vector representation via an embedding layer, denoted as
, where
represents the embedding space dimension. To preserve the sequential order of log events, we add positional encodings
to each event template. These positional encodings allow the model to capture the relative positions of events within the sequence. The final representation fed into the BERT model is:
where
represents the index of each event in the sequence, and
is the sequence length.
The input data is then passed to the BERT encoder, where the multi-head self-attention mechanism and feed-forward network (FFN), stacked in multiple layers of Transformer blocks, enable the model to perform global contextual modeling of the log sequence. The self-attention mechanism allows the model to compute the relationships between positions within the sequence, ensuring that each log event’s representation depends not only on local information but also captures long-range dependencies. The core calculation of the attention mechanism is given by:
where
is the query vector,
is the key vector,
is the value vector, and
is the dimension of the key vector. Multiple attention heads are computed in parallel to produce diverse contextual representations, which are then concatenated and linearly transformed to obtain the final global context representation for each event, denoted as
, where
represents the feature vector of the
log event in the context.
Through multiple layers of self-attention, BERT can thoroughly explore the complex contextual relationships within the log sequence. In each Transformer layer, besides the self-attention module, a feed-forward network (FFN) is used to apply nonlinear transformations to the features. The output features of each layer are processed with residual connections and layer normalization, ensuring effective information flow and mitigating issues such as gradient vanishing.
In this manner, the BERT encoder not only preserves the local semantic information of the log events but also captures long-range temporal and semantic dependencies across the entire sequence, greatly enhancing the expressive power of log sequence modeling. The feature sequence , after BERT encoding, serves as input to the subsequent sequence enhancement module, providing rich contextual information and high-dimensional features for further anomaly detection tasks.
4.5. Sequential Enhancement Module
Although the BERT model excels at capturing the global contextual semantics of log sequences, the inherent characteristics of the Transformer architecture limit its ability to explicitly model sequential dependencies and temporal structures. Log data often exhibit strong sequential dependencies, especially in real-world anomaly scenarios where abnormal events are frequently associated with specific patterns of event sequences. To address this limitation and enhance the model’s ability to capture explicit sequential relationships, we propose a novel Gated Residual Bidirectional LSTM (GR-BiLSTM) sequential enhancement module. This module complements the contextual representations learned by the BERT encoder and improves the accuracy of anomaly detection.
The core design of the GR-BiLSTM module integrates bidirectional LSTM networks, residual connections, and a gating mechanism. It effectively captures the sequential dependencies among log events while dynamically adjusting the importance of different events. Concretely, the contextual feature sequence generated by the BERT encoder is denoted as
, where each vector
encodes rich semantic information of an event but lacks explicit ordering constraints. The GR-BiLSTM first applies bidirectional LSTMs to model sequential dependencies from both forward and backward directions, producing forward and backward hidden states:
where
and
denote the forward and backward LSTM units, respectively,
is the input representation at position
, and
and
represent the forward and backward hidden states.
The final BiLSTM output is the concatenation of these two representations:
To preserve the global contextual semantics extracted by BERT while avoiding information loss during sequential modeling, a residual connection is employed. Specifically, the BiLSTM output is combined with the original BERT output to form the enhanced sequential feature representation:
This residual mechanism ensures that the global semantics captured by BERT are retained, while the explicit sequential dependencies learned by BiLSTM are incorporated, thereby enhancing the robustness of sequence modeling.
To further improve the model’s sensitivity to abnormal events, we introduce a gating mechanism within the GR-BiLSTM module. This mechanism dynamically adjusts the importance of each log event in the feature space, assigning greater weights to anomaly-sensitive events. Formally, a gating weight
is computed as:
where
and
denote the weight matrix and bias vector of the gating layer, respectively, and
represents the sigmoid activation function. The final gated feature representation is obtained as:
Through this gating mechanism, the model automatically emphasizes events that are more likely related to anomalies, thereby improving its detection sensitivity and effectiveness.
In summary, the GR-BiLSTM sequential enhancement module integrates bidirectional LSTMs, residual connections, and a gating mechanism to effectively overcome the limitations of Transformer-based models in explicit sequence modeling. By strengthening the representation of sequential dependencies and adaptively highlighting critical anomalous events, this module provides richer and more discriminative features for downstream loss function optimization and anomaly decision-making.
4.6. Loss Function Design
In this study, we introduce two key self-supervised loss functions—Masked Log Key Prediction (MLKP) and Volume Hypersphere Minimization (VHM)—to optimize the model’s learning capability on unlabeled log data. By combining these two objectives, the model is able to capture contextual dependencies and structural patterns of log sequences during training, thereby improving anomaly detection performance and robustness.
The MLKP loss is designed to enable the model to learn contextual relationships within log sequences through a self-supervised prediction task inspired by the Masked Language Model (MLM) used in BERT. Specifically, within a log template sequence
, a subset of template positions is randomly selected for masking. These selected tokens
are replaced by a special symbol [MASK], and the model is required to predict the original token based on the surrounding context
. The optimization objective minimizes the discrepancy between the predicted and the true templates using cross-entropy loss:
where
denotes the set of masked positions, and
is the predicted probability of recovering the original template
. By training on this objective, the model learns to capture global contextual dependencies across the log sequence, strengthening its capability to detect anomalies that disrupt normal event patterns.
The VHM loss is designed to optimize the distribution of normal log representations in feature space, ensuring they remain compact and well-clustered. The goal is to minimize the feature volume occupied by normal logs, thereby enhancing the separation between normal and anomalous patterns. Suppose the model generates feature representations of normal logs as
, where each
is the feature vector of the
-th log. Let the centroid of these features be
The VHM loss is then defined as:
where
denotes the squared Euclidean distance between a log feature
and the cluster centroid
. By minimizing this loss, the model enforces normal log features to be concentrated in a small hyperspherical region, increasing the discriminability of anomalies that lie outside this region.
During training, MLKP and VHM are jointly optimized to balance contextual learning and distribution compactness. The overall training objective is expressed as a weighted sum of the two loss functions:
where
and
are hyperparameters that control the relative contributions of the two objectives. Through this joint optimization, the model simultaneously learns contextual semantics and structural distribution patterns of log sequences, enabling effective discrimination between normal and anomalous logs without requiring labeled anomaly data.
In summary, the MLKP task enhances the model’s understanding of log sequence context, while the VHM task improves feature compactness for normal logs in latent space. Together, they provide a strong self-supervised learning framework that supports robust anomaly detection under real-world conditions.
4.7. Dynamic Thresholding Prediction Module
A central issue in log anomaly detection lies in deciding whether a given log sequence should be regarded as normal or anomalous. Many existing methods make this decision by applying fixed thresholds to anomaly scores or reconstruction errors. In practice, however, such static thresholds are highly sensitive to shifts in log data distributions, which commonly arise from system upgrades, workload variations, or changes in the runtime environment. Under these conditions, fixed decision boundaries often lead to performance degradation, manifested as increased false positives or false negatives. To address this problem, this work introduces a Dynamic Threshold Prediction Module (DTPM), which adjusts anomaly decision boundaries in an adaptive manner according to the statistical characteristics of the incoming log data.
The core idea of DTPM is to combine short-term statistical adaptation with long-term drift calibration, thereby providing a robust and self-supervised anomaly decision mechanism. Specifically, the anomaly score of a log sequence is denoted as . A naive static decision rule would classify the sequence as anomalous if , where is a fixed threshold. Instead, our module dynamically estimates the threshold based on data statistics and adaptive modeling.
First, a fast adaptation layer based on Exponentially Weighted Moving Average (EWMA) is employed to track short-term variations in the anomaly score distribution. The adaptive threshold estimate at time
is defined as:
where
is a smoothing parameter controlling the influence of recent scores. This component ensures that the threshold can quickly respond to sudden shifts caused by temporary load fluctuations or bursty workloads.
Second, a slow calibration layer is introduced to handle long-term distribution drift. We design a lightweight gated neural network that takes historical anomaly score statistics as input and outputs a calibration factor
. The final threshold adjustment is then given by:
where
adaptively corrects for systematic shifts in the score distribution, ensuring that the model maintains stable performance even in evolving environments. The gating mechanism inside the neural network regulates the influence of different statistical features, allowing the system to emphasize anomaly-sensitive signals.
The final decision rule of the dynamic thresholding module is:
where
denotes that the sequence is classified as anomalous, and
denotes a normal sequence.
By combining short-term EWMA-based adaptation with long-term gated calibration, the DTPM effectively mitigates the limitations of static thresholds. It enables the model to self-adapt to concept drift in log distributions without requiring labeled anomaly data, thereby improving both robustness and generalization. Furthermore, the lightweight design ensures that threshold updates can be computed efficiently, making the module suitable for real-time log stream processing in large-scale industrial systems.
4.8. Anomaly Decision and Output
After feature extraction and representation learning by the contextual encoder, sequential enhancement, and dynamic thresholding modules, the model produces an anomaly score for each log sequence. The final step is to decide whether a given sequence should be classified as normal or anomalous.
In the proposed framework, two complementary self-supervised tasks—Masked Log Key Prediction (MLKP) and Volume Hypersphere Minimization (VHM)—are used to generate reliable anomaly scores. The MLKP task enables the model to learn semantic and contextual dependencies within log sequences. If the model fails to correctly predict masked log keys, this reflects a deviation from learned normal patterns. Consequently, the prediction loss of MLKP is used as an anomaly score, with larger loss values indicating a higher likelihood of abnormality.
At the same time, the VHM task encourages the representations of normal log sequences to remain compact in the latent space. By reducing intra-class variance, normal samples are constrained to lie within a hyperspherical region centered at a centroid. Sequences whose representations fall outside this region yield higher reconstruction errors, which are also treated as anomaly scores. Together, these two tasks form a robust self-supervised framework, where MLKP captures semantic deviations and VHM focuses on structural deviations.
Formally, given a log sequence
, the overall anomaly score
is defined as a weighted combination of the MLKP and VHM components:
where
is a balancing coefficient. This combined score integrates semantic prediction difficulty and spatial deviation, offering a comprehensive measurement of anomaly likelihood.
The anomaly score
is then passed to the Dynamic Thresholding Prediction Module (DTPM), which adaptively determines the threshold
according to both short-term fluctuations and long-term drift in log distributions. The final classification decision is made as:
where
indicates that the log sequence
is anomalous, and
indicates normal behavior.
Finally, anomaly detection results can be reported at the sequence level or further aggregated to the system level, depending on specific application requirements. In real-time monitoring scenarios, detected anomalies are immediately flagged and can be integrated with alerting or visualization systems to support rapid diagnosis. In offline analysis settings, anomaly labels can be leveraged for subsequent root cause analysis, trend identification, or system optimization.
By integrating contextual encoding, sequential enhancement, dual self-supervised objectives, and adaptive thresholding, the proposed framework provides a reliable and interpretable anomaly decision process. This design enables robust performance under dynamic log distributions while preserving the efficiency needed for deployment in large-scale real-world systems.
5. Experiments
5.1. Experimental Setup
This section introduces the experimental setup used to evaluate the effectiveness of the proposed method. Three publicly available datasets commonly used in log anomaly detection research are selected, namely HDFS, BGL, and Thunderbird. These datasets span different application domains and exhibit diverse log formats, providing a comprehensive benchmark for evaluating model performance in a range of real-world scenarios. The following subsections describe the datasets, preprocessing steps, experimental environment, parameter configurations, and baseline methods in detail.
Datasets. We evaluate the proposed method on three widely used benchmark datasets for log anomaly detection: HDFS, BGL, and Thunderbird. The overall statistics of the datasets are summarized in
Table 1.
Hadoop Distributed File System (HDFS) [
34]. The HDFS dataset is generated by executing Hadoop-based MapReduce jobs on Amazon EC2 clusters and is manually labeled using handcrafted rules to identify anomalous events. It contains 11,172,157 raw log messages, among which 284,818 are labeled as anomalous. For HDFS, log messages are grouped into log sequences based on the session (block) identifiers associated with each log entry. This session-based grouping reflects the execution semantics of distributed jobs. The resulting log sequences have an average length of 19 log events. All normal log sequences in the training split are used for model training.
BlueGene/L Supercomputer System (BGL) [
35]. The BGL dataset is collected from a BlueGene/L supercomputer system at Lawrence Livermore National Laboratory (LLNL). Log messages are annotated using alert category tags, where alert messages are treated as anomalous events. The dataset consists of 4,747,963 log messages, including 348,460 anomalous entries. For BGL, log sequences are constructed using a time-based sliding window of 5 min, which captures temporal correlations among system events. The average length of the resulting log sequences is approximately 562 log events. All normal sequences in the training split are used for unsupervised learning.
Thunderbird [
35]. Thunderbird is a large-scale log dataset collected from a supercomputer system. To ensure computational feasibility while preserving the temporal characteristics of the data, we use a subset consisting of the first 20,000,000 log messages from the original dataset, among which 758,562 messages are labeled as anomalous.Log sequences are generated using a time-based sliding window of 1 min, resulting in sequences with an average length of approximately 326 log events. Similar to the other datasets, all normal log sequences in the training split are used to learn normal system behavior.
Data Split Strategy. For each dataset, log sequences are divided into training, validation, and testing sets following the experimental protocol commonly adopted in unsupervised log anomaly detection.
The training set consists exclusively of normal log sequences, which are used to learn the patterns of normal system behavior. In line with established experimental settings, the number of normal log sequences used for training is on the order of several thousand for each dataset. A small subset of the training data is further reserved as a validation set for hyperparameter tuning, such as threshold selection.
The testing set contains both normal and anomalous log sequences and is used solely for performance evaluation. Anomaly labels are not used during training or validation.
Sequence construction. For HDFS, raw log messages are grouped into log sequences according to the session (block) identifiers associated with each log entry. For BGL and Thunderbird, log sequences are constructed using time-based sliding windows of 5 min and 1 min, respectively, following standard practice in log anomaly detection.
The resulting sequences are directly used as the model inputs. For sequences shorter than the maximum input length supported by the model, padding is applied. Sequences exceeding the maximum length are truncated to preserve the most recent log events.
Experimental platform. All experiments were conducted on a server equipped with an NVIDIA RTX 3090 GPU (24 GB memory), an Intel Xeon Silver 4210 CPU, and 128 GB of RAM, running Ubuntu 20.04. The implementation was based on PyTorch 1.12 and Python 3.8. To ensure reproducibility and stability, all software dependencies were strictly version-controlled. Log parsing was carried out using the Drain tool, which efficiently converts unstructured log text into structured event template identifiers for model training and evaluation.
Model parameter settings.The BERT encoder followed the standard configuration with 12 Transformer layers, a hidden size of 768, and 12 attention heads. For the Gated Residual BiLSTM (GR-BiLSTM) module, we employed two BiLSTM layers with 128 hidden units each, balancing performance and efficiency. In the Dynamic Thresholding Prediction Module (DTPM), the fast adaptation layer was implemented using the Exponentially Weighted Moving Average (EWMA) method with a smoothing factor of 0.2, while the slow calibration layer was realized as a gated neural network with 64 hidden units. The model was optimized using the Adam optimizer with a learning rate of and a batch size of 128. In the total loss function, the weights for the Masked Log Key Prediction (MLKP) and Volume Hypersphere Minimization (VHM) tasks were set to 1.0 and 0.2, respectively, ensuring a balanced contribution.
Baseline methods.We compared our model against several representative baseline methods:
PCA: Principal Component Analysis reduces dimensionality and extracts dominant features for anomaly detection.
DeepLog: An LSTM-based method that models sequential dependencies among log events and detects anomalies via prediction error.
LogAnomaly: An unsupervised method that detects sequential and quantitative anomalies from log event sequences.
LogBERT: A BERT-based model that captures contextual semantics of logs through self-supervised learning.
LAnoBERT: An extension of LogBERT that integrates LSTM to enhance sequential dependency modeling.
LogGPT: A generative pre-trained Transformer (GPT)-based model that leverages generative tasks for semantic representation and anomaly detection.
LogFiT: A fine-tuning based approach that adapts pre-trained language models such as BERT for log anomaly detection.
5.2. Evaluation Metrics
To comprehensively assess the performance of the proposed method on log anomaly detection tasks, three commonly used evaluation metrics are adopted: Precision, Recall, and F1-score. These metrics are well suited to anomaly detection scenarios, where the data distribution is often highly imbalanced.
Precision reflects the proportion of correctly identified anomalies among all instances predicted as anomalous. A higher precision indicates a lower false alarm rate, meaning that a larger fraction of the detected anomalies correspond to true anomalous events. It is defined as:
where
(True Positives) denotes the number of correctly predicted anomalous log sequences, and
(False Positives) denotes the number of normal log sequences incorrectly predicted as anomalous.
Recall evaluates the model’s ability to identify true anomalies from all actual anomalies. A higher recall indicates that the model can detect a greater portion of anomalous logs, though it may come at the cost of higher false alarms. It is defined as:
where
(False Negatives) denotes the number of anomalous log sequences that were mistakenly classified as normal.
F1-score is the harmonic mean of precision and recall, providing a balanced evaluation when there is a trade-off between the two metrics. It is particularly useful in imbalanced settings, as it ensures that neither precision nor recall is neglected. The F1-score is defined as:
By jointly considering these three metrics, we obtain a comprehensive evaluation of the model’s performance in log anomaly detection. This allows us to assess not only the accuracy of anomaly identification but also the model’s robustness in detecting rare yet critical anomalies under imbalanced data conditions.
5.3. Experimental Results and Analysis
This section reports the experimental results of the proposed BERT-LogAnom model on three benchmark datasets and compares its performance with several state-of-the-art baseline methods. Precision, Recall, and F1-score are used as evaluation metrics to systematically evaluate the effectiveness of each approach in log anomaly detection. Overall, the results indicate that BERT-LogAnom consistently outperforms existing methods, especially in its ability to capture complex anomaly patterns in log data.
As shown in
Table 2, BERT-LogAnom consistently outperforms baseline methods across all three datasets. On HDFS, BERT-LogAnom achieves an F1-score of 89.98%, surpassing LogBERT (85.56%) and LAnoBERT (85.80%). This improvement demonstrates the effectiveness of the GR-BiLSTM and dynamic thresholding modules in capturing both contextual and sequential dependencies. On BGL, our model obtains an F1-score of 89.84%, which is higher than LogBERT (88.39%) and LogFiT (86.89%), confirming its capability to handle complex and large-scale system logs. On the most challenging Thunderbird dataset, BERT-LogAnom achieves an F1-score of 94.39%, outperforming LogBERT (92.68%), LAnoBERT (92.89%), and LogFiT (92.81%). These results highlight the robustness of our model in heterogeneous and user-driven log environments.
In terms of precision, BERT-LogAnom reaches 96.17% on Thunderbird, the highest among all methods, significantly reducing false positives. At the same time, the recall of 92.68% indicates that the model successfully identifies the majority of anomalies. By contrast, while LogAnomaly achieves competitive recall on Thunderbird (98.74%), its precision drops to 85.69%, leading to a less favorable F1-score. This further confirms the balanced performance of our model across precision and recall.
Among the baseline models, LogBERT and LAnoBERT perform well in capturing contextual semantics, but they struggle with sequential modeling and adaptive thresholding, leading to slightly lower scores compared with our approach. LogGPT and LogFiT also deliver competitive results by leveraging generative and fine-tuning strategies, respectively. Nevertheless, BERT-LogAnom achieves superior performance by combining BERT contextual representation, GR-BiLSTM sequential enhancement, and DTPM adaptive thresholding, leading to consistent improvements across datasets.
5.4. Statistical Robustness and Stability Analysis
Deep learning–based log anomaly detection models are inherently influenced by stochastic factors such as random parameter initialization and mini-batch optimization. To ensure that the reported results reflect the intrinsic performance characteristics of the proposed method rather than incidental randomness, we further analyze the robustness and stability of the experimental outcomes.
For each dataset, the proposed model is trained and evaluated five times with different random seeds, while keeping the dataset splits, model architecture, and all hyperparameters unchanged. For each run, Precision, Recall, and F1-score are computed on the test set. We report the mean and standard deviation of these evaluation metrics across the five runs to characterize performance stability.
As shown in
Table 3, the proposed method demonstrates stable performance on all three datasets, with relatively small standard deviations across evaluation metrics. This indicates that the observed performance improvements are consistent across different random initializations and confirms the robustness of the proposed approach.
To ensure reproducibility, dataset splits and hyperparameters are fixed across experiments; randomness is introduced only through parameter initialization and optimization, which is explicitly analyzed in the stability evaluation (five independent runs).
In summary, the experimental results validate that BERT-LogAnom provides a more robust, precise, and adaptive solution to log anomaly detection compared with existing approaches, particularly in large-scale and heterogeneous environments.
5.5. Temporal Concept Drift Evaluation
To evaluate the effectiveness of the proposed Dynamic Thresholding Prediction Module (DTPM) under temporal distribution shifts, we conduct an additional time-aware evaluation using a chronological data split. For each dataset, log sequences are ordered by timestamp, where normal sequences from an earlier time period are used for training, and sequences from a later period containing both normal and anomalous samples are used for testing. No information from the testing period is used during training or threshold estimation.
We compare the proposed method with and without DTPM to isolate the effect of dynamic thresholding under temporal concept drift. As shown in
Table 4, the fixed-threshold variant suffers performance degradation due to distribution mismatch, while the model equipped with DTPM consistently achieves higher F1-scores across all datasets. This demonstrates that DTPM effectively adapts the anomaly decision threshold to evolving score distributions and improves robustness in non-stationary environments.
Following the chronological evaluation protocol, the model is trained exclusively on normal log sequences from an earlier time period, while evaluation is performed on sequences from a subsequent time period that contain both normal and anomalous samples. For the fixed-threshold setting, a single global threshold is estimated from the training data and applied uniformly during testing. In contrast, the proposed DTPM dynamically predicts anomaly thresholds for each sequence based on contextual information, enabling adaptive decision making under temporal distribution shifts.
5.6. Ablation Studies and Module Effectiveness
To rigorously assess the contribution of the two core components in BERT-LogAnom—the Gated Residual Bidirectional LSTM (GR-BiLSTM) sequential enhancement and the Dynamic Thresholding Prediction Module (DTPM)—we conduct ablation studies by selectively removing each module (and both) while keeping all other settings identical to the full model. Performance is evaluated on HDFS, BGL, and Thunderbird using Precision, Recall, and F1-score.
Effect of removing GR-BiLSTM.
Table 5 reports the results when the GR-BiLSTM module is removed. Across all three datasets, performance drops are observed, with the largest decline on HDFS (F1 from 89.98% to 86.61%). This confirms that explicitly modeling bidirectional sequential dependencies, coupled with residual fusion and gating, is crucial for capturing order-sensitive anomalies beyond what contextual encoding alone can provide.
Effect of removing DTPM.
Table 6 shows the impact of removing the dynamic thresholding module. The performance consistently decreases, especially on HDFS and BGL (F1 drops of ~3 and ~2.8 points, respectively). The reduction is smaller on Thunderbird (F1 from 94.39% to 93.44%), suggesting that while contextual+sequential modeling already provides strong separation, adaptive thresholding further stabilizes decisions under distributional shifts and load variability.
Effect of removing both modules.
Table 7 summarizes the results when both GR-BiLSTM and DTPM are removed. The compounded degradation is evident, with F1 decreasing to 86.16% on HDFS and 85.62% on BGL. This demonstrates the complementarity of the two components: GR-BiLSTM strengthens explicit sequence modeling and anomaly-sensitive emphasis, while DTPM enhances robustness to data drift via adaptive decision boundaries.
The ablation results establish that both modules materially contribute to the final performance, and their joint effect is greater than either alone. In particular, the GR-BiLSTM contributes larger gains on datasets with stronger order dependencies (e.g., HDFS),while DTPM yields notable benefits where distribution drift and non-stationarity are more pronounced (e.g., BGL). On Thunderbird, where semantic context is already highly informative, both modules still provide measurable improvements, with DTPM further reducing decision instability.
5.7. Parameter Sensitivity Analysis
To evaluate the sensitivity of BERT-LogAnom to key hyperparameters, we conducted experiments by varying important parameters and observing their impact on model performance. The parameters examined include the number of hidden units and LSTM layers in the Gated Residual Bidirectional LSTM (GR-BiLSTM) module, the smoothing factor in the Dynamic Thresholding Prediction Module (DTPM), and the weight ratio between the Masked Log Key Prediction (MLKP) and Volume Hypersphere Minimization (VHM) tasks.
Figure 5a shows the impact of varying the hidden units in the GR-BiLSTM module. As the number of hidden units increases from 32 to 128, both Precision and Recall improve significantly, indicating that richer sequential representations benefit anomaly recognition. Beyond 128 units, however, performance slightly decreases, suggesting that excessive capacity may introduce redundant information and overfitting.
Figure 5b presents the effect of the smoothing factor (α) in the Dynamic Threshold Prediction Module (DTPM). The optimal performance occurs around α = 0.2, where the model achieves the best balance between rapid adaptation to short-term fluctuations and stability under long-term drift. Too small a value results in delayed response, whereas too large a value causes instability.
Figure 5c analyzes the effect of the MLKP loss weight (λₘₗₖₚ) on detection accuracy. Increasing λₘₗₖₚ up to 1.2 enhances the model’s discriminative ability by strengthening contextual learning, but further increments lead to marginal declines as the VHM loss becomes underweighted. Overall, these results confirm that BERT-LogAnom maintains stable performance across a wide range of hyperparameters, demonstrating good robustness and generalization.
5.8. Running Efficiency Analysis
This section evaluates the running efficiency of BERT-LogAnom on the HDFS dataset. We compare its training time and inference speed with several baseline models: PCA, DeepLog, LogAnomaly, LogBERT, LAnoBERT, LogGPT, and LogFiT.
Table 8 provides a comparison of training time and inference speed for each model on the HDFS dataset. The results show that BERT-LogAnom achieves competitive training time despite its complex architecture, with 11,000 s per epoch. While more sophisticated models like LogGPT and LogFiT require longer training times (up to 18,000 s and 16,000 s, respectively), BERT-LogAnom maintains a balance between performance and efficiency.
Although BERT-LogAnom is built upon the BERT architecture, which is typically associated with high computational cost, it still achieves efficient processing performance. The training time of the model is comparable to that of other deep learning–based approaches, such as LogBERT and LAnoBERT, which have similar model complexity but require slightly longer training time. In addition, an inference time of 0.9 s per log indicates that BERT-LogAnom is capable of supporting real-time log stream processing, making it suitable for large-scale anomaly detection applications.