BiGRMT: Bidirectional GRU–Recurrent Memory Transformer for Efficient Long-Sequence Anomaly Detection in High-Concurrency Microservices

Zhang, Ruicheng; Zhang, Renzun; Wang, Shuyuan; Yang, Kun; Xu, Miao; Qiao, Dongwei; Hu, Xuanzheng

doi:10.3390/electronics14234754

Open AccessArticle

BiGRMT: Bidirectional GRU–Recurrent Memory Transformer for Efficient Long-Sequence Anomaly Detection in High-Concurrency Microservices

by

Ruicheng Zhang

^*,

Renzun Zhang

,

Shuyuan Wang

,

Kun Yang

,

Miao Xu

,

Dongwei Qiao

and

Xuanzheng Hu

State Grid Shandong Electric Power Company Tai’an Power Supply Company, Tai’an 271000, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(23), 4754; https://doi.org/10.3390/electronics14234754

Submission received: 28 October 2025 / Revised: 21 November 2025 / Accepted: 23 November 2025 / Published: 3 December 2025

(This article belongs to the Special Issue Advances in Image Processing, Artificial Intelligence and Intelligent Robotics, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

In high-concurrency distributed systems, log data often exhibits sequence uncertainty and redundancy, which pose significant challenges to the accuracy and efficiency of anomaly detection. To address these issues, we propose BiGRMT, a hybrid architecture that integrates Bidirectional Gated Recurrent Unit (Bi-GRU) with a Recurrent Memory Transformer (RMT). BiGRMT enhances local temporal feature extraction through bidirectional modeling and adaptive noise filtering using Bi-GRU, while a RMT component is incorporated to significantly extend the model’s capacity for long-sequence modeling via segment-level memory. The Transformer’s multi-head attention mechanism continues to capture global time dependencies but now with improved efficiency due to the RMT’s memory-sharing design. Extensive experiments on three benchmark datasets from LogHub (Spark, BGL(Blue Gene/L), and HDFS (Hadoop distributed file system)) demonstrate that BiGRMT achieves strong results in terms of precision, recall, and F1-score. It attains a precision of 0.913, outperforming LogGPT (0.487) and slightly exceeding Temporal logical attention network (TLAN) (0.912). Compared to LogPal, which prioritizes detection accuracy, BiGRMT strikes a better balance by significantly reducing computational overhead while maintaining high detection performance. Even under challenging conditions such as a 50% increase in log generation rate or 20% injected noise, BiGRMT maintains F1-scores of 87.4% and 83.6%, respectively, showcasing excellent robustness. These findings confirm that BiGRMT is a scalable and practical solution for automated fault detection and intelligent maintenance in complex distributed software systems.

Keywords:

anomaly detection; Bi-GRU; Recurrent Memory Transformer; large-scale log; intelligent operation and maintenance

1. Introduction

The rapid development of big data and artificial intelligence has accelerated the adoption of microservice architecture, which decomposes monolithic software systems into fine-grained, business-oriented services [1,2]. Microservices provide high scalability, maintainability, and rapid deployment and communicate through lightweight protocols in distributed, loosely coupled environments. Over time, these systems evolve into complex software ecosystems, with heterogeneous services, diverse implementation technologies, and numerous runtime instances whose states frequently change.

The operation and maintenance of such complex and dynamic systems presents significant challenges. Ensuring stable and continuous operation requires substantial resources and effective monitoring [3]. In this context, artificial intelligence for IT operations (AIOps) has emerged as a key solution. Among its core tasks, log-based anomaly detection is critical for ensuring system reliability, business continuity, and data security. System logs record operational events, status changes, and error messages, providing a rich basis for diagnosing system health. Automated log anomaly detection not only identifies abnormal events but also supports root cause analysis, predicts potential failures, and enables timely interventions [4,5].

However, the increasing scale and complexity of microservice systems create two major challenges for log anomaly detection. First, microservices generate massive volumes of logs, requiring real-time analysis and efficient processing. Second, logs contain complex semantics and redundancy, arising from multi-threaded execution, concurrent processing, and heterogeneous services. Redundant or noisy logs can obscure critical abnormal events, making anomaly detection difficult. These challenges highlight the limitations of existing methods and motivate the development of more effective approaches.

To address these issues, this paper proposes a BiGRMT model that integrates Bi-GRU for local temporal feature extraction, a redundancy filtering mechanism to remove repeated logs, and a Recurrent Memory Transformer (RMT) to capture long-range global dependencies. By combining local and global modeling with noise reduction, the proposed approach improves both the accuracy and efficiency of log anomaly detection in microservice environments.

The main contributions of this work are summarized as follows: (1) We identify the key challenges in log anomaly detection for distributed microservice systems, highlighting the limitations of existing methods in capturing long-range dependencies and filtering redundant information. (2) We propose BiGRMT, a novel framework that combines Bi-GRU with a Recurrent Memory Transformer and introduces a redundancy filtering mechanism, enabling effective modeling of long log sequences and improving anomaly detection accuracy. (3) We designed comprehensive experiments on multiple real-world microservice log datasets, demonstrating not only superior detection performance but also high efficiency, validating the framework’s practical applicability in real-time high-throughput systems.

2. Related Work

Detecting abnormal events based on log data has become an important task in the intelligent operation and maintenance of distributed software systems due to the complexity of distributed software. There are two main types of log-based anomaly detection: traditional machine learning and deep learning.

During system operation, traditional machine learning methods extract log information and then use classification or clustering algorithms to detect abnormal events. The conversion of text information in log files into log keys has become an effective method for detecting automatic anomalies based on log analysis [6]. A LogCluster method is proposed to solve the problem of fault discovery being time-consuming and error-prone in large-scale service systems. By clustering logs, this method simplifies fault identification. It is only necessary to check a small number of representative log sequences to accurately identify problems, thus reducing the number of logs that must be checked and improving fault detection accuracy [7]. The deployment scale of distributed software systems increases, increasing the size and diversity of log files. It is difficult for traditional machine learning methods to cope with complex and changing software operations and maintenance scenarios.

Due to its unsupervised learning capabilities, deep learning has gradually become the standard method for detecting log anomalies. Deep learning methods are divided into two categories: classic deep learning and graph neural networks. By building more complex models, classical deep learning can extract features from log data automatically, reduce reliance on manual feature engineering, and improve anomaly detection performance. A GRU neural network model PLELog based on attention mechanisms is proposed to solve the time-consuming manual labeling problem. Semi-supervised learning is achieved by combining probabilistic label estimation with historical anomaly knowledge [8]. By avoiding interference caused by parsing errors, NeuralLog [9] bypasses the traditional log parsing steps and extracts semantic information directly from the original log. A log anomaly detection framework based on ChatGPT (version: GPT-4o mini) is proposed in response to the continuous growth of log data volume in order to efficiently detect zero-shot and few-shot anomalies, improve detection accuracy, and reduce the need for labeled data. This method relies on the ChatGPT API, which results in high computational overhead and certain performance limitations [10]. A multi-head attention mechanism is used to capture the global temporal dependencies of log sequences in TLAN [11], a model based on the Transformer model with complex log event dependencies. Although this method is suitable for distributed and high-concurrency environments, its computational complexity is high, and it is difficult to process large log files. MultiLog is an innovative multivariate log anomaly detection method for distributed databases. Log embedding, self-attention mechanisms, and LSTM are integrated to extract key features, along with clustering classifiers to detect anomalies efficiently on multi-node and single-node networks [12]. It is proposed that LogPal, a log pattern event generation method that combines template sequences and original log sequences, solves the anomaly labeling problem caused by massive heterogeneous logs. The Transformer can automatically adapt to different log types and improve anomaly detection accuracy by improving its self-attention mechanism [13]. In log anomaly detection, traditional deep learning methods have made significant progress, but there are two obvious limitations: (1) Modeling complex software component structures has limitations. Log data feature extraction relies on static text and ignores its global topological structure. (2) Deep learning models have poor interpretability, which limits their application and promotion in system anomaly detection.

With graph data structure at its core, graph neural networks are suitable for processing non-Euclidean data types. Complex events can be captured with their global and contextual information. Thus, they have advantages in processing dynamic changes and structural irregularities of logs and are a new research area. As a result, LogGD enables more accurate log anomaly detection by converting log sequences into graph structures, integrating graph structure information with node semantics, and improving the efficiency and accuracy of system fault diagnosis by deeply learning the spatial structural relationship between log events [14]. In order to analyze computer logs, it is proposed to periodically sample logs, calculate numerical scores, and then train a semisupervised deep autoencoder AutoLog using these scores. Future log scores can be classified based on the encoder. Due to the high dynamics and variability of workloads and systems in real environments, establishing a stable and normative baseline remains challenging [15]. Log anomaly event detection relies heavily on graph neural networks. It is possible to model the global topology structure effectively by taking into account the logical structure of software components. In a high-concurrency execution environment with distributed systems with microservice architecture, there are still limitations when processing log operations.

In addition to traditional deep learning and graph neural network approaches, several Transformer variants designed for long-sequence modeling, such as Longformer [16], Reformer [17], and MemTransformer [18], have recently been proposed. These models reduce the quadratic complexity of standard Transformers through mechanisms such as sparse attention, locality-sensitive hashing, and external memory. However, their applicability in log anomaly detection remains limited. Specifically, Longformer relies on a sliding-window sparse attention mechanism with a small number of global tokens, which constrains its ability to capture fine-grained long-range dependencies in highly interleaved log sequences. Reformer reduces attention costs via locality-sensitive hashing, but the instability of hash bucket assignments may separate semantically related log events, thereby disrupting temporal dependency modeling. MemTransformer leverages external memory to propagate global information across segments; nevertheless, in log environments characterized by high redundancy and noise, its memory mechanism is prone to contamination and exhibits limited capability in capturing short-term local patterns.

To overcome these limitations, we propose BiGRMT, a memory-enhanced, segmentation-based model tailored to the characteristics of log data. Specifically, BiGRMT introduces three key innovations. First, it employs a dual-path architecture that combines the local temporal modeling capabilities of Bi-GRU with the cross-segment long-range dependency modeling of a Recurrent Memory Transformer (RMT), achieving a complementary integration of recurrent and Transformer structures. Second, BiGRMT incorporates workflow separation prior to sequence modeling, allowing the input sequences to be semantically structured and denoised before entering the memory mechanism—capabilities absent in existing long-sequence Transformer models. Third, by maintaining a fixed-length design for the RMT memory tokens, BiGRMT preserves O(N) linear complexity while stably capturing both local and global dependencies in high-concurrency, high-noise microservice log sequences.

In summary, existing deep learning approaches face two major challenges in log anomaly detection: (1) the difficulty of disentangling time series information from highly redundant logs, where normal and abnormal events are intertwined due to multi-threaded outputs and concurrent processing, and (2) the inability to capture long-range temporal dependencies across extended sequences, which often leads to high false alarm and missed detection rates. BiGRMT addresses these challenges by integrating Bi-GRU for local feature extraction, adaptive redundancy filtering to remove repeated logs, and RMT for capturing global dependencies across segments, thereby improving the stability and accuracy of anomaly detection in complex microservice log environments.

3. Problem Description

Large-scale data computing is primarily handled by distributed software systems with microservice architecture. Various microservices or components in the system interact asynchronously, and massive amounts of log data are generated. As a result of the complexity of the microservices architecture and the uncertainty of software execution, log anomaly detection has two key challenges:

3.1. Log Data Have Strong Uncertainty in Their Time Series Attributes

As a result of multiple concurrent operations and asynchronous interactions among service components, the time attributes of log data are chaotic and uncertain [19]. As shown in Figure 1, since concurrent multi-threads have to complete log output, the log events of the same thread will be intertwined with the logs of other threads, resulting in the log events of the same thread being far apart in the time series, increasing the complexity of log time series and making it harder to capture time series dependency [20].

From

t_{1}

to

t_{n}

, multiple thread tasks are executed concurrently in Figure 1;

l_{0}

to

l_{n + 1}

is a logging sequence that records each thread’s activity, state change, or important event. Logs

l_{10}

,

l_{i}

, and

l_{n}

record the activities of thread

t_{i}

. Interleaving logs complicates log timing and complicates software abnormal events analysis. A wrong timestamp sequence of log data will directly lead to the failure of event correlation analysis because it is so important to capture the timing relationship of log data [20]. In short, the distributed software system of microservice architecture has higher requirements for log abnormal analysis that depends on timing relationships, and it requires the abnormal event detection model to be able to capture the long-distance timing dependency characteristics of logs.

3.2. Data Redundancy for High-Concurrency Logs

As a result of the high-concurrency operation of multi-threads or multi-processes in distributed system software processing large-scale data, redundant information will appear in the log sequence. It consists of a large number of repeated logs or noise logs. In addition to increasing the computational complexity of real-time processing, redundant logs may also interfere with the accuracy of log anomaly detection. Currently, however, most log anomaly detection approaches based on deep learning models assign the same weight to all timestep data without an effective screening mechanism. It is therefore difficult to filter redundant information, affecting the extraction of temporal context features of abnormal time, and reducing the detection accuracy. On an e-commerce platform, for example, a large number of users will log in concurrently, resulting in a large number of repeated user logins. The log anomaly detection model will identify these redundant login logs as normal events. Also, malicious users who attempt frequent login attacks will be mistaken for normal users. Redundant log data prevents the model from capturing abnormal features, increasing the model’s missed detection rate.

In order to solve the above problems, this paper combines Bi-GRU and a Recurrent Memory Transformer to capture the correlation of different execution contexts in a high-concurrent computing environment, models and processes multi-threaded concurrent log data from a global perspective, and solves the problem of log data time series uncertainty when concurrently running multi-threaded processes. Through the RMT mechanism, the ability to capture long-term time series dependencies is enhanced by storing and transmitting memory information between segments, effectively solving the log time series confusion problem. Complex time series relationships are modeled by using Bi-GRU for completing more detailed local modeling of nonlinearity and long-short time dependencies. As a result of Bi-GRU’s bidirectional mechanism, not only can forward dependencies be captured but also reverse dependencies, allowing a deeper understanding of the contextual information between events. Using Bi-GRU, short-term time series features and cyclic patterns can be identified, such as frequent cyclic anomalies and short-cycle triggering events. Bi-GRU introduces an adaptive redundant filtering mechanism. Dynamically adjusting the weights of high-frequency and low-information log events allows redundant information to be effectively filtered by learning the importance of each time step. In modeling local dependencies, the method ensures that the model focuses on features that actually contribute to anomaly detection and reduces information interference caused by high-frequency redundant logs.

4. System Design

In real time, the distributed software log abnormal event detection system analyzes log data and identifies potential anomalies. It has three main modules: BiGRMT abnormality detection, log parsing, and workflow separation. The unstructured log data are converted into a structured format by extracting the log template and parsing the time series vector. The chaotic timing of concurrent log data can be solved by analyzing the temporal relationship of log events. In this way, log events are processed in the correct order. Lastly, the BiGRMT anomaly detection model accurately identifies abnormal events in log time series through feature extraction and model training. Figure 2 shows the system structure:

As shown in Figure 2, the log parsing module cleans the original logs, formats and extracts features from the logs, and converts the unstructured logs into structured data. By extracting timestamps, event types, and log levels, it completes the extraction process. The workflow separation module analyzes the sequence of events in the log, separates different workflows based on timing and dependencies, solves the problem of uncertain timing of log data, and provides accurate time clues for anomaly detection. In the BGforme anomaly detection model, the bidirectional GRU network architecture is integrated with the Recurrent Memory Transformer mechanism, which extracts global and local features from log data. By using the RMT mechanism, the model can store and transmit memory information between segments, improving its ability to model long sequences. Global timing dependencies are captured by the Transformer, and local timing relationships are strengthened by a bidirectional GRU. Together, they optimize log event feature extraction. In order to predict whether an abnormal event will occur, a classification model is used.

Log parsing is performed using Swisslog [21], which efficiently converts unstructured log data into structured data. In order to detect log anomalies, log information is converted into time series data. Log processing consists of the following steps: To begin with, the log is tokenized and standardized. Tokens are divided into words, and predefined dictionaries filter and construct valid words. Aside from improving structured processing efficiency, this process also extracts key anomaly detection information. Besides improving structured processing efficiency, this process also extracts key anomaly detection information. Let us suppose that there are three log word sets,

W_{i}

,

i = 1, 2, 3, \dots, N

, and N, where N is the number of entries in each log word set. Through the longest common subsequence (LCS) algorithm, the same part of the log is regarded as a constant, and the different parts are regarded as variables to simplify the log structure. In the cluster analysis, the system matches the prefix of the log sequence layer by layer to capture similarities. In addition, a similar template merging algorithm based on information entropy is used to optimize the processing efficiency by calculating the similarity between templates and give priority to retaining templates with higher information content to ensure information expression capabilities.

During the workflow separation stage, for the log dataset

L = \{l_{1}, l_{2}, \dots, l_{N}\}

, where N is the number of log entries, each log entry

l_{i}

contains a timestamp, time information, and other relevant context information. By sorting the timing information of events

l_{i}

, a time series

T = \{t_{1}, t_{2}, \dots, t_{N}\}

is created, where

t_{i}

represents the timestamp of event

l_{i}

. Then, based on the dependency set

D = \{d_{i, j}\}

between events, the events belonging to the same workflow are associated with a defined workflow set

W = \{w_{1}, w_{2}, \dots, w_{K}\}

. The workflow separation method identifies and separates independent workflows by analyzing dependencies

d_{i, j}

and timing constraints

t_{i} < t_{j}

. The workflows

w_{K}

contain a sequence of events in time sequence

\{l_{k 1}, l_{k 2}, \dots\}

that satisfy the timing constraints

t_{k 1} < t_{k 2} < \dots

. It is possible to accurately represent the execution order of concurrent tasks in the system using these workflows.

The input log dataset is first preprocessed to generate a feature representation suitable for model input during anomaly detection. Bi-GRU captures the temporal dependency of the log sequence through bidirectional calculations, extracts richer temporal features, and constructs the global context representation of the log event. To eliminate log events that appear frequently and contribute little to anomaly detection, a high-frequency redundant log filtering mechanism is introduced. A log event sequence

L_{filtered} = \{l_{1}, l_{2}, \dots, l_{M}\}

is filtered and input into the RMT mechanism. As a result of combining the multilayer Transformer with memory tokens to store and transmit cross-segment dependency information, efficient long sequence modeling is achieved, richer temporal features are extracted, and an output feature sequence

H = \{h_{l}, h_{2}, \dots, h_{M}\}

is produced. Using the anomaly score of each log entry

A = \{a_{1}, a_{2}, \dots, a_{M}\}

, the classifier is used to predict normal and abnormal events based on the combined feature vector H.

5. BiGRMT Anomaly Detection Model

The BiGRMT model integrates the bidirectional GRU and Recurrent Memory Transformer mechanisms to improve the anomaly detection capability of log data, as shown in Figure 3. The advantage of BiGRMT is that it extracts more comprehensive and rich log time series features. Bi-GRU captures the temporal dependencies of the previous and next contexts through bidirectional information transmission, enhances the understanding of short-term temporal relationships, and is suitable for modeling local temporal dependencies; it uses the self-attention mechanism to capture global temporal dependencies and combines the RMT mechanism to store and transmit cross-segment memory information, thereby overcoming the limitations of the Transformer’s computational complexity and achieving efficient long sequence modeling. BiGRMT introduces a log redundancy filtering mechanism to reduce the interference of redundant information and improve the model’s sensitivity to key information and processing efficiency and is particularly outstanding in high-concurrency environments. Therefore, BiGRMT can efficiently reduce the interference of irrelevant information in large-scale log data streams, ensuring accuracy and response speed in the real-time anomaly detection process.

In Figure 3, BiGRMT consists of three parts: the input layer, model layer, and output layer. The input layer is responsible for preprocessing the original log data and generating feature vectors and then combining position embedding to retain the timing information. At the same time, memory tokens are introduced to store global historical information to provide input data for the follow-up. The model layer uses Bi-GRU for preliminary timing modeling to extract short-term dependency information and enhance the understanding of local timing features. Subsequently, the Recurrent Memory Transformer mechanism is used to further model the global timing dependency through the self-attention mechanism, and the memory tokens are used to transmit cross-segment memory information to achieve long sequence modeling. The output layer converts the high-dimensional features extracted by the model layer into the final detection results.

5.1. Input Layer

The input layer of BiGRMT is responsible for preprocessing the raw log data and extracting features. By generating feature vectors and combining them with position embedding, the input layer can effectively retain the time series information. The sine and cosine functions are used to provide position information for each sequence element so that the model can understand the sequential structure of the data [22]. The pre-trained word embedding Word2Vec is used to achieve semantically rich initialization. Reasonable adjustment of the dropout ratio can effectively prevent the loss of information in transmission. A small amount of dropout is added after word embedding and position embedding to ensure information stability.

As a first step in parsing the structured information in the log data, the discrete words or events are embedded and mapped to a continuous vector space for model input. The input log sequence

\{w_{1}, w_{2}, w_{3}, \dots, w_{n}\}

is transformed into a vector sequence

X = \{x_{1}, x_{2}, x_{3}, \dots, x_{n}\}

through the word embedding method, where each

W_{i} \in R^{d}

is a vector of dimension d. In order to capture the contextual relationship and semantic information between words, these vector sequences can be used as input to the deep learning model.

In order to maintain the relative order relationship in time series data, position information is introduced. Through position embedding, position information is combined with word embedding. To obtain the input vector

Z_{i} = P_{i} + X_{i}

, the position embedding vector

P_{i}

is added to the word embedding

X_{i}

for each element i in the sequence. Input features adjusted by position embedding include not only vocabulary information but also time series information, improving the performance of the BiGRMT model.

5.2. Model Layer

The model layer of BiGRMT is responsible for capturing and fusing deep feature information and enhancing the ability to model long sequences through the RMT mechanism. The working process of the model layer is as follows: (1) The embedded feature vector is input into the Bi-GRU module, which processes the context information of the sequence simultaneously through forward and reverse calculations, strengthens the learning of temporal dependencies, and captures bidirectional dependencies, thereby significantly improving the accuracy of anomaly detection [22]. (2) After the Bi-GRU module, an adaptive redundant filtering mechanism is introduced to dynamically learn the importance of features at each time step and filter redundant information by adjusting the weights of log events with high frequency and low information content. When modeling local dependencies, the model ensures that the model focuses on features with high contributions and reduces the information interference caused by high-frequency redundant logs. (3) The feature vector after Bi-GRU and redundant filtering is passed to the multi-layer Transformer in the Recurrent Memory Transformer mechanism, which uses segmented processing and combines memory tokens to recursively store and transmit cross-segment dependency information so that the model can not only use the multi-head self-attention mechanism to mine local features but also capture long-term global dependencies, thereby improving the contextual understanding of log time series [23]. The specific content is as follows:

During the feature processing of the BiGRMT model, Bi-GRU is first used to process the input log sequence. The bidirectional gated recurrent unit (Bi-GRU) is good at capturing local temporal features in log sequences. By processing the forward and reverse input sequences simultaneously, Bi-GRU can fully extract contextual information, capture richer temporal dependencies, and provide a more refined feature expression for the model. At each time step, the state update of the GRU is expressed by the following formula [24]:

z_{t} = σ (W_{z} \cdot [h_{t - 1}, x_{t}])

(1)

r_{t} = σ (W_{r} \cdot [h_{t - 1}, x_{t}])

(2)

\tilde{h_{t}} = tanh (W_{h} \cdot [r_{t} * h_{t - 1}, x_{t}])

(3)

h_{t} = (1 - z_{t}) * h_{t - 1} + z_{t} * \tilde{h_{t}}

(4)

Update gates and reset gates are used by the GRU to control the retention and discarding of information. By concatenating previous state

h_{t - 1}

and current input

x_{t}

, update gate

z_{t}

determines the influence of previous state on current state. The update gate’s weight matrix is represented by

W_{z}

. In the current time step t, gate

r_{t}

represents the reset gate, which controls how information from the previous hidden state affects the current candidate hidden state. The reset gate’s weight matrix is represented by

W_{r}

. Considering the current input

x_{t}

and the previous hidden state,

\tilde{h_{t}}

represents the candidate hidden state of the current time step t. The candidate state’s weight matrix is represented by

W_{h}

. The hidden state of the current time step t represents the influence of the previous hidden state and the candidate hidden state combined. A combination of forward and reverse hidden states makes up the final hidden state

h_{t}

composed of the forward GRU’s hidden state

\vec{h_{t}}

and the backward GRU’s hidden state

\overset{\leftarrow}{h_{t}}

:

h_{t} = [\vec{h_{t}}, \overset{\leftarrow}{h_{t}}]

(5)

For the bidirectional GRU, the final hidden state is a combination of the forward and reverse hidden states, thereby fully capturing the temporal information in the sequence. This feature representation can better adapt to the application scenario of log anomaly detection tasks that require comprehensive contextual information. While capturing local dependencies, Bi-GRU enhances the understanding of global context through a bidirectional structure.

However, in log data, high-frequency redundant information may interfere with the effect of anomaly detection. Therefore, an adaptive redundant filtering mechanism is introduced to dynamically adjust the importance of features, thereby improving the sensitivity of BiGRMT to abnormal patterns. The working process of the adaptive redundant filtering mechanism is as follows: By dynamically assigning weights to each time step, the importance of features is adaptively adjusted. After Bi-GRU extracts local features, redundant information is filtered out to weaken the interference of noise features on the model and enhance the model’s attention to information with high correlation with anomalies. Let the output of Bi-GRU be represented as

H = \{h_{1}, h_{2}, \dots, h_{T}\} \in R^{T \times d}

(6)

h_{t}

represents the feature vector at time step t, T represents the time step, and d represents the dimension feature. By learning the weight vector

w = \{w_{1}, w_{2}, \dots, w_{T}\}

, the adaptive redundant filtering mechanism adjusts the importance of each time step dynamically. Following is the formula for calculating weight:

w_{t} = σ (W_{w} h_{t} + b_{w})

(7)

where

w_{t}

is the weight of the t time step,

W_{w} \in R^{d \times 1}

and

b_{w} \in R

are learnable parameters, and

σ

is the sigmoid activation function, which is used to normalize the weight to [0, 1]. The final feature representation is redundantly filtered by weighted summation:

H_{filtered} = \sum_{t = 1}^{T} w_{t} \cdot h_{t}

(8)

H_{filtered}

is the final feature representation,

w_{t} \cdot h_{t}

is the weighted feature vector of the t time step,

w_{t}

is the weight, and

h_{t}

is the feature vector.

Regarding the fusion between Bi-GRU and RMT, let

H_{g r u} \in R^{T \times d}

denote the contextual representations generated by Bi-GRU. For each RMT layer, the memory token from the previous segment is concatenated with the Bi-GRU output:

X_{t} = [M_{t - 1}; H_{g r u}]

(9)

RMT then performs attention over both the current segment and the memory token:

A t t n (X_{t}) = s o f t m a x (\frac{Q K}{\sqrt{d}}) V

(10)

The updated memory token is generated using a gated update mechanism:

M_{t} = σ (W_{g} {\bar{H}}_{t}) ⊙ M_{t - 1} + (1 - σ (W_{g} {\bar{H}}_{t})) ⊙ {\bar{H}}_{t}

(11)

where

{\bar{H}}_{t}

is the aggregated representation of the current segment. This mechanism allows Bi-GRU to capture local temporal dependencies, while the RMT leverages memory tokens to model long-range cross-segment relationships, enabling both modules to collaborate effectively.

Despite the fact that Bi-GRU is capable of capturing local dependencies, it still has some limitations when it comes to capturing global patterns over a long period of time. In order to achieve global modeling, the recursive memory Transformer is introduced after the filtering mechanism. By using the recursive memory mechanism, the recursive memory Transformer achieves linear computational complexity O(N), effectively breaking the computational bottleneck of the traditional Transformer. To ensure cross-segment information transmission while reducing the quadratic complexity of global attention calculations, RMT utilizes a memory enhancement and segmentation processing scheme. The core calculation process is as follows: The long sequence is first divided into multiple segments of fixed length. Every segment contains a memory token, and the previous segment’s memory state is recursively transferred to the current segment. RMT [25] is calculated as follows at time step

τ

:

{\tilde{H}}_{τ}^{0} = [H_{τ}^{mem} \circ H_{τ}^{0}]

(12)

{\bar{H}}_{τ}^{N} = Transformer ({\tilde{H}}_{τ}^{0})

(13)

[H_{τ}^{mem \circ} H_{τ}^{N}] : = {\bar{H}}_{τ}^{N}

(14)

As a result of the recursive update mechanism,

{\tilde{H}}_{τ}^{0}

represents the input representation of the current segment,

H_{τ}^{mem}

represents the memory state of the recursive transfer,

H_{τ}^{0}

represents the input representation of the current segment,

{\bar{H}}_{τ}^{N}

represents the output representation after processing by the Transformer model, and N represents the number of layers in the Transformer model. RMT accumulates long-term dependency information between segments while keeping the computational complexity at O(N).

Using the RMT mechanism, the Transformer’s multi-head self-attention mechanism enhances the ability to capture global information across time steps and effectively addresses long-distance dependency problems. In the input sequence

\{x_{1}, x_{2}, x_{3}, \dots, x_{n}\}

, the self-attention mechanism dynamically assigns weights based on the weighted correlation between the query, key, and value matrices, thereby capturing global information efficiently. The self-attention mechanism [26] formula is as follows:

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(15)

Q, K, and V represent the query matrix (Query), key matrix (Key), and value matrix (Value), obtained by linear transformation of input

\{x_{1}, x_{2}, x_{3}, \dots, x_{n}\}

,

d_{k}

is the dimension of the key vector, and

s o f t m a x

is used to normalize the attention weight. The Transformer learns multiple dependency patterns simultaneously in multiple heads, enriching feature expression through multi-head attention. In the case of h heads, the final multi-head attention [26] is expressed as follows:

MultiHead (Q, K, V) = Concat ({head}_{1}, {head}_{2}, \dots, {head}_{h}) W^{o}

(16)

There are five learnable weight matrices:

{head}_{i} = Attention (Q W_{i}^{Q},

K W_{i}^{K}, V W_{i}^{V})

,

W_{i}^{Q}

,

W_{i}^{K}

,

W_{i}^{V}

, and

W^{o}

. Through the stacking of self-attention and feedforward networks in each layer of the encoder, input data is gradually transformed into deep feature representations. Each encoder layer produces

H^{(I)}

outputs, where

l \in {1, 2, 3, \dots, N}

represents the number of layers. Later, the input sequence

\{x_{1}, x_{2}, x_{3}, \dots, x_{n}\}

is finally mapped to a feature table

H^{(L)}

in a high-dimensional space, thereby forming a deep representation for downstream use. BiGRMT’s stability and efficiency during training is ensured by the residual connection and normalization operation of each layer.

5.3. Output Layer

The output layer of BiGRMT is responsible for converting the high-dimensional features extracted by the model layer into detection results. The output of the BiGRMT model layer first passes through a fully connected layer, which aims to improve the model’s expressiveness so that it can capture more complex feature relationships. Then it passes through the dimensionality reduction layer to compress the high-dimensional features to a smaller dimension. Here, the linear dimensionality reduction method is selected to ensure that key information is retained while reducing the dimension. Finally, the detection results generated by the BiGRMT output layer are used to trigger the alarm mechanism to enhance the stability and security of the system. The specific process is as follows:

The input of the linear layer is the feature output from the previous layer. For an input vector, the linear layer can capture important features in the data and reduce the dimension by performing a linear transformation on it. The formula for linear transformation is as follows:

y_{t} = W_{o} \cdot h_{t} + b_{o}

(17)

There are four outputs:

h_{t}

is the output of the Transformer,

y_{t}

is the output of the linear layer,

W_{o} \in R^{2 d_{h} \times d_{out}}

is the weight matrix of the linear layer, and

b_{o} \in R^{d_{out}}

is the bias. This linear layer allows BiGRMT to reduce the high-dimensional time series features to the dimensions required by the target task. After linear layer processing, dimensionality reduction is required. For dimensionality reduction, a fully connected layer or a Softmax layer is used. Using the Softmax function, the output can be normalized so that the sum of the probabilities of each category is 1, which is especially useful for classification problems:

{\hat{y}}_{t} = softmax (y_{t})

(18)

The probability distribution is

{\hat{y}}_{t}

and the linear layer’s output is

y_{t}

. In order to detect anomalies in BiGRMT, alarms are triggered based on the generated results.

6. Experiments and Results

6.1. Experimental Setup

The experiment used log data from a real distributed system to evaluate BiGRMT’s performance comprehensively. There are a variety of abnormal patterns in the data, which comes from the actual operating environment. BiGRMT’s generalization ability and anomaly detection can be effectively tested. Python code was implemented using Pytorch v1.10.1 deep learning framework. Ubuntu 22 (manufactured by Canonical Ltd. in London, England, UK) and a GTX 3090 Ti graphics card (manufactured by NVIDIA Corporation in Santa Clara, CA, USA) were employed in the experimental environment.

All datasets used in this study are publicly available through Loghub [27], including Spark, HDFS, and BGL logs. The Spark dataset is collected from 32 physical computers in a laboratory environment and contains operating status and abnormal events. The HDFS dataset comes from Hadoop runs on Amazon EC2, covering various anomaly types such as disk failure, node crash, and data loss. The BGL dataset contains system-level logs from the IBM Blue Gene/L supercomputer, including normal operations and multiple anomaly types. To prepare the data for model input, logs are cleaned by removing irrelevant fields and normalizing timestamps, and event templates are extracted to convert raw messages into structured sequences. For each dataset, sequences are split chronologically, with 70% used for training, 10% for validation, and 20% for testing, ensuring the model is evaluated on unseen future logs.

The experiment compares multiple baseline methods, including LogGPT [10], TLAN [11], and LogPal [13]. Each method is briefly described below: Based on ChatGPT, LogGPT [10] detects log anomalies. Large-scale corpus knowledge is transferred to system log analysis through ChatGPT’s language understanding capabilities. Through experiments, LogGPT is proven to be effective in zero-shot and few-shot learning scenarios. The TLAN [11] framework is a deep learning framework for detecting log anomalies in distributed systems. By combining time series modeling and logical dependency analysis, it explicitly models the temporal pattern of log events. Multiscale feature extraction, temporal–logical modeling, cross-component correlation analysis, and adaptive anomaly detection are included in TLAN. LogPal is a general anomaly detection model for heterogeneous logs of network systems [14]. FT-tree extracts log templates, and pattern events are generated by combining the original logs. Anomalies are then detected using the improved Transformer model. In the model, global attention is combined with sparse attention, effectively balancing template information and semantic information, as well as reducing noise impact.

To evaluate a model’s performance, precision, recall, and F1-score are mainly used. This refers to the proportion of anomalies detected by the model that are actually anomalies. Model precision refers to the accuracy with which abnormal events are judged by the model. With high precision, the model can better distinguish normal logs from abnormal logs, reducing false positives. This refers to the proportion of abnormal logs that the model correctly detects. The recall rate indicates how well the model captures abnormal events. With a high recall rate, the model can find as many abnormal logs as possible and reduce missed reports. The F1-score is the harmonic mean of precision and recall, which takes the accuracy and recall of the model into account. Precision and recall are balanced by the F1-score evaluation indicator. The F1-score indicates how well a model balances precision and recall, making it suitable for anomaly detection tasks that demand a balance between false positives and missed detection.

Word embedding is initialized with a 128-dimensional Word2Vector vector, and all weight parameters are uniformly initialized. A hidden layer is optimized using the Adam optimizer (dimension 128), with a learning rate of 0.001, a random dropout rate of 0.1, and a batch size of 128. The model is trained 100 times with random initialization, and the average result of these 100 experiments is used to evaluate it. To evaluate the effectiveness and robustness of the model, multiple dimensions were measured, including precision, recall, and F1 score.

The BiGRMT model consists of 1 layer of Bi-GRU and 2 layers of the Recurrent Memory Transformer (RMT). Each RMT layer has 8 attention heads and an embedding dimension of 512. The model uses the GELU activation function and applies a dropout rate of 0.1 during training. It is trained for 100 epochs. Key parameters include the learning rate, batch size, and the number of memory units per RMT layer (16).

To prevent overfitting, early stopping was applied based on validation loss, and L2 regularization was added to the model parameters. Dropout (0.1) was also used during training. The training and validation losses gradually decreased and converged after several epochs, with validation loss closely following training loss, indicating that the model did not suffer from significant overfitting.

6.2. Experimental Results and Analysis

The experiment uses the Spark dataset, which records events during the operation of a distributed computing system. Log data provide rich context for anomaly detection tasks and reflect potential system problems. In this experiment, the BiGRMT model was evaluated against a variety of benchmark anomaly detection models in order to verify its effectiveness. As shown in Table 1, by separating workflow logs to reduce data confusion, the model can capture semantic information from multiple angles, improving its ability to detect anomalies in the system. The superiority of BiGRMT in anomaly detection tasks was evaluated by comparing and analyzing its performance on indicators such as precision, recall, and F1-score. By utilizing the bidirectional information transmission mechanism and the global feature extraction capability of RMT, the BiGRMT model significantly reduces the gradient of the model training data and successfully eliminates overfitting. Through a filtering mechanism, the model improves detection accuracy by combining forward and backward semantic information.

In the graph,

| E |

represents the number of edges. The length of the sequence is n, and the dimension of the hidden layer is d. Model parameters are m, and input length is l. According to Table 1, BiGRMT performs well in precision, recall, and F1-score. Compared with LogGPT [10], BiGRMT has a significant improvement in precision and recall primarily because of its RMT mechanism and Bi-GRU bidirectional dependency capture, which can better model the temporal dependency of log sequences and effectively reduce the interference of redundant logs through filtering mechanisms, improving detection accuracy and recall. In contrast, LogGPT [10] failed to effectively process temporal features and contextual dependencies in log data, resulting in unsatisfactory performance in both areas. In comparison with TLAN [11], BiGRMT performs slightly better in precision but slightly worse in recall. BiGRMT performs slightly better in detecting anomalies but slightly sacrifices performance in terms of capturing potential anomalies. The redundant filtering mechanism in BiGRMT reduces the impact of redundant logs, improving the model’s applicability and efficiency in high-concurrency scenarios. In comparison with LogPal [14], BiGRMT has a lower precision and recall but a higher computational complexity. LogPal [14] has a computational complexity of

O (n \times l \times d)

, and the input length l affects the computational overhead, so high-concurrency environments have higher computational overhead. In contrast, BiGRMT adopts the Recurrent Memory Transformer structure, which reduces the computational complexity to

O (n \times d^{2})

. In addition, it introduces memory units that can efficiently process abnormal correlation data over a long period of time, enhancing the efficiency of distributed high-throughput, low-latency processing.

6.3. Performance Evaluation

Table 2 compares BiGRMT with LogGPT, TLAN, and LogPal baseline methods on Spark, HDFS, and BGL datasets:

As shown in Table 2, even though LogPal shows excellent performance on a variety of datasets, BiGRMT offers unique benefits when processing abnormal correlation data over a long period of time. In this way, BiGRMT is more suitable for processing large and complex log files. BiGRMT not only achieves a better balance between precision and recall but also maintains a higher level of stability and efficiency over long time series data when compared with TLAN and LogGPT.

We analyzed and verified the early detection capability of BiGRMT, that is, whether it can provide early warning before an abnormality occurs. As shown in Figure 4, BiGRMT can identify potential anomalies in about one second, indicating that this model can identify potential anomalies earlier, reducing the impact of system failures.

This advantage mainly comes from the fact that the Bi-GRU structure can capture short-term context information, while the RMT mechanism can accumulate abnormal patterns across time periods, enabling BiGRMT to detect potential problems as soon as the log pattern fluctuates abnormally, thereby improving the system’s early warning capabilities.

As part of this experiment, the Spark log dataset was used to verify BiGRMT’s robustness in different environments, introducing 20% noise, 15% missing data, and increasing the log generation rate by 50% (high load) into the dataset, along with multiple anomalies that affect the model’s ability to detect anomalies, such as task timeouts, service crashes, and abnormal call chains. The performance of the models under different conditions is shown in Table 3.

Table 3 shows that LogPal’s F1-score is 99.2% under ideal conditions, which is significantly better than BiGRMT, suggesting that LogPal is capable of better anomaly detection than BiGRMT. Increasing the log generation rate by 50% decreases both LogPal and BiGRMT’s F1, indicating that both models are affected by the high load. LogPal drops to 82.8%, while BiGRMT maintains its 87.4%, indicating a more robust model for high-load environments. When 20% noise is introduced into the data, the F1 decreases for both models, and the false alarm rate also increases significantly, which indicates that the noise has significantly interfered with the performance of the model, but BiGRMT still shows better noise resistance because of its filtering mechanism. With 15% missing data, LogPal and BiGRMT’s F1-scores fell to 74.1% and 80.2%, respectively. However, BiGRMT can detect anomalies and maintain a high detection capability even when incomplete data is available due to its ability to process abnormal correlation data with a long span. It was observed that even when multiple abnormalities occurred, such as task timeouts, service crashes, abnormal call chains, etc., BiGRMT still demonstrated strong robustness overall and was able to adapt to anomaly detection tasks under adverse conditions, even though the F1-scores for LogPal and BiGRMT both dropped significantly.

To verify the computational efficiency of BiGRMT and its O(N) complexity, we conducted performance analysis by measuring GPU memory usage and inference time for varying sequence lengths. Experiments were performed on a GTX 3090 Ti GPU with 24 GB memory using the Spark dataset. Input sequence lengths ranged from 100 to 1000 tokens, and corresponding memory and inference times were recorded, as summarized in Table 4.

As shown in Table 4, both GPU memory usage and inference time increase roughly linearly with sequence length, confirming the O(N) computational complexity of BiGRMT. This efficiency is achieved through the RMT mechanism, which processes sequences in segments and uses memory tokens to avoid the quadratic complexity (O(N²)) of standard Transformers. The near-linear scaling highlights BiGRMT’s suitability for real-time anomaly detection in high-concurrency environments with long log sequences.

To further demonstrate the efficiency of BiGRMT relative to state-of-the-art models, we compared its parameter count and average inference latency with TLAN, LogPal, and LogBERT. The number of parameters (Params) reflects model complexity and memory footprint, while average inference latency (Latency) is measured on the Spark dataset with sequences of length 1000. Table 5 summarizes the comparison.

As shown in Table 5, BiGRMT achieves the smallest model size and fastest inference among the compared models, owing to its lightweight Bi-GRU and the RMT mechanism, which reduce parameter requirements and maintain linear complexity. This efficiency demonstrates that BiGRMT is highly suitable for real-time anomaly detection in long-sequence, high-concurrency log analysis scenarios.

6.4. Ablation Experiment

To assess the contributions of the key components in BiGRMT, we conducted ablation experiments on the Bi-GRU, the Recurrent Memory Transformer (RMT), and the adaptive redundant filtering module. Table 6 summarizes the results in terms of precision, recall, and F1-score.

The ablation results demonstrate the distinct contributions of BiGRMT’s components. Removing the Bi-GRU results in the largest performance degradation, with F1-score dropping from 90.0% to 86.1%, highlighting its critical role in capturing both forward and backward temporal dependencies in log sequences. Excluding the RMT leads to a slight decrease in F1-score to 89.7%, indicating that RMT is important for extracting global features and modeling long-sequence dependencies efficiently. Similarly, removing the adaptive redundant filtering mechanism causes a moderate reduction in F1-score to 88.7%, confirming its effectiveness in mitigating noise from high-frequency, low-information log events and enhancing anomaly detection. Collectively, these results validate that all three components contribute positively to the overall performance of BiGRMT, with Bi-GRU having the most substantial impact, followed by adaptive filtering and RMT.

7. Conclusions

This paper presents BiGRMT, a hybrid framework that integrates Bi-GRU, adaptive redundancy filtering, and a RMT to jointly capture local and long-range dependencies in log sequences. Bi-GRU enhances short-term feature extraction while suppressing noise, and RMT efficiently models long-sequence global dependencies through segment-level memory, achieving near-linear complexity. Experiments on Spark, BGL, and HDFS datasets demonstrate high precision and F1-scores, maintaining stability with increased log rates and injected noise, highlighting strong robustness. Compared to existing methods, BiGRMT significantly reduces computational overhead while preserving detection performance, and its fixed-memory design and lightweight architecture make it suitable for real-time deployment in high-concurrency microservice environments. Future work will explore model compression, adaptive memory mechanisms, and large-scale engineering validation to further improve scalability, industrial applicability, and deployment efficiency.

Author Contributions

Software, R.Z. (Ruicheng Zhang), R.Z. (Renzun Zhang) and D.Q.; Validation, S.W.; Data curation, R.Z. (Ruicheng Zhang) and K.Y.; Writing—original draft, R.Z. (Ruicheng Zhang) and R.Z. (Renzun Zhang); Writing—review & editing, M.X., D.Q. and X.H.; Visualization, R.Z. (Renzun Zhang); Supervision, X.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by State Grid Shandong Power Science and Technology Project: Research on Key Technologies for Intelligent Operation and Inspection of Secondary Systems Based on Logical Models and Image Object Recognition Technology (Project No.: 520609240002).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

All authors were employed by the Tai’an Power Supply Company. They declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Abgaz, Y.; McCarren, A.; Elger, P.; Solan, D.; Lapuz, N.; Bivol, M.; Jackson, G.; Yilmaz, M.; Buckley, J.; Clarke, P. Decomposition of monolith applications into microservices architectures: A systematic review. IEEE Trans. Softw. Eng. 2023, 49, 4213–4242. [Google Scholar] [CrossRef]
Razzaq, A.; Ghayyur, S.A. A systematic mapping study: The new age of software architecture from monolithic to microservice architecture—Awareness and challenges. Comput. Appl. Eng. Educ. 2023, 31, 421–451. [Google Scholar] [CrossRef]
Diaz-De-Arcaya, J.; Torre-Bastida, A.I.; Zárate, G.; Miñón, R.; Almeida, A. A joint study of the challenges, opportunities, and roadmap of mlops and aiops: A systematic survey. ACM Comput. Surv. 2023, 56, 1–30. [Google Scholar] [CrossRef]
Guo, H.; Yang, J.; Liu, J.; Bai, J.; Wang, B.; Li, Z.; Zheng, T.; Zhang, B.; Peng, J.; Tian, Q. Logformer: A pre-train and tuning pipeline for log anomaly detection. Proc. AAAI Conf. Artif. Intell. 2024, 38, 135–143. [Google Scholar] [CrossRef]
Lee, Y.; Kim, J.; Kang, P. Lanobert: System log anomaly detection based on bert masked language model. Appl. Soft Comput. 2023, 146, 110689. [Google Scholar] [CrossRef]
Fu, Q.; Lou, J.G.; Wang, Y.; Li, J. Execution anomaly detection in distributed systems through unstructured log analysis. In Proceedings of the 2009 Ninth IEEE International Conference on Data Mining, Miami, FL, USA, 6–9 December 2009; pp. 149–158. [Google Scholar]
Lin, Q.; Zhang, H.; Lou, J.G.; Zhang, Y.; Chen, X. Log clustering based problem identification for online service systems. In Proceedings of the 38th International Conference on Software Engineering Companion, Austin, TX, USA, 14–22 May 2016; pp. 102–111. [Google Scholar]
Yang, L.; Chen, J.; Wang, Z.; Wang, W.; Jiang, J.; Dong, X.; Zhang, W. Semi-supervised log-based anomaly detection via probabilistic label estimation. In Proceedings of the 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), Virtual Event, 25–28 May 2021; pp. 1448–1460. [Google Scholar]
Le, V.H.; Zhang, H. Log-based anomaly detection without log parsing. In Proceedings of the 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE), Melbourne, Australia, 15–19 November 2021; pp. 492–504. [Google Scholar]
Qi, J.; Huang, S.; Luan, Z.; Yang, S.; Fung, C.; Yang, H.; Qian, D.; Shang, J.; Xiao, Z.; Wu, Z. Loggpt: Exploring chatgpt for log-based anomaly detection. In Proceedings of the 2023 IEEE International Conference on High Performance Computing & Communications, Data Science & Systems, Smart City & Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys), Melbourne, Australia,, 17–21 December 2023; pp. 273–280. [Google Scholar]
Liu, Y.; Ren, S.; Wang, X.; Zhou, M. Temporal logical attention network for log-based anomaly detection in distributed systems. Sensors 2024, 24, 7949. [Google Scholar] [CrossRef] [PubMed]
Zhang, L.; Jia, T.; Jia, M.; Li, Y.; Yang, Y.; Wu, Z. Multivariate log-based anomaly detection for distributed database. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Barcelona, Spain, 25–29 August 2024; pp. 4256–4267. [Google Scholar]
Sun, L.; Xu, X. LogPal: A generic anomaly detection scheme of heterogeneous logs for network systems. Secur. Commun. Netw. 2023, 2023, 2803139. [Google Scholar] [CrossRef]
Xie, Y.; Zhang, H.; Babar, M.A. Loggd: Detecting anomalies from system logs with graph neural networks. In Proceedings of the 2022 IEEE 22nd International Conference on Software Quality, Reliability and Security (QRS), Guangzhou, China, 5–9 December 2022; pp. 299–310. [Google Scholar]
Catillo, M.; Pecchia, A.; Villano, U. AutoLog: Anomaly detection by deep autoencoding of system logs. Expert Syst. Appl. 2022, 191, 116263. [Google Scholar] [CrossRef]
Beltagy, I.; Peters, M.E.; Cohan, A. Longformer: The long-document transformer. arXiv 2020, arXiv:2004.05150. [Google Scholar] [CrossRef]
Kitaev, N.; Kaiser, Ł.; Levskaya, A. Reformer: The efficient transformer. arXiv 2020, arXiv:2001.04451. [Google Scholar] [CrossRef]
Burtsev, M.S.; Kuratov, Y.; Peganov, A.; Sapunov, G.V. Memory transformer. arXiv 2020, arXiv:2006.11527. [Google Scholar]
Meng, W.; Liu, Y.; Zhang, S.; Zaiter, F.; Zhang, Y.; Huang, Y.; Yu, Z.; Zhang, Y.; Song, L.; Zhang, M.; et al. Logclass: Anomalous log identification and classification with partial labels. IEEE Trans. Netw. Serv. Manag. 2021, 18, 1870–1884. [Google Scholar] [CrossRef]
Meng, W.; Liu, Y.; Zhu, Y.; Zhang, S.; Pei, D.; Liu, Y.; Chen, Y.; Zhang, R.; Tao, S.; Sun, P.; et al. Loganomaly: Unsupervised detection of sequential and quantitative anomalies in unstructured logs. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, Macao, China, 10–16 August 2019; Volume 19, pp. 4739–4745. [Google Scholar]
Li, X.; Chen, P.; Jing, L.; He, Z.; Yu, G. Swisslog: Robust and unified deep learning based log anomaly detection for diverse faults. In Proceedings of the 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE), Online, 12–15 October 2020; pp. 92–103. [Google Scholar]
Si, C.; Yu, W.; Zhou, P.; Zhou, Y.; Wang, X.; Yan, S. Inception transformer. In Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2022; Volume 35, pp. 23495–23509. [Google Scholar]
Zhang, L.; Jia, T.; Wang, K.; Jia, M.; Yang, Y.; Li, Y. Reducing events to augment log-based anomaly detection models: An empirical study. In Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, Barcelona, Spain, 24–25 October 2024; pp. 538–548. [Google Scholar]
Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv 2014, arXiv:1406.1078. [Google Scholar] [CrossRef]
Bulatov, A.; Kuratov, Y.; Kapushev, Y.; Burtsev, M. Beyond attention: Breaking the limits of transformer context length with recurrent memory. Proc. AAAI Conf. Artif. Intell. 2024, 38, 17700–17708. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Zhu, J.; He, S.; He, P.; Liu, J.; Lyu, M.R. Loghub: A large collection of system log datasets for ai-driven log analytics. In Proceedings of the 2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE), Florence, Italy, 9–12 October 2023; pp. 355–366. [Google Scholar]
Guo, H.; Yuan, S.; Wu, X. Logbert: Log anomaly detection via bert. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; pp. 1–8. [Google Scholar]

Figure 1. Timing confusion diagram.

Figure 2. Distributed Software log anomaly event detection system architecture.

Figure 3. BiGRMT model network structure diagram.

Figure 4. Comparison of early detection capability with the same model.

Table 1. Comparison of experimental effects of models.

	LogGPT [10]	TLAN [11]	LogPal [13]	LogBERT [28]	BiGRMT
Precision (%)	48.7	91.2	98.3	87.0	91.3
Recall (%)	48.9	89.4	98.3	78.1	88.8
F1-Score (%)	55.6	90.3	98.6	82.3	90.0
Computational Complexity	$O (m \times l^{2})$	$O (n \times d^{2} + \| E \| \times d)$	$O (n \times l \times d)$	$O (n \times d^{2})$	$O (n \times d^{2})$

Table 2. Evaluation of the overall performance of different models.

Dataset	Method	F1-Score (%)	Precision (%)	Recall (%)
Spark	LogGPT	55.7	38.6	100.0
	TLAN	89.7	89.1	90.3
	LogPal	99.0	99.0	99.0
	BiGRMT	90.0	91.3	88.8
HDFS	LogGPT	50.7	34.0	100.0
	TLAN	91.5	90.9	92.1
	LogPal	99.0	98.0	99.0
	BiGRMT	89.7	90.3	89.2
BGL	LogGPT	44.4	28.6	100.0
	TLAN	89.3	88.7	89.9
	LogPal	98.0	98.0	97.0
	BiGRMT	91.3	92.2	90.5

Table 3. Comparison of model performance under different conditions.

Condition	Model	F1-Score (%)	FAR (%)	DL (s)
Normal	LogPal	99.2	2.03	1.30
Normal	BiGRMT	90.3	5.12	1.45
High Load	LogPal	82.8	10.07	1.75
High Load	BiGRMT	87.4	7.25	1.56
Noisy Logs	LogPal	78.0	16.34	1.85
Noisy Logs	BiGRMT	83.6	12.08	1.68
Missing Data	LogPal	74.1	19.57	1.92
Missing Data	BiGRMT	80.2	15.14	1.72
Multiple Anomalies	LogPal	68.8	25.62	2.05
Multiple Anomalies	BiGRMT	75.4	20.49	1.80

Table 4. GPU memory usage and inference time for different sequence lengths.

Sequence Length	GPU Memory Usage (MB)	Inference Time (ms)
100	1200	15
200	1400	18
500	1800	25
1000	2200	35

Table 5. Model parameters and inference latency comparison.

Model	Params (M)	Latency (ms)
LogGPT [10]	175.2	120.5
TLAN [11]	68.7	45.2
LogPal [13]	92.3	65.8
LogBERT [28]	110.4	78.6
BiGRMT	45.1	35.0

Table 6. Comparison of the effects of ablation experiments.

Model Configuration	Precision (%)	Recall (%)	F1-Score (%)
BiGRMT	91.3	88.8	90.0
BiGRMT without Bi-GRU	87.8	84.5	86.1
BiGRMT without RMT	91.1	88.4	89.7
BiGRMT without Adaptive Filtering	90.0	87.5	88.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, R.; Zhang, R.; Wang, S.; Yang, K.; Xu, M.; Qiao, D.; Hu, X. BiGRMT: Bidirectional GRU–Recurrent Memory Transformer for Efficient Long-Sequence Anomaly Detection in High-Concurrency Microservices. Electronics 2025, 14, 4754. https://doi.org/10.3390/electronics14234754

AMA Style

Zhang R, Zhang R, Wang S, Yang K, Xu M, Qiao D, Hu X. BiGRMT: Bidirectional GRU–Recurrent Memory Transformer for Efficient Long-Sequence Anomaly Detection in High-Concurrency Microservices. Electronics. 2025; 14(23):4754. https://doi.org/10.3390/electronics14234754

Chicago/Turabian Style

Zhang, Ruicheng, Renzun Zhang, Shuyuan Wang, Kun Yang, Miao Xu, Dongwei Qiao, and Xuanzheng Hu. 2025. "BiGRMT: Bidirectional GRU–Recurrent Memory Transformer for Efficient Long-Sequence Anomaly Detection in High-Concurrency Microservices" Electronics 14, no. 23: 4754. https://doi.org/10.3390/electronics14234754

APA Style

Zhang, R., Zhang, R., Wang, S., Yang, K., Xu, M., Qiao, D., & Hu, X. (2025). BiGRMT: Bidirectional GRU–Recurrent Memory Transformer for Efficient Long-Sequence Anomaly Detection in High-Concurrency Microservices. Electronics, 14(23), 4754. https://doi.org/10.3390/electronics14234754

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

BiGRMT: Bidirectional GRU–Recurrent Memory Transformer for Efficient Long-Sequence Anomaly Detection in High-Concurrency Microservices

Abstract

1. Introduction

2. Related Work

3. Problem Description

3.1. Log Data Have Strong Uncertainty in Their Time Series Attributes

3.2. Data Redundancy for High-Concurrency Logs

4. System Design

5. BiGRMT Anomaly Detection Model

5.1. Input Layer

5.2. Model Layer

5.3. Output Layer

6. Experiments and Results

6.1. Experimental Setup

6.2. Experimental Results and Analysis

6.3. Performance Evaluation

6.4. Ablation Experiment

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI