Next Article in Journal
Constrained Metropolitan Service Placement: Integrating Bayesian Optimization with Spatial Heuristics
Previous Article in Journal
AI-Powered Evaluation of On-Demand Public Transport: A Hybrid Simulation Approach
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

LogPPO: A Log-Based Anomaly Detector Aided with Proximal Policy Optimization Algorithms

State Key Laboratory of Photonics and Communications, School of Electronics, Peking University, Beijing 100871, China
*
Author to whom correspondence should be addressed.
Smart Cities 2026, 9(1), 5; https://doi.org/10.3390/smartcities9010005
Submission received: 16 November 2025 / Revised: 18 December 2025 / Accepted: 24 December 2025 / Published: 26 December 2025

Highlights

What are the main findings?
  • Using Large Language Models (LLMs) together with domain classifiers improves log anomaly detection in data-scarce settings, with F1-Score gains of 5–86% over Transformer-based baselines.
  • A PPO-based method aligns LLM outputs with classifier preferences, increasing “confidence” in label predictions.
What are the implications of the main findings?
  • Enhancing automation improves operational stability and resilience in smart city cloud platforms, minimizing downtime and human intervention.
  • Reducing data requirements lowers the practical barriers for deploying NLP-based anomaly detection, making advanced diagnostic solutions feasible for large-scale urban systems.

Abstract

Cloud-based platforms form the backbone of smart city ecosystems, powering essential services such as transportation, energy management, and public safety. However, their operational complexity generates vast volumes of system logs, making manual anomaly detection infeasible and raising reliability concerns. This study addresses the challenge of data scarcity in log anomaly detection by leveraging Large Language Models (LLMs) to enhance domain-specific classification tasks. We empirically validate that domain-adapted classifiers preserve strong natural language understanding, and introduce a Proximal Policy Optimization (PPO)-based approach to align semantic patterns between LLM outputs and classifier preferences. Experiments were conducted using three Transformer-based baselines under few-shot conditions across four public datasets. Results indicate that integrating natural language analyses improves anomaly detection F1-Scores by 5–86% over the baselines, while iterative PPO refinement boosts classifier’s “confidence” in label prediction. This research pioneers a novel framework for few-shot log anomaly detection, establishing an innovative paradigm in resource-constrained diagnostic systems in smart city infrastructures.

1. Introduction

Modern smart city infrastructures increasingly rely on distributed cloud architectures to deliver essential urban services, from intelligent transportation to energy management [1]. These systems, characterized by large-scale software and hardware integration, generate massive volumes of operational logs that are critical for anomaly detection and reliability assurance. However, the exponential growth in system complexity renders manual log analysis impractical, creating an urgent need for intelligent, automated solutions.
Existing deep-learning approaches for automated log anomaly detection generally fall into two categories: encoder-based and decoder-based methods. Encoder-based models [2,3,4,5,6] typically map log content into high-dimensional embedding vectors for classification using Support Vector Machine (SVM) or neural networks. While effective, these methods suffer from two primary limitations. First, the obtained high-dimensional vector spaces are inherently opaque “black boxes” difficult to interpret. Second, the accuracy of such methods is sensitive to sample size [2], which constrains their application in real-world scenarios characterized by data scarcity—where manually labeled data is significantly smaller than the massive volume of system-generated logs.
To improve transparency and address the performance degradation due to data scarcity, researchers have explored large language models (LLMs) with decoder-based architectures [5,7]. These models offer interpretability by generating natural language explanations and maintain the classification performance stability in data-scarce settings via prompting. However, pure decoder approaches can only generate unstructured analyses rather than explicit labels, impeding full automation by necessitating manual inspection for classification conclusions. In addition, relying solely on prompting without fine-tuning, they fail to leverage the valuable classification patterns contained in available labeled data.
To address these complementary limitations, we propose a hybrid decoder-encoder architecture, as shown in Figure 1. A LLM is employed to generate interpretable insights, while a fine-tuned encoder extracts explicit classification conclusions from those insights, ensuring robust performance and full automation even with scarce training data.
In summary, to mitigate the aforementioned issues, we propose LogPPO. Our main contributions can be summarized as follows:
(1)
We introduce a decoder-encoder framework specifically designed to utilize both domain-specific expertise and general human knowledge within encoder-based classification models.
(2)
To the best of our knowledge, we are the first to employ the PPO algorithm in log-based anomaly detection tasks to address and align cognitive discrepancies between LLMs and classification models.
(3)
Three Transformer-based models are selected as baselines. Experimental results on four public log datasets show that LogPPO achieves a 5% to 86% improvement in anomaly detection F1-Scores compared to the best-performing baseline.

2. Background and Related Work

2.1. Log Data

During systems operation, a substantial volume of log data is generated, which manifests as semi-structured textual data encapsulating critical operational information. Figure 2 illustrates the logs extracted from BGL dataset, comprising three distinct components: the log header, static events, and dynamic parameters [8]. (1) The log header incorporates elements which are operationally independent of system runtime status yet fundamentally associated with log governance mechanisms. (2) Events, also referred to as key words, systematically document specific system events or particular operational states occurring at the moment of log generation. For instance, the event “RAS KERNEL FATAL data TLB error interrupt” typically signifies that an interrupt was triggered by the emergence of a Translation Lookaside Buffer (TLB) fault. (3) Dynamic parameters record dynamically configured attributes during system runtime, such as IP addresses and TCP port numbers. Log parsing serves as a foundational preprocessing step that automatically extracts structured log templates from raw log data through advanced parsing techniques.
In recent years, a multitude of sophisticated methodologies for log parsing have been developed [9], with diverse algorithmic frameworks and intelligent parsing techniques emerging to address the growing complexity of system log analysis, such as AEL [10], LogPPT [11], Spell [12], Drain [13] and ChatGPT-based log parsing [14]. To closely approximate real-world industrial application scenarios, we strategically adopted the Drain algorithm, given its implementation cases.
According to [15], algorithm-based log parsing tools can not identify all the parameters in logs, leading to multiple distinct static templates with redundant information such as unique IDs retained inside. Whereas such errors can introduce noise to classification encoders, modern LLMs are pre-trained on noisy texts, and can still identify the core event semantics of a log message even with imperfect template abstractions. Therefore, even if the template is noisy, the analysis remains clean and can guide the classifier correctly.

2.2. Log Anomaly Detection via Deep Learning Architectures

In recent years, the field of natural language processing (NLP) has witnessed an unprecedented surge of groundbreaking innovations, exemplified by the methods based on Transformer architectures [16] such as ChatGPT [17] and BERT [18], which have fundamentally redefined the paradigms of linguistic computation. Some researchers, observing structural and semantic parallels between human language and log data, have been motivated to explore the application of NLP techniques for log anomaly detection and develop a multitude of methodologies [2,3,4,5,6].
Based on model architectures, they can be categorized into embedding-based methods and generation-based methods. The main distinctions between them are demonstrated in Figure 3. In the realm of log anomaly detection, a substantial proportion of tasks fundamentally rely on embedding-based techniques. The background context of this paper focuses on the methods based on deep learning frameworks.

2.2.1. Embedding-Based Methods

Embedding-based methods typically comprise the following sequential stages: log preprocessing, feature extraction, and anomaly identification, with feature extraction constituting the most pivotal component in the pipeline [19]. In the domain of log preprocessing, log parsing stands as the most pivotal technique, serving as the cornerstone for transforming unstructured log messages into structured formats through log template extraction and parameter identification. There are two primary methodologies for anomaly identification: (1) Generate a predictive set of normal tokens or log sequences, then verifying whether subsequent logs fall within these normal subsets [8]; (2) Employ supervised classification models to determine whether an input log exhibits normal or abnormal patterns. Commonly, classifiers include shallow neural networks and SVM [20,21].
Feature extraction refers to the process of projecting log data into a floating-point vector space, where the generated vectors encompass semantically rich information characteristics that prove instrumental in facilitating the identification of abnormal log messages. To maximize the informative value encapsulated within these vectors, several dedicated training methods have been developed specifically for log feature extraction in machine learning frameworks [2,3,4,8,19].
Based on the pre-trained BERT architecture, LogBERT [4] proposes two supervised fine-tuning tasks to systematically capture the intrinsic patterns from normal logs. The first task is Masked Log Key Prediction (MLKP). This methodology masks randomly selected tokens, followed by predictive reconstruction of the obscured tokens. The second task is Volume of Hypersphere Minimization (VHM), which aims to reduce the distance between normal log messages and the centroid within the log vector space. BigLog [3] conducts Masked Language Modeling (MLM) pre-training tasks utilizing a self-curated large-scale dataset. This methodological design aimed to cultivate universal representation vectors specifically optimized for log data characterization. KnowLog (model) [2] incorporates three specialized fine-tuning tasks to enhance domain-specific log comprehension: (1) The Abbreviation Prediction Task emulates the MLM task to predict abbreviated terms within contextual sequences, thereby enhancing the model’s comprehension and retention of abbreviated lexical knowledge. (2) The Log Contrastive Learning Task, grounded in contrastive learning methods, facilitates the acquisition of more generalized log semantic embeddings. (3) The Description Discrimination Task, inspired by contrastive learning methods, employs human language descriptions as teaching materials to enable KnowLog (model) to systematically acquire and internalize knowledge of network terminology.

2.2.2. Generation-Based Methods

Generation-based methods predominantly rely on Prompt Engineering methodologies while refraining from parameter adjustments in the original models. Based on Prompt Engineering, LogPrompt [5] systematically design three prompt design strategies: Self-Prompt, Chain-of-Thought (CoT) Prompting and In-Context Prompting. This methodology achieves state-of-the-art performance in zero-shot learning scenarios. LogExpert [7] constructs a comprehensive database utilizing Stack Overflow technical forum as foundational resources. When generating solutions for anomaly log analysis, the system initially queries this database to retrieve pertinent information, which is then strategically incorporated into the prompt engineering process. The integration of novel reference materials from external repositories enables the derivation of more holistic and robust resolution.
Nevertheless, both embedding-based and generation-based methodologies demonstrate inherent limitations. The performance metrics of embedding technologies exhibit a positive correlation with the scale of training datasets. However, empirical observations indicate that real-world online application scenarios [22] are often characterized by data scarcity. The temporal sampling of operational data captures only minuscule segments of system uptime in complex network environments. Consequently, assembling human-labeled datasets of adequate scale becomes inherently problematic, even being exceedingly challenging. Therefore, all experiments were designed to replicate scenarios characterized by data scarcity, mirroring the constrained data conditions commonly encountered in real-world applications. The application of generation-based technologies, particularly Prompt Engineering techniques, faces a critical challenge: while these methods enhance model knowledge acquisition, they simultaneously suffer from inherent bias propagation stemming from the prompts’ latent suppositions.

3. Approach

The architecture of LogPPO is depicted in Figure 4, comprising two key components: a classification model and a generation model. This section provides a sequential exposition of both constituent modules.

3.1. Classification Model

As depicted in Figure 5, we first fine-tuned KnowLog (model) to transform it into a label-output classification model capable of determining whether an input log exhibits normal or abnormal patterns. During the fine-tuning process, we employed the AdamW optimizer with a learning rate of 5 × 10 5 , a batch size of 8, and a weight decay of 0.01. Regarding the training scope, we adopted a full parameter fine-tuning strategy, unfreezing all layers of the KnowLog architecture to enable the model to fully capture the semantic nuances of the log data.
To ensure the integrity of model evaluation and prevent data leakage, we strategically excluded a substantial subset of the dataset from the fine-tuning pipeline while maintained rigorous validation protocols throughout the model optimization process. At this stage, we have established a classification model capable of performing anomaly detection in log data through sophisticated pattern recognition and semantic analysis.
The LLM was adopted as the initial policy model, tasked with producing multiple analyses on identical log messages to synthesize a curated candidate dataset, and workflow is shown in Figure 6. The log messages and their corresponding analyses are integrated into the fully connected layer, where the layer derives probabilities associated with real labels. The generated analyses are paired into sets, where the analysis with a higher probability value is designated as the “chosen” candidate, while its counterpart is labeled as “rejected”. This methodological framework ultimately yields a strategically curated “chosen-rejected” dataset [23], to serve as foundational material for training reward models.

3.2. Generation Model

The models implemented in our study were selected from the LLaMA series [24]. We selected the LLaMA 2-7B architecture as the reference model due to its decoder-based transformer structure, which is essential for generative text analysis instead of classification. Furthermore, among open-source models of comparable size including MPT-7B and Falcon-7B, LLaMA 2-7B demonstrates superior performances on reasoning and reading comprehension benchmarks. Its 7-billion parameter size offers an optimal balance between reasoning capability and computational efficiency, aligning with the resource constraints of smart city infrastructure. The workflow for tuning generation model using PPO is illustrated in Figure 7.
To guide the learning process of the policy model, we first trained a reward model as the foundational step. While a more straightforward approach would involve directly employing classification model to train policy model, practical experimentation revealed suboptimal outcomes. The policy model may generate garbled outputs characterized by anomalous character sequences. The underlying intuition is that while classification models can identify correct labels, they may not serve as effective instructors to guide policy models in optimizing their parameters. To enhance mutual comprehension between the policy model and reward model during reinforcement learning, aligning their parameter distributions during the initialization phase is recommended [25].
This alignment mechanism mirrors the phenomenon where students sharing congruent knowledge frameworks demonstrate heightened mutual understanding. We removed the final unembedding layer from the reference model and replaced it with a linear output layer. The refined reference model, formally designated as the reward model, now generates a scalar value through its forward propagation, serving as a predictive metric for estimating the reward value of the input text sequence.
The reward model processes a pair of chosen and rejected analyses as inputs, generating corresponding predicted reward values for comparative evaluation. To enforce the fundamental preference alignment principle that chosen responses should receive higher rewards than rejected counterparts, we employ the following loss function:
L ( φ ) = ln σ ( r ( L T , L A c h o s e n ) r ( L T , L A r e j e c t e d ) ) ,
where φ denotes trainable parameters of the reward model, σ is the sigmoid function, and r ( L T , L A ) represents the reward value predicted by the reward model when processing the input log template L T and log analysis L A .
After completing the training of the reward model, we can proceed to train the policy model. We employ gradient ascent methodology which is a classical reinforcement learning algorithm for parametric refinement of the policy model and is rigorously expressed in (2):
θ n e w = θ o l d + α θ J ( θ ) ,
where α is the learning rate, θ represents the parameters of policy model and J ( θ ) is the objective function designed to be maximized.
In this context, the interactive dynamics of classification model are designed as the “environment”. At time step t, the intelligent agent (LLM) resides in state s t  (defined as the text sequence up to the current time step), and executes action a t  (the process of predicting the next token based on the current input text sequence). The policy π of LLMs is exclusively governed by the model’s parameter θ , thus the strategy can be formally denoted as π θ ( a t | s t ) . After the policy model completes the entire output sequence through T sequential steps, a trajectory τ = { s 1 , a 1 , , s T , a T } is generated, whereupon the reward model assigns a corresponding scalar reward value r T to evaluate the complete action sequence. Following the framework established by Ouyang et al. (2022) [26], a sequence-level reward is used, where the reward model’s score is assigned to the end of the sequence:
r t = r ( L T , L A ) , if t = T , 0 , if t < T .
This is because the reward model evaluates the complete response. Intermediate tokens are transitional states lacking full semantic integrity and cannot be assessed individually. The return R t is formulated as the exponentially discounted weighted summation of future rewards, expressed in (4):
R t = k = t T γ k t r k , t T ,
where γ denotes the decay factor. This formulation systematically reduces the weight of rewards from temporally distant states relative to the terminal state, reflecting their diminished relevance to the final return estimation.
The policy gradient is calculated by (5):
θ J ( θ ) = E τ t = 0 T θ ln π θ ( a t | s t ) Φ t ,
where Φ t can be directly employed as R t , engineering implementations typically adopt the advantage function A ( t ) , which substantially reduces variance in practical applications. A more detailed introduction of A ( t ) will be placed in Section 4.4.

4. Experiments

In this section, we evaluate our approach by answering the following research questions (RQs):
  • RQ1: How effective is LogPPO under conditions of data scarcity?
  • RQ2: How effective are the LogPPO components?
  • RQ3: Why is the Decoder-Encoder architecture effective for incorporating natural language analyses?
  • RQ4: How do hyperparameter settings affect the convergence of LogPPO?

4.1. Experimental Metrics and Datasets

Our experiments are conducted on three widely-adopted benchmark datasets (BGL, Spirit, Thunder) along with a novel dataset named KnowLog (dataset). The evaluation metric employs F1-Score [27], which comprehensively integrates both Precision and Recall through their harmonic mean, to ensure a balanced and robust assessment of model performance. The details of the experimental dataset are described as follows:
The Blue Gene/L (BGL) dataset [28,29], collected at Lawrence Livermore National Laboratory (LLNL), comprises 4,747,963 system log messages. The determination of abnormal log messages is based on verifying whether the first character of each log message is a hyphen (“-”). The Spirit and Thunderbird datasets [28] were both collected at Sandia National Laboratories (SNL), comprising 272,298,969 and 211,212,192 log messages respectively, which serve as critical resources for AIOps research in log anomaly detection. Each log entry has been manually annotated with alert category tags to indicate whether it is normal or not. KnowLog (model) directly extracts log templates and their corresponding descriptions from device vendors’ technical documentation. These descriptions, which encapsulate the vendor’s official interpretations of the logs, are utilized as the ground truth to classify whether log templates are normal or not.

4.2. RQ1: How Effective Is LogPPO Under Conditions of Data Scarcity?

Initially, we employed the Drain to extract log templates from four publicly available log datasets. We posit that the accurate comprehension of log templates serves as the cornerstone for the proper comprehension of sequences. Consequently, the experiments conducted in this study are predominantly focused on log template analysis. In our experimental configuration, the training and test sets were partitioned at a ratio of 1:9 through randomized sampling. Specifically, this resulted in a training set size of 34 samples for BGL, 73 for Spirit, 93 for Thunderbird, and 150 for the KnowLog dataset. This setup deliberately mirrors operational scenarios where annotated datasets are substantially smaller than the unlabeled log data.
Drawing upon the latest research advancements, we have selected three Transformer-based methodologies, incorporating two embedding-based approaches [2,3] and one generation-based model [5] as comparative baselines. To balance the experiment, we implemented LogPrompt using few-shot prompt engineering, providing it with demonstration examples to mirror the information access of our trained model. We observed that adding these examples successfully improved LogPrompt’s performance metrics, indicating that the baseline benefited significantly from this setup compared to a purely zero-shot approach. Consequently, to ensure the most rigorous comparison, we selected the best results achieved across these configurations. The experimental outcomes are summarized in Table 1.
Our method has demonstrated superior performance across all four datasets. Concurrently, we observe that embedding-based methodologies yield suboptimal F1-Scores under data scarcity conditions, which aligns with empirical findings demonstrating the positive correlation between the efficacy of embedding-based approaches and dataset scale in prior experimental validations [5].
As shown in Table 1, embedding-based methods (BigLog and KnowLog) are hampered by extremely low recall, indicating they fail to capture the vast majority of actual anomalies. For instance, on the Thunderbird and Spirit datasets, BigLog demonstrates limited sensitivity towards anomalies. Conversely, the generation-based method (LogPrompt) struggles with low precision (hovering around 11–13% on datasets like BGL, Spirit, and Thunderbird), implying it is prone to excessive false alarms despite its high recall.
This comparison highlights the specific contribution of our PPO component that PPO algorithm strikes a critical balance. It effectively addresses the “low precision” issue of LogPrompt without reverting to the “low recall” limitations of embedding methods. By aligning the LLM with the classifier’s preferences, LogPPO significantly reduces false positives to achieve a balance between identifying actual faults and minimizing false alarms. Consequently, LogPPO achieves the highest F1-Scores across all datasets, demonstrating its robustness in data-scarcity scenarios.
The utilization of human linguistic knowledge demonstrates a substantial enhancement in F1-Scores, elevating performance by 5% to 86% compared to the best-performing baseline LogPrompt, and demonstrates an even more pronounced advantage over the other two baselines. When the available labeled log data for training is significantly outnumbered by the unlabeled logs requiring detection, encoder-based models struggle to capture the enough knowledge for effective anomaly identification. However, the inherent natural language knowledge remained within the classification model enables more effective capturing of semantic correlations among log keys, thereby facilitating comprehensive log comprehension and enhancing classification performance. While generation-based methods exhibit stable F1-Scores, their lack of fine-tuning prevents them from fully leveraging the classification knowledge embedded in annotated log data, thereby failing to achieve state-of-the-art performance.
While BigLog employs straightforward and simple pre-training tasks to acquire general-purpose vector representations, its lack of specialized optimization for domain-specific terminology and abbreviations results in underperformance compared to KnowLog (model). KnowLog (model) incorporates contrastive learning methods to enhance its acquisition of domain-specific knowledge and technical abbreviations. However, it fails to fully leverage the substantial natural human linguistic knowledge preserved within the model architecture, resulting in insufficient partial information capture and suboptimal utilization of the model’s inherent capabilities. The KnowLog (model) demonstrates notable performance on the KnowLog dataset, primarily attributable to its domain-specific training regimen on analogous data architectures, thereby achieving an elevated F1-Score. Notably, even under these optimized conditions, the integration of analyses can further enhance anomaly detection robustness.
Notably, BigLog yielded a F1-Score of 0 on the Spirit and Thunderbird datasets. Our examination of the prediction outputs confirms that the model converged to a trivial “All-Normal” prediction strategy, classifying all test instances as negative. This phenomenon arises because the optimization algorithm succumbed to “Majority Class Bias” [30] under the dual constraints of severe data scarcity (1:9 split) and extreme dataset imbalance (skewed class distribution). Because there were too few anomalous samples for the model to learn their distinct patterns, the classifier minimized the global training loss by predicting the majority label (“Normal”) for every instance. This degenerate solution results in zero True Positives, mathematically forcing the Recall and F1-Score to zero.
To evaluate the incremental effectiveness of incorporating natural language analyses, we conducted a visualization-based comparative experiment. The 768-dimensional vector representations generated by embedding models pose multifaceted analytical challenges within their native high-dimensional space, a phenomenon predominantly attributable to the intrinsic curse of dimensionality inherent in high-dimensional vector spaces. Therefore, this study employs the classical dimensionality reduction algorithm, Principal Component Analysis (PCA) [31], to project high-dimensional vectors into a two-dimensional subspace while preserving critical information through variance maximization, thereby facilitating subsequent visualization and analytical processes.
We performed randomized sampling of both normal and abnormal data subsets. As depicted in Figure 8, within the two-dimensional projection space, normal logs (red) and abnormal logs (blue) demonstrate a marked overlapping distribution pattern. In comparison, with the incorporation of natural language analyses, the projection results between different classes exhibit distinct inter-class separation characteristics. After integrating natural language analyses, the model demonstrates enhanced comprehension of contextual semantic patterns in log data at the representational level. This advancement reveals distinctive differences between normal and abnormal data distributions, thereby significantly reducing the complexity of subsequent classification tasks by establishing more distinct decision boundaries in the latent space.
To further validate the robustness of LogPPO under strict data scarcity, we evaluated the model’s performance across varying few-shot scenarios. As the training size increased from 5 to 30 samples, the F1-score demonstrated a consistent upward trend, as illustrated in Figure 9, improving from 0.443 (5-shot) to 0.499 (30-shot). This indicates that LogPPO can effectively extract patterns even with extremely limited data (e.g., 5 samples) while steadily benefiting from additional annotated examples.

4.3. RQ2: How Effective Are the LogPPO Components?

To further validate the effectiveness of our components, we conducted an ablation study comparing three settings: (1) Log Template only, (2) Log Template combined with human-mimicking analyses generated by the Pre-PPO LLM (via specific prompt engineering to simulate expert reasoning), and (3) Log Template combined with Post-PPO analyses.
As shown in Table 2, the integration of human-mimicking analyses surpasses the template-only baseline in most cases (e.g., F1-score increases from 0.056 to 0.105 on BGL), confirming the value of semantic explanations (“analysis matters”). However, the Post-PPO model achieves the best performance across all datasets (e.g., reaching 0.444 on BGL). This confirms the effectiveness of PPO optimization (“alignment matters”). While human-like linguistic features are beneficial, the alignment provided by PPO—optimizing the analysis specifically for the classifier’s discrimination—is essential for maximizing anomaly detection accuracy.
Following the previous subsection, we map two categories of semantic vectors, “log template combined with pre-PPO analysis” and “log template combined with post-PPO analysis”, into a two-dimensional subspace through dimensionality reduction. The Figure 10 demonstrates that the “log template combined with pre-PPO analysis” approach exhibits obvious cluster overlaps observed in specific regions. Conversely, the “log template combined with post-PPO analysis” methodology manifests substantially enhanced inter-class separability with enhanced cluster boundaries.
The separation observed in Figure 10 illustrates the positive impact of PPO on semantic vector distribution. Specifically, the “Normal” cluster (blue) represents logs describing routine operations and informational updates, whereas the “Abnormal” cluster (red) represents failure states. This improved separability stems from the model’s capability to interpret and articulate domain-specific expertise. While raw logs often contain obscure abbreviations (e.g., “TLB error”) that lead to high-dimensional overlap, the generated analyses translate these into explicit semantic features. Furthermore, PPO aligns the LLM with the classifier. It teaches the LLM to use specific words that the classifier recognizes easily—such as explicitly describing a state as “routine” or “critical.” This clear difference in wording naturally pushes the semantic vectors apart, creating the clear separation between the normal and abnormal groups observed in the figures.
We formally define confidence score of a classification model as the predicted probabilities associated with real labels, as shown in Figure 6. We present the curve of classification model confidence scores versus optimization steps in Figure 11, demonstrating a notable 5% enhancement in confidence scores after post-100-step optimization. This indicates that the generation-model dynamically adjusts their policy parameters through the PPO process to produce textual outputs that align with the classification model’s learned patterns. Thereby enabling the classification model to develop steadily growing confidence in its predictive determinations.

4.4. RQ3: Why Is the Decoder-Encoder Architecture Effective for Incorporating Natural Language Analyses?

During the fine-tuning process, pre-trained language models often experience Catastrophic Forgetting [32], a phenomenon characterized by the gradual erosion of general linguistic knowledge acquired during their initial pre-training phase. To evaluate the extent of knowledge retention in our encoder-based classification model, we utilized the General Language Understanding Evaluation (GLUE) benchmark [33], specifically employing the MRPC, SST-2, and STS-B tasks. Figure 12 illustrates the variation of testing metrics with respect to fine-tuning epochs. After 50 epochs of fine-tuning, the three benchmarks showed only marginal declines (0.030, 0.057, and 0.084), confirming that the model effectively retains the linguistic knowledge acquired during pre-training.
This finding validates our proposed Decoder-Encoder architecture. Our design relies on the encoder’s ability to process hybrid inputs: the rigid, technical syntax of raw logs and the flexible, natural language analyses generated by the LLM. If the encoder were to lose its general linguistic capability due to overfitting on logs, it would fail to comprehend the semantic context provided by the LLM’s analyses. Therefore, the observed stability confirms that Knowlog model successfully operates with “dual capabilities”—possessing the domain expertise to interpret system logs while retaining the general “common sense” required to understand human-readable explanations. This synergy allows the classifier to leverage the high-quality insights from the decoder, directly leading to the improved anomaly detection performance reported in Section 4.2.

4.5. RQ4: How Do Hyperparameter Settings Affect the Convergence of LogPPO?

The PPO algorithm leverages the Generalized Advantage Estimation (GAE) methodology [34] to define the advantage function, enabling an efficient and stable convergence dynamics. As delineated in the preceding methodological framework, we employ the notation s t to represent the agent’s temporal state, while value function V ( s t ) probabilistically estimates the expected return attributable to state s t . Upon the agent’s execution of action a t precipitating a state transition to s t + 1 , the system gains the reward r t alongside the post-transition value estimate V ( s t + 1 ) . Here, V ( s t ) and V ( s t + 1 ) denote state-value estimates produced by learned value function approximations, whereas r t represents the environmentally generated reward signal which aligns directly with the optimization objective dynamics. The expected return estimation for state s t bifurcates into dual formal representation, V ( s t + 1 ) + γ r t and V ( s t ) , whereas the former inherently embodies the critical optimization direction due to its explicit integration of immediate environmental feedback mechanisms. Here, the parameter γ serves as a discount factor governing the exponential decay of future reward contributions.
δ t = V ( s t + 1 ) + γ r t V ( s t ) .
Hence, (6) is systematically implemented to quantify the residual divergence between the agent’s prognostic estimations and empirical observations, with the resultant metric operationally defined as δ t .
The multistep aggregation is operationalized through an exponentially attenuated weighting γ , thereby engendering the canonical mathematical expression ubiquitously adopted for k-step advantage:
A ^ t ( k ) = l = 1 k γ l δ t + l .
Advantage function defined by GAE can be calculated by subsequently applying weighting coefficient ( 1 λ ) λ ( k 1 ) to perform weighted average on the k-step advantages:
A ^ t G A E ( γ , λ ) = ( 1 λ ) ( A ^ t ( 1 ) + λ A ^ t ( 2 ) + λ 2 A ^ t ( 3 ) + ) = l = 0 ( γ λ ) l δ t + l .
For a trajectory comprising T steps, the mean value of advantages associated with given trajectory is calculated through (9):
A ^ t G A E ( γ , λ ) ¯ = t = 0 T A ^ t G A E ( γ , λ ) T .
The closer the advantage values approach zero, the smaller the magnitude of parameter updates in the generation model, indicating progressive convergence.
From (8), it is clear that the hyperparameters γ and λ are the principal governing factors within the parametric configuration space. Therefore, this section focuses on experiments on these two hyperparameters to demonstrate their influence on convergence.
A discernible trend evident in the (8) reveals that the temporal propagation distance of rewards undergoes proportional amplification as parameter γ asymptotically approaches 1. And, the parametric configuration of γ  = 1 yields an undiscounted cumulative k-step advantages manifest in the (10):
A ^ t ( k ) = δ t + + δ t + k 1 .
The temporal propagation of estimation errors arises from the inherent discrepancy between predicted value estimates V ( s t ) generated by the value model and their ground-truth counterparts. The error amplification magnitude is positively correlated with the discount factor γ , thereby higher values of γ may adversely impact convergence dynamics in the learning process. As empirically demonstrated in Figure 13, the parameterized system with γ = 0.99 evidences markedly inferior convergence rates compared to its γ = 0.9 counterpart.
The closer the value of γ approaches zero, the more the agent prioritizes short-term rewards over long-term considerations. When γ is set to 0, the discrepancy at the current time step is solely taken into account and k-step advantage estimation collapses to single-step Temporal Difference (TD) error which is formulated as:
A ^ t ( k ) = δ t .
This configuration imposes pronounced limitations on text-oriented tasks requiring longitudinal strategic orchestration, potentially resulting in non-convergence of optimization processes. This phenomenon is notably exemplified in Figure 13 with γ = 0.8 where persistent advantages value oscillations emerge during later training phases, ultimately precipitating a failure to achieve convergence due to inherent algorithmic instability.
The role of λ demonstrates functional similarity with γ , yet distinguishes itself through prioritizing the nuanced trade-off between variance and bias within the advantage function framework. As λ asymptotically approaches 1, the advantage function’s computational paradigm demonstrates progressive convergence with Monte Carlo sampling method. When λ is set to 1, the computational expression for GAE condenses into:
G A E ( λ , 1 ) : A ^ t = l = 0 γ l δ t + 1 = l = 0 γ l r t + 1 V ( s t ) .
The Monte Carlo methodology calculates advantage values by aggregating empirical rewards across full trajectory paths, exhibiting negligible dependence on predictive estimations from value functions, thereby attaining theoretical minimization of estimation bias. However, as each trajectory reward constituent represents a stochastic variable, the cumulative summation of successive variables results in the probably cumulative variance. A substantial variance implies the introduction of excessive noise, which may result in a decelerated convergence rate in optimization processes.
As exemplified in Figure 14, the convergence rate corresponding to λ = 0.99 exhibits a comparatively slower progression than that observed when λ = 0.90. As the regularization coefficient λ asymptotically approaches zero, the computation of the advantage function increasingly approximates the single-step TD error. When the λ parameter is set to zero, the GAE formulation reduces to:
G A E ( λ , 0 ) : A ^ t = δ t = r t + γ V ( s t + 1 ) V ( s t ) .
It is evident that the GAE computation in this configuration relies on the minimal number of stochastic variables among comparable configurations, consequently exhibiting the most constrained randomness. The computational process in this implementation predominantly hinges upon estimated valuations of the value function. When these estimations exhibit lower accuracy, they introduce systematic errors in the computation of advantage functions, consequently resulting in maximal bias within the system. Excessive estimation bias in GAE computations induces substantial divergence from true advantage values, thereby precipitating convergence failure in iterative optimization processes. As illustrated in Figure 14, the trajectory corresponding to λ = 0.8 exhibits oscillatory behavior during the later phases of the training process, manifesting an absence of convergence predisposition.
Beyond convergence dynamics, we further investigated the sensitivity of the final anomaly detection performance to two critical PPO hyperparameters: the learning rate and the clip ratio. We selected the F1-Score as the primary metric for this evaluation. We performed a grid search on the BGL dataset, testing learning rates of { 5 × 10 6 ,   1 × 10 5 ,   5 × 10 5 } paired with clip ratios of { 0.1 , 0.2 , 0.3 } .
As summarized in Table 3, the model performance exhibits clear sensitivity to these configurations. A learning rate of 1 × 10 5 combined with a clip ratio of 0.2 yielded the optimal F1-Score of 0.4444. We observed that determining the appropriate step size is crucial. For instance, at the optimal learning rate, increasing the clip ratio to 0.3 resulted in a significant performance decline to 0.2759. This suggests that overly aggressive policy updates may destabilize the alignment between the generation and classification models. Conversely, a lower learning rate of 5 × 10 6 constrained the F1-Scores to a suboptimal range (0.22–0.28), regardless of the clip ratio, indicating insufficient policy exploration.

5. Discussion of Computational Cost and Deployment

5.1. Computational Cost

To evaluate the practical feasibility of our LLM-based approach, we analyzed the operational costs using a LLamA-2-7B model on a single NVIDIA A10 GPU. Utilizing batch processing, the system achieves a throughput of approximately 3600 logs per hour. Based on AWS cloud server pricing (specifically the g5.xlarge instance at $1.01/h), the computational cost is estimated at $0.3 per 1000 logs. In contrast, manual inspection by a skilled expert (assuming $15/h wage and 120 logs/h throughput) incurs a cost of $125.00 for the same volume.
Beyond monetary savings, the proposed architecture offers critical advantages in scalability and availability. Smart city platforms operate uninterruptedly, generating logs at all times. Unlike human operators constrained by fatigue and working hours, our system provides continuous, consistent analysis. We argue that the computational cost is a justifiable investment to automate a task that was previously labor-intensive and unscalable. Furthermore, future deployments can leverage 4-bit quantization techniques (e.g., GPTQ [35] or AWQ [36]) to reduce VRAM usage from 14 GB to 6 GB, enabling deployment of our system.

5.2. Deployment Challenges and Continuous Adaptation

In future work, we will focus on bridging the gap between experimental validation and real-world deployment in large-scale smart city platforms. To address deployment challenges such as inference latency, knowledge distillation can be explored to transfer the reasoning capabilities of the larger PPO-aligned LLM into lightweight architectures, improving the real-time analysis capability.
Furthermore, to handle potential variations in future log data streams, continuous learning frameworks driven by pseudo-labeling can be employed. High-confidence predictions on unlabeled data can be incorporated for self-training, with only uncertain, low-confidence samples routed to human experts. Larger generation and classification models can also be used to improve labelling quality. This effectively addresses the challenge of data drift and reduces the labeling cost required to maintain model reliability in evolving environments.

6. Conclusions

In large-scale smart city platforms, system logs serve as high-fidelity streams, capturing causal relationships and error contexts with fine temporal granularity—making them essential for maintaining operational resilience and reliability. However, generating high-quality labeled datasets for anomaly detection remains prohibitively expensive and time-consuming, creating a persistent challenge of data scarcity. This challenge elevates the optimal utilization of existing datasets to be a valuable problem. The majority of previously proposed methodologies have predominantly employed domain-specific knowledge from log data to tune pre-trained models, with the ultimate objective of deriving a classification model through this specialized adaptation process. Our experimental findings indicate that model fine-tuning does not result in substantial degradation of general natural language knowledge, with the classification model retaining enough capacity for comprehending and processing natural human linguistic expressions. Therefore, we employ LLMs to generate analyses of log templates, subsequently integrating these analyses alongside log templates into the classification model. This methodology enables simultaneous utilization of both the classification model’s inherent natural human language processing capabilities and its domain-specific expertise in professional contexts. In parallel, this study conducted experiments under data-scarce experimental conditions across four publicly available datasets to simulate real-world operational constraints. Three Transformer-based methods are selected as baselines, and the experimental results demonstrate that LogPPO achieves a substantial improvement in F1-Score performance, with gains ranging from 5% to 86% compared to the best-performing baseline LogPrompt and a pronounced advantage over the other two baselines. The empirical findings demonstrate that the integration of natural language analysis techniques substantially enhances the model’s semantic comprehension of log data.
To enhance the interpretability of LLM-generated analyses for classification models, we employ the PPO algorithm to align the generative preferences of LLMs with the semantic comprehension preference of classification models. The experimental results presented in this study demonstrate that classification models exhibit enhanced familiarity with the semantic patterns inherent in analyses generated by PPO-optimized generation models, resulting in statistically significant increases in probability assignments to real labels. Furthermore, this study conducts a systematic investigation into the impact of various hyperparameter configurations on the convergence properties of PPO, establishing a methodological framework to offer practical guidance for engineering implementations.
In the future, we will remain committed to exploring methodologies that optimize the utilization of existing datasets, such as enhancing the efficacy of PPO algorithms or investigating alternative reinforcement learning frameworks. Concurrently, we will investigate methodologies to endow LLMs with enriched domain-specific knowledge, thereby facilitating the generation of more accurate analyses.

Author Contributions

Z.W., J.D. and C.Y. conceived the idea for this work. C.Y. supervised the research. Z.W. was responsible for the theoretical derivation, algorithm development, fabrication, and manuscript preparation. Z.W. and J.D. built the measurement system and conducted experimental measurements. Z.W., J.D. and C.Y. reviewed and revised the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Key Research and Development Program of China (No.2024YFB2908303).

Data Availability Statement

The data supporting this study’s findings are available from the corresponding author upon reasonable request.

Acknowledgments

During the preparation of this study, the authors used LLaMA 2-7B model for the purposes of generating log analysis text and creating datasets required for reinforcement learning experiments. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. He, S.; He, P.; Chen, Z.; Yang, T.; Su, Y.; Lyu, M.R. A Survey on Automated Log Analysis for Reliability Engineering. ACM Comput. Surv. 2021, 54, 130. [Google Scholar] [CrossRef]
  2. Ma, L.; Yang, W.; Xu, B.; Jiang, S.; Fei, B.; Liang, J.; Zhou, M.; Xiao, Y. KnowLog: Knowledge Enhanced Pre-trained Language Model for Log Understanding. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, Lisbon, Portugal, 14–20 April 2024. [Google Scholar] [CrossRef]
  3. Tao, S.; Liu, Y.; Meng, W.; Ren, Z.; Yang, H.; Chen, X.; Zhang, L.; Xie, Y.; Su, C.; Oiao, X.; et al. Biglog: Unsupervised Large-scale Pre-training for a Unified Log Representation. In Proceedings of the 2023 IEEE/ACM 31st International Symposium on Quality of Service (IWQoS), Orlando, FL, USA, 19–21 June 2023; pp. 1–11. [Google Scholar] [CrossRef]
  4. Guo, H.; Yuan, S.; Wu, X. LogBERT: Log Anomaly Detection via BERT. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; pp. 1–8. [Google Scholar] [CrossRef]
  5. Liu, Y.; Tao, S.; Meng, W.; Yao, F.; Zhao, X.; Yang, H. LogPrompt: Prompt Engineering Towards Zero-Shot and Interpretable Log Analysis. In Proceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion), Lisbon, Portugal, 14–20 April 2024; pp. 364–365. [Google Scholar] [CrossRef]
  6. Chen, R.; Zhang, S.; Li, D.; Zhang, Y.; Guo, F.; Meng, W.; Pei, D.; Zhang, Y.; Chen, X.; Liu, Y. LogTransfer: Cross-System Log Anomaly Detection for Software Systems with Transfer Learning. In Proceedings of the 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE), Coimbra, Portugal, 12–15 October 2020; pp. 37–47. [Google Scholar] [CrossRef]
  7. Wang, J.; Chu, G.; Wang, J.; Sun, H.; Qi, Q.; Wang, Y.; Qi, J.; Liao, J. LogExpert: Log-based Recommended Resolutions Generation using Large Language Model. In Proceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER), Lisbon, Portugal, 14–20 April 2024; pp. 42–46. [Google Scholar] [CrossRef]
  8. Wang, X.; Song, J.; Zhang, X.; Tang, J.; Gao, W.; Lin, Q. LogOnline: A Semi-Supervised Log-Based Anomaly Detector Aided with Online Learning Mechanism. In Proceedings of the 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), Luxembourg, 11–15 September 2023; pp. 141–152. [Google Scholar] [CrossRef]
  9. Le, V.H.; Zhang, H. Log-based Anomaly Detection Without Log Parsing. In Proceedings of the 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE), Melbourne, Australia, 15–19 November 2021; pp. 492–504. [Google Scholar] [CrossRef]
  10. Jiang, Z.M.; Hassan, A.E.; Flora, P.; Hamann, G. Abstracting Execution Logs to Execution Events for Enterprise Applications (Short Paper). In Proceedings of the 2008 The Eighth International Conference on Quality Software, Oxford, UK, 12–13 August 2008; pp. 181–186. [Google Scholar] [CrossRef]
  11. Le, V.H.; Zhang, H. Log Parsing with Prompt-based Few-shot Learning. In Proceedings of the 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), Melbourne, Australia, 14–20 May 2023; pp. 2438–2449. [Google Scholar] [CrossRef]
  12. Du, M.; Li, F. Spell: Streaming Parsing of System Event Logs. In Proceedings of the 2016 IEEE 16th International Conference on Data Mining (ICDM), Barcelona, Spain, 12–15 December 2016; pp. 859–864. [Google Scholar] [CrossRef]
  13. He, P.; Zhu, J.; Zheng, Z.; Lyu, M.R. Drain: An Online Log Parsing Approach with Fixed Depth Tree. In Proceedings of the 2017 IEEE International Conference on Web Services (ICWS), Honolulu, HI, USA, 25–30 June 2017; pp. 33–40. [Google Scholar] [CrossRef]
  14. Le, V.H.; Zhang, H. Log Parsing: How Far Can ChatGPT Go? In Proceedings of the 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), Luxembourg, 11–15 November 2023; pp. 1699–1704. [Google Scholar] [CrossRef]
  15. Ma, Z.; Chen, A.R.; Kim, D.J.; Chen, T.H.P.; Wang, S. LLMParser: An Exploratory Study on Using Large Language Models for Log Parsing. In Proceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE), Lisbon, Portugal, 14–20 April 2024; pp. 1209–1221. [Google Scholar] [CrossRef]
  16. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
  17. Wang, T.; Zhu, Q. ChatGPT—Technical Research Model, Capability Analysis, and Application Prospects. In Proceedings of the 2024 IEEE 7th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), Chongqing, China, 15–17 March 2024; Volume 7, pp. 787–796. [Google Scholar] [CrossRef]
  18. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); Burstein, J., Doran, C., Solorio, T., Eds.; Association for Computational Linguistics: Minneapolis, MN, USA, 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
  19. Nguyen, H.T.; Nguyen, L.V.; Le, V.H.; Zhang, H.; Le, M.T. Efficient Log-based Anomaly Detection with Knowledge Distillation. In Proceedings of the 2024 IEEE International Conference on Web Services (ICWS), Shenzhen, China, 7–13 July 2024; pp. 578–589. [Google Scholar] [CrossRef]
  20. Hearst, M.; Dumais, S.; Osuna, E.; Platt, J.; Scholkopf, B. Support vector machines. IEEE Intell. Syst. Their Appl. 1998, 13, 18–28. [Google Scholar] [CrossRef]
  21. Liang, Y.; Zhang, Y.; Xiong, H.; Sahoo, R. Failure Prediction in IBM BlueGene/L Event Logs. In Proceedings of the Seventh IEEE International Conference on Data Mining (ICDM 2007), Omaha, NE, USA, 28–31 October 2007; pp. 583–588. [Google Scholar] [CrossRef]
  22. Zhao, N.; Wang, H.; Li, Z.; Peng, X.; Wang, G.; Pan, Z.; Wu, Y.; Feng, Z.; Wen, X.; Zhang, W.; et al. An empirical investigation of practical log anomaly detection for online service systems. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Athens, Greece, 23–28 August 2021; pp. 1404–1415. [Google Scholar] [CrossRef]
  23. Stiennon, N.; Ouyang, L.; Wu, J.; Ziegler, D.M.; Lowe, R.; Voss, C.; Radford, A.; Amodei, D.; Christiano, P. Learning to summarize from human feedback. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020. [Google Scholar]
  24. Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
  25. Zheng, R.; Dou, S.; Gao, S.; Hua, Y.; Shen, W.; Wang, B.; Liu, Y.; Jin, S.; Liu, Q.; Zhou, Y.; et al. Secrets of RLHF in Large Language Models Part I: PPO. arXiv 2023, arXiv:2307.04964. [Google Scholar] [CrossRef]
  26. Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. In Proceedings of the Advances in Neural Information Processing Systems; Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A., Eds.; Curran Associates, Inc.: Sydney, Australia, 2022; Volume 35, pp. 27730–27744. [Google Scholar]
  27. Powers, D. Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation. J. Mach. Learn. Technol. 2011, 2, 37–63. [Google Scholar]
  28. Oliner, A.; Stearley, J. What Supercomputers Say: A Study of Five System Logs. In Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’07), Edinburgh, UK, 25–28 June 2007; pp. 575–584. [Google Scholar] [CrossRef]
  29. Zhu, J.; He, S.; He, P.; Liu, J.; Lyu, M.R. Loghub: A large collection of system log datasets for ai-driven log analytics. In Proceedings of the 2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE), Florence, Italy, 9–12 October 2023; pp. 355–366. [Google Scholar]
  30. He, H.; Garcia, E.A. Learning from Imbalanced Data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar] [CrossRef]
  31. Abdi, H.; Williams, L.J. Principal component analysis. Wiley Interdiscip. Rev. Comput. Stat. 2010, 2, 433–459. [Google Scholar] [CrossRef]
  32. Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A.A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al. Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. USA 2017, 114, 3521–3526. [Google Scholar] [CrossRef] [PubMed]
  33. Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP; Linzen, T., Chrupała, G., Alishahi, A., Eds.; Association for Computational Linguistics: Brussels, Belgium, 2018; pp. 353–355. [Google Scholar] [CrossRef]
  34. Jaques, N.; Ghandeharioun, A.; Shen, J.H.; Ferguson, C.; Lapedriza, A.; Jones, N.; Gu, S.; Picard, R. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv 2019, arXiv:1907.00456. [Google Scholar] [CrossRef]
  35. Frantar, E.; Ashkboos, S.; Hoefler, T.; Alistarh, D. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. arXiv 2022, arXiv:2210.17323. [Google Scholar]
  36. Lin, J.; Tang, J.; Tang, H.; Yang, S.; Xiao, G.; Han, S. AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration. Getmobile Mobile Comp. Comm. 2025, 28, 12–17. [Google Scholar] [CrossRef]
Figure 1. An example of LogPPO workflow. The blue arrow denotes the initial phase: gaining the analysis through a Post-PPO LLM (referred to as the generation model). The orange arrow signifies the subsequent step whereby both the log template and analysis are synergistically input into the classification model. The green arrow marks the final stage, wherein the classification model produces predictive outcomes.
Figure 1. An example of LogPPO workflow. The blue arrow denotes the initial phase: gaining the analysis through a Post-PPO LLM (referred to as the generation model). The orange arrow signifies the subsequent step whereby both the log template and analysis are synergistically input into the classification model. The green arrow marks the final stage, wherein the classification model produces predictive outcomes.
Smartcities 09 00005 g001
Figure 2. An example of log parsing. Upon parsing, the raw log is segmented into distinct components, with event part emerging as the most critical constituent.
Figure 2. An example of log parsing. Upon parsing, the raw log is segmented into distinct components, with event part emerging as the most critical constituent.
Smartcities 09 00005 g002
Figure 3. Generalized architectures of embedding-based method, generation-based method and LogPPO. (a) Most embedding-based methods transform log templates into semantic vectors, which are inputted directly into classifier to produce predictive results. (b) Analyses generated by generation-based methods necessitate the application of supplementary programs or experts to derive predictive results. (c) LogPPO not only generates analyses but also obviates the degradation of automation level caused by post process.
Figure 3. Generalized architectures of embedding-based method, generation-based method and LogPPO. (a) Most embedding-based methods transform log templates into semantic vectors, which are inputted directly into classifier to produce predictive results. (b) Analyses generated by generation-based methods necessitate the application of supplementary programs or experts to derive predictive results. (c) LogPPO not only generates analyses but also obviates the degradation of automation level caused by post process.
Smartcities 09 00005 g003
Figure 4. Architecture of LogPPO. LogPPO framework primarily comprises two core components: a classification model and a generation model. The classification model performs predictive results, while the generation model is responsible for generating log analyses.
Figure 4. Architecture of LogPPO. LogPPO framework primarily comprises two core components: a classification model and a generation model. The classification model performs predictive results, while the generation model is responsible for generating log analyses.
Smartcities 09 00005 g004
Figure 5. Workflow of developing the classification model.
Figure 5. Workflow of developing the classification model.
Smartcities 09 00005 g005
Figure 6. An example of “chosen-reject” dataset acquisition process. The softmax function generates probabilistic distributions across distinct class labels. In this illustration, the left-side floating-point numbers represent probabilities corresponding to real labels, which we formally designate as confidence scores. The analyses are categorized as either “chosen” or “rejected” through a comparative assessment of their confidence scores.
Figure 6. An example of “chosen-reject” dataset acquisition process. The softmax function generates probabilistic distributions across distinct class labels. In this illustration, the left-side floating-point numbers represent probabilities corresponding to real labels, which we formally designate as confidence scores. The analyses are categorized as either “chosen” or “rejected” through a comparative assessment of their confidence scores.
Smartcities 09 00005 g006
Figure 7. Workflow of developing the generation model using PPO.
Figure 7. Workflow of developing the generation model using PPO.
Smartcities 09 00005 g007
Figure 8. Visualization demonstration about effectiveness of analyses generated from generation model. The red dots indicate the normal logs, while the blue symbols indicate anomalous logs. After employing analyses generated from generation model, the distinction between normal logs and abnormal logs is more discernible. (a) Semantic vectors derived from embedded log templates without analyses generated from generation model. (b) Semantic vectors obtained through embedding concatenated outputs of log templates and analyses generated from generation model.
Figure 8. Visualization demonstration about effectiveness of analyses generated from generation model. The red dots indicate the normal logs, while the blue symbols indicate anomalous logs. After employing analyses generated from generation model, the distinction between normal logs and abnormal logs is more discernible. (a) Semantic vectors derived from embedded log templates without analyses generated from generation model. (b) Semantic vectors obtained through embedding concatenated outputs of log templates and analyses generated from generation model.
Smartcities 09 00005 g008
Figure 9. F1-Scores of LogPPO across different training sizes, demonstrating its few-shot robustness.
Figure 9. F1-Scores of LogPPO across different training sizes, demonstrating its few-shot robustness.
Smartcities 09 00005 g009
Figure 10. Visualization demonstration about effectiveness of PPO. After the implementation of PPO for tuning generation model, a more discernible distinction emerges between normal logs and abnormal logs. (a) Semantic vectors obtained through embedding the concatenated outputs of log templates and analyses generated from pre-PPO generation mode. (b) Semantic vectors obtained through embedding the concatenated outputs of log templates and analyses generated from post-PPO generation mode.
Figure 10. Visualization demonstration about effectiveness of PPO. After the implementation of PPO for tuning generation model, a more discernible distinction emerges between normal logs and abnormal logs. (a) Semantic vectors obtained through embedding the concatenated outputs of log templates and analyses generated from pre-PPO generation mode. (b) Semantic vectors obtained through embedding the concatenated outputs of log templates and analyses generated from post-PPO generation mode.
Smartcities 09 00005 g010
Figure 11. Confidence scores during PPO optimization process.
Figure 11. Confidence scores during PPO optimization process.
Smartcities 09 00005 g011
Figure 12. The performance of KnowLog (model) on natural human language processing tasks during fine-tuning process. We selected three tasks—MRPC, SST-2, and STS-B [33]—with corresponding evaluation metrics of F1-Score, Accuracy, and Spearman Correlation Coefficient (SCC), respectively. As depicted in the figure, KnowLog (model) retains a substantial reservoir of natural human language knowledge.
Figure 12. The performance of KnowLog (model) on natural human language processing tasks during fine-tuning process. We selected three tasks—MRPC, SST-2, and STS-B [33]—with corresponding evaluation metrics of F1-Score, Accuracy, and Spearman Correlation Coefficient (SCC), respectively. As depicted in the figure, KnowLog (model) retains a substantial reservoir of natural human language knowledge.
Smartcities 09 00005 g012
Figure 13. Absolute mean advantage values for a trajectory versus PPO optimizing steps under varied γ hyperparameter configurations. Lowering γ values proves effective when encountering sluggish convergence rates, while enhancing γ values becomes useful when the system exhibits non-convergence tendencies.
Figure 13. Absolute mean advantage values for a trajectory versus PPO optimizing steps under varied γ hyperparameter configurations. Lowering γ values proves effective when encountering sluggish convergence rates, while enhancing γ values becomes useful when the system exhibits non-convergence tendencies.
Smartcities 09 00005 g013
Figure 14. Absolute mean advantage values for a trajectory versus PPO optimizing steps under varied λ hyperparameter configurations. A reduction in λ value may be considered when encountering sluggish convergence rates, whereas elevating λ value could serve as an effective strategy to mitigate non-convergence scenarios.
Figure 14. Absolute mean advantage values for a trajectory versus PPO optimizing steps under varied λ hyperparameter configurations. A reduction in λ value may be considered when encountering sluggish convergence rates, whereas elevating λ value could serve as an effective strategy to mitigate non-convergence scenarios.
Smartcities 09 00005 g014
Table 1. Comparison of Precision, Recall, and F1-Score across different datasets.
Table 1. Comparison of Precision, Recall, and F1-Score across different datasets.
DatasetMethodPrecisionRecallF1-Score
BGLKnowLog (model)1.00000.02860.0556
BigLog0.25000.02860.0513
LogPrompt0.13571.00000.2389
LogPPO (Ours)0.43240.45710.4444
SpiritKnowLog (model)1.00000.03330.0645
BigLog0.00000.00000.0000
LogPrompt0.13041.00000.2308
LogPPO (Ours)0.66670.20000.3077
ThunderbirdKnowLog (model)1.00000.07690.1429
BigLog0.00000.00000.0000
LogPrompt0.11301.00000.2031
LogPPO (Ours)1.00000.15380.2667
KnowLog
(dataset)
KnowLog (model)0.46970.37480.4169
BigLog0.29610.31610.3058
LogPrompt0.37330.74780.4980
LogPPO (Ours)0.47390.58030.5217
Table 2. Ablation Study: F1-Score Comparison of Different Analysis Integration Strategies.
Table 2. Ablation Study: F1-Score Comparison of Different Analysis Integration Strategies.
Dataset(1) Log Template Only(2) Template + Pre-PPO Analysis(3) Template + Post-PPO Analysis
BGL0.05600.10530.4444
Spirit0.06450.18180.2778
Thunderbird0.14290.14290.2667
KnowLog
(dataset)
0.41690.47330.5197
Table 3. Hyperparameter Sensitivity Analysis of F1-scores on BGL Dataset.
Table 3. Hyperparameter Sensitivity Analysis of F1-scores on BGL Dataset.
Clip Ratio
Learning Rate0.10.20.3
5 × 10 6 0.26090.22220.2857
1 × 10 5 0.37500.44440.2759
5 × 10 5 0.34380.28000.3143
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, Z.; Dong, J.; Yang, C. LogPPO: A Log-Based Anomaly Detector Aided with Proximal Policy Optimization Algorithms. Smart Cities 2026, 9, 5. https://doi.org/10.3390/smartcities9010005

AMA Style

Wang Z, Dong J, Yang C. LogPPO: A Log-Based Anomaly Detector Aided with Proximal Policy Optimization Algorithms. Smart Cities. 2026; 9(1):5. https://doi.org/10.3390/smartcities9010005

Chicago/Turabian Style

Wang, Zhihao, Jiachen Dong, and Chuanchuan Yang. 2026. "LogPPO: A Log-Based Anomaly Detector Aided with Proximal Policy Optimization Algorithms" Smart Cities 9, no. 1: 5. https://doi.org/10.3390/smartcities9010005

APA Style

Wang, Z., Dong, J., & Yang, C. (2026). LogPPO: A Log-Based Anomaly Detector Aided with Proximal Policy Optimization Algorithms. Smart Cities, 9(1), 5. https://doi.org/10.3390/smartcities9010005

Article Metrics

Back to TopTop