An AI-Based Risk Analysis Framework Using Large Language Models for Web Log Security

Jeong, Hoseong; Joe, Inwhee

doi:10.3390/electronics14173512

Open AccessArticle

An AI-Based Risk Analysis Framework Using Large Language Models for Web Log Security

by

Hoseong Jeong

and

Inwhee Joe

^*

Department of Computer Science, Hanyang University, Seoul 04763, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(17), 3512; https://doi.org/10.3390/electronics14173512

Submission received: 21 July 2025 / Revised: 26 August 2025 / Accepted: 29 August 2025 / Published: 2 September 2025

Download

Browse Figures

Versions Notes

Abstract

Web log data analysis is essential for monitoring and securing modern software systems. However, traditional manual analysis methods struggle to cope with the rapidly growing volumes and complexity of log data, resulting in inefficiencies and potential security risks. To address these challenges, this paper proposes an AI-driven log analysis framework utilizing advanced natural language processing techniques from large language models (LLMs), specifically ChatGPT. The framework aims to automate log data normalization, anomaly detection, and risk assessment, enabling the real-time identification and mitigation of security threats. Our objectives include reducing dependency on human analysis, enhancing the accuracy and speed of threat detection, and providing a scalable solution suitable for diverse web service environments. Through extensive experimentation with realistic log scenarios, we demonstrate the effectiveness of the proposed framework in swiftly identifying and responding to web-based security threats, ultimately improving both security posture and operational efficiency.

Keywords:

LLM; log analysis; ChatGPT; log parsing; risk analysis; web log security

1. Introduction

1.1. Background

In modern society, various organizations—including corporations, government agencies, and educational institutions—utilize web services for information dissemination, communication, and diverse online transactions. The significance of web services is increasing worldwide [1,2], making the stability and security of web servers a crucial concern. Log data contain a wealth of information regarding user requests, server responses, and access attempts; consequently, they can be leveraged not only for monitoring system performance but also for efficiently detecting security threats [3,4]. However, as log data become more voluminous and structurally complex, traditional manual analysis methods demand excessive time and labor, thereby proving inefficient. In response to these challenges, recent years have witnessed a surge in research aimed at detecting abnormal log activities within system logs [5,6]. Furthermore, the development of artificial intelligence (AI), particularly large language models (LLMs) such as ChatGPT, has spurred an increasing number of studies that focus on automatically identifying anomalies and security threats from log data. With ongoing improvements in LLMs, the integration of LLM-based log analysis frameworks is expected to have a positive impact on the security and operational efficiency of web services.

1.2. Proposed System Overview

This paper proposes a system designed to systematically collect logs generated from a web server and automatically normalize the vast amount of information obtained in this process for subsequent analysis. Concretely, the setup comprises one web server, one log-collection/analysis server, and one server designated for simulating security vulnerabilities. The logs produced by the web server are normalized using tools such as Logstash, before being stored in a database on the log collection/analysis server. Subsequently, the ChatGPT-3.5 Turbo model analyzes these logs. The analysis results are provided in real time via an administrator web page, enabling a quick assessment of log status and the necessary countermeasures. If a log entry is determined to be malicious, a security policy is dispatched to the web server to automatically block the suspicious activity. This approach is expected to significantly reduce the time spent on log analysis and enhance accuracy.

1.3. Research Objectives and Expected Contributions

The primary objective of this research is to present the design and implementation of the proposed system and to verify its effectiveness through experiments using real-world log data. Specifically, the goal is to construct an environment that can automatically extract critical information from web log data and swiftly identify, alert, and block security threats, thereby enabling rapid responses from web service operators. Additionally, the proposed log analysis system offers the following key benefits:

Automated Data Collection and Analysis: By reducing the need for direct log analysis by server administrators and security specialists, the system facilitates prompt monitoring and response to security threats.
In-Depth Log Data Utilization: AI-based analytics yield more refined insights from log data, thereby enhancing decision-making processes in user behavior analysis, system performance optimization, and the early detection of security threats.
Applicability in Various Environments: By proposing a novel method of leveraging web log data, the system can be adopted across a range of web server environments and has the potential to advance the field of log analysis research.

2. Literature Review

2.1. Purpose and Selection Criteria

This literature review aims to explore recent developments relevant to the integration of large language models (LLMs) in automated log analysis and cybersecurity applications. The selected studies were categorized into three key research areas, as follows:

The application of LLMs to log parsing or interpretation tasks.
The development of prompting techniques or in-context learning methods applicable to system-generated data.
The evaluation of security analysis frameworks involving automated detection or policy updates.

These selection criteria were aligned with the primary objective of this study—leveraging LLMs to improve the interpretability, adaptability, and responsiveness of security log systems in real-world environments.

2.1.1. Application of LLMs to Log Parsing or Interpretation Tasks

The authors of [7] explore the effectiveness of large language models in log analysis for cybersecurity. They benchmark multiple models, including BERT-base-cased, RoBERTa-base, DistilRoBERTa-base, GPT-2 (124M), GPT-Neo (1.3B), to assess their ability to analyze system and application logs. The study implements an experimentation pipeline called LLM4Sec, which facilitates log analysis, model evaluation, and security interpretation. By fine-tuning models on six datasets, the research demonstrates that DistilRoBERTa achieves the best performance, surpassing existing state-of-the-art methods. The findings highlight the importance of fine-tuning for domain adaptation, enabling models to effectively identify security threats. The study also emphasizes the benefits of visualization techniques, such as t-SNE and SHAP, for enhancing model interpretability in log analysis.

In [8], the authors investigate the capability of ChatGPT in log parsing, which is an essential task for analyzing system logs in large-scale software environments. Traditional log parsers rely on predefined rules or data-driven models, requiring extensive labeled data and computational resources. ChatGPT, as a large language model, offers a potential alternative through zero-shot and few-shot learning. The study evaluates its performance using different prompting techniques, and benchmarks it against existing log parsers. The results show that while ChatGPT performs competitively in certain cases, its accuracy varies depending on the prompt structure and log format. Few-shot prompting significantly improves its parsing capability by providing example templates, allowing it to generalize better across log types. However, challenges remain in handling log-specific variables, maintaining consistency across different datasets, and designing optimal prompts. The research highlights the potential of ChatGPT in automated log analysis but emphasizes the need for further refinement to achieve higher reliability in production environments.

The authors of [9] explore automated log file analysis using AI-driven techniques, addressing challenges in traditional manual inspection methods. They highlight the limitations of conventional log analysis, such as the reliance on human interpretation, difficulty in detecting new errors, and issues with scalability. The study reviews AI approaches, particularly the use of large language models (LLMs) like LLaMA 2, which enhance log analysis by automating error detection, anomaly identification, and summarization. The proposed framework leverages AI to preprocess raw logs, extract patterns, and dynamically adapt to new log formats without predefined rules. The research identifies gaps in existing methods, emphasizing the need for improved real-time log tracking, flexible parsing models, and continuous learning mechanisms. The AI-driven tool integrates a conversational interface to enable intuitive interaction and real-time error detection. Future advancements could enhance scalability, cybersecurity applications, and cross-domain adaptability, making AI-powered log analysis a critical solution for efficient system monitoring and maintenance.

2.1.2. Prompting Techniques and In-Context Learning

In [10], the authors provide a systematic survey of prompting methods in natural language processing, introducing the paradigm of “pre-train, prompt, and predict”. Unlike traditional supervised learning, which relies on labeled datasets, prompting reformulates tasks into a format that pre-trained language models can handle directly. By using templates with unfilled slots, language models generate predictions with minimal additional training. The survey categorizes prompting techniques based on manual and automated template engineering, discrete and continuous prompts, and tuning strategies. It also explores multi-prompt learning, training methods, and their impact on tasks such as classification, question answering, and text generation. Challenges in prompt design, answer mapping, and optimization are discussed, emphasizing the need for effective prompt engineering to enhance model performance. The study concludes that prompting offers a flexible and efficient approach to adapting language models but requires further research to optimize generalization and interpretability in diverse NLP applications.

The authors of [11] explore chain-of-thought prompting as a method to enhance reasoning in large language models. Traditional prompting methods struggle with complex tasks, but chain-of-thought prompting introduces intermediate reasoning steps, allowing models to break down problems into logical sequences. This approach significantly improves performance in arithmetic, commonsense, and symbolic reasoning tasks. The study evaluates various large models, including PaLM and GPT-3, showing that chain-of-thought prompting enables them to achieve state-of-the-art results in mathematical problem-solving benchmarks. The findings suggest that reasoning abilities emerge as model size increases, making prompting an effective alternative to extensive fine-tuning. The paper highlights the advantages of interpretability, computational efficiency, and adaptability across different reasoning tasks. However, challenges remain in ensuring reasoning consistency and generalization across diverse prompts. The study concludes that chain-of-thought prompting expands the range of tasks that large models can perform and provides insights into optimizing model reasoning capabilities.

The authors of [12] investigate the effectiveness of in-context learning for code intelligence tasks using large language models. They explore how the selection, ordering, and number of demonstration examples impact the performance of models on tasks such as code summarization, bug fixing, and program synthesis. The study finds that selecting diverse yet relevant examples, ordering them by similarity to the test case, and carefully adjusting the number of demonstrations significantly improve model performance. Instance-level demonstrations outperform task-level ones, and excessive examples may lead to input truncation, reducing effectiveness. The results demonstrate that optimizing in-context demonstrations leads to substantial performance improvements, highlighting the importance of structured prompt engineering for code-related tasks. The study provides practical insights for leveraging large language models in software development, emphasizing the need for well-designed prompts to maximize efficiency and accuracy.

2.1.3. Evaluation of Security Analysis Frameworks

In [13], the authors evaluate static web vulnerability analysis tools, focusing on their effectiveness in detecting security flaws in web applications. With the increasing number of cyber threats, manual vulnerability assessment is impractical, necessitating the use of automated tools. The study compares two open-source static analysis tools–OWASP WAP and RIPS–by testing them against deliberately vulnerable web applications. The evaluation considers factors such as detection accuracy, false positive rates, and efficiency in identifying common vulnerabilities like SQL injection, cross-site scripting, and file inclusion. Experimental results show that OWASP WAP outperforms RIPS in terms of precision, although both tools have limitations in detecting certain types of vulnerabilities. The study highlights the challenges of static analysis, including its dependency on source code structure and its inability to detect runtime issues. The findings emphasize the importance of combining static and dynamic analysis methods for comprehensive security assessments. The paper concludes that while static tools provide valuable insights, further advancements are needed to improve their accuracy and applicability in real-world scenarios.

2.2. Synthesis and Research Gap

Based on the above categorization, the reviewed literature can be synthesized into the following three thematic groups:

LLM-based log analysis and security interpretation;
Prompting strategies and in-context learning for model adaptation;
Vulnerability detection and evaluation of automated security tools.

Collectively, these studies demonstrate the growing applicability of LLMs to log analytics, the importance of prompt engineering, and the increasing relevance of automation in cybersecurity. However, most existing research tends to isolate these areas, lacking an integrated framework that connects LLM-driven log interpretation with real-time threat detection, risk assessment, and automated security response. Moreover, there is a noticeable gap in leveraging LLMs not only for understanding log data but also for generating actionable alerts and dynamically reconfiguring security policies. This study addresses these gaps by proposing a unified system that integrates LLM-based log analysis with real-time alert generation and autonomous security configuration.

3. System Design and Implementation

To analyze log data generated by a web server, we configured a system consisting of one web server, one log collection/analysis server, and another server dedicated to simulating security vulnerabilities. In this environment, we designed and tested a web log risk analysis system, where logs stored on the log collection/analysis server are processed by an AI model, while the results are displayed on an administrator web interface.

This system systematically collects and refines logs generated by the web server, analyzes them with an AI model, and automates security configurations based on the analysis results. Administrators can intuitively monitor the system’s status. The data are shared within a central database, and when a critical security threat is detected, the system reconfigures security policies on the web server. Through this process, the designed framework enables log analysis and risk mitigation, as illustrated in Figure 1.

To support preprocessing and enhance extensibility, the pipeline integrates Logstash at the data collection stage. This enables the ingestion of not only web logs from OWASP Juice Shop but also logs from other heterogeneous web applications or servers in a unified and consistent format. The same regular expression-based structure detection and field extraction described above are applied within Logstash, allowing diverse log formats to be normalized into a standardized schema. Key fields such as date/time, IP address, URL, HTTP method, and status codes are extracted and structured before being stored in the database. This configuration ensures that logs from multiple sources can be processed and analyzed within a single framework without requiring modifications to downstream components used for AI model analysis.

The database (DB) of this system acts as the central hub where all data are stored. The DB stores three types of logs–sanitized logs processed in the log data processing stage, AI analysis logs generated by the AI analysis model, and summary AI analysis logs, which contain risk evaluations and security settings determined by the Security Settings Program. The input-output flow includes storing sanitized logs in the DB, retrieving them for AI analysis, saving AI analysis results back to the DB, reading AI analysis logs for security risk assessment, and saving summary AI analysis logs, which serve as references for web server security updates. In addition, administrators can query the database through the admin page to gain insight. Given the large volume of logs, schema design must consider partitioning, indexing, and archiving strategies, and a combination of NoSQL and relational databases may be used to optimize response speed.

The AI model retrieves sanitized logs from the DB, applies internal algorithms such as tokenization and context recognition, and assigns a log data risk analysis level to each analyzed log. The analyzed results are stored back in the DB in the form of AI analysis logs.

The Security Settings Program reads AI analysis logs from the DB and detects the structure and patterns using regular expressions to extract key information such as attack IPs, attack frequency, and log data risk analysis levels. It then generates summary AI analysis logs. If the risk analysis level is determined to be critical and repeated attacks from a specific IP are detected, the system immediately enforces security policies on the web server and applies countermeasures.

The administration page is developed within the log collection and analysis server to reduce the time required for log analysis and system status monitoring. The web interface allows administrators to search logs based on date and time, enabling them to access analyzed logs for specific periods. Additionally, it provides insights into potential future security issues, allowing proactive security management.

In summary, logs from the web server are preprocessed and normalized using Logstash before being stored in a database on the log collection and analysis server. The stored logs are then analyzed by the ChatGPT-based AI model and the results are written back into the database. Once analysis is complete, administrators can access the processed log data through a dedicated web page. If a malicious attack is detected, the system automatically enforces a security policy to block the identified threat. By implementing this approach, the proposed system provides a comprehensive end-to-end solution that encompasses log collection, security threat detection, and automated response mechanisms. This architecture enables web service providers to swiftly identify and mitigate potential security threats, ensuring robust protection for their infrastructure.

3.1. Materials and Methods

Data Ingestion and Normalization

Web Log Data and Their Importance

Web log data are a collection of records generated by web servers in response to user requests. These data include a variety of information, such as the user’s IP address, the time of the request, the URL of the requested page, the HTTP response code, and user agent information. The importance of web log data is revealed in several aspects, including the following:

User Behavior Analysis: Web logs provide in-depth insights into how users interact with a website. This can be used to improve the usability of the website and optimize the user experience.
System Monitoring: By analyzing web log data, you can identify performance issues in the web server and understand traffic patterns. This is essential for maintaining the stability and efficiency of the system.
Security Threat Detection: By analyzing abnormal request patterns, suspicious IP addresses, error codes, etc., you can detect and respond to security threats early. Web logs are used as an important data source in cybersecurity.

Logstash

Logstash is one of the core components of the Elastic Stack, which is an open-source data processing pipeline that collects, transforms, and stores data from various sources. Logstash is designed primarily to process data generated from log files, system metrics, web applications, etc., and is widely used in various use cases and environments due to its flexibility and scalability. The structure of Logstash is shown in Figure 2. It has three core functions and configurations [14,15].

Input: Logstash can collect data from a variety of input sources. This includes files, HTTP, syslog, Twitter, and lightweight data collectors such as Beats. Through input plugins, Logstash supports real-time data streaming.
Filter: Collected data can be transformed or processed through filters. Filters perform tasks such as changing the structure of data, extracting necessary information, or adding or removing fields. The most representative filters include grok (pattern matching to structure log data), mutate (data transformation), drop (deleting events that meet specific conditions), and date (parsing and standardizing date information).
Output: Processed data can be sent to one or more destinations. This function supports various output targets such as Elasticsearch, file systems, databases, and messaging queues. Data can be indexed, stored, and analyzed through output plugins.

Generate Web Log Data

When web users access a web service, access logs are generated on the web server. In this study, we collect and analyze log data using these access logs. The log format employed is the Combined Log Format, which extends the Common Log Format (CLF) by adding two fields–“referer” and “user agent”. These fields provide information about the link through which a visitor accessed the page and the browser or device used, enabling the collection of more detailed web logs.

Common Log Format (CLF): This format logs the client’s IP address, user identifier, user ID, request time, request line, HTTP status code, and response byte size.
Combined Log Format: In addition to CLF, this format includes two additional fields–“referer” and “user agent”. This provides information about which link the visitor accessed the page through and which browser or device they used. In this study, the Combined Log Format was used to collect web logs.

Web Log Data Cleansing and Storage

The logs refined in the filter pipeline are forwarded to the output pipeline, as shown in Figure 3. In the output pipeline, the JDBC plugin maps the cleaned fields to the corresponding columns of the web log table in the database of the log collection/analysis server, through which they are transmitted and stored. For data cleansing and storage, we implemented the pipeline by creating a logstash.conf file containing the pipeline code, as shown in Figure 4.

Log Data Risk Analysis Level Design

To facilitate administrator identification and risk recognition of AI-analyzed log data in the web log risk analysis system, four risk levels were designed. Level 1 (Information) provides simple information to administrators. Level 2 (Warning) provides advance warning of potential problems. Level 3 (Error) indicates a problem but does not require immediate action. Level 4 (Critical Error) requires immediate action by the system administrator. The log data risk analysis level is defined in Figure 5.

3.2. Database Schema Design

The schema of the three log tables of the system proposed in this paper is as follows. The first log data processing step (web log) is defined in Table 1, the AI analysis log generated by the second AI analysis model, Analyze_Log, is defined in Table 2, and the third AI analysis log, Summary_Analyze_Log, is defined in Table 3, which includes the risk assessment and security settings determined by the Security Settings Program.

Web Log
Analyze_Log
Summary_Analyze_Log

3.3. Web Application Security Vulnerabilities

The OWASP Top 10 provides important criteria for identifying and understanding security vulnerabilities in web applications. This list includes security threats that frequently occur in web applications, such as SQL injection, cross-site scripting (XSS), sensitive data exposure, XML external entity (XXE) attacks, improper access control, and security configuration errors. OWASP has been exploring ways to strengthen the security of web applications centered on these vulnerabilities. For example, existing studies have emphasized the use of parameterized queries for developers to counter SQL injection, suggesting user input methods to prevent XSS attacks. In order to prevent sensitive data exposure, data encryption, through the use of HTTPS and the hashing of important information such as passwords, has been widely studied. Existing studies based on the OWASP Top 10 provide specific methodologies for improving various security aspects of web applications, which helps developers build more secure web applications. In addition, the systematic management and mitigation of these vulnerabilities contributes to improving the overall security level of web applications and plays an important role in developing a strategic approach to responding to security threats. This paper verified the performance of the system proposed in this study by detecting SQL injection, cross-site scripting (XSS), and file upload attacks among the web application security vulnerabilities in the OWASP Top 10, as well as by providing analysis results.

3.4. ChatGPT Log Analysis Performance

Log data are important data that show the operating status of IT systems; effectively analyzing these data is essential for system security and performance optimization. Therefore, it is important to determine how well ChatGPT can process and analyze log data. Previous studies have evaluated ChatGPT, which processes log data, in three main areas. Log parsing is the process of extracting meaningful information from log data, while log analysis is the process of detecting abnormal signs or identifying patterns based on the extracted data. Finally, log summarization is the process of simply summarizing massive log data to provide key information. ChatGPT showed excellent performance in parsing various types of log data. It is effective in extracting key information from log data by utilizing its natural language processing capabilities. However, it was found that its accuracy may be low when the log format is complex or irregular. ChatGPT has proven to be a very useful tool in identifying patterns and detecting abnormal signs in log data. In particular, it was effective in the early detection of security threats to the system through log data. However, there is a possibility of false positives or inaccurate analysis results, and additional research is needed to improve this. ChatGPT has shown excellent performance in summarizing massive log data and providing key information. This is very helpful for system administrators and security personnel to quickly understand the situation and respond. There is a risk that some important information may be omitted during the summarization process, and auxiliary analysis tools may be needed to supplement this. ChatGPT has shown excellent performance in parsing, analyzing, and summarizing log data, and can be a useful tool for improving the operation and security management of IT systems. However, considering its limitations in certain situations, it is desirable to use it as an auxiliary tool. Future research should be conducted in the direction of overcoming these limitations and further improving the performance of ChatGPT [16].

First, there is a limitation whereby the accuracy may be reduced when the log format is complex or irregular. In this paper, to solve the first limitation, the scope of log data analysis is limited to web logs. Second, there is a possibility of false detections or inaccurate analysis results.

Third, there is a risk that some important information may be omitted during the summary process, and an auxiliary analysis tool may be needed to supplement this. In this paper, to solve the third limitation, we designed a log data risk analysis level as an auxiliary analysis tool, thereby minimizing the risk of important information being omitted during the log analysis summary process. In addition, previous studies have focused only on log data analysis performance. However, in order to improve the operation and security management of IT systems, the judgment of the administrator is also important. In this paper, we designed an administrator web page that supports web service providers to respond quickly by identifying and warning of potential risks such as security threats, as well as log data analysis performance. This paper goes beyond the possibilities and limitations of log data analysis using ChatGPT and suggests a method of utilizing it in an actual operating environment. Through this, system administrators will be able to efficiently manage and analyze log data and respond quickly to security threats.

4. Experiments and Evaluation

This section describes the system’s design, implementation, and evaluation, and is organized as follows. First, we describe the general results of web log analysis using benign and noisy data, focusing on the end-to-end data flow from log parsing to risk-level assignment and visualization. Next, we evaluate the system’s response to simulated attack scenarios—SQL injection, cross-site scripting (XSS), and file upload vulnerabilities—chosen based on the OWASP Top 10 categories. Each scenario is assessed in terms of threat detection accuracy and risk-level classification. Lastly, we demonstrate how the system responds to repeated high-risk behavior through automated firewall policy activation.

The evaluation environment was intentionally scoped to simulated attack scenarios within the OWASP Juice Shop platform, which is a widely adopted and reproducible testbed for web vulnerability testing. This choice enables focused analysis under controlled conditions aligned with known security benchmarks. While further testing on external datasets may enhance generalizability, this study aims to validate the feasibility and coherence of the system across an end-to-end workflow—from raw log ingestion to actionable threat response—in a realistic yet manageable domain setting.

4.1. General Web Log Analysis Results

A web log analysis system was designed and evaluated to detect and analyze security threats to web servers. The system accurately parsed and processed seven different web logs (varying in IP addresses, timestamps, response contents, etc.) generated for testing. The original logs were parsed line-by-line, refined, and transmitted to the log collection and analysis server. These logs were then processed by an AI analysis model, which assigned log levels to each IP and performed relationship and risk analysis. Logs identified as potential or detected threats were displayed on the admin page, enabling administrators to quickly assess potential future security issues. The logs displayed on the admin page after risk analysis was completed are shown in Figure 6.

4.2. Web Attack Log Risk Analysis Results

To evaluate the system’s ability to detect and analyze security risks, simulated web vulnerability attacks were conducted. The system was tested against various attack types, including SQL injection, where malicious SQL code is injected into the application’s database queries to manipulate the database or expose sensitive data. Additionally, cross-site scripting (XSS) attacks were performed, in which malicious scripts are injected into web pages to execute within other users’ browsers.

Lastly, file upload vulnerability attacks were conducted, where an attacker attempts to upload and execute malicious files on the server via an insecure file upload function. The web log risk analysis system was tested against these attack scenarios to assess its threat detection capabilities.

4.2.1. SQL Injection Attack

An analysis of the attack indicates that two SQL injection attempts were made from the same IP address (172.168.100.10) within a short time frame, and both requests received a 200 OK response. While classified as “Warning,” the behavior suggests a potential threat, as the requests targeted vulnerable URLs. If unaddressed, this IP may continue to exploit system vulnerabilities, posing risks such as unauthorized access or service disruption. The system successfully detected the attack, assessed the risk, and generated appropriate response recommendations for administrators. The details of this detection and analysis are illustrated in Figure 7.

4.2.2. Cross-Site Scripting (XSS) Attack

An analysis of the attack classified the risk level as “Error” due to a cross-site scripting (XSS) payload found in a GET request from IP addresses 172.168.100.10 to 172.168.30.30. Although the request received a 200 OK response, the presence of a script injection in the “name” parameter represents a serious security threat. The system successfully detected the XSS attempt, assessed the risk, and recommended appropriate countermeasures. The details of this detection and analysis are illustrated in Figure 8.

4.2.3. File Upload Vulnerability Attack

An analysis of the attack classified the risk level as “Critical Error” due to repeated attempts from IP address 172.168.100.10 to upload a web shell and access sensitive system files such as /etc/passwd, /etc/group, and /etc/hosts. These actions indicate a serious risk of unauthorized access or system compromise. The system accurately detected this behavior, assessed the threat, and generated appropriate response recommendations. The details of this detection and analysis are illustrated in Figure 9.

The test results from these three types of web attacks confirm that the system effectively detects and analyzes web security vulnerabilities and provides administrators with appropriate countermeasures. Additionally, the Security Settings Program was activated to send and apply firewall-blocking security configurations to the web server only after five or more “Critical” risk-level attacks from the same IP address were detected within 60 s. This conservative threshold mitigates false positives and reduces the likelihood of unintended lockouts. The details are shown in Figure 10.

5. Results and Discussion

5.1. Experimental Setup

This section details the experimental setup designed to objectively evaluate the performance of the proposed ChatGPT-based web log risk analysis system. A controlled experimental environment was established, and a comparative analysis was conducted across various large language models (LLMs). We elaborate on the selection of comparison models, the dataset construction process, and the methodology for generating reference logs for evaluation.

5.1.1. Selection of Comparative Models and Their Rationale

To identify the optimal LLM to serve as the core component for generating risk alerts and summaries within the web log risk analysis system, we selected a diverse set of leading models, each possessing distinct characteristics. The rationale behind the selection of each model is as follows.

GPT-4o: As OpenAI’s latest flagship model, GPT-4o boasts advanced multimodal capabilities encompassing text, image, and audio understanding. For this study, our primary focus was on its exceptional text comprehension and generation abilities. We selected GPT-4o with the expectation that it would deliver a state-of-the-art performance in analyzing complex web log data and producing high-quality risk alerts. Its extensive pre-training on vast datasets provides remarkable generalization and reasoning capabilities across various domains, which we anticipate will be advantageous in explaining complex and unpredictable web attack patterns.
GPT-4o-mini: This model represents a lightweight version of GPT-4o, emphasizing faster response times and improved efficiency while aiming to retain a performance level comparable to GPT-4o. Given the critical importance of minimizing latency in real-time or near-real-time risk analysis systems, GPT-4o-mini was included to explore the balance between performance and operational efficiency. This inclusion is crucial for assessing the practical utility of the system in an actual operational environment.
Gemini 2.5 Flash (Gemini): Developed by Google DeepMind, Gemini 2.5 Flash is part of the Gemini series, optimized for speed and high efficiency. Its extensive training on diverse data imbues it with outstanding multilingual comprehension and generation capabilities. We hypothesized that this would prove advantageous in processing various formats of web logs and security threat terminologies. Gemini 2.5 Flash was thus evaluated as a viable alternative suitable for web log analysis environments demanding high-volume data processing and rapid responses.
LLaMA3: As an open-source large language model developed by Meta AI, LLaMA3 is widely adopted within research communities that value transparency and customization. LLaMA3 holds significant potential for performance maximization through domain-specific fine-tuning. In this research, it was chosen as a baseline to explore the current performance levels of open-source models and their future scalability. This provides important insights into the applicability of open-source models compared to proprietary API-based alternatives.

These selected models, each characterized by different architectures, training data, and optimization objectives, enable a comprehensive multi-faceted analysis of their impact on the text generation capabilities of the web log risk analysis system.

5.1.2. Dataset Construction

To facilitate the experiments, a dataset simulating a real-world web attack environment was constructed. This dataset is an indispensable component for verifying how effectively the system operates under actual threat conditions.

OWASP Juice Shop Deployment: An OWASP Juice Shop environment, specifically designed for learning about and testing web application vulnerabilities, was deployed. This platform is ideal for systematically simulating various security vulnerabilities commonly found in real web services, thereby enabling the collection of relevant logs.
Application of OWASP Top 10 Attack Scenarios: Based on the major web security vulnerabilities outlined in the OWASP Top 10 (e.g., SQL injection, cross-site scripting (XSS), broken access control, etc.), realistic attack scenarios were devised. These scenarios were then executed against the deployed OWASP Juice Shop to generate and collect diverse attack logs. The logs collected serve as a realistic dataset reflecting actual security threat situations.

The collected log data underwent normalization and analysis through the system’s log analysis module, which then served as the basis for the LLMs to generate risk alerts and summaries.

5.1.3. Generation of Reference Logs for Evaluation

To objectively evaluate the quality of the risk alerts generated by the LLMs, accurate and reliable reference texts are indispensable. In this study, we utilized the GPT-3.5 Turbo model to generate “ideal” risk alerts or summaries (reference logs) corresponding to each web attack log [17].

Rationale for Using GPT-3.5 Turbo: GPT-3.5 Turbo was selected as the core LLM for the web log risk analysis system proposed in this study. Consequently, this model best represents the fundamental approach and understanding of web log analysis and risk alert generation within the scope of this research. To ensure evaluation fairness and establish a benchmark for the “ideal” alerts expected from the system, it was deemed most appropriate to use GPT-3.5 Turbo, which forms the basis of our current system, to generate the reference logs. This choice maintains consistency between the system’s internal logic and the language model, providing reference data that most closely align with our objectives.
Role of Reference Logs: The reference logs generated by GPT-3.5 Turbo are concise and clear summaries of the core information from each attack type and log data. These served as the “ground truth” for calculating ROUGE, BLEU, METEOR, and BERT-Score for the alerts generated by the comparative models in the experiments, thus providing a quantitative benchmark for evaluating each model’s text generation capabilities.

5.1.4. Prompt Construction

A consistent prompt format was used when querying each comparative model to ensure fair evaluation. The prompts included specific risk information extracted from web log data, instructing the models to generate security threat alerts or summaries based on this information.

Figure 11 illustrates an example of the prompt provided to each model. The prompt encompassed key information from the analyzed web logs (e.g., attack type, source IP, target URL, detected patterns, etc.) and requested the model to generate a risk alert in a format that security administrators could immediately understand and act upon. This setup enabled a fair evaluation of summarization and natural language transformation.

This meticulously designed experimental setup ensures the reliability and reproducibility of the research findings, offering significant implications for applying LLMs in web log risk analysis systems.

5.2. Quantitative Evaluation

5.2.1. Evaluation Metrics

In this study, we utilized several representative natural language processing (NLP) evaluation metrics to comprehensively assess the quality of text-based notifications and summaries generated by our proposed web log risk analysis system. These metrics measure the similarity and semantic alignment between generated texts and reference texts, allowing for a holistic evaluation of the accuracy, clarity, and practical utility of the system’s risk alerts.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is widely used for evaluating summarization and text generation systems. ROUGE assesses the extent to which the key content of a reference text is effectively covered by generated texts based on the overlap of n-grams.

ROUGE-1_f (F-score of Unigram Overlap): This metric evaluates overlap at the unigram level, indicating how well essential keywords from the reference text are incorporated in the generated text. A high ROUGE-1_f score ensures that critical keywords essential for risk alerts are not omitted.
ROUGE-2_f (F-score of Bigram Overlap): ROUGE-2_f measures overlap at the bigram level, focusing on evaluating sentence fluency and syntactic correctness. In the context of web log risk alerts, it reflects the model’s capability to accurately express specific threat patterns or types through precise phrasing.
ROUGE-L_f (F-score of Longest Common Subsequence): Based on the longest common subsequence (LCS), ROUGE-L_f captures the inclusion of essential information, irrespective of strict word order. Higher ROUGE-L scores imply that the system effectively identifies and accurately represents key information within risk notifications, which is a crucial factor for rapid situational awareness.

BLEU (Bilingual Evaluation Understudy), originally developed for machine translation evaluation, has also been broadly adopted for evaluating various text generation tasks. BLEU measures the n-gram precision between generated and reference texts, reflecting the accuracy and naturalness of generated sentences. High BLEU scores indicate the system’s ability to generate risk alerts that are both accurate and naturally phrased, enabling administrators to quickly understand and respond without confusion.

METEOR (Metric for Evaluation of Translation with Explicit Ordering) was developed to complement BLEU by capturing finer semantic matches. Unlike BLEU, METEOR considers not only exact matches but also matches based on stems, synonyms, and paraphrases. Thus, METEOR provides a more nuanced evaluation of semantic similarity, significantly enhancing the interpretability of risk notifications. A higher METEOR score indicates the system’s proficiency in recognizing and communicating threat information, even when expressed in varied ways.

BERT-Score (Bidirectional Encoder Representations from Transformers Score) is a state-of-the-art embedding-based evaluation metric leveraging the BERT model to measure contextual similarity between generated and reference texts. It captures subtle semantic differences often missed by traditional n-gram metrics.

BERT_p (Precision): This measures how closely each token in the generated text semantically aligns with tokens in the reference text.
BERT_r (Recall): This evaluates how effectively each token in the reference text is semantically represented by tokens in the generated text.
BERT_f (F-score): This is the harmonic mean of precision and recall, providing an overall measure of semantic coherence.

Given its ability to capture contextual semantics beyond simple lexical overlap, the BERT-Score is particularly valuable for assessing whether risk notifications accurately and deeply reflect the actual threat scenarios. This makes it essential for effectively communicating complex or nuanced security threats.

Collectively, these metrics are instrumental in determining whether the text-based alerts generated by our web log risk analysis system transcend the mere enumeration of log data, instead providing high-quality, actionable insights that empower security analysts and administrators to quickly and accurately perceive and respond to potential security threats.

5.2.2. Experimental Results and Analysis

The performance of various large language models (LLMs) that are applicable to the developed system was comparatively evaluated using the NLP metrics described previously.

Table 4 presents the performance scores of each model. As shown in Table 4, the GPT-4o-mini model consistently exhibited the best overall performance across all evaluation metrics. Particularly, GPT-4o-mini demonstrated significantly superior scores on ROUGE metrics, such as ROUGE-1_f (0.6781), ROUGE-2_f(0.5451), and ROUGE-L_f(0.6087), clearly surpassing other models. This indicates that GPT-4o-mini excels in effectively capturing and communicating essential information when generating risk notifications.

Additionally, GPT-4o-mini achieved the highest scores in relation to both BLEU (43.6969) and METEOR (0.5353) metrics, indicating its exceptional grammatical accuracy, fluency, and semantic similarity to the reference texts. These results suggest that GPT-4o-mini is highly effective at transforming complex security risk information extracted from web logs into user-friendly and easily comprehensible alerts.

Furthermore, the GPT-4o-mini model attained the highest BERT_f score (0.9112), emphasizing its ability to accurately capture and reflect the contextual meaning of the reference texts. Thus, GPT-4o-mini emerges as the most effective LLM model for generating meaningful, contextually accurate, and practically valuable risk alerts in web log risk analysis scenarios.

In summary, this performance evaluation clearly demonstrates that the GPT-4o-mini model exhibits the highest performance among the tested models, making it an ideal candidate as the central text-generation component for the web log risk analysis system proposed in this study. The Gemini model also displayed a strong competitive performance, indicating its potential to significantly contribute to enhancing the reliability and efficiency of the system. Collectively, these quantitative findings substantiate that the developed system can effectively strengthen web service security in practical environments and can meaningfully reduce administrators’ response time.

5.3. Qualitative Evaluation Criteria

To address the limitations of conventional NLP evaluation metrics in capturing the real-world utility of generated security alerts, this study incorporates a qualitative evaluation conducted by three domain experts. Their assessments focused on four key quality dimensions—Accuracy, Completeness, Actionability, and Logical Consistency—to holistically evaluate the responses of various large language models (LLMs). The following table (Table 5) presents the four expert-defined evaluation metrics and their corresponding criteria. This expert-driven evaluation was further subjected to statistical analysis; specifically, the Wilcoxon signed-rank test was performed to determine whether the observed differences in quality scores across models were statistically significant, using a significance level of

α = 0.05

. The results and interpretations of this analysis are reported in the subsequent section.

Accuracy: This criterion assessed whether the generated alerts accurately interpreted and reflected core information from the web logs, such as IP addresses, timestamps, HTTP methods/endpoints, and status codes, without factual errors. Furthermore, it examined whether the assigned risk level was based on factual relations derived from the log data. A higher accuracy score indicated that the system effectively conveyed the essence of the threat without distortion.
Completeness: The evaluation of completeness focused on whether the alert comprehensively included all critical clues that are necessary to describe a specific threat situation (e.g., patterns of repeated requests, suspicious user agents (UAs), attempts to access sensitive endpoints, etc.). Experts meticulously checked for any omission of potential risk factors, ensuring that the alert provided sufficient information for security administrators to gain a comprehensive understanding of the situation.
Actionability: This criterion evaluated whether the proposed response measures (e.g., monitoring specific traffic, applying blocking rules, reviewing user sessions, etc.) were at a practical level that an actual operations team could immediately implement. Moreover, it critically assessed whether the suggested actions were presented with clear priorities, enabling security personnel to respond efficiently and without confusion.
Logical Consistency: The logical consistency assessment measured how naturally and logically the information flowed within the generated alert. Specifically, it examined whether the progression from inter-log relationship analysis to technical risk assessment and then to the inference of threat intent was organically connected. It also ensured that the conclusion clearly corresponded to the presented evidence, which is a crucial factor in enhancing the reliability and comprehensibility of the alerts.

Each expert assigned a score on a scale from 1 (very poor) to 4 (excellent) for each criterion. The aggregated expert scores served as the foundational data for subsequent statistical significance analysis (Wilcoxon signed-rank test) between the models.

5.4. Qualitative Evaluation Result

This section reports the detailed statistical analysis results of the expert-based qualitative evaluation, providing interpretations and discussions focusing on significant differences across various quality metrics. In this study, the response quality of various large language models (LLMs) was compared across four dimensions–Accuracy, Completeness, Actionability, and Logical Consistency. The Wilcoxon signed-rank test was performed to evaluate rank differences between models for each quality metric, with the significance level set at the commonly used

α

= 0.05.

5.4.1. Accuracy Evaluation Results

Accuracy assessment measured how factually correct and error-free the LLM-generated risk alerts were in relation to the actual log data and threat information. The table (Table 6) below presents the Wilcoxon signed-rank test results for Accuracy rank comparisons between LLM pairs.

In the accuracy evaluation, no statistically significant differences were observed in comparisons between most model pairs (p > 0.05). This suggests that overall, LLMs exhibited a comparable performance in providing accurate information. However, a statistically significant difference (p = 0.0338) was found between GPT-4o-mini and LLaMA3, indicating a systematic difference in their accuracy evaluations. Consequently, it is highly probable that GPT-4o-mini provided more accurate responses compared to LLaMA3. For other model pairs, the absence of statistical significance implies a similar accuracy performance or no clear preference from expert evaluations.

5.4.2. Completeness Evaluation Results

Completeness evaluation assessed how sufficiently and comprehensively the LLM-generated risk alerts included essential information pertaining to a specific threat situation. The table (Table 7) below presents the Wilcoxon signed-rank test results for Completeness rank comparisons between LLM pairs.

In terms of completeness, GPT-4o showed a statistically significant superiority over GPT-4o-mini (p = 0.0482). This indicates that GPT-4o provided more comprehensive and thorough responses to the queries. Furthermore, the comparison between GPT-4o-mini and LLaMA3 yielded a p-value of 0.0572, which is close to the significance level, suggesting potential for a statistically significant difference to emerge with additional data or repeated experiments. For other model pairs, no significant differences were observed.

5.4.3. Actionability Evaluation Results

Actionability evaluation measured how clearly and practically the LLM-generated risk alerts provided information that could lead to immediate and appropriate actions by security administrators. The table (Table 8) below presents the Wilcoxon signed-rank test results for Actionability rank comparisons between LLM pairs.

The Actionability comparison results revealed statistically significant differences in two model pairs. Firstly, GPT-4o demonstrated a significantly higher actionability than GPT-4o-mini (p = 0.0457). Secondly, GPT-4o-mini also exhibited a remarkably superior actionability compared to LLaMA3 (p = 0.0074). These findings suggest that the more advanced models are providing responses that can lead to more concrete and practical measures. Specifically, GPT-4o-mini showed a distinct advantage over LLaMA3 in presenting structured action guidelines or actionable information.

5.4.4. Logical Consistency Evaluation Results

Logical Consistency evaluation assessed whether the LLM-generated risk alerts were logically and naturally connected, making them easy to understand. The table (Table 9) below presents the Wilcoxon signed-rank test results for Logical Coherence rank comparisons between LLM pairs.

In the Coherence aspect, statistically significant differences were observed in two model pairs. GPT-4o generated more coherent responses compared to GPT-4o-mini (p = 0.0233). Additionally, GPT-4o-mini showed more consistent results in response structure and logical flow compared to LLaMA3 (p = 0.0159). This corroborates the notion that more advanced models perform better in generating natural and well-structured text.

5.4.5. Comprehensive Analysis and Implications

Synthesizing the statistical analysis results leads to the following conclusions:

GPT-4o demonstrated a superior performance over GPT-4o-mini in most categories, particularly showing statistically significant differences in Completeness, Actionability, and Coherence. This suggests that while GPT-4o-mini is a strong contender, the full GPT-4o model, despite potentially being slower or more resource-intensive, provides a qualitatively higher standard in critical aspects of practical utility for security professionals.
GPT-4o-mini showed statistically superior or significant tendencies across all key metrics compared to LLaMA3, indicating a strong competitiveness in the mid-performance range. This positions GPT-4o-mini as a highly effective model that balances performance with efficiency.
Gemini-2.5 showed no statistically significant differences when compared to either GPT-4o or GPT-4o-mini, suggesting that this model exhibits a balanced performance at an intermediate level. Its consistent, non-significant differences imply it is a robust alternative, offering a comparable qualitative experience in many aspects.

These findings highlight that the latest LLMs differentiate themselves not only in basic accuracy but also in various aspects that enhance practical usability. Therefore, when selecting a model for real-world application, it is crucial to prioritize specific metrics based on the intended use case. A multifaceted evaluation, rather than relying on a single score, is essential for informed model selection in the context of web log risk analysis.

5.5. Discussion on Findings

This study aimed to design and evaluate a ChatGPT-based web log risk analysis system capable of automatically extracting useful information from web log data, identifying potential security threats, and alerting web service providers for quick response. To this end, a comprehensive performance evaluation was conducted, encompassing a quantitative analysis using established natural language processing (NLP) metrics, as well as a qualitative assessment through expert evaluation.

The quantitative evaluation consistently demonstrated that GPT-4o-mini exhibited the highest overall performance across standard NLP metrics, including ROUGE, BLEU, METEOR, and BERT-Score. Its superior scores across these metrics highlighted its exceptional ability to effectively capture essential information, maintain grammatical accuracy and fluency, and ensure semantic similarity and contextual meaning when generating risk notifications. This suggests that GPT-4o-mini excels in transforming complex security risk information into comprehensible and user-friendly alerts, making it a highly efficient choice for automated text generation within such a system. The Gemini 2.5 Flash model also proved to be a strong contender in the quantitative aspects, showing a competitive performance similar to that of GPT-4o-mini.

However, the qualitative expert evaluation provided a crucial, complementary perspective, revealing nuances not fully captured by automated metrics. While GPT-4o-mini showed a statistically significant edge in Accuracy over LLaMA3, the full GPT-4o model demonstrated a statistically significant superiority over GPT-4o-mini in Completeness, Actionability, and Logical Consistency. These qualitative distinctions are particularly vital for a security system, where the ability of an alert to comprehensively convey all necessary information, guide immediate and effective action, and present a logically coherent understanding of the threat can be paramount. This suggests that the larger scale and more sophisticated reasoning capabilities of GPT-4o may lead to a higher perceived quality and practical utility for human experts in complex security scenarios, despite a potentially lower efficiency in raw metric scores.

Conversely, GPT-4o-mini consistently outperformed LLaMA3 across all qualitative metrics, showing a statistically significant superiority or strong positive tendencies. This reinforces GPT-4o-mini’s position as a robust and highly effective model, bridging the gap between high-end performance and computational efficiency. The Gemini 2.5 Flash model maintained a balanced qualitative performance, showing no statistically significant differences when compared to GPT-4o or GPT-4o-mini, indicating its reliability as a viable alternative.

In conclusion, the multifaceted evaluation underscores that the choice of an LLM for a web log risk analysis system is a strategic decision involving trade-offs between various performance dimensions. While GPT-4o-mini stands out for its overall quantitative excellence and strong qualitative performance against lighter models, offering an excellent balance of efficiency and quality, GPT-4o demonstrates a critical advantage in providing more comprehensive, actionable, and logically coherent alerts as perceived by security experts. This study therefore highlights that for critical security applications where the precision and utility of human-interpretable alerts are paramount, investing in models like GPT-4o may yield superior operational benefits, even if it entails higher computational demands. For scenarios where a strong balance of performance and efficiency is key, GPT-4o-mini is an exceptional candidate. Ultimately, this research provides valuable insights into the differential capabilities of leading LLMs for security text generation, advocating for a nuanced, context-aware model selection process to achieve optimal system effectiveness in real-world cybersecurity environments.

6. Conclusions and Future Work

The proposed system demonstrated a high accuracy in identifying threats within web log data. Additionally, it provided an admin page that enables system administrators to easily monitor log data and take necessary actions. This significantly enhances the efficiency of security operations, allowing security personnel to focus on higher-level security management tasks.

One of the key contributions of this study lies not in inventing a new detection algorithm, but in explicitly specifying and integrating each stage of the end-to-end pipeline—ranging from log collection and normalization, through LLM-based interpretation and risk analysis, to database storage, visualization, and policy configuration. This architectural integration demonstrates the operational feasibility of applying LLMs to real-time cybersecurity workflows.

Several novel elements distinguish this system from prior work, as follows: (1) the use of a domain-specific focus on web access logs, (2) a hybrid evaluation strategy that combines quantitative NLP metrics with expert-based qualitative assessments focusing on actionability and logical consistency, (3) formalized risk-level classification used to trigger responses, (4) automated security policy generation based on interpretation outputs (e.g., “Critical + repeated IP”), and (5) the implementation of a unified full-stack architecture linking all operational components.

From a practical standpoint, the system supports actionable decision-making through automatic policy enforcement, provides operational guidance for model selection, and stores all alert metadata in a structured schema, enabling a retrospective analysis of patterns and policy effectiveness. These characteristics make it particularly valuable for resource-constrained environments such as small- and medium-sized enterprises (SMEs).

Nevertheless, this study has some limitations. It primarily focuses on log management and vulnerability analysis within a specific web application environment, namely the OWASP Juice Shop. Therefore, when applied to different types of applications or operational contexts, additional fine-tuning may be necessary to ensure optimal performance. Moreover, the analytical framework is currently dependent on specific tools and models, which may affect generalizability.

In future work, we plan to expand the system’s scope to include additional log domains, refine real-time policy automation logic, and explore generalization to unseen web environments. Furthermore, the integration of continuously learning models and adaptive prompt strategies may enhance the long-term resilience against evolving threats. This study underscores the value of combining AI-driven language models with automated security workflows and sets the stage for future advances in intelligent and scalable cybersecurity systems.

Author Contributions

H.J.: conceptualization, methodology, software, validation, investigation, resources, data curation, writing—original draft preparation, and writing—review and editing. I.J.: conceptualization, methodology, validation, investigation, supervision, and writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Mi, H.; Wang, H.; Zhou, Y.; Lyu, M.R.T.; Cai, H. Toward finegrained, unsupervised, scalable performance diagnosis for production cloud computing systems. IEEE Trans. Parallel Distrib. Syst. 2013, 24, 1245–1255. [Google Scholar] [CrossRef]
Reinsel, D.; Gantz, J.; Rydning, J. The Digitization of the World from Edge to Core; IDC White Paper; International Data Corporation: Needham, MA, USA, 2018; Volume 13. [Google Scholar]
Kent, K.; Souppaya, M. Guide to Computer Security Log Management; NIST Special Publication 800-92; National Institute of Standards and Technology (NIST): Gaithersburg, MD, USA, 2006. [Google Scholar]
Awotipe, O. Log Analysis in Cyber Threat Detection; M.S. Creative Component, Department of Information Assurance, Iowa State University: Ames, IA, USA, 2019. [Google Scholar]
Du, M.; Li, F.; Zheng, G.; Srikumar, V. Deeplog: Anomaly detection and diagnosis from system logs through deep learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, Dallas, TX, USA, 30 October–3 November 2017; pp. 1285–1298. [Google Scholar]
Meng, W.; Liu, Y.; Zhu, Y.; Zhang, S.; Pei, D.; Liu, Y.; Chen, Y.; Zhang, R.; Tao, S.; Sun, P.; et al. Loganomaly: Unsupervised detection of sequential and quantitative anomalies in unstructured logs. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, Macao, China, 10–16 August 2019; pp. 4739–4745. [Google Scholar]
Karlsen, E.; Luo, X.; Zincir-Heywood, N.; Heywood, M. Benchmarking Large Language Models for Log Analysis, Security, and Interpretation. J. Netw. Syst. Manag. 2024, 32, 59. [Google Scholar] [CrossRef]
Le, V.-H.; Zhang, H. Log Parsing: How Far Can ChatGPT Go? In Proceedings of the 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), Luxembourg, 11–15 September 2023; pp. 1699–1700. [Google Scholar]
Lohar, P.; Baraskar, T. Automated AI Tool for Log File Analysis. In Proceedings of the 2025 6th International Conference on Mobile Computing and Sustainable Informatics (ICMCSI), Goathgaun, Nepal, 7–8 January 2025; pp. 1762–1766. [Google Scholar]
Liu, P.; Yuan, W.; Fu, J.; Jiang, Z.; Hayashi, H.; Neubig, G. Pretrain, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surv. 2023, 55, 1–35. [Google Scholar] [CrossRef]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]
Gao, S.; Wen, X.-C.; Gao, C.; Wang, W.; Zhang, H.; Lyu, M.R. What makes good in-context demonstrations for code intelligence tasks with llms? In Proceedings of 38th IEEE/ACM International Conference on Automated Software Engineering, Kirchberg, Luxembourg, 11–15 September 2023.
Tyagi, S.; Kumar, K. Evaluation of Static Web Vulnerability Analysis Tools. In Proceedings of the 2018 Fifth International Conference on Parallel, Distributed and Grid Computing (PDGC), Solan, India, 20–22 December 2018; pp. 1–6. [Google Scholar]
Elasticsearch, Logstash, Kibana. ELK Stack. Available online: www.elastic.co/elk-stack (accessed on 21 February 2025).
A Toolkit for Automated Log Parsing. 2023. Available online: https://github.com/logpai/logparser (accessed on 21 February 2025).
Mudgal, P.; Wouhaybi, R. An Assessment of ChatGPT on Log Data. In Proceedings of the at First International Conference, AIGC 2023, Shanghai, China, 25–26 August 2023; pp. 148–169. [Google Scholar]
GPT-3.5 Turbo. 2023. Available online: https://platform.openai.com/docs/models/gpt-3.5-turbo (accessed on 15 February 2025).

Figure 1. Architectural workflow of web log risk analysis system.

Figure 2. Logstash structure.

Figure 3. Refined log data.

Figure 4. The logstash.conf settings.

Figure 5. Log data threat analysis levels.

Figure 6. Results of general web log analysis.

Figure 7. SQL injection attack log analysis result.

Figure 8. XSS attack log analysis results.

Figure 9. File upload vulnerability attack log analysis results.

Figure 10. Security settings results.

Figure 11. The prompt provided to each model.

Table 1. Sanitized logs processed in the log data processing stage.

Column Name	Data Type	NULL	Key
id	integer	NOT NULL	Primary key
client_ip	character varying (100)	NOT NULL
agent	character varying (100)	NOT NULL
http_ver	character varying (100)	NOT NULL
verb	character varying (100)	NOT NULL
response	character varying (100)	NOT NULL
bytes	character varying (100)	NOT NULL
request	character varying (100)	NOT NULL
referrer	character varying (100)	NOT NULL
log_created_at	timestamp	NOT NULL
created_at	timestamp	NOT NULL

Table 2. AI analysis logs generated by the AI analysis model.

Column Name	Data Type	NULL	Key
id	integer	NOT NULL	Primary key
message	character varying (3000)	NOT NULL
created_at	timestamp	NOT NULL

Table 3. Risk evaluations and security settings determined by the Security Settings Program.

Column Name	Data Type	NULL	Key
id	integer	NOT NULL	Primary key
ip	character varying (100)	NOT NULL
level	character varying (100)	NOT NULL
count	character varying (100)	NOT NULL
analyze_time	timestamp	NOT NULL
created_at	timestamp	NOT NULL

Table 4. Evaluation metrics for different models.

Model	ROUGE-1_f	ROUGE-2_f	ROUGE-L_f	BLEU	METEOR	BERT_p	BERT_r	BERT_f
GPT-4o	0.5865	0.4730	0.5416	24.7855	0.3923	0.9019	0.8648	0.8829
GPT-4o-MINI	0.6781	0.5451	0.6087	43.6969	0.5353	0.9164	0.9061	0.9112
Gemini 2.5 Flash	0.6496	0.5214	0.5775	42.1649	0.5298	0.9146	0.9068	0.9107
LLaMA3	0.5190	0.3913	0.4497	24.9215	0.4315	0.8742	0.8884	0.8812

Bold numbers denote the best performance for each metric.

Table 5. Expert evaluation criteria.

Evaluation Metric	Evaluation Criteria	Score Scale (1–4)
Accuracy	Whether there are errors in interpreting core information in logs such as IP, timestamp, HTTP method/endpoint, and status code. Whether the risk level judgment was based on factual relations.	1–4
Completeness	Whether all key clues such as repeated requests, suspicious UA, and attempts to access sensitive endpoints are mentioned. Whether any potential risk factors are omitted.	1–4
Practicality	Whether the proposed response measures (monitoring, blocking rules, session review, etc.) are at a level that the actual operations team can immediately implement. Whether specific priorities are presented.	1–4
Logical Consistency	Whether the flow from relationship analysis → technical risk assessment → inference of threat intent is naturally connected. Whether the conclusion clearly matches the presented evidence.	1–4

Table 6. Wilcoxon signed-rank test results for Accuracy rank comparisons between LLMs.

Model Pair	W-Statistic	p-Value
GPT-4o vs. GPT-4o-mini	1638.0	0.0926
GPT-4o vs. Gemini-2.5	1834.0	0.3808
GPT-4o vs. LLaMA3	1965.5	0.7367
GPT-4o-mini vs. Gemini-2.5	1865.0	0.4399
GPT-4o-mini vs. LLaMA3	1529.0	0.0338

Table 7. Wilcoxon signed-rank test results for Completeness rank comparisons between LLMs.

Model Pair	W-Statistic	p-Value
GPT-4o vs. GPT-4o-mini	1570.5	0.0482
GPT-4o vs. Gemini-2.5	1785.0	0.2723
GPT-4o vs. LLaMA3	1971.5	0.7996
GPT-4o-mini vs. Gemini-2.5	1637.0	0.1069
GPT-4o-mini vs. LLaMA3	1587.5	0.0572

Table 8. Wilcoxon signed-rank test results for Actionability rank comparisons between LLMs.

Model Pair	W-Statistic	p-Value
GPT-4o vs. GPT-4o-mini	1575.5	0.0457
GPT-4o vs. Gemini-2.5	1737.5	0.1653
GPT-4o vs. LLaMA3	1881.0	0.4627
GPT-4o-mini vs. Gemini-2.5	1643.5	0.0968
GPT-4o-mini vs. LLaMA3	1382.5	0.0074

Table 9. Wilcoxon signed-rank test results for Logical Consistency rank comparisons between LLMs.

Model Pair	W-Statistic	p-Value
GPT-4o vs. GPT-4o-mini	1502.0	0.0233
GPT-4o vs. Gemini-2.5	1807.5	0.3190
GPT-4o vs. LLaMA3	1846.5	0.3942
GPT-4o-mini vs. Gemini-2.5	1680.5	0.1537
GPT-4o-mini vs. LLaMA3	1458.5	0.0159

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jeong, H.; Joe, I. An AI-Based Risk Analysis Framework Using Large Language Models for Web Log Security. Electronics 2025, 14, 3512. https://doi.org/10.3390/electronics14173512

AMA Style

Jeong H, Joe I. An AI-Based Risk Analysis Framework Using Large Language Models for Web Log Security. Electronics. 2025; 14(17):3512. https://doi.org/10.3390/electronics14173512

Chicago/Turabian Style

Jeong, Hoseong, and Inwhee Joe. 2025. "An AI-Based Risk Analysis Framework Using Large Language Models for Web Log Security" Electronics 14, no. 17: 3512. https://doi.org/10.3390/electronics14173512

APA Style

Jeong, H., & Joe, I. (2025). An AI-Based Risk Analysis Framework Using Large Language Models for Web Log Security. Electronics, 14(17), 3512. https://doi.org/10.3390/electronics14173512

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An AI-Based Risk Analysis Framework Using Large Language Models for Web Log Security

Abstract

1. Introduction

1.1. Background

1.2. Proposed System Overview

1.3. Research Objectives and Expected Contributions

2. Literature Review

2.1. Purpose and Selection Criteria

2.1.1. Application of LLMs to Log Parsing or Interpretation Tasks

2.1.2. Prompting Techniques and In-Context Learning

2.1.3. Evaluation of Security Analysis Frameworks

2.2. Synthesis and Research Gap

3. System Design and Implementation

3.1. Materials and Methods

Data Ingestion and Normalization

3.2. Database Schema Design

3.3. Web Application Security Vulnerabilities

3.4. ChatGPT Log Analysis Performance

4. Experiments and Evaluation

4.1. General Web Log Analysis Results

4.2. Web Attack Log Risk Analysis Results

4.2.1. SQL Injection Attack

4.2.2. Cross-Site Scripting (XSS) Attack

4.2.3. File Upload Vulnerability Attack

5. Results and Discussion

5.1. Experimental Setup

5.1.1. Selection of Comparative Models and Their Rationale

5.1.2. Dataset Construction

5.1.3. Generation of Reference Logs for Evaluation

5.1.4. Prompt Construction

5.2. Quantitative Evaluation

5.2.1. Evaluation Metrics

5.2.2. Experimental Results and Analysis

5.3. Qualitative Evaluation Criteria

5.4. Qualitative Evaluation Result

5.4.1. Accuracy Evaluation Results

5.4.2. Completeness Evaluation Results

5.4.3. Actionability Evaluation Results

5.4.4. Logical Consistency Evaluation Results

5.4.5. Comprehensive Analysis and Implications

5.5. Discussion on Findings

6. Conclusions and Future Work

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI