ATIRS: Towards Adaptive Threat Analysis with Intelligent Log Summarization and Response Recommendation

Park, Daekyeong; Min, Byeongjun; Lim, Sungwon; Kim, Byeongjin

doi:10.3390/electronics14071289

Open AccessArticle

ATIRS: Towards Adaptive Threat Analysis with Intelligent Log Summarization and Response Recommendation

by

Daekyeong Park

^*

,

Byeongjun Min

,

Sungwon Lim

and

Byeongjin Kim

Cyber Battlefield Field, Hanwha Systems, Seongnam-si 13524, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(7), 1289; https://doi.org/10.3390/electronics14071289

Submission received: 26 February 2025 / Revised: 21 March 2025 / Accepted: 24 March 2025 / Published: 25 March 2025

(This article belongs to the Special Issue AI-Based Solutions for Cybersecurity)

Download

Browse Figures

Versions Notes

Abstract

Modern maritime operations rely on diverse network components, increasing cybersecurity risks. While security solutions like Suricata generate extensive network alert logs, ships often operate without dedicated security personnel, requiring general crew members to review and respond to alerts. This challenge is exacerbated when vessels are at sea, delaying threat mitigation due to limited external support. We propose an Adaptive Threat Intelligence and Response Recommendation System (ATIRS), a small language model (SLM)-based framework that automates network alert log summarization and response recommendations to address this. The ATIRS processes real-world Suricata network alert log data and converts unstructured alerts into structured summaries, allowing the response recommendation model to generate contextually relevant and actionable countermeasures. It then suggests appropriate follow-up actions, such as IP blocking or account locking, ensuring timely and effective threat response. Additionally, the ATIRS employs adaptive learning, continuously refining its recommendations based on user feedback and emerging threats. Experimental results from shipboard network data demonstrate that the ATIRS significantly reduces the Mean Time to Respond (MTTR) while alleviating the burden on crew members, allowing for faster and more efficient threat mitigation, even in resource-constrained maritime environments.

Keywords:

maritime cybersecurity; shipboard network; Suricata; alert summarization; automated response

1. Introduction

The maritime industry is rapidly increasing the size and complexity of internal ship networks due to the combined use of cutting-edge technologies such as satellite communications, maritime IoT, and automated navigation systems [1]. These technological advancements provide various benefits in terms of operational efficiency, safety, and economy, but at the same time, they also bring new threats in terms of cybersecurity [2,3,4].

In fact, as various devices and sensors inside the ship, crew work systems, and remote control modules are interconnected, attackers are more likely to target the maritime environment and attempt ransomware, spoofing, radio frequency (RF) jamming, and supply chain attacks [5,6,7]. These threats can directly affect ships’ safety and operational efficiency at sea [8,9]. In particular, since it is difficult to receive external assistance when a ship is sailing in open waters after leaving the port, there is a greater risk as sailors may struggle to respond immediately in the event of a security incident [10,11].

A real-world example of such a threat is the 2023 ransomware attack on the Norwegian classification society DNV, which compromised its ShipManager system, a widely used fleet management software. The attack led to a system shutdown, affecting nearly 1000 vessels worldwide and disrupting critical onboard operations, forcing shipowners to rely on offline procedures [12]. Similarly, in 2022, the Port of Lisbon fell victim to a ransomware attack by the LockBit group, disrupting administrative and operational networks. While core shipping activities remained functional, the attack exposed security gaps in port infrastructure, demonstrating how cybercriminals can target maritime logistics hubs to disrupt trade and demand ransoms [13].

Another significant case occurred in 2022, when CMA CGM, one of the largest shipping companies, was hit by the Ragnar Locker ransomware attack, forcing a two-week system shutdown that delayed container operations and disrupted global logistics chains. This attack demonstrated the severe operational risks posed by ransomware in the maritime industry and how prolonged recovery times can exacerbate financial losses and supply chain disruptions [14]. These incidents underscore the urgency of rapid threat response in maritime cybersecurity. Given the interconnected nature of modern shipping and port operations, delays in addressing cyber threats can escalate into widespread operational failures, financial losses, and compromised safety at sea.

To prevent this, many shipping companies and shipowners are introducing Intrusion Detection System (IDS) and Intrusion Prevention System (IPS) solutions into ship internal networks. Among these, Suricata is in the spotlight for its network traffic analysis capabilities, supported by an active open-source community. Suricata monitors network flow in real-time and generates alert logs using predefined signature rules or anomaly detection methods [15,16]. Although Suricata and similar IDS tools can detect potential threats, they lack automated context-aware analysis or advanced summarization functions, requiring security experts to interpret logs and decide on responses.

However, in situations where large-scale alert logs are generated, there are usually no security experts on board, and crew members—who typically have only basic user-level IT knowledge—often have to analyze these logs independently. Unlike trained cybersecurity professionals, most seafarers are not well-versed in network security protocols, log analysis, or advanced threat detection techniques, making it difficult for them to interpret complex security alerts accurately. Large-scale alert logs generated by IDS solutions such as Suricata can appear in diverse and extensive forms due to the characteristics of the ship’s internal environment (e.g., satellite communication, limited bandwidth, and maritime IoT).

This places a significant burden on crew members, who have limited manpower and time resources to manually assess security events. Misinterpreting or overlooking critical alerts due to a lack of cybersecurity expertise can result in delayed or inadequate responses, potentially exposing vessels to serious threats such as ransomware, GPS spoofing, or unauthorized access. In addition, providing immediate remote assistance from land is challenging because communication bandwidth with the land control center is limited during long voyages. As a result, crew members may struggle to identify high-priority threats amid the overwhelming daily torrent of alerts and determine the appropriate actions to take, increasing the likelihood of delayed threat responses and serious security incidents [17]. Moreover, the absence of real-time cybersecurity guidance forces crew members to rely on their limited IT knowledge and experience, heightening the risk of misjudgments in critical situations.

Recently, large language models (LLMs) have been actively introduced in the field of cybersecurity to address these challenges [18,19]. However, leading LLMs such as GPT-4 [20], LLaMA3 [21], and Gemini [22] have limitations in that they require enormous computing resources. Therefore, attempts to introduce small language models (SLMs) to alleviate the high resource demands of LLMs and achieve optimized performance for specific purposes or environments are increasing [23]. SLM not only saves memory and computational resources by using relatively small parameter sizes but also maintains efficiency and high performance through fine-tuning and quantization techniques tailored to specific domains. Due to these characteristics, they are gaining attention as cost-effective solutions that can be implemented in resource-constrained environments.

In this paper, we propose an Adaptive Threat Intelligence and Response Recommendation System (ATIRS), an intelligent log summary and automatic response recommendation system utilizing SLMs to address the constraints of the ship environment and security operation challenges. Unlike traditional IDS solutions—which focus primarily on detecting suspicious traffic and raising alerts—the ATIRS summarizes network alert logs and provides users with response recommendations based on these summaries. Specifically, the ATIRS aims to dramatically improve the efficiency of security operations through the following process. First, network warning logs collected inside the ship are preprocessed, and key information is analyzed and concisely summarized through the SLM-based summary module. Through this, essential data such as attack type, risk level, source/destination IP, and occurrence time are extracted to help the SLM-based response recommendation module interpret them more effectively.

Additionally, to address the limitations of rule-based analysis, SLM’s natural language understanding and generation capabilities are leveraged to capture the context of complex security logs and intuitively generate signature summaries. Then, the SLM-powered response recommendation module automatically suggests follow-up actions such as IP blocking, account locking, and network segment separation, allowing crews to respond swiftly and efficiently even in environments without dedicated security personnel. Furthermore, ATIRS continuously enhances recommendation accuracy when similar or new alerts occur by applying Adaptive Learning, which continuously incorporates users’ response choices and new threat data.

The main contributions of this study can be summarized as follows:

Automated Network Alert Log Summarization: The system automatically analyzes and summarizes key information such as major attack types, severity levels, and timestamps from large volumes of logs, enabling crew members with limited security expertise to understand them more easily. Additionally, the SLM-based response recommendation module utilizes this information to suggest appropriate response actions.
Automated Response Recommendation: Instead of relying on traditional rule-based approaches, the system leverages SLM inference capabilities based on alert log summaries to recommend response actions automatically. This helps mitigate the lack of security expertise among crew members while reducing the burden caused by excessive alerts.
Performance Evaluation Based on a Ship Environment: The proposed system is implemented and evaluated using network alert logs collected from an actual ship environment. The effectiveness of the system is verified by measuring quantitative benefits such as reduced Mean Time to Response (MTTR) and alleviation of the crew’s alert analysis workload.

The remainder of this paper is organized as follows:

Section 2 reviews related work on IDS/IPS systems, the application of LLMs in cybersecurity, and LLM fine-tuning techniques. Section 3 describes the overall architecture of the proposed ATIRS system. Section 4 introduces the dataset, experimental settings, and evaluation methods used in the study. Section 5 presents the experimental results and analysis. Finally, Section 6 provides conclusions and discusses potential future research directions.

2. Related Work

2.1. IDS/IPS Systems

IDS detects and blocks abnormal activities by comparing normal network patterns with anomalous cyberattack patterns at a defined network boundary [24,25]. Praptodiyono et al. [26] proposed a hybrid approach that integrates Suricata (IDS) and pfSense (firewall) to mitigate Distributed Denial of Service (DDoS) attacks. Their study demonstrated that this integration effectively detected and blocked malicious traffic, thereby improving the Quality of Service (QoS). Experimental results indicated a 1.08% increase in throughput and an approximately 88% reduction in delay and jitter, confirming enhanced network performance.

Ouiazzane et al. [27] proposed a hybrid, intelligent, and distributed intrusion detection system (KIDS). Using the CICIDS2017 dataset, they performed anomaly detection with a decision tree (DT) and identified known attacks using Suricata. The DT algorithm effectively distinguished normal traffic with an accuracy of 99.9%, maintained a low false alarm rate, and quickly blocked known attacks through Suricata’s multi-threading function. As a result, the proposed system achieved high detection performance and a low false alarm rate, demonstrating significant effectiveness in responding to cyber threats.

Ohri et al. [28] applied Suricata IPS to address security vulnerabilities in ONOS controllers within SDN (Software-Defined Networking) environments. Their experimental results showed that Suricata effectively blocked web-based DDoS traffic and successfully defended against malicious traffic targeting the ONOS control layer. This study was the first to focus on protecting ONOS in an SDN environment and provides valuable insights into enhancing SDN security.

Ernawati et al. [29] compared the detection performance of three IDSs—PSAD, Portsentry, and Suricata—in identifying port scanning, DDoS SYN flood, and brute-force attacks. The experimental results showed that both Suricata and PSAD achieved excellent performance, with 100% detection accuracy. Additionally, Suricata had advantages in CPU and disk usage, Portsentry exhibited relatively low RAM usage, and PSAD recorded the fastest detection speed, with an average of 4.21 s. Overall, Suricata and PSAD were evaluated as highly suitable network IDSs.

Bada et al. [30] compared and analyzed open-source IDSs such as Suricata, Snort, and Bro to assess their detection performance across various traffic scenarios. Their experiments compared the accuracy of detecting different types of threats, including Denial of Service (DoS) and User-to-Root (U2R) attacks.

Nam et al. [31] and Xing et al. [32] proposed Suricata OpenFlow and SnortFlow, which are elastic and distributed intrusion detection and prevention systems (IDPS) designed for DoS attack defense in virtualized SDN environments. These systems were developed to achieve high scalability and efficiency by integrating Snort- and Bro-based intrusion detection capabilities with OpenFlow-based network reconfiguration technology.

2.2. LLMs for Cyber Security

LLMs, trained on extensive text datasets with vast linguistic information, have become a key technology that has significantly advanced human language processing, generation, and understanding. These models can handle complex linguistic tasks and generate coherent and meaningful outputs based on contextual information acquired from large amounts of text [33]. As a result, recent research paradigms are shifting toward solving complex problems using LLMs. In particular, the ability of LLMs to analyze massive text data from diverse sources is emerging as a game changer in the field of cybersecurity [34,35].

Balasubramanian et al. [36] developed CYGENT, a cybersecurity conversational agent based on GPT-3.5, and proposed a framework for automating security log analysis and summarization. CYGENT summarizes log data using a fine-tuned GPT-3 model and demonstrated superior performance in a comparative evaluation against the CodeT5 model, with the GPT-3 Davinci model showing the best results.

Roy et al. [37] introduced PhishLang, a framework designed to enhance LLM-based phishing website detection and explainability. PhishLang integrates with GPT-3.5 Turbo to provide phishing detection and explanation functions for warning messages, building a detection system that is both faster and more lightweight than conventional machine learning models. In a 3.5-month test, PhishLang detected approximately 26,000 phishing URLs and effectively identified threats that were not listed in existing blacklists.

Ferrag et al. [38] proposed a lightweight BERT-based security model (SecurityBERT) for detecting cyber threats in IoT/IIoT environments. This model employs the Privacy-Preserving Fixed-Length Encoding (PPFLE) technique to convert network traffic data into a format that the BERT model can process. SecurityBERT demonstrated superior detection performance compared to traditional machine learning and deep learning models (CNN, RNN). In an experiment using the Edge-IIoTset dataset, it achieved 98.2% accuracy, and its small model size (16.7 MB) and short inference time (0.15 s) proved its feasibility for real-time detection in resource-constrained IoT devices.

Sánchez et al. [39] developed a malware detection framework utilizing LLM-based system call analysis. This study explored methods to enhance malware detection performance by applying transfer learning to pre-trained LLMs. The experimental results showed that the BigBird and Longformer models achieved an accuracy and F1-Score of approximately 0.86. Additionally, the study found that the context length of system calls had a significant impact on detection performance. Based on this finding, the balance between real-time detection capabilities and computational complexity was analyzed. The study suggested the potential for real-time malware detection applications in military and high-risk environments.

Houssel et al. [40] investigated methods to enhance the explainability of network intrusion detection systems (NIDS) using LLMs. They evaluated the performance of NetFlow-based malicious traffic detection with OpenAI’s GPT-4 and Meta’s LLama3 models and conducted a comparative analysis against traditional machine learning models (Random Forest, LSTM, etc.). The results showed that while LLMs were less effective than traditional models in precise attack detection, they exhibited strong potential as auxiliary tools for explaining detected threats. In particular, when combined with Retrieval-Augmented Generation (RAG) and function-calling capabilities, LLMs showed high potential as complementary explanation tools for existing NIDS.

Zhang et al. [41] proposed an automatic NIDS framework using LLMs. They compared and analyzed GPT-4, GPT-3.5, and LLama models, incorporating the in-context learning (ICL) technique to enhance detection performance. The research findings indicated that applying ICL to GPT-4 increased detection accuracy and F1-Score by over 90%, and with only 10 examples, both metrics exceeded 95%. This study highlights the potential of AI-driven automation in wireless network security and 6G environments.

Sufi [42] introduced a GPT-based cyber threat intelligence extraction framework for Open-Source Intelligence (OSINT). By analyzing past cyber incident reports using an LLMs, the framework automatically extracted seven key cyber threat factors: attacker type, target, attack source country, attack destination country, attack level, attack type, and attack time. Additionally, a CNN-based anomaly detection system and an NLP-based Q&A system were integrated to automate cyber threat analysis and enhance explainability. Analyzing 214 major cyber incident reports, the framework achieved a Precision of 96%, Recall of 98%, and F1-Score of 97%, outperforming traditional machine learning-based cyber threat detection methods.

2.3. LLMs Fine-Tuning

Models such as GPT-4, LLaMA3, and Gemini, which are SLMs, are often not optimized for specific tasks required by users because they were not trained with a particular purpose in mind [43]. To address this issue, a fine-tuning process can be performed, as shown in Figure 1, to adapt the model’s output to a specific domain or use case.

The existing Full Fine-Tuning (FFT) method adjusts all the weights of the model, which requires a high computational cost and large memory resources [44]. In other words, FFT is relatively inexpensive for smaller models such as BERT [45] and TinyBERT [46], but the computational burden increases significantly for larger models such as LLaMA3 and GPT-4. Consequently, the Parameter-Efficient Fine-Tuning (PEFT) technique is emerging as a more efficient alternative. Figure 2 illustrates the LLM fine-tuning method using the PEFT technique.

PEFT is an intermediate approach between the existing FFT method and the transfer learning method, significantly reducing the computational cost and memory usage required for fine-tuning a large model. The first section (FFT) of Figure 2 illustrates the method of updating all the weights of the model. A pre-trained model with 16-bit precision (FP16) weights is used, and when an input feature is provided, the forward and backward processes are performed to update all the weights. This method offers the highest performance but has the disadvantage of requiring high computational cost and memory usage.

The second section (LoRA Fine-Tuning) of Figure 2 explains the PEFT method using the Low-Rank Adaptation (LoRA) technique. In this method, the existing model’s weights are frozen, and only additional low-rank matrices (A, B) are learned. The original weight matrix W remains unchanged, and the backward process is performed only on the newly added matrices A and B. This allows for maintaining similar performance while reducing memory usage and computational complexity compared to FFT. As a result, a single pre-trained model can be efficiently optimized for various use cases [47].

The third section (QLoRA Fine-Tuning) in Figure 2 represents the Quantized Low-Rank Adaptation (QLoRA) technique. This method is similar to LoRA but provides additional memory savings by quantizing the existing model’s weights to 4-bit precision (NF4). Here, the original weight matrix W is frozen, and only the low-rank matrices A and B are learned. QLoRA has the advantage of maintaining high performance while further reducing computational complexity and memory usage compared to LoRA [48].

As shown in Figure 2, PEFT enables more efficient computation than the existing FFT method, particularly facilitating the fine-tuning of large models in hardware-constrained environments. By appropriately utilizing PEFT techniques such as LoRA and QLoRA, existing LLMs can be cost-effectively optimized for various use cases and effectively applied even in resource-limited environments.

In this paper, we propose an ATIRS, a system based on a pre-trained SLM for summarizing network alert logs collected aboard a ship and generating response recommendations. The core idea of the ATIRS is to transform network security log data into a format that an SLM can process while leveraging its pre-trained knowledge. To achieve this, key information such as attack type, risk level, and source/destination IP addresses is extracted and summarized from the alert logs. Additionally, the system automatically suggests appropriate security measures, such as IP blocking and account locking, through a response recommendation module.

Moreover, a PEFT technique—specifically, QLoRA—is applied to maximize computational efficiency in hardware-constrained environments, reducing training costs without compromising performance. Consequently, the ATIRS is designed to summarize network threats in real time and effectively recommend response actions while maintaining high performance without requiring FFT.

3. Methodology

This section describes the methodology of ATIRS, which consists of Tokenization & Chat Template Conversion, and Fine-Tuning for Network Alert Log Summarization and Response Recommendation Generation.

3.1. Overview

The overall structure of the ATIRS is illustrated in Figure 3.

The ATIRS is designed to process network alert logs and generate log summaries and response recommendations efficiently using a fine-tuned transformer-based SLM. To effectively handle security-related tasks while maintaining computational efficiency, ATIRS employs a multi-stage processing pipeline consisting of the following:

Tokenization & Chat Template Conversion:
-
A network alert log tokenizer is trained to bridge the modality gap between structured log data and natural language.
-
Tokenized logs are converted into chat templates, ensuring compatibility with instruction-based fine-tuning for various SLMs.
Fine-Tuning for Log Summarization and Response Recommendation:
-
ATIRS fine-tunes Phi-3.5 [49] (3.8B), Qwen2.5 [50] (7B), and Gemma-2 [51] (9B) to enhance their ability to summarize network logs and recommend security responses.
-
Fine-tuning is optimized using QLoRA, enabling 4-bit and 8-bit low-rank adaptation to balance performance and computational efficiency.

The fine-tuning process follows a two-stage approach:

Stage B (Network Alert Log Summarization): Converts network alert logs into structured summaries, improving interpretability.
Stage C (Response Recommendation): Uses the summarized logs to generate appropriate security response actions, assisting personnel without cybersecurity expertise in making informed decisions.

By structuring the pipeline in this manner, the ATIRS effectively processes security alerts, reduces computational overhead, and improves response efficiency in resource-constrained environments.

3.2. Network Alert Logs and Summary Tokenization

To effectively process network security alerts, the ATIRS employs a specialized tokenization mechanism designed to bridge the modality gap between structured log data and natural language. This approach enhances token efficiency, ensuring that network alert logs are formatted optimally for downstream tasks such as log summarization and response recommendation. A comparative example of base tokenization versus ATIRS tokenization is illustrated in Figure 4.

Unlike standard tokenization, which often results in excessive token length and redundant segmentation of structured data, the ATIRS introduces a domain-specific tokenization strategy. The base tokenization method, as shown in Figure 4, produces a tokenized output with excessive fragmentation and the redundant splitting of critical fields such as timestamps, IP addresses, and protocol descriptions. This results in an inflated token length (1040), increasing computational overhead and reducing model efficiency.

Conversely, ATIRS tokenization optimizes the processing of network alert logs by preserving the structural integrity of key fields. As shown in Figure 4, the ATIRS tokenizer processes the same alert log while maintaining proper formatting of numerical values, network-related terms, and timestamps, reducing the overall token length to 604. This optimized representation allows for more efficient fine-tuning while retaining essential contextual information.

To achieve this, the ATIRS employs a domain-adapted Byte Pair Encoding (BPE) tokenizer, trained specifically on network security text. This tokenizer extends the vocabulary of standard SLM tokenizers, incorporating network-specific terminology, security-related event codes, and structured log patterns. As a result, the ATIRS tokenization approach ensures that network security alerts are effectively transformed into concise yet informative token sequences, reducing token redundancy and enhancing processing efficiency.

This specialized tokenization not only improves model efficiency by reducing token length but also enhances downstream tasks such as log summarization and response recommendation. By applying this technique, the ATIRS ensures that network alert logs are optimally formatted for large-scale processing, making it well suited for resource-constrained environments.

3.3. Chat Template Conversion

To standardize the input format for different language models, the ATIRS incorporates a chat template conversion mechanism. Since each SLM (e.g., Phi-3.5, Qwen2.5, Gemma-2) utilizes distinct chat formatting rules, a unified conversion step is necessary to ensure consistency in prompt structure. Figure 5 illustrates how the same input is adapted to different chat templates for these models.

The conversion process involves restructuring the network alert log prompts into the specific format required by each model. In the case of Phi-3.5, the template follows a structured format with designated <system>, <user>, and <assistant> tags, ensuring that system instructions and user queries are properly distinguished. Similarly, Qwen2.5 adopts a comparable format but uses <im_start> and <im_end> markers instead. On the other hand, Gemma-2 structures its interactions using <start_of_turn> and <end_of_turn> markers.

This template conversion serves several key functions: (I) Since each SLM interprets input differently, standardizing prompts eliminates formatting inconsistencies that could impact response quality. (II) Properly formatted prompts help models better distinguish between system messages, user queries, and expected responses. (III) By automating chat template adaptation, the ATIRS streamlines the training process across multiple models without requiring manual intervention.

By applying these structured chat templates, the ATIRS ensures seamless integration with different SLMs, optimizing performance in both network alert log summarization and response recommendation tasks. The automated chat template conversion mechanism is essential for maintaining model adaptability and robustness in processing network security alerts.

3.4. Fine-Tuning and PEFT

To enhance the adaptability of the ATIRS in resource-constrained environments, we employ a PEFT approach using LoRA. This method enables efficient optimization by fine-tuning only a small subset of additional trainable parameters while keeping the majority of the pre-trained model frozen.

The fine-tuning process in the ATIRS consists of two key stages: network alert log summarization, referred to as the ATIRS-Summarization Model (ATIRS-SM), and response recommendation, referred to as the ATIRS-Response Recommendation Model (ATIRS-RM). Both stages leverage LoRA-based PEFT techniques to optimize SLMs for cybersecurity tasks while maintaining computational efficiency. Figure 6 illustrates how LoRA is integrated into the transformer architecture during fine-tuning, ensuring effective adaptation to the specific needs of each stage.

LoRA modifies the transformer’s self-attention mechanism by introducing low-rank matrices into the key (K), query (Q), and value (V) projections. Instead of updating the full weight matrices, LoRA injects trainable rank-decomposed matrices (

A_{q}, B_{q}

for Q;

A_{k}, B_{k}

for K; and

A_{v}, B_{v}

for V) while keeping the pre-trained parameters frozen.

This approach significantly reduces memory consumption and computational complexity while enabling effective domain adaptation. The residual connection helps preserve the Base model’s general knowledge while allowing the LoRA layers to specialize in cybersecurity-related tasks, minimizing the risk of catastrophic forgetting.

ATIRS-SM transforms raw network alert logs into concise summaries. The model processes network security alerts, which often contain complex and unstructured information. Using parameter-efficient fine-tuning, particularly LoRA, pre-trained SLMs such as Phi-3.5, Qwen2.5, and Gemma-2 are adapted to generate structured and informative summaries. These summaries provide essential contextual information that assists ATIRS-RM in generating accurate and relevant response recommendations.

Once the logs are summarized, ATIRS-RM utilizes these summaries to fine-tune pre-trained models for response recommendation. Rather than relying on raw network alert logs, ATIRS-RM learns from structured summaries, enabling a more effective mapping between security events and corresponding response actions. The response generation process involves analyzing summarized alerts and selecting the most suitable security measures, ensuring that the recommended actions are both contextually appropriate and actionable.

By employing hierarchical fine-tuning, ATIRS-RM optimizes its response recommendation process, leveraging LoRA-based adaptation to enhance efficiency while maintaining high-quality recommendations. The integration of ATIRS-SM and ATIRS-RM establishes a streamlined cybersecurity workflow, where security alerts are first condensed into structured summaries and then mapped to corresponding response actions.

3.5. Adaptive Learning Strategy

Maritime vessels typically lack dedicated cybersecurity experts onboard, requiring crew members to execute response actions based on the recommendations generated by the ATIRS. To enhance performance and adaptability, the ATIRS integrates an adaptive learning strategy, allowing the model to refine its response recommendation capabilities based on real-world feedback.

Once the ATIRS generates a response recommendation, onboard personnel review and execute the suggested actions. Their final decisions serve as implicit feedback, which, along with metadata such as alert severity and response modifications, is securely stored in the onboard system. Due to intermittent network connectivity, security experts periodically access these feedback data via satellite communication and synchronize them with the central security management system, where they are aggregated and analyzed to improve future recommendations.

To efficiently adapt to evolving threats, the ATIRS employs a periodic fine-tuning approach based on accumulated feedback, ensuring continuous improvement without requiring full model retraining. Instead of updating the entire model, the ATIRS utilizes QLoRA to modify only the low-rank adaptation layers while keeping the pre-trained model weights largely unchanged. This enables the ATIRS to rapidly integrate response strategies based on validated user feedback, improving the accuracy of response recommendations for both recurring and emerging threats, while minimizing computational overhead.

4. Experiment

4.1. Datasets

In this study, we deployed the Suricata IDS on an actual ship’s internal network to monitor security events in a real-world maritime environment. High-severity (Severity 1 and 2) alert log data collected through this system were used to train and evaluate the proposed ATIRS-SM and ATIRS-RM models. These logs were sourced from a real-world maritime cybersecurity environment, capturing security events that occurred while the ship was actively operating. The dataset used in our experiments consists of network alert logs generated during the following period:

Sun Nov 10 2024 00:00:00 (UTC)
Fri Jan 03 2025 23:59:59 (UTC)

The collected dataset contains security logs reflecting various cyber threats encountered in maritime operations, including unauthorized access attempts, network scanning activities, potential malware infections, abnormal traffic patterns, and other intrusion attempts. Suricata logs generally classify severity into three levels. Severity 1 indicates the highest level of threat, encompassing critical security events that require immediate response and action. Severity 2 represents medium-level threats that require additional monitoring and analysis. In contrast, Severity 3 indicates low-level threats, including minor or potential risk factors. This study focuses exclusively on Severity 1 and 2 alerts to effectively summarize significant security events and recommend appropriate response strategies.

To ensure high-quality data for model training and evaluation, redundant and low-relevance logs were filtered during preprocessing. Identical logs with the same timestamp, source/destination IP, and threat signatures were removed to prevent duplication. Additionally, security experts reviewed and validated false positives, eliminating incorrect alerts. Based on these findings, Suricata IDS rules were updated to minimize false detections in future monitoring. By reducing false detections and filtering out duplicate and low-relevance logs, these preprocessing steps refined the dataset to focus on high-severity security events, ensuring that the model is trained to provide effective response recommendations for critical threats.

Distribution and Types of Alerts that Occurred in the Actual Ship Network

The types, descriptions, and distributions of the collected network alert log data are summarized in Table 1. The dataset encompasses a variety of alert categories, including information leakage attempts, privilege escalation attempts, network scanning activities, and other cyber threats encountered in the ship’s network.

However, the initial analysis revealed several limitations in the existing classification system. Some alert categories were excessively detailed, adding unnecessary complexity, while others had very few occurrences, making statistical analysis unreliable and potentially leading to biased model training. Sparse alert categories also posed challenges in learning effective response strategies, as there were insufficient instances for the model to generalize appropriate recommendations.

Additionally, the presence of semantically overlapping categories made it difficult to clearly distinguish different types of security events, potentially causing inconsistencies in response mapping. To address these issues, we systematically reconstructed the alert categories based on similarity, frequency, and analytical feasibility.

Reconstruction of Alert Categories

Some alert types with fundamentally similar meanings were grouped. For instance, both Attempted Information Leak and Information Leak pertain to data exposure events. To simplify classification and improve analytical efficiency, they were merged into a single category, “Information Leak and Attempted Information Leak”.

Similarly, Attempted User Privilege Gain, Successful User Privilege Gain, Attempted Administrator Privilege Gain, and Successful Administrator Privilege Gain all indicate privilege escalation attempts or successes. These were consolidated into the category “User and Administrator Privilege Escalation Attempts and Successes”.

Certain alert types with very low occurrences were also reorganized for better analysis. For example, Successful Administrator Privilege Gain was recorded only twice. However, it is closely related to Attempted Administrator Privilege Gain (27 occurrences) and Successful User Privilege Gain (2 occurrences). Therefore, rather than being classified separately, these alerts were grouped under the category “User and Administrator Privilege Escalation Attempts and Successes”.

Likewise, Device Retrieving External IP Address Detected (7 occurrences) and Attempted Denial of Service (6 occurrences) were integrated into a broader category, “Miscellaneous Anomalous Traffic and Malicious Code Detection”, to improve analytical efficiency. The reconstructed alert categories are summarized in Table 2.

The reconstructed alert logs were manually reviewed by security experts, who applied clear criteria to eliminate false positives. Additionally, they conducted a detailed analysis of alert categories and signatures to structure the dataset as follows:

Network Alert Log Summary Dataset (NALSD): This dataset consists of raw network alert logs and summaries that have been manually written and validated by security experts. It is used to train ATIRS-SM, enabling the model to generate concise and informative summaries from network alert logs. Security experts meticulously review each log entry to ensure that the summaries accurately capture key aspects of security incidents while maintaining clarity and relevance.
Network Alert Summarization-based Response Recommendation Dataset (NASR-RD): This dataset consists of summaries generated by ATIRS-SM from raw network alert logs, along with their corresponding response recommendations. Security experts rigorously review and validate both the summaries and the recommendations to ensure accuracy and effectiveness. Representative examples of response actions included in the dataset are blocking suspicious IP addresses, isolating compromised devices, disabling or locking user accounts, and revoking access privileges. These expert-validated actions provide a practical foundation for training ATIRS-RM, enabling it to learn how to generate contextually appropriate and operationally feasible response strategies based on summarized security logs.
Network Alert Log Response Recommendation Dataset (NALRD): This dataset is used to evaluate the effectiveness of the ATIRS by comparing it against baseline models. Unlike dataset (II), which trains ATIRS-RM using summaries generated by ATIRS-SM, this dataset consists of raw network alert logs that have been manually reviewed by security experts and paired with recommended responses.

To establish a baseline for comparison, a baseline model is trained directly on this dataset using a “raw network alert log → recommended response” approach, bypassing the summarization step. In contrast, the ATIRS follows a “raw network alert log → summary → recommended response” flow. By comparing the performance of both models, researchers can quantify the impact of the summarization process on response recommendation accuracy. This evaluation ensures that the ATIRS provides a tangible improvement in security operations by generating more effective and context-aware responses.

Additionally, this study aims to develop a summarization and response recommendation system based on high-severity alert logs, offering a robust solution for effective real-time security threat response. Accordingly, the dataset is partitioned into 70% for training, 15% for testing, and 15% for validation to ensure a well-balanced evaluation of the proposed approach.

4.2. Settings

This section describes the training configurations of the ATIRS-SM and ATIRS-RM models. The experiments were conducted using Phi-3.5, Qwen2.5, and Gemma-2 as backbone models, with all models sharing the same hyperparameters and training settings.

Additionally, QLoRA-based 4-bit and 8-bit quantization were applied for model optimization, allowing efficient fine-tuning even in resource-constrained environments. The key hyperparameters used in the ATIRS experiments are summarized in Table 3.

A batch size of 4 was selected to balance memory efficiency and training stability, as larger batch sizes were impractical on the NVIDIA A30 HBM2 (24 GB) GPU due to the additional memory overhead introduced by quantization and low-rank adaptation. The learning rate of

2 \times 10^{- 4}

was empirically chosen based on its stability across different backbone models, ensuring effective gradient updates while avoiding catastrophic forgetting. Since QLoRA introduces trainable low-rank weight matrices on top of the quantized pre-trained model, updates are efficiently applied while minimizing memory overhead. Unlike full fine-tuning, QLoRA allows gradient computation through the quantized model while primarily adapting the newly introduced low-rank matrices.

AdamW was used as the optimizer due to its robust weight decay properties, which help maintain generalization when fine-tuning pre-trained language models with quantization. A warmup ratio of 0.1 was applied to gradually adjust the learning rate, preventing sudden gradient spikes that could arise from quantized training. The training process was controlled by setting max_steps to 1000, ensuring a predefined number of optimization updates for efficient task-specific adaptation. This step-based approach helps maintain computational efficiency while preventing unnecessary overtraining. Although training_epochs was set to 1, max_steps took precedence, meaning that training could conclude before completing a full pass through the dataset if the step limit was reached. This step-based approach is particularly useful in QLoRA fine-tuning, where a full dataset pass is often unnecessary for effective adaptation, allowing for more controlled and efficient training.

For LoRA-specific configurations, a rank (r) of 8 and a LoRA alpha of 16 were chosen, as they provide a good trade-off between parameter efficiency and model expressiveness, allowing effective adaptation without excessive computational overhead. A LoRA dropout rate of 0.1 was applied to enhance generalization and prevent overfitting, ensuring stable adaptation across both high- and low-resource settings. Additionally, all linear layers were selected as target modules to ensure comprehensive adaptation of critical model components, while the bias parameter was set to None to maintain training efficiency. Furthermore, the training and evaluation of the ATIRS models were conducted in a PyTorch 2.6.0 and Python 3.9.21 environment. All experiments were performed on a server equipped with two Intel Xeon Gold 5317 processors, an NVIDIA A30 HBM2 (24 GB) GPU, and 128 GB of RAM, ensuring efficient execution of QLoRA-based fine-tuning while accommodating large-scale alert log processing.

4.3. Evaluation Metrics

To evaluate the performance of the trained model, we used Bilingual Evaluation Understudy (BLEU), Recall-Oriented Understudy for Gisting Evaluation (ROUGE)-N, ROUGE-L, ROUGE-Lsum, and Metric for Evaluation of Translation with Explicit ORdering (METEOR) [52,53,54,55]. The equations for performance evaluation are as follows.

The BLEU formula is shown as follows:

$BLEU = BP \times \exp (\sum_{n = 1}^{N} w_{n} log p_{n}),$

(1)

(Additional) Brevity penalty (BP) is defined as follows:

BP = \{\begin{matrix} 1, & if c > r, \\ \exp (1 - \frac{r}{c}), & if c \leq r . \end{matrix}

(2)

Explanation:

BLEU is a widely used metric in machine translation and text generation that evaluates the similarity between generated text and reference text. It calculates the precision of overlapping word sequences, known as n-grams, and combines them using a weighted geometric mean. Here, N represents the highest n-gram order considered, typically set to 4 to account for unigrams, bigrams, trigrams, and 4-grams. The weights

w_{n}

determine the relative contribution of each n-gram precision score and are usually distributed equally as

w_{n} = \frac{1}{N}

, ensuring that all n-grams influence the final BLEU score in a balanced manner.

To prevent artificially high scores from overly short outputs, BLEU applies a brevity penalty (BP). If the generated text (c) is at least as long as the reference text (r), no penalty is applied, and BP remains 1. However, if the generated text is shorter than the reference, BP applies an exponential penalty, reducing the BLEU score proportional to the length discrepancy.

Advantages:

(I) Widely used in text generation and translation: BLEU is a standard benchmark metric in NLP research. (II) Computationally efficient: It provides a quick and reliable measure without requiring human annotations. (III) Language-independent: It can be applied to multiple languages without modification. (IV) Handles multiple reference texts: BLEU allows comparison against multiple reference translations, improving evaluation robustness. (V) Prevents length bias: The brevity penalty ensures that shorter outputs are not unfairly advantaged.

The ROUGE-1 and ROUGE-2 (ROUGE-N) formulas are as follows:

$ROUGE - N = \frac{\sum_{{ngram}_{n} \in Ref} \min ({Count}_{hyp} ({ngram}_{n}), {Count}_{ref} ({ngram}_{n}))}{\sum_{{ngram}_{n}} {Count}_{ref} ({ngram}_{n})} .$

(3)

Explanation:

ROUGE is a widely used metric in text summarization and evaluates how much of the reference summary is covered by the generated summary.

ROUGE - N

measures the recall-based overlap of n-grams between the generated and reference texts.

ROUGE - 1

uses unigrams, while

ROUGE - 2

uses bigrams.

Advantages:

(I) Effectively assesses how much key information from the reference summary is retained. (II) Simple and computationally efficient for large-scale evaluations. (III) Adaptable for both recall-based and precision-based evaluations (e.g., ROUGE-P). (IV) Widely adopted in summarization research, facilitating comparison with existing models.

The ROUGE-L formula is shown as follows:

$R_{LCS} = \frac{LCS (ref, hyp)}{length (ref)},$

(4)

$P_{LCS} = \frac{LCS (ref, hyp)}{length (hyp)},$

(5)

$ROUGE - L = \frac{(1 + β^{2}) R_{LCS} P_{LCS}}{R_{LCS} + β^{2} P_{LCS}} .$

(6)

Explanation:

ROUGE-L measures the similarity between the generated text and the reference text using the Longest Common Subsequence (LCS). Unlike n-gram-based metrics, LCS-based evaluation considers sequence order while allowing gaps in matches.

R_{LCS}

(recall) represents the proportion of the reference text covered by the LCS, while

P_{LCS}

(precision) measures the proportion of the generated text that matches the LCS. The final ROUGE-L score is computed as an F-measure that balances recall and precision, where

β

determines the relative importance of recall and precision. A typical choice is

β = 1

, giving equal weight to both recall and precision.

Advantages:

(I) Captures sentence structure by considering the longest common subsequence rather than isolated n-grams. (II) Accounts for sequence order, favoring outputs that preserve logical flow. (III) More flexible than strict n-gram overlap, allowing partial matches within longer text sequences.

The ROUGE-Lsum formula is shown as follows:

$ROUGE - Lsum = \frac{(1 + β^{2}) R_{LCS} P_{LCS}}{R_{LCS} + β^{2} P_{LCS}},$

(7)

where

$R_{LCS} = \frac{LCS (tokens (ref), tokens (hyp))}{length (tokens (ref))}, P_{LCS} = \frac{LCS (tokens (ref), tokens (hyp))}{length (tokens (hyp))} .$

Explanation:

ROUGE-Lsum extends ROUGE-L to multi-sentence or document-level summaries by applying the LCS-based recall and precision over the entire text rather than at the single-sentence level. Specifically, all sentences from the reference and generated summaries are treated as continuous token sequences (

tokens (ref), tokens (hyp)

), and the Longest Common Subsequence is computed on these full sequences. Then,

R_{LCS}

and

P_{LCS}

are calculated as above, leading to a single F-measure (with

β

balancing recall vs. precision).

Advantages:

(I) Offers a holistic, document-level evaluation of summarization quality. (II) Preserves sequence order, rewarding coherent text structures. (III) More robust for longer or multi-paragraph summaries than single-sentence ROUGE-L.

The METEOR formula is shown as follows:

$METEOR = (1 - penalty) \times F_{mean},$

(8)

$F_{mean} = \frac{P \times R}{α P + (1 - α) R},$

(9)

$penalty = γ {(\frac{C}{M})}^{θ},$

(10)

Explanation:

METEOR improves upon BLEU by considering recall (R) and precision (P) at the word level, while applying a penalty if matching words are fragmented into separate chunks. The more disjointed the matches (C), the higher the penalty, reducing the overall score. Unlike BLEU, METEOR enhances evaluation by incorporating synonym matching, stemming, and paraphrasing, enabling a more flexible and linguistically informed assessment that better aligns with human judgments. The parameters

α

,

γ

, and

θ

are empirically optimized to enhance performance.

Advantages:

(I) Demonstrates a strong correlation with human evaluations due to its linguistic considerations. (II) More flexible than BLEU, as it accounts for synonyms, stemming, and paraphrasing, providing a more comprehensive evaluation. (III) Includes recall, which is crucial for summarization and translation evaluation. (IV) Applies a penalty for fragmented matches, favoring outputs with better word sequence alignment, thus improving overall coherence.

In this study, we leverage the strengths and unique characteristics of multiple evaluation metrics, by adopting BLEU, ROUGE (ROUGE-1, ROUGE-2, ROUGE-L, ROUGE-Lsum), and METEOR to comprehensively assess our models. Each metric provides a distinct perspective: BLEU focuses on n-gram precision, ROUGE measures recall-based coverage and sequence overlap, while METEOR captures broader lexical and semantic variations by incorporating synonym matching and stemming.

By analyzing results from multiple metrics, we aim to gain a holistic understanding of our models’ performance and mitigate potential biases that may arise from relying on a single metric. Ultimately, a model that consistently achieves high scores across these evaluation measures will be regarded as robust and well-balanced.

5. Results

In this study, we evaluated the performance of the proposed ATIRS-SM and ATIRS-RM using BLEU, ROUGE-N, ROUGE-L, ROUGE-Lsum, and METEOR metrics. The backbone models used were Phi-3.5, Qwen2.5, and Gemma-2, and each model was evaluated under 4-bit and 8-bit QLoRA settings. The following sections present detailed experimental results and analyses for each model.

5.1. Performance Evaluation of ATIRS-SM

Figure 7 and Table 4 present the overall performance metrics and comparative results of ATIRS-SM in both visual and numerical formats.

According to Table 4, Phi-3.5 (4-bit) achieved the highest performance on the primary ROUGE metrics (ROUGE-1, ROUGE-2, ROUGE-L, ROUGE-Lsum), indicating that the summarized text effectively preserves the structure and key information of the original text. In particular, the superiority of Phi-3.5 is prominent in environments that require high accuracy, such as the summarization of security alert logs. Furthermore, the minimal performance degradation observed in the 4-bit QLoRA setting compared to the 8-bit setting suggests that high-quality summarization is achievable even in environments with limited computational and memory resources.

Meanwhile, in the BLEU index, Qwen2.5 (8-bit) recorded the highest BLEU score of 0.9761, and in the METEOR index, Gemma-2 (4-bit) recorded 0.9797, showing a slight advantage. This means that the performance of the models may vary somewhat depending on the specific evaluation criteria, but Phi-3.5 (4-bit) showed the most balanced results in overall ROUGE performance. That is, it was confirmed that it can maintain competitive performance in terms of fluency and semantic consistency, and provide stable performance in applications such as security log summarization that require real-time processing.

5.2. Performance Evaluation of ATIRS-RM

Figure 8 and Table 5 summarize the performance metrics and relative improvements of ATIRS-RM in both visual and numerical forms.

The 4-bit quantized backbone models were fine-tuned using data in the format “network alert logs → response recommendation” derived from network alert logs, while the proposed ATIRS-RM was fine-tuned using data in the format “summary → response recommendation” derived from summaries generated by ATIRS-SM. This allowed for a comparative analysis of the performance differences between using the original log data and the summarized data.

The results presented in Table 5 show that the overall performance of ATIRS-RM (4-bit/8-bit) significantly improved compared to the Base model. Moreover, the performance differences between the 4-bit and 8-bit versions were very minimal for each backbone model, with some metrics even showing slightly higher scores for the 4-bit version. This indicates that when applying the QLoRA technique, 4-bit quantization can substantially reduce memory usage and computational cost while minimizing accuracy loss.

Phi-3.5-based ATIRS-RM:

Compared to the Base model, ATIRS-RM (4-bit) improved ROUGE-1 from 0.4361 to 0.5831 (+33.71%) and ROUGE-2 from 0.2245 to 0.2974 (+36.17%). This indicates that the model produced response recommendations that closely align with the reference responses in terms of structure and key information. Additionally, between the 4-bit and 8-bit versions, the 4-bit version achieved higher scores in certain metrics such as ROUGE-1, ROUGE-L, and ROUGE-Lsum, while ROUGE-2 and METEOR were either slightly higher for the 8-bit version or nearly identical. Consequently, the model achieved up to a 36% improvement over the Base model, while simultaneously delivering memory and computational efficiency in a 4-bit environment.

Qwen2.5-based ATIRS-RM:

Qwen2.5 (4-bit) achieved slightly higher scores on most metrics, including BLEU (0.2609), ROUGE-1 (0.5836), ROUGE-2 (0.3158), and METEOR (0.5471), compared to Qwen2.5 (8-bit). It recorded up to a 26.17% improvement (in ROUGE-2) over the Base model, confirming that it generates much more accurate and comprehensive response recommendations.

Gemma-2-based ATIRS-RM:

Gemma-2 (4-bit) also exhibited performance improvements of up to 28.97% (in ROUGE-Lsum) over the Base model. The performance between the 4-bit and 8-bit versions was very similar, with some metrics favoring the 4-bit version and others the 8-bit version. For example, ROUGE-1 was slightly higher for the 8-bit version (0.5843), whereas BLEU was marginally higher for the 4-bit version (0.2593).

Comprehensive Performance Analysis

All three models, when fine-tuned using the proposed ATIRS-RM with “summary → response recommendation” data, showed significant improvements across BLEU, ROUGE, and METEOR metrics. This indicates that utilizing ATIRS-SM based data enables the learning of much more refined response recommendations compared to the Base model that relies solely on the original network alert logs.

To support these results, a statistical analysis was conducted based on the findings presented in Table 5. Specifically, each Base model was compared with its corresponding ATIRS-RM model, fine-tuned using the QLoRA technique. A paired t-test was performed to determine whether the improvements in BLEU, ROUGE, and METEOR scores achieved by the ATIRS-RM models were statistically significant.

Table 6 presents the results of this analysis. The findings indicate that the Phi-3.5-based and Qwen2.5-based ATIRS-RM models, in both 4-bit and 8-bit configurations, demonstrated statistically significant improvements over their respective Base models (

p < 0.05

). This confirms that the observed performance gains are unlikely to be due to random variations, reinforcing that QLoRA-based fine-tuning effectively enhances response recommendation capabilities.

In contrast, the Gemma-2-based ATIRS-RM models exhibited numerical improvements over their respective Base models; however, these differences were not statistically significant (

p > 0.05

). This suggests that, unlike Phi-3.5 and Qwen2.5, the adaptation of Gemma-2 using QLoRA did not yield sufficiently consistent performance gains across multiple runs. The lack of statistical significance could be attributed to the model’s inherent architectural differences, the impact of low-rank adaptation on its internal representations, or limitations in dataset alignment with its pre-trained knowledge.

Although ATIRS tokenization introduces a slight increase in tokenization time compared to the Base model (1.963 ms vs. 1.187 ms per text), it significantly reduces the token sequence length from 1040 to 604. This reduction is particularly beneficial for Transformer-based models, where inference complexity scales with sequence length. As a result, while tokenization takes marginally longer, overall computational efficiency in downstream processing, including model inference, is improved. This makes ATIRS-RM not only more effective in response recommendation generation but also more scalable for real-time cybersecurity applications.

In summary, ATIRS-RM exhibited substantial performance enhancements over the Base model, and applying 4-bit QLoRA can reduce memory usage while maintaining or even improving the accuracy of response recommendations. In particular, the Phi-3.5-based ATIRS-RM demonstrated consistently superior performance across all metrics. To further analyze the efficiency of different models under QLoRA-based fine-tuning, inference speed and memory usage were measured across multiple backbone models. The results are presented in Table 7.

These results indicate that while the Qwen2.5-based ATIRS-RM 4-bit achieves faster inference speed (2737.28 ms), it requires significantly more GPU memory (8.5681 GB) compared to the Phi-3.5-based ATIRS-RM 4-bit (3896.89 ms, 3.9668 GB). This makes the Phi-3.5-based ATIRS-RM 4-bit the most memory-efficient choice for real-time applications in constrained environments, where reducing memory consumption is crucial.

Additionally, while the Qwen2.5-based ATIRS-RM 8-bit achieves significantly faster inference time (1903.26 ms) compared to the Phi-3.5-based ATIRS-RM 8-bit (4735.03 ms), it comes at the cost of higher GPU memory usage (12.2661 GB vs. 5.5861 GB). This trade-off suggests that the Phi-3.5-based ATIRS-RM 8-bit remains a viable option for environments where memory constraints outweigh processing speed.

Furthermore, the Gemma-2-based ATIRS-RM 4-bit and 8-bit models exhibit significantly longer inference times in both settings, with the slowest performance across all tested configurations. This suggests that they may not be suitable for real-time applications in maritime and edge computing scenarios, where low-latency responses are critical.

Overall, these findings validate that the Phi-3.5-based ATIRS-RM 4-bit provides the best balance between inference efficiency and memory footprint, making it the most practical model for deployment in resource-constrained maritime cybersecurity environments.

6. Conclusions and Further Research

In this study, we proposed an ATIRS, a framework designed for real-time network alert log summarization and response recommendation in resource-constrained environments. The ATIRS integrates network alert log summarization (ATIRS-SM) and response recommendation (ATIRS-RM) to assist crew members with limited cybersecurity expertise in understanding and responding to security threats. Experimental results demonstrated that Phi-3.5 (4-bit)-based ATIRS-SM achieved the highest performance in ROUGE metrics, ensuring accurate security log summarization. Additionally, ATIRS-RM significantly improved response recommendation accuracy, outperforming the base model by up to 36.17%.

Traditionally, security incidents occurring during voyages cannot be addressed immediately, as responses are typically handled by security personnel after docking. However, the ATIRS automates log summarization and provides real-time response recommendations, enabling crew members without prior security expertise to take immediate action. In actual testing, analyzing 10 security incidents manually required an average response time of 30 min, whereas the ATIRS reduced this response time to 5–10 min, significantly improving operational efficiency.

In conclusion, the ATIRS has been experimentally validated as an effective solution for overcoming the limitations of conventional maritime security operations. By enabling real-time threat detection and response in constrained hardware environments, the ATIRS enhances cybersecurity operations with minimal computational overhead.

In future work, we aim to improve the ATIRS by expanding the dataset with real-time network data and applying data augmentation techniques to enhance the model’s generalization capability. Additionally, we plan to refine the feedback integration process—building on the adaptive learning strategy described in Section 3.5—by introducing automated validation mechanisms to ensure higher-quality user feedback before incorporating it into fine-tuning cycles.

Furthermore, we will explore reinforcement learning-based optimization techniques to enhance the ATIRS’s ability to dynamically adapt to real-time security incidents and refine response recommendations based on long-term incident resolution outcomes. To further improve real-time adaptation, we will investigate integrating Retrieval-Augmented Generation (RAG), enabling the ATIRS to leverage up-to-date security intelligence from external threat databases while maintaining a lightweight model adaptation process. These enhancements will ensure that the ATIRS continuously refines its recommendations and remains effective in rapidly evolving maritime cybersecurity environments.

Author Contributions

Conceptualization, D.P. and B.M.; Methodology, D.P. and B.M.; Software, D.P.; Validation, D.P., B.M. and S.L.; Formal analysis, D.P. and S.L.; Investigation, D.P. and B.M.; Writing—original draft, D.P.; Writing—review and editing, D.P., B.M., S.L. and B.K.; Supervision, B.M. and B.K.; Project administration, B.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets presented in this article are not readily available because they contain sensitive security information and are subject to confidentiality agreements related to real-world vessel network data. Requests to access the datasets should be directed to the corresponding author and will be reviewed by company policies.

Acknowledgments

The authors express their sincere gratitude to all individuals and organizations who have contributed to the advancement of research in this field. Their invaluable insights and pioneering work have laid the foundation for this study. Additionally, the authors extend their appreciation to the anonymous reviewers for their constructive feedback and insightful suggestions, which have significantly enhanced the clarity and quality of this manuscript.

Conflicts of Interest

All authors were employed by the company Hanwha Systems (South Korea). The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Tam, K.; Jones, K. MaCRA: A model-based framework for maritime cyber-risk assessment. WMU J. Marit. Affs. 2019, 18, 129–163. [Google Scholar] [CrossRef]
Jones, K.D.; Tam, K.; Papadaki, M. Threats and impacts in maritime cyber security. Eng. Technol. Ref. 2016, 2016. [Google Scholar] [CrossRef]
Mel, P.H.; Bernsmed, K.; Wille, E.; Rødseth, Ø.J.; Nesheim, D.A. A retrospective analysis of maritime cyber security incidents. Transnav Int. J. Mar. Navig. Saf. Sea Transp. 2021, 15, 519–530. [Google Scholar]
Erstad, E.; Hopcraft, R.; Vineetha Harish, A.; Tam, K. A human-centred design approach for the development and conducting of maritime cyber resilience training. WMU J. Marit. Aff. 2023, 22, 241–266. [Google Scholar] [CrossRef]
Yousaf, A.; Amro, A.; Kwa, P.T.H.; Li, M.; Zhou, J. Cyber risk assessment of cyber-enabled autonomous cargo vessel. Int. J. Crit. Infrastruct. Prot. 2024, 46, 100695. [Google Scholar]
Al-Mhiqani, M.N.; Ahmad, R.; Yassin, W.; Hassan, A.; Abidin, Z.Z.; Ali, N.S.; Abdulkareem, K.H. Cyber-security incidents: A review cases in cyber-physical systems. Int. J. Adv. Comput. Sci. Appl. 2018, 9, 499–508. [Google Scholar]
Ben Farah, M.A.; Ukw, E.; Hindy, H.; Brosset, D.; Bures, M.; Andonovic, I.; Bellekens, X. Cyber security in the maritime industry: A systematic survey of recent advances and future trends. Information 2022, 13, 22. [Google Scholar] [CrossRef]
Longo, G.; Orlich, A.; Musante, S.; Merlo, A.; Russo, E. MaCySTe: A virtual testbed for maritime cybersecurity. SoftwareX 2023, 23, 101426. [Google Scholar]
Gu, Y.; Wallace, S.W. Operational benefits of autonomous vessels in logistics—A case of autonomous water-taxis in Bergen. Transp. Res. Part Logist. Transp. Rev. 2021, 154, 102456. [Google Scholar]
Alcaide, J.I.; Llave, R.G. Critical infrastructures cybersecurity and the maritime sector. Transp. Res. Procedia 2020, 45, 547–554. [Google Scholar] [CrossRef]
Daum, O. Cyber security in the maritime sector. J. Marit. Law Commer. 2019, 50, 1. [Google Scholar]
Zhou, J. The need of testbeds for cyberphysical system security. IEEE Secur. Priv. 2024, 22, 4–6. [Google Scholar] [CrossRef]
Li, M.; Zhou, J.; Chattopadhyay, S.; Goh, M. Maritime Cybersecurity: A Comprehensive Review. arXiv 2024, arXiv:2409.11417. [Google Scholar]
Schwarz, M.; Marx, M.; Federrath, H. A structured analysis of information security incidents in the maritime sector. arXiv 2021, arXiv:2112.06545. [Google Scholar]
Bhavsar, R.; Thakar, V. Design and Implementation of an Open-Source Security Operations Center for Effective Cyber Threat Detection and Response. Res. Sq. 2025. [Google Scholar] [CrossRef]
Sree, T.; Harsha, Y.S.S.; Rajagopalan, N. Suricata-Based Intrusion Detection and Isolation System for Local Area Networks. In Proceedings of the 2024 International Conference on Signal Processing, Computation, Electronics, Power and Telecommunication (IConSCEPT), Karaikal, India, 4–5 July 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–5. [Google Scholar]
Jacq, O.; Boudvin, X.; Brosset, D.; Kermarrec, Y.; Simonin, J. Detecting and hunting cyberthreats in a maritime environment: Specification and experimentation of a maritime cybersecurity operations centre. In Proceedings of the 2018 2nd Cyber Security in Networking Conference (CSNet), Paris, France, 24–26 October 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1–8. [Google Scholar]
Xu, H.; Wang, S.; Li, N.; Wang, K.; Zhao, Y.; Chen, K.; Yu, T.; Liu, Y.; Wang, H. Large language models for cyber security: A systematic literature review. arXiv 2024, arXiv:2405.04760. [Google Scholar]
Ferrag, M.A.; Alwahedi, F.; Battah, A.; Cherif, B.; Mechri, A.; Tihanyi, N. Generative AI and large language models for cyber security: All insights you need. SSRN 2024. [Google Scholar] [CrossRef]
Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
Dubey, A.; Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Vaughan, A.; et al. The llama 3 herd of models. arXiv 2024, arXiv:2407.21783. [Google Scholar]
Team, G.; Anil, R.; Borgeaud, S.; Alayrac, J.B.; Yu, J.; Soricut, R.; Schalkwyk, J.; Dai, A.M.; Hauth, A.; Millican, K.; et al. Gemini: A family of highly capable multimodal models. arXiv 2023, arXiv:2312.11805. [Google Scholar]
Ryu, S.; Do, H.; Kim, Y.; Lee, G.G.; Ok, J. Key-Element-Informed sLLM Tuning for Document Summarization. arXiv 2024, arXiv:2406.04625. [Google Scholar]
Tidjon, L.N.; Frappier, M.; Mammar, A. Intrusion detection systems: A cross-domain overview. IEEE Commun. Surv. Tutor. 2019, 21, 3639–3681. [Google Scholar]
Carta, S.; Podda, A.S.; Recupero, D.R.; Saia, R. A local feature engineering strategy to improve network anomaly detection. Future Internet 2020, 12, 177. [Google Scholar] [CrossRef]
Praptodiyono, S.; Firmansyah, T.; Anwar, M.H.; Wicaksana, C.A.; Pramudyo, A.S.; Al-Allawee, A. Development of hybrid intrusion detection system based on Suricata with pfSense method for high reduction of DDoS attacks on IPv6 networks. East.-Eur. J. Enterp. Technol. 2023, 125, 75. [Google Scholar] [CrossRef]
Ouiazzane, S.; Addou, M.; Barramou, F. Cyberthreat Real-time Detection Based on an Intelligent Hybrid Network Intrusion Detection System. In Big Data Analytics and Intelligent Systems for Cyber Threat Intelligence; River Publishers: Roma, Italy, 2023; pp. 175–194. [Google Scholar]
Ohri, P.; Arockiam, D.; Neogi, S.G.; Muttoo, S.K. Intrusion Detection and Prevention System for Early Detection and Mitigation of DDoS Attacks in SDN Environment. In Proceedings of the 2024 IEEE International Students’ Conference on Electrical, Electronics and Computer Science (SCEECS), Bhopal, India, 24–25 February 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar]
Ernawati, T.; Fachrozi, M.F.; Syaputri, D.D. Analysis of Intrusion Detection System Performance for the Port Scan Attack Detector, Portsentry, and Suricata. In IOP Conference Series: Materials Science and Engineering; IOP Publishing: Bristol, UK, 2019; p. 052013. [Google Scholar]
Bada, G.K.; Nabare, W.K.; Quansah, D. Comparative analysis of the performance of network intrusion detection systems: Snort suricata and bro intrusion detection systems in perspective. Int. J. Comput. Appl. 2020, 176, 39–44. [Google Scholar]
Nam, K.; Kim, K. A Study on SDN security enhancement using open source IDS/IPS Suricata. In Proceedings of the 2018 International Conference on Information and Communication Technology Convergence (ICTC), Jeju Island, Republic of Korea, 17–19 October 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1124–1126. [Google Scholar]
Xing, T.; Huang, D.; Xu, L.; Chung, C.J.; Khatkar, P. Snortflow: A openflow-based intrusion prevention system in cloud environment. In Proceedings of the 2013 Second GENI Research and Educational Experiment Workshop, Salt Lake, UT, USA, 20–22 March 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 89–92. [Google Scholar]
Yang, J.; Jin, H.; Tang, R.; Han, X.; Feng, Q.; Jiang, H.; Zhong, S.; Yin, B.; Hu, X. Harnessing the power of llms in practice: A survey on chatgpt and beyond. ACM Trans. Knowl. Discov. Data 2024, 18, 1–32. [Google Scholar]
Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A survey of large language models. arXiv 2023, arXiv:2303.18223. [Google Scholar]
Shen, Y.; Heacock, L.; Elias, J.; Hentel, K.D.; Reig, B.; Shih, G.; Moy, L. ChatGPT and other large language models are double-edged swords. Radiology 2023, 307, e230163. [Google Scholar] [CrossRef]
Balasubramanian, P.; Seby, J.; Kostakos, P. CYGENT: A cybersecurity conversational agent with log summarization powered by GPT-3. In Proceedings of the 2024 3rd International Conference on Artificial Intelligence For Internet of Things (AIIoT), Istanbul, Turkey, 26–27 December 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar]
Roy, S.S.; Nilizadeh, S. Utilizing large language models to optimize the detection and explainability of phishing websites. arXiv 2024, arXiv:2408.05667. [Google Scholar]
Ferrag, M.A.; Ndhlovu, M.; Tihanyi, N.; Cordeiro, L.C.; Debbah, M.; Lestable, T.; Thandi, N.S. Revolutionizing cyber threat detection with large language models: A privacy-preserving bert-based lightweight model for iot/iiot devices. IEEE Access 2024, 12, 23733–23750. [Google Scholar] [CrossRef]
Sánchez, P.M.S.; Celdrán, A.H.; Bovet, G.; Pérez, G.M. Transfer Learning in Pre-Trained Large Language Models for Malware Detection Based on System Calls. In Proceedings of the MILCOM 2024—2024 IEEE Military Communications Conference (MILCOM), Washington, DC, USA, 28 October–1 November 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 853–858. [Google Scholar]
Houssel, P.R.; Singh, P.; Layeghy, S.; Portmann, M. Towards explainable network intrusion detection using large language models. arXiv 2024, arXiv:2408.04342. [Google Scholar]
Zhang, H.; Sediq, A.B.; Afana, A.; Erol-Kantarci, M. Large Language Models in Wireless Application Design: In-Context Learning-enhanced Automatic Network Intrusion Detection. arXiv 2024, arXiv:2405.11002. [Google Scholar]
Sufi, F. An innovative GPT-based open-source intelligence using historical cyber incident reports. Nat. Lang. Process. J. 2024, 7, 100074. [Google Scholar]
Zheng, J.; Hong, H.; Liu, F.; Wang, X.; Su, J.; Liang, Y.; Wu, S. Fine-tuning large language models for domain-specific machine translation. arXiv 2024, arXiv:2402.15061. [Google Scholar]
Xin, Y.; Luo, S.; Zhou, H.; Du, J.; Liu, X.; Fan, Y.; Li, Q.; Du, Y. Parameter-efficient fine-tuning for pre-trained vision models: A survey. arXiv 2024, arXiv:2402.02242. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Reston, VA, USA, 2–7 June 2019; Long and Short Papers. Volume 1, pp. 4171–4186. [Google Scholar]
Jiao, X.; Yin, Y.; Shang, L.; Jiang, X.; Chen, X.; Li, L.; Wang, F.; Liu, Q. Tinybert: Distilling bert for natural language understanding. arXiv 2019, arXiv:1909.10351. [Google Scholar]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Chen, W. Lora: Low-rank adaptation of large language models. ICLR 2022, 1, 3. [Google Scholar]
Dettmers, T.; Pagnoni, A.; Holtzman, A.; Zettlemoyer, L. Qlora: Efficient finetuning of quantized llms. Adv. Neural Inf. Process. Syst. 2023, 36, 10088–10115. [Google Scholar]
Abdin, M.; Aneja, J.; Awadalla, H.; Awadallah, A.; Awan, A.A.; Bach, N.; Bahree, A.; Bakhtiari, A.; Bao, J.; Behl, H.; et al. Phi-3 technical report: A highly capable language model locally on your phone. arXiv 2024, arXiv:2404.14219. [Google Scholar]
Yang, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Li, C.; Liu, D.; Huang, F.; Wei, H.; et al. Qwen2.5 technical report. arXiv 2024, arXiv:2412.15115. [Google Scholar]
Team, G.; Riviere, M.; Pathak, S.; Sessa, P.G.; Hardin, C.; Bhupatiraju, S.; Hussenot, L.; Mesnard, T.; Shahriari, B.; Ramé, A.; et al. Gemma 2: Improving open language models at a practical size. arXiv 2024, arXiv:2408.00118. [Google Scholar]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar]
Lin, C.Y. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out; Association for Computational Linguistics: Barcelona, Spain, 2004; pp. 74–81. [Google Scholar]
Giarelis, N.; Mastrokostas, C.; Karacapilidis, N. Abstractive vs. extractive summarization: An experimental review. Appl. Sci. 2023, 13, 7620. [Google Scholar] [CrossRef]
Banerjee, S.; Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 29 June 2005; pp. 65–72. [Google Scholar]

Figure 1. Comparison of LLM Fine-Tuning Approaches: Pretraining, Conventional Fine-Tuning, and Parameter-Efficient Fine-Tuning.

Figure 2. Parameter-efficient fine-tuning: LoRA and QLoRA compared to FFT.

Figure 3. Proposed ATIRS architecture.

Figure 4. Comparison of base tokenization and ATIRS tokenization for network alert logs.

Figure 5. Comparison of chat templates used for different backbone models (Phi-3.5, Qwen2.5, and Gemma-2).

Figure 6. LoRA-based Fine-Tuning Mechanism in Transformer Architecture.

Figure 7. Training performance of the ATIRS-SM network traffic alert log-based summarization models across different steps. Sub-figures illustrate the following: (a) BLEU, (b) ROUGE-1, (c) ROUGE-2, (d) ROUGE-L, (e) ROUGE-Lsum, and (f) METEOR. Each plot compares multiple backbone models (Phi-3.5, Qwen2.5, Gemma-2) under 4-bit and 8-bit QLoRA quantization settings.

Figure 8. Training performance of the ATIRS-RM network traffic alert log summarization-based recommendation models across different steps. Sub-figures illustrate the following: (a) BLEU, (b) ROUGE-1, (c) ROUGE-2, (d) ROUGE-L, (e) ROUGE-Lsum, and (f) METEOR. Each plot compares multiple backbone models (Phi-3.5, Qwen2.5, Gemma-2) under 4-bit and 8-bit QLoRA quantization settings.

Table 1. Distribution, severity, and description of alerts that occurred in the actual ship network.

Severity	Alert Type	Description	Total Count
1 (High)	Potential Corporate Privacy Violation	Activities that may indicate a breach of corporate confidential information.	394
	A Network Trojan was detected	Indicates infection with malicious code such as a network Trojan.	127
	Web Application Attack	Detects attack attempts targeting web applications.	1005
	Attempted Administrator Privilege Gain	Detects attempts to gain administrator privileges.	27
	Information Leak	Represents cases where information was leaked.	1
	Successful User Privilege Gain	Represents successful cases of user privilege acquisition.	2
	Attempted User Privilege Gain	Detects attempts to gain user privileges.	8
	Successful Administrator Privilege Gain	Represents successful cases of administrator privilege acquisition.	2
2 (Medium)	Attempted Information Leak	Detected attempts to leak sensitive information externally.	11,575
	Misc Attack	Includes other uncategorized types of attacks.	13,262
	Potentially Bad Traffic	Detects abnormal or potentially malicious traffic.	2390
	Attempted Denial of Service	Detects attempts at a denial-of-service attack.	6
	Detection of a Network Scan	Detects network scanning activities, suggesting a potential attack.	4
	Access to a Potentially Vulnerable Web Application	Detects attempts to access a vulnerable web application.	14
	Generic Protocol Command Decode	Detects anomalies occurring during protocol command decoding.	89
	Decode of an RPC Query	detect anomalies occurring during RPC query decoding.	15
	Device Retrieving External IP Address Detected	Detects activities where a device queries an external IP address.	7

Table 2. Reconstructed Alert Categories and Included Types.

New Alert Category	Description	Included Alert Types	Total Count
Information Leaks and Attempted Information Leak	Includes attempts to leak sensitive information as well as actual leaks.	Attempted Information Leak, Information Leak	11,576
Potential Corporate Privacy Violation	Activities that may indicate a breach of corporate confidential information.	Potential Corporate Privacy Violation	394
Web Application Vulnerability Attack and Access Attempts	Includes attacks targeting web applications and attempts to access vulnerable web applications.	Web Application Attack, Access to a Potentially Vulnerable Web Application	1019
User and Administrator Privilege Escalation Attempts and Successes	Includes attempts and successful cases of user and administrator privilege escalation.	Attempted User Privilege Gain, Successful User Privilege Gain, Attempted Administrator Privilege Gain, Successful Administrator Privilege Gain	39
Network Scanning and Protocol Anomalies	Includes network scanning activities and protocol-related anomalies.	Detection of a Network Scan, Decode of an RPC Query, Generic Protocol Command Decode	108
Miscellaneous Anomalous Traffic and Malicious Code Detection	Includes trojan infections, anomalous traffic, and other uncategorized attacks.	A Network Trojan was detected, Potentially Bad Traffic, Misc Attack, Attempted Denial of Service, Device Retrieving External IP Address Detected	15,792

Table 3. Hyperparameters used in the ATIRS experiments.

Trainer Parameters
Batch Size	4
Optimizer	AdamW
Training Epochs	1
Max Steps	1000
Learning Rate	$2 \times 10^{- 4}$
Warmup Ratio	0.1
LoRA Parameters
Rank (r)	8
LoRA Alpha	16
LoRA Dropout	0.1
Target Modules	All Linear Layers
Bias	None

Table 4. Summary model performance comparison.

Backbone	Variants	BLEU	ROUGE-1	ROUGE-2	ROUGE-L	ROUGE-Lsum	METEOR
Phi-3.5	ATIRS-SM (4 bit)	0.9692	0.9858	0.9852	0.9861	0.9859	0.9781
Phi-3.5	ATIRS-SM (8 bit)	0.9682	0.9851	0.9841	0.9851	0.9851	0.9777
Qwen2.5	ATIRS-SM (4 bit)	0.9745	0.9843	0.9821	0.9844	0.9843	0.9792
Qwen2.5	ATIRS-SM (8 bit)	0.9761	0.9851	0.9834	0.9852	0.9852	0.9788
Gemma-2	ATIRS-SM (4 bit)	0.9645	0.9831	0.9781	0.9836	0.9836	0.9797
Gemma-2	ATIRS-SM (8 bit)	0.9634	0.9826	0.9763	0.9829	0.9828	0.9782

Note: Bold values indicate the best performance for each evaluation metric.

Table 5. Comparison of different fine-tuning methods (base vs. QLoRA/ATIRS-RM).

Backbone	Variants	BLEU	ROUGE-1	ROUGE-2	ROUGE-L	ROUGE-Lsum	METEOR
Phi-3.5	Base	0.2437	0.4361	0.2245	0.4216	0.4239	0.4223
	ATIRS-RM (4 bit)	0.2604	0.5831	0.2974	0.5556	0.5557	0.5565
	ATIRS-RM (8 bit)	0.2526	0.5817	0.3057	0.5513	0.5519	0.5571
	Best Imprv.	↑ 6.85%	$↑ 33.71 %$	$↑ 36.17 %$	$↑ 31.78 %$	$↑ 31.10 %$	$↑ 31.92 %$
Qwen2.5	Base	0.2386	0.4845	0.2503	0.4565	0.4569	0.4517
	ATIRS-RM (4 bit)	0.2609	0.5836	0.3158	0.5501	0.5503	0.5471
	ATIRS-RM (8 bit)	0.2551	0.5809	0.3113	0.5476	0.5475	0.5467
	Best Imprv.	$↑ 9.35 %$	$↑ 20.45 %$	$↑ 26.17 %$	$↑ 20.50 %$	$↑ 20.44 %$	$↑ 21.12 %$
Gemma-2	Base	0.2452	0.5523	0.2905	0.5219	0.4224	0.5181
	ATIRS-RM (4 bit)	0.2593	0.5833	0.3062	0.5453	0.5448	0.5449
	ATIRS-RM (8 bit)	0.2588	0.5843	0.3057	0.5449	0.5443	0.5472
	Best Imprv.	$↑ 5.75 %$	$↑ 5.79 %$	$↑ 5.40 %$	$↑ 4.48 %$	$↑ 28.97 %$	$↑ 5.61 %$

“Best Imprv.” indicates the relative improvement over the Base model. Note: Bold values indicate the best performance for each evaluation metric.

Table 6. Paired t-test results for model performance.

Base Model	Comparison Model	t-Statistic	p-Value	Significance
Phi-3.5	ATIRS-RM (4 bit)	−5.0977	0.0038	Yes
Phi-3.5	ATIRS-RM (8 bit)	−4.9399	0.0043	Yes
Qwen2.5	ATIRS-RM (4 bit)	−6.3956	0.0014	Yes
Qwen2.5	ATIRS-RM (8 bit)	−5.8296	0.0021	Yes
Gemma-2	ATIRS-RM (4 bit)	−2.3009	0.0697	No
Gemma-2	ATIRS-RM (8 bit)	−2.3267	0.0675	No

Table 7. Comparison of average inference time (ms) and GPU memory usage (GB) for 100 inferences across each backbone model (Phi-3.5, Qwen2.5, Gemma-2) under 4-bit and 8-bit QLoRA quantization settings.

Backbone	Variants	Avg Inference Time (ms)	GPU Memory Usage (GB)
Phi-3.5	ATIRS-SM (4 bit)	3896.89	3.9668
Phi-3.5	ATIRS-SM (8 bit)	4735.03	5.5861
Qwen2.5	ATIRS-SM (4 bit)	2737.28	8.5681
Qwen2.5	ATIRS-SM (8 bit)	1903.26	12.2661
Gemma-2	ATIRS-SM (4 bit)	7706.03	6.9872
Gemma-2	ATIRS-SM (8 bit)	10,132.08	11.3898

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Park, D.; Min, B.; Lim, S.; Kim, B. ATIRS: Towards Adaptive Threat Analysis with Intelligent Log Summarization and Response Recommendation. Electronics 2025, 14, 1289. https://doi.org/10.3390/electronics14071289

AMA Style

Park D, Min B, Lim S, Kim B. ATIRS: Towards Adaptive Threat Analysis with Intelligent Log Summarization and Response Recommendation. Electronics. 2025; 14(7):1289. https://doi.org/10.3390/electronics14071289

Chicago/Turabian Style

Park, Daekyeong, Byeongjun Min, Sungwon Lim, and Byeongjin Kim. 2025. "ATIRS: Towards Adaptive Threat Analysis with Intelligent Log Summarization and Response Recommendation" Electronics 14, no. 7: 1289. https://doi.org/10.3390/electronics14071289

APA Style

Park, D., Min, B., Lim, S., & Kim, B. (2025). ATIRS: Towards Adaptive Threat Analysis with Intelligent Log Summarization and Response Recommendation. Electronics, 14(7), 1289. https://doi.org/10.3390/electronics14071289

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ATIRS: Towards Adaptive Threat Analysis with Intelligent Log Summarization and Response Recommendation

Abstract

1. Introduction

2. Related Work

2.1. IDS/IPS Systems

2.2. LLMs for Cyber Security

2.3. LLMs Fine-Tuning

3. Methodology

3.1. Overview

3.2. Network Alert Logs and Summary Tokenization

3.3. Chat Template Conversion

3.4. Fine-Tuning and PEFT

3.5. Adaptive Learning Strategy

4. Experiment

4.1. Datasets

4.2. Settings

4.3. Evaluation Metrics

5. Results

5.1. Performance Evaluation of ATIRS-SM

5.2. Performance Evaluation of ATIRS-RM

6. Conclusions and Further Research

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI