Next Article in Journal
BankNet: Real-Time Big Data Analytics for Secure Internet Banking
Next Article in Special Issue
Large Language Models as Kuwaiti Annotators
Previous Article in Journal
Evaluating the Effect of Surrogate Data Generation on Healthcare Data Assessment
Previous Article in Special Issue
A Review of Large Language Models in Healthcare: Taxonomy, Threats, Vulnerabilities, and Framework
 
 
Article
Peer-Review Record

Labeling Network Intrusion Detection System (NIDS) Rules with MITRE ATT&CK Techniques: Machine Learning vs. Large Language Models

Big Data Cogn. Comput. 2025, 9(2), 23; https://doi.org/10.3390/bdcc9020023
by Nir Daniel 1,2,*, Florian Klaus Kaiser 3, Shay Giladi 1, Sapir Sharabi 1, Raz Moyal 1, Shalev Shpolyansky 1, Andres Murillo 4, Aviad Elyashar 2,5 and Rami Puzis 1,2
Reviewer 1: Anonymous
Reviewer 2:
Reviewer 3: Anonymous
Big Data Cogn. Comput. 2025, 9(2), 23; https://doi.org/10.3390/bdcc9020023
Submission received: 8 December 2024 / Revised: 20 January 2025 / Accepted: 23 January 2025 / Published: 26 January 2025
(This article belongs to the Special Issue Generative AI and Large Language Models)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper is presented in a logical manner. Both methodology and result section appears to be presented in a clear manner. I found the paper to be interesting as in touched on two most popular topics ... i.e. LLM and Cyber Security. Overall the paper is good.

Author Response

Thank you for your positive feedback. We're glad to hear that the paper resonated with you. We're pleased that the focus on LLM and Cyber Security was engaging. Your input is much appreciated.

Reviewer 2 Report

Comments and Suggestions for Authors

Review of the article

«Labeling NIDS Rules with MITRE ATT&CK Techniques: Machine Learning vs. Large Language Models»

 

The paper discusses the problem of increasing the efficiency of analysis of weakly structured text data of Cyber Threat Intelligence using large language models in order to automate the development of hypotheses about tactics and techniques for exploiting vulnerabilities.

It is proposed to use large language models to directly map Network Intrusion Detection Systems rules to MITRE ATT&CK tactics and techniques for enterprise information systems and IDS. The second approach is to use LLM to create classical machine learning models that solve a similar problem. A detailed analysis of works in the field of using hybrid models based on classical machine learning and large language models is carried out.

The main conclusions and results are novel and significant for improving systems for analyzing and processing Cyber Threat Intelligence data in order to reduce the labor intensity and workload of a subject matter expert.

The proposed estimates of the experimental results are justified.

The interpretation of the results of the computational experiment and the reliability of their analysis are sufficient. The provided links to the repository with the source code and data set allow to confirm the correctness of the computational experiment.

The provided list of references to bibliographic sources reflects the depth of the research problem.

The abstract, introduction and conclusions are presented correctly.

 

Main remarks:

·       The abstract and introduction must clearly formulate the purpose of the study. The tasks and solution methods are described in sufficient detail.

·       In Cyber Threat Intelligence data processing tasks, some of the data for enrichment (analysis context) is collected in the information security monitoring center (SOC) and is “sensitive” information for the customer organization. Therefore, transferring such data to external services (LLM) is undesirable. In this regard, the results of comparing the performance of cloud models with locally deployed models would be very important.

·       In formula (6) it is necessary to correct the designation “F1-score”.

 

The work may be published in its current form.

Author Response

Thank you for your comprehensive and thoughtful review. We appreciate your constructive feedback. We addressed the suggested corrections as follows:

Comment 1The abstract and introduction must clearly formulate the purpose of the study. The tasks and solution methods are described in sufficient detail.
Response 1:
We added the following to the abstract: 
By utilizing automation, the presented methods will enhance the analysis efficiency of SOC alerts, and decrease workloads on the analysts.

Comment 2: In Cyber Threat Intelligence data processing tasks, some of the data for enrichment (analysis context) is collected in the information security monitoring center (SOC) and is “sensitive” information for the customer organization. Therefore, transferring such data to external services (LLM) is undesirable. In this regard, the results of comparing the performance of cloud models with locally deployed models would be very important.
Response 2
We believe that this is a very promising field for future research. We therefore added the following to the conclusion:

  • Cloud Versus Local: Given the sensitivity of collected data within SOCs and the computational constraints and critical response time, it would be of especial interest to deploy high quality local models to analyze the rules.

Comment 3: In formula (6) it is necessary to correct the designation “F1-score”.
Response 3: We changed the font of F1-score and ensured that the notation in the formula is consistent with the rest of the paper.

Your input has been valuable in refining the paper.

Reviewer 3 Report

Comments and Suggestions for Authors

This paper represents a significant contribution to the ongoing exploration of how Large Language Models (LLMs) can enhance Network Intrusion Detection System (NIDS) operations, a critical area in cybersecurity. By investigating the capabilities of prominent LLMs—ChatGPT, Claude, and Gemini—to associate Snort rules with MITRE ATT&CK tactics and techniques, the authors address a pressing challenge in Security Operations Centers (SOCs): the lack of explainability and connection between NIDS alerts and attack methodologies.

The study is particularly innovative in demonstrating how LLMs can autonomously design and execute machine learning pipelines. Tasks such as model selection, feature extraction using Term Frequency-Inverse Document Frequency (TF-IDF), and multi-label classification were entirely guided by the LLMs, highlighting their potential to streamline complex workflows. The emphasis on optimizing evaluation metrics, particularly the F1-score, underscores the utility of LLMs in creating scalable and efficient systems for cybersecurity.

The findings are insightful and practical. The results reveal that while LLMs like ChatGPT and Claude achieve F1 scores exceeding 0.6, traditional machine learning models remain superior in terms of precision, recall, and overall accuracy. This suggests that a hybrid approach—leveraging the interpretability and scalability of LLMs alongside the precision of traditional models—could unlock new possibilities for SOC operations. Furthermore, the study identifies the superiority of T-ICL2 prompt templates, which guide LLMs toward task-specific reasoning, though it also raises concerns about the increased computational costs of In-Context Learning (ICL) in real-time and large-scale applications. 

One of the most compelling aspects of this paper is its practical relevance. By addressing the explainability gap in NIDS alerts, the research provides a pathway to more effective alert triage and hypothesis generation in SOCs. The detailed comparison between LLMs also lays a foundation for further exploration into optimizing LLM performance for cybersecurity use cases.

The authors should expand the references on generative-AI discussions with work due to LLM driven code generation or AI-assisted programming. A discussion on OpenAI's advancements in code generation would be particularly valuable, especially in the context of generating cyber-secured code for tasks such as intrusion detection and attack mitigation.

Mark Chen et al, Evaluating Large Language Models Trained on Code, https://arxiv.org/abs/2107.03374

Man-Fai Wong et al, Natural Language Generation and Understanding of Big Code for AI-Assisted Programming: A Review. Entropy 25(6): 888 (2023)

Additionally, incorporating formal reasoning techniques could further bridge the gap between NIDS rules and attack hypotheses, offering SOC analysts more structured and reliable explanations for alerts as LLMs are used to generate the code and analyze the log of intermediate routers and switches in networks.  Analysts often spend significant time investigating unclear rules, and LLMs that can autonomously reason about rule logic and connect them to attack tactics would greatly enhance their efficiency.

Moreover, while the current study focuses on the explainability and scalability of LLMs, future work could evaluate their application in real-time SOC environments, where computational constraints and response times are critical. Exploring fine-tuning methods for LLMs using domain-specific datasets could also improve accuracy and expand their applicability.

 

In conclusion, this paper presents a robust foundation for leveraging LLMs to improve SOC operations and address the evolving cybersecurity landscape. Its focus on hybrid approaches, efficient prompt design, and the autonomous capabilities of LLMs opens new avenues for research and practical application. Incorporating advancements in OpenAI technologies and formal reasoning methods could further enrich this promising area of study, making NIDS operations more effective and SOCs more resilient in the face of modern threats.

Author Response

Thank you for your comprehensive and thoughtful review. We appreciate your constructive feedback. We addressed the suggested corrections as follows:

Comment 1The authors should expand the references on generative-AI discussions with work due to LLM driven code generation or AI-assisted programming. A discussion on OpenAI's advancements in code generation would be particularly valuable, especially in the context of generating cyber-secured code for tasks such as intrusion detection and attack mitigation.

Mark Chen et al, Evaluating Large Language Models Trained on Code, https://arxiv.org/abs/2107.03374

Man-Fai Wong et al, Natural Language Generation and Understanding of Big Code for AI-Assisted Programming: A Review. Entropy 25(6): 888 (2023)

Response 1: We added the following section to the discussion:

6.4. LLM Driven Code Generation 

A key aspect of this study is the use of LLMs to autonomously develop ML pipelines, a task traditionally requiring significant human expertise. OpenAI’s Codex, as highlighted by Chen et al. [45], demonstrates the potential of LLMs trained on code to generate robust, domain-specific workflows. In our work, LLMs chose ML models, feature engineering, and hyperparameter tuning that gave 0.87 F1-score in technique labeling, further underlining their capability to handle complex cybersecurity tasks. 

Code generation driven by LLMs has its own set of limitations. Wong et al. [ 46 ] discuss the challenges of generating secure and reliable code in AI-assisted programming, especially in scenarios involving sensitive domains like cybersecurity. Besides, proper prompt engineering still plays a very important role in making sure the generated code fits the domain-specific requirements. 

In general, LLMs embeddings in the cybersecurity workflows is a promising future research direction with respect to continuous advancements of LLMs, for both automation and infrastructure development.

Comment 2: Additionally, incorporating formal reasoning techniques could further bridge the gap between NIDS rules and attack hypotheses, offering SOC analysts more structured and reliable explanations for alerts as LLMs are used to generate the code and analyze the log of intermediate routers and switches in networks.  Analysts often spend significant time investigating unclear rules, and LLMs that can autonomously reason about rule logic and connect them to attack tactics would greatly enhance their efficiency.

Response 2: Thank you for pointing this out, we consider formal reasoning techniques as beyond the scope of this paper and adding it would necessitate an in depth analysis, however, we consider it as an interesting research direction and therefore added the following to the discussion:

[...] Furthermore, in future research additional prompt design techniques, especially formal reasoning techniques, and additional LLMs should be tested to further boost performance of LLMs for the proposed labeling task allowing the practical application within SOCs.

Comment 3: Moreover, while the current study focuses on the explainability and scalability of LLMs, future work could evaluate their application in real-time SOC environments, where computational constraints and response times are critical. Exploring fine-tuning methods for LLMs using domain-specific datasets could also improve accuracy and expand their applicability.

Response 3: We added the following to the conclusion as future work:

  • Domain-Specific Fine-Tuning: Fine-tuning LLMs with domain-specific datasets could improve their accuracy and reduce the need for extensive prompt engineering while increasing applicability. This would be particularly beneficial for complex domains such as ICS and real time SOC-environments with computational constraints and critical response times.

         [...]

  • Cloud Versus Local: Given the sensitivity of collected data within SOCs and the computational constraints and critical response time, it would be of especial interest to deploy high quality local models to analyze the rules.

Your input has been valuable in refining the paper.

Back to TopTop