Abstract
Large Language Models (LLMs) have emerged as powerful tools in cyber security, enabling automation, threat detection, and adaptive learning. Their ability to process unstructured data and generate context-aware outputs supports both operational tasks and educational initiatives. Despite their growing adoption, current research often focuses on isolated applications, lacking a systematic understanding of how LLMs align with domain-specific requirements and pedagogical effectiveness. This highlights a pressing need for comprehensive evaluations that address the challenges of integration, generalization, and ethical deployment in both operational and educational cyber security environments. Therefore, this paper provides a comprehensive and State-of-the-Art review of the significant role of LLMs in cyber security, addressing both operational and educational dimensions. It introduces a holistic framework that categorizes LLM applications into six key cyber security domains, examining each in depth to demonstrate their impact on automation, context-aware reasoning, and adaptability to emerging threats. The paper highlights the potential of LLMs to enhance operational performance and educational effectiveness while also exploring emerging technical, ethical, and security challenges. The paper also uniquely addresses the underexamined area of LLMs in cyber security education by reviewing recent studies and illustrating how these models support personalized learning, hands-on training, and awareness initiatives. The key findings reveal that while LLMs offer significant potential in automating tasks and enabling personalized learning, challenges remain in model generalization, ethical deployment, and production readiness. Finally, the paper discusses open issues and future research directions for the application of LLMs in both operational and educational contexts. This paper serves as a valuable reference for researchers, educators, and practitioners aiming to develop intelligent, adaptive, scalable, and ethically responsible LLM-based cyber security solutions.
1. Introduction
Large Language Models (LLMs) have emerged as powerful tools across a wide range of fields, revolutionizing the way information is processed, generated, and understood. In domains such as healthcare, finance, education, cyber security, and software engineering, LLMs enhance productivity by automating complex tasks, generating human-like text, and enabling advanced data analysis []. Their ability to understand context, synthesize large volumes of information, and produce coherent responses has led to advances in personalized learning, intelligent tutoring, automated report writing, code generation, and even clinical decision support []. LLMs also foster innovation by bridging knowledge gaps and democratizing access to expertise. Their scalability and adaptability make them invaluable for both individual users and large organizations, positioning them as key enablers of efficiency, accessibility, and creative problem-solving in the digital age [,].
The cyber security domain faces persistent challenges, such as an evolving threat landscape, a global skills shortage, and the increasing sophistication of cyberattacks. Traditional tools often struggle to keep pace with zero-day exploits, social engineering tactics, and large-scale data breaches []. In response to these issues, LLMs have emerged as powerful tools, offering significant impact and benefits across the cyber security lifecycle. LLMs can accelerate threat intelligence by extracting insights from unstructured data, generating indicators of compromise (IOCs), and summarizing security reports in real-time []. They assist in vulnerability assessment, automate incident response workflows, and support hands-on training through interactive simulations and personalized guidance. Furthermore, LLMs enhance security awareness initiatives by delivering tailored educational content to diverse user profiles.
Recent statistics highlight the rapidly growing integration of LLMs into cyber security and education, where more than 55% of organizations have adopted generative AI tools, including LLMs, for cyber security operations, such as threat detection and incident response, and 82% plan to increase their investment in AI-driven security solutions over the next two years []. The global market for LLMs in cyber security is projected to exceed USD 11.2 billion by 2033, driven by increasing demand for scalable, intelligent defence mechanisms []. In parallel, LLM adoption in education is also accelerating, where AI tools are being used to support personalized learning, simulate attacks in safe environments, and train students in ethical hacking and incident responses [].
The integration of LLMs into cyber security presents a complex and evolving research problem, shaped by dynamic threat landscapes, data sensitivity, and ethical concerns. While recent studies have examined LLM applications in areas such as threat intelligence, malware analysis, penetration testing, and incident response, most focus on isolated tasks and overlook how LLMs align with domain-specific requirements, data constraints, or operational workflows [,]. Similarly, recent work on LLMs in cyber security education often lacks depth, with limited attention to instructional effectiveness, long-term learning outcomes, or ethical use in the classroom setting []. To date, existing reviews have largely overlooked this dual perspective, offering either narrowly scoped technical surveys or education-focused discussions that do not reflect operational complexities. This fragmented landscape reveals a clear gap, which is the absence of a comprehensive and integrative review that critically examines LLM applications across both operational and educational domains while addressing technical, pedagogical, and ethical dimensions. As interest in LLMs grows, there is a critical need to move beyond siloed research and examine their role across both operational and educational contexts in an integrated manner. Our study is motivated by this need for a comprehensive understanding that bridges technological innovation with educational and ethical imperatives.
This paper provides a comprehensive and State-of-the-Art review of the evolving role of LLMs in cyber security, addressing both operational and educational dimensions. It goes beyond the existing literature by offering a holistic framework that categorizes LLM applications into six major cyber security domains, with each domain examined in depth to highlight the significance of LLMs in context-aware reasoning, automation, and adaptability to emerging threats. The paper also presents various opportunities for applying LLMs across key cyber security domains, demonstrating their potential to enhance both operational performance and educational effectiveness. Moreover, it identifies and discusses emerging challenges related to LLM integration in cyber environments. The paper uniquely addresses the underexamined area of LLMs in cyber security education by reviewing recent studies and highlighting how these models support personalized learning, hands-on training, and awareness initiatives. Finally, the paper discusses open issues and future research directions for the application of LLMs in both operational and educational contexts.
Compared to similar reviews, this paper offers a uniquely comprehensive perspective on the application of LLMs in cyber security by bridging both operational and educational dimensions, which is an approach not previously explored in the existing literature. While prior surveys have focused primarily on the technical implementation of LLMs in specific security domains, this paper stands out by introducing a holistic framework that systematically categorizes LLM applications across six critical areas of cyber security. This framework not only clarifies the diverse roles LLMs play in automating tasks, enabling context-aware reasoning, and responding to emerging threats but also provides a structured lens through which to assess their broader impact. In addition, this is the first review to explore the significant potential of LLMs in cyber security education, examining how these models enhance personalized learning, hands-on training, and awareness initiatives. By integrating both practical deployments and pedagogical strategies, the paper addresses a critical gap in current research and sets a foundation for future interdisciplinary research.
The contributions of this paper can be summarized as follows:
- Provides a comprehensive, State-of-the-Art review of LLM applications in cyber security, spanning both operational and educational domains.
- Proposes a holistic framework that systematically categorizes LLM use across six core cyber security domains, highlighting their roles in automation, context-aware reasoning, and adaptability to evolving threats.
- Explores diverse practical opportunities for leveraging LLMs across various cyber security functions, highlighting their emerging capabilities.
- Identifies and critically examines emerging challenges in integrating LLMs into cyber environments, including technical, ethical, and security-related implications.
- Provides an in-depth review of LLMs in cyber security education, analysing their applications in personalized learning, interactive training, and awareness-building while discussing key challenges.
- Outlines open issues and suggests future research directions for the responsible and effective use of LLMs in both cyber security practice and education.
The rest of this paper is structured as follows: Section 2 presents an overview of LLMs, their architecture, and the evolution of AI/ML in cyber security; Section 3 compares traditional cyber security systems with those incorporating LLMs; Section 4 reviews recent studies applying LLMs across various cyber security domains, while Section 5 explores their opportunities and practical applications. Section 6 outlines key challenges in adopting LLMs for cyber security. Section 7 focuses on LLMs in cyber security education, highlighting their benefits, applications, and challenges. Section 8 discusses open issues and future research directions for the use of LLMs in cyber security for both operational and educational dimensions, and Section 9 concludes the paper.
2. Background
This section provides an overview of LLMs, their technical foundations, and how advancements in AI and ML have historically shaped cyber security practices.
2.1. An Overview of LLMs
LLMs have greatly impacted Natural Language Processing (NLP), enabling machines to comprehend and generate human-like text with remarkable fluency. These models are trained on vast amounts of data encompassing diverse linguistic patterns and knowledge domains []. Prominent examples such as GPT-4, PaLM, and LLaMA have demonstrated proficiency across a spectrum of tasks, including text summarization, translation, and code generation []. The evolution of LLMs has been driven by advancements in computational resources and the availability of extensive training data, adhering to scaling laws that correlate model performance with size and data volume [,].
The architecture of LLMs is primarily based on the transformer model, introduced by Vaswani et al. [], which relies on self-attention mechanisms to process sequential data efficiently. Unlike traditional RNNs (Recurrent Neural Networks) or LSTMs (Long Short-Term Memory), transformers process entire sequences in parallel, making them highly scalable for large datasets. The architecture consists of an encoder–decoder structure with multiple layers of attention and feedforward neural networks []. The essential components of LLM architecture include multi-head attention, positional encoding, and layer normalization. Multi-head attention allows the model to focus on different parts of the input simultaneously, capturing complex linguistic patterns. Positional encoding ensures the model retains information about word order, while layer normalization stabilizes training. Modern LLMs, like GPT-4 and PaLM, scale these components to billions of parameters, enhancing their ability to generate coherent and contextually relevant text [,].
Several researchers have proposed various architectures and layer configurations for LLMs [,,]. However, the most common architecture consists of six layers, as shown in Figure 1, as follows:
Figure 1.
Layered architecture of LLMs.
- I.
- Input Layer: Tokenization and Embeddings
LLMs begin by transforming raw text into machine-readable tokens, words, or sub-words through tokenization. These tokens are converted into semantic vector representations using word embeddings. To capture word order, positional embeddings are added, compensating for the transformer’s lack of inherent sequential awareness [].
- II.
- Encoder/Decoder Layer
At the core of LLMs are transformer encoder and decoder layers that utilize self-attention to capture contextual relationships among tokens. Using queries, keys, and values, the model identifies relevant words and computes attention weights. Multi-head attention allows parallel context capture, while feedforward networks, residual connections, and layer normalization enhance stability and performance [].
- III.
- Stacking of Layers (Transformer Blocks)
Transformer layers are stacked to form deep networks, enabling hierarchical language understanding. As depth increases, the model captures increasingly abstract and complex patterns, facilitating tasks from grammar recognition to advanced reasoning and dialogue handling [].
- IV.
- Output Layer: Decoding and Prediction
The output layer generates predictions from internal representations. Autoregressive models, like GPT, predict the next token in a sequence, while masked models, like BERT, fill in missing tokens. A softmax function generates probability distributions over the vocabulary, allowing outputs for tasks like completion and question-answering [].
- V.
- Training and Fine-Tuning Layers
LLMs are pretrained on large text corpora to learn general language patterns using objectives like masked or next-token prediction. They are then fine-tuned on specific tasks, adjusting model weights and hyperparameters to enhance performance in targeted applications, such as summarization or translation [,].
- VI.
- Optimization and Scaling Layers
Given their scale, LLMs require significant computational resources. Gradient descent minimizes loss during training, while techniques like model/data parallelism support scalability. Inference efficiency is improved through methods such as quantization, pruning, and distillation, reducing resource demands without sacrificing accuracy [].
2.2. History of AI/ML in Cyber Security
The integration of Artificial Intelligence (AI) and Machine Learning (ML) into cyber security has evolved significantly over the past two decades, driven by the increasing scale, complexity, and sophistication of cyber threats. Traditional cyber security mechanisms, primarily rule-based systems, were effective in earlier computing environments where threats were relatively predictable and less dynamic. However, the surge in zero-day vulnerabilities, polymorphic malware, and targeted attacks necessitated more adaptive and intelligent defence mechanisms, giving rise to the adoption of AI/ML techniques []. Initial applications of ML in cyber security emerged in the early 2000s, particularly in spam filtering and Intrusion Detection Systems (IDSs). One of the earliest successful applications was the use of Naïve Bayes (NB) and Support Vector Machines (SVM) for spam classification, as demonstrated by Sahami et al. []. These early models laid the groundwork for anomaly detection techniques, where ML models learned normal network behaviour and flagged deviations as potential threats []. As cyber threats evolved, the need for more complex models became a necessity. Researchers began applying unsupervised learning and clustering techniques for malware detection and behavioural analysis [].
The 2010s witnessed a shift towards Deep Learning (DL) techniques, enabled by advances in computational power and large-scale datasets. RNNs and Convolutional Neural Networks (CNNs) found applications in detecting botnets, network intrusions, and malware families [,]. These models offered the advantage of learning hierarchical representations from raw data, significantly improving detection accuracy. For example, Saxe and Berlin [] introduced a DL network for static malware classification that outperformed traditional feature engineering approaches. Despite these advances, AI/ML in cyber security also presented challenges. Adversarial ML became an area of growing concern, highlighting the vulnerability of ML models to evasion and poisoning attacks []. Additionally, the explainability of AI systems remained a key issue, particularly in critical domains like threat detection and incident response. In recent years, the emergence of transfer learning and pretrained models has opened new frontiers in cyber security applications, with models like BERT and GPT being adapted for threat intelligence, phishing detection, and code analysis []. These developments laid the foundation for today’s exploration of LLMs in cyber security, enabling a more contextual, human-like understanding of threats, behaviours, and mitigations.
While each generation of AI/ML approaches in cyber security has contributed notable advancements, their evolution also reflects a series of trade-offs. Traditional ML models, such as SVMs and decision trees, offered transparency and efficiency but required extensive feature engineering and struggled with adaptability to novel threats. DL techniques improved detection accuracy and scalability by learning hierarchical features; however, they lacked interpretability and introduced high computational demands []. More recently, pretrained language models and LLMs have demonstrated strong performance in handling unstructured data and performing complex reasoning tasks. However, their use introduces new concerns, including susceptibility to adversarial prompts, hallucination, and ethical risks related to bias and misuse []. Thus, while the effectiveness of AI approaches has generally increased over time, so too have their limitations, requiring careful consideration of robustness, transparency, and governance in modern deployments. This trajectory highlights the need for critical evaluation as AI systems are integrated into increasingly sensitive cyber environments (Figure 2).
Figure 2.
Evolution of AI and ML in cyber security leading to LLMs [].
3. Traditional vs. LLM-Based Cyber Security
Traditional cyber security systems have long relied on deterministic, rule-based mechanisms and statistical ML approaches to detect and mitigate cyber threats. These methods have been instrumental in protecting against a wide range of attacks, such as malware, phishing, network intrusions, and Denial-of-Service (DoS) attacks. Signature-based systems, such as antivirus software and IDSs, are built on predefined patterns of known threats. While effective against well-understood and previously identified attacks, these systems often struggle to adapt to novel or obfuscated threats []. Beyond signature-based systems, traditional anomaly detection techniques leverage statistical methods or classical ML algorithms like Decision Trees (DTs), SVMs, and K-nearest neighbours (KNNs). These systems require extensive feature engineering and domain knowledge to design inputs that represent security events, such as packet payloads or user behaviour patterns []. While these approaches can identify deviations from normal behaviour, they are often limited by false positives, lack of scalability, and reduced adaptability in dynamic and adversarial environments. Although these approaches have demonstrated effectiveness in controlled environments, empirical studies have increasingly shown their limitations in dynamic or adversarial settings.
Moreover, traditional cyber security tools typically operate in siloed architectures, focusing on narrow domains, such as endpoint protection, firewall filtering, or network traffic analysis. They lack a holistic understanding across contexts, languages, and modalities, making it difficult to detect complex multi-stage attacks or interpret intent in unstructured data like emails, chats, or social media posts []. LLMs represent a paradigm shift in cyber security applications. Unlike traditional models that rely on handcrafted features and limited context, LLMs are pretrained on massive amounts of data and fine-tuned to understand complex linguistic and semantic structures. This enables LLMs to excel in natural language understanding, making them particularly effective in tasks such as phishing detection, threat intelligence summarization, vulnerability triage, and social engineering simulation [].
A key differentiator of LLMs is their contextual reasoning ability. They can generate, interpret, and summarize text-based artefacts such as logs, code snippets, or threat reports, which traditional systems cannot handle natively. For example, LLMs can analyse an email and determine whether its tone, structure, and wording match common phishing tactics, without the need for explicit rules or labelled datasets []. Similarly, in malware analysis, LLMs can assist in deobfuscating code and suggesting interpretations or remediations, functioning as intelligent assistants rather than just alert generators. Another major advantage of LLMs is their adaptability through few-shot and zero-shot learning, where the model performs tasks without extensive retraining. This is crucial in cyber security, where new threats emerge constantly, and labelled data is rare [].
Additionally, LLMs facilitate the automation of routine but time-consuming tasks that typically require human intervention, such as generating policy reports, summarizing lengthy security logs, and even writing secure code snippets. Traditional systems, while capable of identifying known threat patterns, often require significant analyst involvement to interpret and act on results. LLMs can streamline this process by offering actionable insights or recommended mitigations in natural language, significantly reducing response time in Security Operations Centres (SOCs) []. Furthermore, their ability to interact conversationally allows for seamless integration with analysts through chatbots or co-pilot tools, enabling real-time assistance and bridging the gap between data complexity and human interpretation []. Table 1 compares traditional cyber security systems and LLM-based systems.
Table 1.
Comparison of traditional cyber security systems vs. LLM-based systems.
While Table 1 offers a conceptual comparison, it is also important to anchor these contrasts in real-world evaluations. Several empirical studies provide insights into the comparative performance of traditional and LLM-based approaches. For instance, Noever [] demonstrated that GPT-4 detected approximately four times more vulnerabilities than traditional static analysers, such as Fortify and Snyk, while also proposing viable remediations. In a separate study, Sheng et al. [] showed that LLM-driven systems, like LProtector, outperformed baseline vulnerability detection tools on C/C++ codebases in terms of F1 score, demonstrating higher accuracy and recall. Similarly, Nahmias et al. [] reported that LLMs trained on contextual persuasion features outperformed classical classifiers in spear-phishing detection, suggesting strong capability in identifying subtle semantic cues that traditional filters miss.
Although LLMs provide countless benefits, the integration of LLMs into cyber security workflows must be approached with caution. While they augment human capabilities, they also introduce new dependencies and vulnerabilities. For instance, overreliance on model outputs without verification may lead to overlooked threats or the implementation of flawed mitigations. Moreover, the ethical concerns around bias, data leakage, and misuse of generative outputs necessitate strong governance frameworks []. These concerns are even more noticeable in critical infrastructure environments, where real-time processing constraints, regulatory compliance requirements, and the need for transparent, explainable decision-making limit the viability of opaque, high-latency models. As a result, the deployment of LLMs in such contexts demands additional scrutiny, human oversight, and integration with trusted, auditable systems. In addition, as highlighted by Sovrano et al. [], LLMs exhibit position-based bias, frequently missing vulnerabilities that appear near the end of input sequences, a flaw not typically observed in deterministic systems. Moreover, empirical evaluations by Isozaki et al. [] and Happe et al. [] found that while LLMs like GPT-4-turbo could automate penetration testing sub-tasks, they still struggled with multi-step reasoning and error recovery, especially without human intervention. These findings highlight that while LLMs offer promising new capabilities, they are not universally superior and often require complementary integration with traditional methods. Therefore, a hybrid approach that combines the precision and predictability of traditional systems with the flexibility and intelligence of LLMs is currently the most pragmatic pathway forward.
Building on this hybrid approach, a growing number of industry solutions are adopting hybrid models that combine both paradigms. Traditional tools, such as signature-based IDSs or rule-based firewalls [], provide reliability, interpretability, and regulatory familiarity. However, they lack adaptability to novel, unstructured, or evolving threats []. On the other hand, LLMs offer enhanced context awareness and generative capabilities, excelling in natural language understanding and real-time threat analysis [,]. Hybrid architectures aim to utilize the strengths of both: using deterministic models for baseline threat detection while leveraging LLMs for tasks like log summarization, phishing analysis, or contextual alert prioritization [,]. This layered approach not only improves detection accuracy but also addresses critical operational concerns, such as explainability and analyst trust. Industry implementations, including LLM-assisted co-pilots integrated into SOCs [], reflect this trend and point toward a pragmatic path forward where legacy systems and AI-powered tools merge in a secure, scalable manner.
4. LLMs in Cyber Security
The increasing adoption of LLMs in cyber security reflects a notable shift in how modern threats are detected, understood, and mitigated. As cyber threats grow in scale, complexity, and subtlety, traditional systems often fall short in responding to evolving attack vectors, particularly those that leverage social engineering, code obfuscation, or multi-stage strategies. In contrast, LLMs and their fine-tuned derivatives have demonstrated a remarkable ability to understand, generate, and contextualize human language, making them uniquely suited for analysing unstructured data. This paper advances the existing literature by offering a holistic framework that categorizes LLM applications into six key cyber security domains, including vulnerability detection, anomaly detection, cyber threat intelligence, blockchain security, penetration testing, and digital forensics. These domains were selected based on the maturity of research, practical impact, and the demonstrated value of LLMs in addressing real-world cyber security challenges. Within each domain, this paper examines in depth how LLMs contribute to context-aware reasoning, automation, and adaptability to emerging threats (Figure 3).
Figure 3.
Holistic framework to categorize LLM applications into six key cyber security domains.
This categorization not only brings clarity to the fragmented landscape of LLM applications in cyber security but also serves as a structured lens through which researchers and practitioners can assess current capabilities and identify future opportunities. The chart, shown in Figure 4, reinforces this categorization by illustrating the distribution of published research papers across the six identified domains from 2023 to 2025. Notably, vulnerability detection (VD) and anomaly detection (AD) exhibit steady growth, reflecting their foundational role in proactive cyber defence. Cyber threat intelligence (CTI) saw a dramatic spike in 2024, suggesting heightened research interest in leveraging LLMs for extracting actionable insights from threat data. In contrast, areas like penetration testing (Pentest) and blockchain security (BC) show relatively modest and consistent publication volumes, potentially indicating emerging research interest in these areas. Digital forensics (DF), while initially underexplored in 2023, witnessed significant attention in 2024, possibly driven by new investigative use cases and tools.
Figure 4.
Published papers on LLM-based systems in each cyber security domain (2023–2025).
4.1. Vulnerability Detection
The application of LLMs in vulnerability detection has rapidly evolved into one of the most active and impactful areas of LLM-driven cyber security research. A wide range of studies reviewed by Zhang et al. [] have explored how LLMs can enhance traditional vulnerability detection techniques by improving automation, reducing false positives, and enabling the semantic understanding of code and configurations. The key to this evolution is the integration of LLMs into both static and dynamic analysis pipelines.
Several studies have demonstrated the use of LLMs to detect vulnerabilities directly from source code. Zhou et al. [,], and Tamberg and Bahsi [] proposed approaches using LLMs to identify insecure coding patterns and semantic weaknesses, emphasizing the benefit of pretrained language models over traditional pattern-matching tools. Similarly, Mahyari [] and Mao et al. [] leveraged LLM models, like CodeBERT and GPT-3, for vulnerability classification tasks, showing that LLMs could generalize across different programming languages and vulnerability types with minimal retraining. Beyond classification, several researchers focused on vulnerability localization and explanation. For example, Cheshkov et al. [] designed a method to highlight precise code lines contributing to a vulnerability, while Purba et al. [] applied LLMs to reason about causality in code defects. These efforts represent a shift toward interpretable vulnerability detection, where models not only identify vulnerable code but also offer human-readable explanations.
Dozono et al. [] extended this evaluation across multiple programming languages, developing CODEGUARDIAN, a VSCode-integrated tool that enhances developers’ accuracy and speed in detecting vulnerabilities. Berabi et al. [] also proposed DeepCode AI Fix, a system that utilizes program analysis to focus LLMs’ attention on relevant code snippets, thereby improving the efficiency of vulnerability fixes. Similarly, Noever [] assessed GPT-4′s performance against traditional static analysers, like Snyk and Fortify, finding that GPT-4 identified approximately four times more vulnerabilities and provided viable fixes with a low false positive rate. Sheng et al. [] also proposed LProtector, an LLM-driven vulnerability detection system that leverages GPT-4o and Retrieval-Augmented Generation (RAG) to identify vulnerabilities in C/C++ codebases, outperforming State-of-the-Art baselines in F1 score.
LLMs have also been integrated into software development environments. Khare et al. [] and Jensen et al. [] demonstrated how developer-facing tools powered by LLMs could offer real-time feedback on insecure code during the writing process, bridging the gap between detection and secure development. Shestov et al. [] extended this by proposing LLM-assisted bug reporting systems that correlate vulnerabilities with developer comments and test results. Other works, like those by Li et al. [] and Kouliaridis et al. [], explored the fine-tuning of LLMs for vulnerability datasets, improving detection precision and recall. Guo et al. [] and Wang et al. [] focused on model robustness in the face of obfuscated or adversarial code, an important area given the tendency of attackers to hide exploit logic. At the infrastructure level, Bakhshandeh et al. [] and Mathews et al. [] showed how LLMs could identify misconfigurations in containerized environments and cloud infrastructures, extending vulnerability detection beyond traditional application code. LLMs were also utilized as virtual security response experts. Lin et al. [] demonstrated that LLMs could assist in vulnerability mitigation by retrieving relevant CVE and CWE data and suggesting remediation strategies. Kaheh et al. [] took this a step further, arguing that LLMs should not only answer queries but also perform actions, such as instructing a firewall to block a malicious IP.
Recent research has increasingly focused on the application of LLMs for vulnerability detection and code security. Dolcetti et al. [] and Feng et al. [] explored enhancing LLM performance by integrating them with traditional techniques like testing, static analysis, and code graph structures, improving both detection and repair capabilities. Similarly, Lin and Mohaisen [] investigated the effectiveness of fine-tuned LLMs and general-purpose models, highlighting the role of dataset quality and token length in shaping model reliability. Torkamani et al. [] proposed CASEY, a triaging tool that accurately classifies CWEs and predicts severity levels, demonstrating practical utility in automating security assessments. In parallel, broader methodological and empirical advancements have surfaced. Ghosh et al. [] introduced CVE-LLM, embedding ontology-driven semantics from CVE and CWE databases to enhance context-aware vulnerability evaluations. Also, Sheng et al. [] provided a State-of-the-Art survey mapping out the LLM landscape in software security and identifying key challenges. Sovrano et al. [] also offered critical insight into LLM limitations by identifying the “lost-in-the-end” effect, where vulnerabilities near the input’s end are frequently missed, calling for smarter input management strategies. Table 2 summarizes the contributions and limitations of recent LLM-based systems for vulnerability detection in ascending order per year of publication.
Table 2.
Summary of contributions and limitations of recent LLM-based systems for vulnerability detection.
Although research on LLM-based vulnerability detection continues to advance, it also uncovers recurring challenges that limit practical deployment. Many systems, such as those proposed by Li et al. [] and Mahyari [], rely on fine-tuning LLMs with labelled datasets to enhance accuracy. However, this approach introduces bias risks and reduces performance on rare or unseen vulnerabilities. Several works, including those by Purba et al. [] and Wang et al. [], focus on enriching LLM reasoning or generalization through causality analysis or code embeddings but at the cost of scalability and compute intensity. A common strength across reviewed studies, such as those by Khare et al. [], Jensen et al. [], and Shestov et al. [], is integration with developer environments for real-time feedback and contextualized insights. However, these systems suffer from trust issues, hallucinations, or reliance on high-quality developer context. Additionally, limitations in generalizability across codebases or languages [,] and challenges in handling obfuscated or deeply nested code [,] highlight persistent gaps. Other studies aim to address these weaknesses by ensuring scalability, cross-language robustness, and enhanced resilience to code complexity, which is lacking in the current systems [,].
Several LLM-based tools show promising potential for production deployment. CODEGUARDIAN [], for example, is integrated into Visual Studio Code, enabling real-time vulnerability detection during coding, which is a valuable feature for secure software development pipelines. DeepCode AI Fix [] extends this by not only identifying but also suggesting fixes, making it suitable for continuous integration workflows. LProtector [] combines GPT-4o with RAG to effectively detect C/C++ vulnerabilities in large codebases, showing promise for enterprise-scale applications. These tools go beyond experimental use and are designed with real-world developer workflows in mind. Their architecture and integration features position them well for practical deployment; however, broader industry validation is still needed.
4.2. Anomaly Detection
Anomaly detection plays a crucial role in cyber security, and recent studies have increasingly leveraged LLMs to enhance this capability across various contexts, from system logs to email filtering and traffic inspection. A standout contribution in log-based anomaly detection is LogPrompt [], which utilizes LLMs to parse and semantically understand unstructured logs. It applies chain-of-thought reasoning and in-context learning to distinguish between normal and anomalous behaviour, thus moving beyond static rule sets. LEMUR [] further expands this field by employing entropy sampling for efficient log clustering and uses LLMs to differentiate between invariant tokens and parameters with high accuracy, greatly improving the effectiveness of template merging. In a practical cloud environment, ScaleAD [] offers an adaptive anomaly detection system where a Trie-based agent flags potential anomalies, and an LLM is queried to assess log semantics and assign confidence scores. Similarly, LogGPT [] introduces a full pipeline that preprocesses logs, crafts context-rich prompts, and parses LLM outputs for anomaly evaluation. These techniques allow the system to intelligently detect anomalies even in high-volume environments. Adding to the landscape, Han et al. [] improve GPT-2′s focus in log sequence analysis through a Top-K reward mechanism, which directs model attention to the most relevant tokens, thus enhancing detection precision.
Outside of logs, LLMs have been evaluated for their performance in phishing and spam detection. The Improved Phishing and Spam Detection Model (IPSDM) [] fine-tunes DistilBERT and RoBERTa for email filtering, showing LLMs’ promise in real-time protection against social engineering. Si et al. [] also highlight ChatGPT’s potential, outperforming BERT on low-resource Chinese datasets but showing limitations on large-scale English corpora. Nahmias et al. [] targeted spear-phishing detection with a unique approach using LLMs to generate contextual document vectors that reflect persuasion tactics, outperforming classical classifiers in recognizing malicious intent. Further, Heiding et al. [] examined the dual-use nature of LLMs in both crafting and detecting phishing emails. Their analysis demonstrates that models like GPT-4, Claude, and PaLM outperform humans in detection tasks, though they warn of adversarial misuse. An additional threat domain is covered by Vörös et al. [], who propose a student–teacher LLM approach for malicious URL detection via knowledge distillation, making the model suitable for lightweight deployments. Also, Guastalla et al. [] fine-tuned LLMs for DDoS detection on pcap-based and IoT datasets, adapting to real-world traffic complexities.
Other research has extended LLM-based anomaly detection beyond logs and text into diverse data modalities. Alnegheimish et al. [] proposed sigllm, a zero-shot time series anomaly detector that converts numerical sequences into a textual form, enabling LLMs to forecast and flag anomalies without specific training. Gu et al. [] introduced AnomalyGPT, leveraging vision–language models like MiniGPT-4 for industrial defect detection using synthetic image–text pairs, eliminating manual thresholding. Elhafsi et al. [] also applied LLMs in robotic systems, identifying semantic inconsistencies in visual behaviours, while Li et al. [] demonstrated that LLMs can detect outliers in tabular data with zero-shot capabilities, which was further improved through fine-tuning for real-world contexts.
Guan et al. [] proposed a framework combining BERT and LLaMA models to detect anomalies in system records. By aligning semantic vectors from log messages, the model effectively captures anomalies without relying on traditional log parsing methods. Su et al. [] conducted a systematic literature review on the use of LLMs in forecasting and anomaly detection. The review highlights the potential of LLMs in these areas while also addressing challenges such as data requirements, generalizability, and computational resources. Ali and Kostakos [] integrated LLMs with traditional ML techniques and explainability tools, like SHAP, in a hybrid model called HuntGPT, which enriches anomaly detection through human-interpretable insights.
Recently, Liu et al. [] introduced CYLENS, an LLM-powered assistant that supports security analysts in interpreting anomalies, attributing threats, and suggesting mitigations by leveraging contextual memory and historical CTI through expert-style dialogues. Wang and Yang [] also proposed a federated LLM framework that performs anomaly detection across distributed and privacy-sensitive environments by processing multimodal data without centralizing it. Qian et al. [] focused on Android malware detection with LAMD, a multi-tiered system that integrates program slicing, control flow graphs, and LLM prompting to detect and explain malicious behaviour while also reducing hallucinations through consistency verification. Benabderrahmane et al. [] target stealthy APTs with APT-LLM, encoding low-level system traces into text and using LLMs to generate semantic embeddings, which are analysed via autoencoders to detect subtle deviations.
Additionally, Meguro and Chong [] proposed AdaPhish to advance phishing defence by automating phishing email analysis and anonymization using LLMs and vector databases, which serve both detection and organizational education. Another tool named SHIELD [] goes beyond detection by combining statistical and graph-based techniques with LLM-generated APT kill chain narratives, which provide security analysts with actionable and human-readable summaries for stealthy threat detection. In the same way, Akhtar et al. [] reviewed the LLM-log analysis landscape and identified major methodologies, like fine-tuning, RAG, and in-context learning, and outlined the trade-offs and practical challenges in real deployments. Walton et al. [] also introduced MSP, which enhances malware reverse engineering with structured LLM-driven summarization that categorizes threats and pinpoints specific malicious behaviours. Table 3 provides a summary of the contributions and limitations of recent LLM-based systems for anomaly detection in ascending order per year of publication.
Table 3.
Summary of contributions and limitations of recent LLM-based systems for anomaly detection.
The reviewed literature demonstrates varied approaches to utilize LLMs in anomaly detection but also highlights common limitations. Several studies, such as those by Qi et al. [] and Liu et al. [], focus on LLM-based log analysis, emphasizing structured outputs and integration with existing systems. However, these approaches suffer from latency issues or reliance on prompt quality. Similarly, phishing and spam detection systems [,] showcase the effectiveness of LLMs in recognizing persuasive language but are hindered by risks of misuse, generalization gaps, or poor performance on novel formats. Other studies proposed solutions to address issues of real-time detection, cross-domain generalization, and reduced dependency on handcrafted prompts. For example, systems, like those by Gandhi et al. [] and Benabderrahmane et al. [], attempt to provide deeper semantic understanding through kill chain alignment or provenance embeddings; however, they still face challenges in computational overhead and interpretability. While many papers suggest innovative techniques, like federated learning [] or zero-shot detection [], their practical deployment remains limited.
4.3. Cyber Threat Intelligence (CTI)
The integration of LLMs into CTI has become a prominent area of study, as researchers assess how these models can assist with threat detection, analysis, and decision-making. Recent advances in CTI leverage LLMs to transform unstructured and distributed threat data into actionable insights. Mitra et al. [] proposed LocalIntel, a framework that synthesizes global threat databases, like CVE and CWE, with organization-specific knowledge, allowing for personalized, high-context intelligence summaries. Perrina et al. [] automate CTI report generation using LLMs to extract threat knowledge from unstructured sources, highlighting LLMs’ capability in structuring scattered data. Fayyazi and Yang [] also fine-tuned LLMs with structured threat ontologies, resulting in models that excel in producing precise, technique-specific threat narratives. Also, Schwartz et al. [] proposed LMCloudHunter, which translates OSINT into candidate detection rules for signature-based systems using LLMs, thus connecting high-level intelligence with operational defences.
Other studies have focused on extracting CTI from unstructured text. For example, Clairoux-Trepanier et al. [] evaluated GPT-based LLMs for extracting key threat indicators, showing promise in identifying IOCs and attack tactics. Siracusano et al. [] introduced the aCTIon framework, combining LLMs with a parsing engine to export CTI in the STIX format, enabling integration with existing intelligence platforms. Hu et al. [] extended this idea using fine-tuned LLMs to build and populate knowledge graphs from raw CTI, improving the structured representation. Zhang et al. [] also proposed AttacKG, a full LLM-based pipeline that rewrites and summarizes CTI into attack graphs. Fieblinger et al. [] validated open-source LLMs for extracting entity–relation triples, confirming their role in enriching semantic CTI knowledge graphs.
Park and You [] proposed CTI-BERT, a domain-adapted BERT model trained on cyber security-specific corpora to enhance the extraction of structured threat intelligence from unstructured data. Gao et al. [] also presented ThreatKG, which uses LLMs to extract entities and relationships from CTI reports to construct and update a cyber threat knowledge graph, supporting continuous integration into analytic workflows. To address the challenge of limited labelled data, Sorokoletova et al. [] proposed 0-CTI, a versatile framework supporting both supervised and zero-shot learning for threat extraction and STIX alignment, ensuring standardized data communication. Wu et al. [] focused on verifying extracted intelligence, using LLMs to identify key claims in reports and validate them against a structured knowledge base, enhancing the credibility of threat intelligence. Fayyazi et al. [] demonstrated how integrating RAG with LLMs enriches and improves the accuracy of MITRE ATT&CK TTP summaries. Singla et al. [] also showed LLMs’ effectiveness in analysing software supply chain vulnerabilities and identifying causes and propagation paths. Zhang et al. [] tackled bug report deduplication by enhancing traditional similarity-based models with LLM-driven semantic insights, leading to more accurate identification of duplicate reports.
To support strategic defence reasoning, Jin et al. [] proposed Crimson, a system that aligns LLM outputs with MITRE ATT&CK tactics through a novel Retrieval-Aware Training (RAT) approach. This fine-tuning method significantly reduced hallucinations while improving alignment with expert knowledge. Similarly, Tseng et al. [] proposed an LLM-powered agent to automate SOC tasks, such as extracting CTI from reports and generating regular expressions for detection rules. Rajapaksha et al. [] also introduced a QA system using LLMs and RAG to support context-rich attack investigations. Shah and Parast [] applied GPT-4o with one-shot fine-tuning to automate CTI workflows, reducing manual effort while ensuring accuracy. Shafee et al. [] assessed LLM-driven chatbots for OSINT, showing their ability to classify threat content and extract entities. Alevizos and Dekker [] also proposed an AI-powered CTI pipeline that fosters collaboration between analysts and LLMs, enhancing both speed and precision in threat intelligence generation. Daniel et al. [] also evaluated LLMs’ ability to map Snort NIDS rules to MITRE ATT&CK techniques, highlighting their scalability and explainability. Similarly, Alturkistani et al. [] explored the significant role of LLMs in ICT education, highlighting their potential to enhance personalised learning, automate content generation, and support practical skill development. Table 4 summarizes the contributions and limitations of recent LLM-based systems for CTI in ascending order per year of publication.
Table 4.
Summary of the contribution and limitations of recent LLM-based systems in CTI.
The reviewed studies highlight the growing role of LLMs in enhancing CTI through automation, semantic understanding, and integration with structured formats. Many studies, such as those by Perrina et al. [], Siracusano et al. [], and Zhang et al. [], employ LLMs to convert unstructured threat data into structured formats, like STIX or knowledge graphs, offering semantic depth and automation. However, these systems suffer from limitations like hallucinated outputs, parsing errors, and cumulative complexity across multi-stage pipelines. Similarly, RAG-based approaches, such as those by Jin et al. [] and Fayyazi et al. [], improve factual accuracy by grounding responses in external sources, but their performance heavily depends on the quality and structure of the retrieval corpus. Tools designed for operational use, such as those by Wu et al. [] and Rajapaksha et al. [], show promise in supporting SOC analysts through verification and context-aware reasoning; however, they struggle with latency and incomplete context. Furthermore, domain-specific models like CTI-BERT [] offer improved accuracy but lack generalizability. A common limitation across various studies, especially those by Hu et al. [] and Liu et al. [], is the difficulty in grounding LLM-generated outputs and managing conflicting or ambiguous intelligence, which poses risks to reliability and practical deployment.
4.4. Blockchain Security
Recent advancements in blockchain security have seen the integration of LLMs to enhance smart contract auditing and anomaly detection. Chen et al. [] conducted an empirical study assessing ChatGPT’s performance in identifying vulnerabilities within smart contracts. Their findings indicate that while ChatGPT exhibits high recall, its precision is limited, with performance varying across different vulnerability types. David et al. [] also explored the feasibility of employing LLMs, specifically GPT-4 and Claude, for smart contract security audits. Their research focused on optimizing prompt engineering and evaluating model performance using a dataset of 52 compromised DeFi smart contracts. The study revealed that LLMs correctly identified vulnerability types in 40% of cases but also demonstrated a high false positive rate, underscoring the continued necessity for manual auditors.
Gai et al. [] proposed BLOCKGPT, a dynamic, real-time approach to detecting anomalous blockchain transactions. By training a custom LLM from scratch, BLOCKGPT acts as an IDS, effectively identifying abnormal transactions without relying on predefined rules. In experiments, it successfully ranked 49 out of 124 attacks among the top three most abnormal transactions, showcasing its potential in real-time anomaly detection. Hu et al. [] proposed GPTLENS, an adversarial framework that enhances smart contract vulnerability detection by breaking the process into two stages: generation and discrimination. In this framework, the LLM serves dual roles as both AUDITOR and CRITIC, aiming to balance the detection of true vulnerabilities while minimizing false positives. The experimental results demonstrate that this two-stage approach yields significant improvements over conventional one-stage detection methods. Similarly, Sun et al. [] proposed GPTScan, a tool that combines GPT with static analysis to detect logic vulnerabilities in smart contracts. Unlike traditional tools that focus on fixed-control or data-flow patterns, GPTScan leverages GPT’s code understanding capabilities to identify a broader range of vulnerabilities. The evaluations on diverse datasets showed that GPTScan achieved high precision for token contracts and effectively detected ground-truth logic vulnerabilities, including some missed by human auditors. Hossain et al. [] combined LLMs with traditional ML models to detect vulnerabilities in smart contract code, achieving over 90% accuracy.
Yu et al. [] proposed Smart-LLaMA, a vulnerability detection framework leveraging domain-specific pretraining and explanation-guided fine-tuning to identify and explain smart contract flaws with improved accuracy. Similarly, Zaazaa and El Bakkali [] proposed SmartLLMSentry, which employs ChatGPT and in-context learning to detect Solidity-based vulnerabilities with over 90% precision. Also, Wei et al. [] proposed LLM-SmartAudit, a multi-agent conversational system that enhances auditing by outperforming traditional tools in identifying complex logical issues. He et al. [] conducted a comprehensive review on integrating LLMs in blockchain security tasks, covering smart contract auditing, anomaly detection, and vulnerability repair. Moreover, Ding et al. [] proposed SmartGuard, which uses LLMs to retrieve semantically similar code snippets and apply Chain-of-Thought prompting to improve smart contract vulnerability detection, achieving near-perfect recall. Ma et al. [] proposed iAudit, a two-stage LLM-based framework for smart contract auditing. The system employs a Detector to flag potential vulnerabilities and a Reasoner to provide contextual explanations, improving audit interpretability. However, its effectiveness is contingent on the quality and diversity of fine-tuning data. Similarly, Bu et al. [] introduced vulnerability detection in DApps by fine-tuning LLMs on a large dataset of real-world smart contracts. Their method excelled in identifying complex, non-machine-auditable issues, achieving high precision and recall. However, its robustness depends on the continual inclusion of diverse, evolving contract patterns. Table 5 summarizes the contributions and limitations of recent LLM-based systems in blockchain security in ascending order per year of publication.
Table 5.
Summary of the contribution and limitations of recent LLM-based systems in blockchain security.
The reviewed studies highlight the growing interest in applying LLMs for smart contract auditing and blockchain security, with varying approaches in model architecture, data usage, and integration techniques. A common trend among various studies, such as those by Hu et al. [], Ma et al. [], and Zaazaa and El Bakkali [], is the adoption of multi-stage or fine-tuned frameworks aimed at improving vulnerability detection accuracy and interpretability. Other studies, including those by David et al. [] and Chen et al. [], focused on evaluating general-purpose LLMs, like GPT-4, revealing promising recall rates but also significant limitations in precision and high false positive rates. Real-time detection systems, like BLOCKGPT [] and Smart-LLaMA [], highlight the importance of domain-specific training; however, they suffer from a dependence on large, high-quality datasets. Similarly, systems like LLM-SmartAudit [] and the one studied by Hossain et al. [] explore hybrid architectures, though they introduce additional complexity and face challenges with generalization to evolving platforms. While tools like SmartGuard [] and GPTScan [] explore novel prompting or static analysis integration, their performance is constrained when multiple or subtle vulnerabilities exist. Generally, most systems demonstrate potential but struggle with generalizability, data dependency, scalability, or interpretability, highlighting the need for balanced, robust, and empirically validated solutions.
Several tools show growing readiness for real-world deployment, particularly in smart contract auditing. iAudit [] simulates a structured audit process using LLM-driven detection and reasoning, making it suitable for automated contract security assessments. SmartLLMSentry [] and Smart-LLaMA [] apply pretrained and fine-tuned LLMs to detect vulnerabilities in Solidity code, achieving high accuracy on real-world datasets. GPTScan [] and SmartGuard [] combine static analysis with LLM reasoning to uncover logic vulnerabilities often overlooked by traditional tools. While large-scale production deployments are not yet widely documented, these tools demonstrate strong potential for integration into blockchain security workflows.
4.5. Penetration Testing
The integration of LLMs into penetration testing has seen significant advancements through recent research. Deng et al. [] introduced PentestGPT, an automated framework that utilizes LLMs to perform tasks such as tool usage, output interpretation, and suggesting subsequent actions. Despite its modular design aimed at mitigating context loss, the study highlighted challenges in maintaining a comprehensive understanding throughout the testing process. Building upon this, Isozaki et al. [] developed a benchmark to evaluate LLMs like GPT-4o and Llama 3.1-405B in automated penetration testing. Their findings indicate that while these models show promise, they currently fall short in executing end-to-end penetration tests without human intervention, particularly in areas like enumeration, exploitation, and privilege escalation.
Exploring the offensive capabilities of LLMs, Fang et al. [] demonstrated that GPT-4 could autonomously exploit 87% of tested one-day vulnerabilities when provided with CVE descriptions, outperforming other models and traditional tools. However, the model’s effectiveness significantly decreased without detailed descriptions, exploiting only 7% of vulnerabilities. In the reconnaissance phase, Temara [] examined ChatGPT’s ability to gather information such as IP addresses and network topology, aiding in penetration test planning. While the tool provided valuable insights, the study emphasized the need for further validation of the accuracy and reliability of the data obtained. Deng et al. [] evaluated PentestGPT, demonstrating a 228.6% increase in task completion compared to GPT-3.5. The modular design improved automation for penetration testing sub-tasks, though challenges with context-maintenance and complex scenarios persist.
Happe et al. [] proposed a fully automated privilege escalation tool to evaluate LLMs in ethical hacking, revealing that GPT-4-turbo exploited 33–83% of tested vulnerabilities, outperforming GPT-3.5 and LLaMA3. The study also explored how factors like context length and in-context learning impact performance. Happe and Cito [] also developed an AI sparring partner using GPT-3.5 to analyse machine states, recommend concrete attack vectors, and automatically execute them in virtual environments. Bianou et al. [] proposed PENTEST-AI, a multi-agent LLM-based framework that automates penetration testing by coordinating agents through the MITRE ATT&CK framework. While it improves test coverage and task distribution, its reliance on predefined attack patterns limits adaptability to emerging threats. Table 6 summarizes the contributions and limitations of recent LLM-based systems in penetration testing in ascending order per year of publication.
Table 6.
Summary of the contribution and limitations of recent LLM-based systems in penetration testing.
The reviewed studies highlight the growing interest in applying LLMs to automate and enhance penetration testing; however, they also reveal common challenges and diverging strategies. Several studies, such as those by Deng et al. [,] and Bianou et al. [], proposed modular or multi-agent frameworks, demonstrating improved task decomposition and alignment with frameworks like MITRE ATT&CK. However, these systems face difficulties in handling complex scenarios or adapting to novel threats due to their reliance on predefined patterns or limited context retention. Other studies, such as those by Temara [] and Fang et al. [], examined specific capabilities of LLMs, such as reconnaissance or exploiting known vulnerabilities, but their findings highlight a dependency on structured input and the need for better validation mechanisms. Studies by Happe and Cito [] and Happe et al. [] also explored the use of LLMs as supportive tools for human testers; however, they emphasized concerns around ethical deployment and the models’ inability to manage errors and maintain focus during tasks. Isozaki et al. [] also proposed benchmarks to evaluate LLM performance, but their findings confirm that human oversight remains crucial for end-to-end testing. Generally, while these approaches show promise, limitations persist in scalability, reliability, and context management across tasks.
4.6. Digital Forensics
LLMs have shown significant promise in enhancing the efficiency and accuracy of digital forensics by automating complex tasks such as evidence search, anomaly detection, and report generation. Many studies have investigated their effectiveness in addressing traditional challenges in forensic investigations, with promising results in areas like ransomware triage, synthetic dataset generation, and cyber incident analysis. Yin et al. [] provided an overview of how LLMs are transforming digital forensics, emphasizing their capabilities and limitations while highlighting the need for forensic experts to understand LLMs to maximize their potential in various forensic tasks. Similarly, Scanlon et al. [] evaluated ChatGPT’s impact on digital forensics, identifying low-risk applications, like evidence searching and anomaly detection, but also noting challenges related to data privacy and accuracy. Wickramasekara et al. [] also explored the integration of LLMs into forensic investigations, highlighting their potential to improve efficiency and overcome technical and judicial barriers, though emphasizing the need for appropriate constraints to fully realize their benefits.
Bin Oh et al. [] proposed volGPT, an LLM-based approach designed to enhance memory forensics, specifically in triaging ransomware processes. Their research demonstrates high accuracy and efficiency in identifying ransomware-related processes from memory dumps, providing more detailed explanations during triage than traditional methods. Similarly, Voigt et al. [] combined LLMs with GUI automation tools to generate realistic background activity for synthetic forensic datasets. This innovation addresses the challenge of creating realistic “wear-and-tear” artefacts in synthetic disk images, improving forensic education and training. In addition, Michelet and Breitinger [] explore the use of local LLMs to assist in writing digital forensic reports, demonstrating that while LLMs can aid in generating content, human oversight remains essential for accuracy and completeness, highlighting their limitations in complex tasks like proofreading. Bhandarkar et al. [] evaluated the vulnerabilities of traditional DFIR pipelines in addressing the threats posed by Neural Text Generators (NTGs) like LLMs. Their study introduced a co-authorship text attack, CS-ACT, revealing significant weaknesses in current methodologies and emphasizing the need for advanced strategies in source attribution. Loumachi et al. [] also proposed GenDFIR, a framework that integrates LLMs with RAG for enhancing cyber incident timeline analysis. By structuring incident data into a knowledge base and enabling semantic enrichment through LLMs, GenDFIR automates timeline analysis and improves threat detection.
Similarly, Zhou et al. [] presented an LLM-driven approach that constructs Forensic Intelligence Graphs (FIGs) from mobile device data, achieving over 90% coverage in identifying evidence entities and their relationships. Also, Kim et al. [] evaluated the application of LLMs, like GPT-4o, Gemini 1.5, and Claude 3.5, in analysing mobile messenger communications, demonstrating improved precision and recall in interpreting ambiguous language within real crime scene data. Sharma [] proposed ForensicLLM, a locally fine-tuned model based on LLaMA-3.1-8B, which outperformed its base model in forensic Q&A tasks and showed high accuracy in source attribution. Xu et al. [] conducted a tutorial exploring the potential of LLMs in automating digital investigations, emphasizing their role in evidence analysis and knowledge graph reconstruction. On the other hand, Cho et al. [] discussed the broader implications of LLM integration in digital forensics, emphasizing the need for transparency, accountability, and standardization to fully realize their benefits. Table 7 summarizes the contributions and limitations of recent LLM-based systems in digital forensics in ascending order per year of publication.
Table 7.
Summary of the contribution and limitations of recent LLM-based systems in digital forensics.
The reviewed studies demonstrate increasing interest in applying LLMs to digital forensics while also exposing key limitations. Several studies focus on task-specific applications, such as ransomware triage [], mobile message analysis [], and forensic report writing []. Other studies, such as those by Scanlon et al. [] and Yin et al. [], take a broader view, evaluating the overall impact and limitations of LLMs across digital forensic tasks. Common themes include the need for human oversight due to LLM hallucinations and errors [,], limited generalizability to diverse datasets [], and challenges in prompt engineering and model selection []. While tools like GenDFIR [] and ForensicLLM [] show promise in enhancing timeline analysis and Q&A tasks, respectively, their effectiveness is constrained by scalability and training data limitations. Furthermore, the lack of standardization, transparency, and ethical controls is a recurring concern [,]. Generally, while LLMs offer promising potential in digital forensics, robust validation, domain adaptation, and expert supervision remain essential for reliable deployment.
Table 8 offers a comprehensive overview of popular LLMs and their applicability across the six cyber security domains that this paper is focusing on. It highlights the versatility and evolution of these models over time, showcasing their strengths and limitations in specific areas. While many models demonstrate utility in tasks like vulnerability detection and threat intelligence, fewer are readily equipped for niche applications, such as blockchain security and digital forensics. This variation reflects the need for task-specific fine-tuning and the integration of domain-relevant datasets to fully unlock each model’s potential. All the models listed in the table can be applied; however, their effectiveness in these specialized areas is generally limited without fine-tuning. These models are not explicitly designed for such tasks, so they require adaptation using domain-specific datasets and objectives. With proper fine-tuning and contextual alignment, they can support targeted functions, such as anomaly detection, smart contract analysis, threat pattern recognition, and forensic data interpretation, though their performance may still vary depending on the complexity of the task.
Table 8.
Popular LLM models across six cyber security domains.
Across the six surveyed domains mentioned earlier, LLMs have introduced capabilities that extend well beyond the scope of traditional rule-based or statistical approaches. Numerous empirical studies show measurable gains in key performance metrics. For instance, LProtector [] achieved State-of-the-Art F1 scores on C/C++ vulnerability detection, and AnomalyGPT [] demonstrated superior detection of industrial defects compared to classical thresholds. Similarly, Nahmias et al. [] reported that contextual vectors derived from LLMs outperformed traditional classifiers in spear-phishing detection. These results suggest that when applied with targeted domain knowledge, LLMs can meaningfully improve detection accuracy, reduce false positives, and deliver more context-aware security insights.
However, these advantages come with critical limitations. A consistent challenge across domains is the limited robustness of LLMs under adversarial, ambiguous, or high-stakes operational conditions. Studies, such as the one by Sovrano et al. [], highlight position-based bias in LLM outputs, while Happe et al. [] showed that GPT-4-turbo struggles with privilege escalation tasks requiring multi-step reasoning. In CTI, tools like LMCloudHunter [] and 0-CTI [] improve automation but rely heavily on high-quality retrieval sources or STIX formats, which can break in noisy environments. Meanwhile, systems like PentestGPT [,] perform well in isolated tasks but still require human oversight for complex logic flows. These empirical findings highlight that while LLMs bring substantial innovations, they are not universally applicable nor fully reliable as stand-alone systems. Critical deployment risks, such as hallucination, cost, latency, and explainability, persist across all domains. Therefore, despite their promising trajectory, LLMs must be deployed within hybrid architectures that retain deterministic safeguards and enable human-in-the-loop oversight.
5. Opportunities for LLMs in Cyber Security
In addition to the opportunities for LLMs in cyber security discussed in the previous section by reviewing State-of-the-Art studies regarding vulnerability detection, anomaly detection, cyber threat intelligence, blockchain, penetration testing, and digital forensics, there are several emerging areas in cyber security where LLMs can play a pivotal role. This section explores these new opportunities, demonstrating the potential of LLMs in the evolving cyber security landscape.
5.1. Malware Analysis and Detection
Malware detection is vital in cyber security. Unlike traditional ML models that rely on known signatures, LLMs analyse context, behaviour, and patterns in large datasets. They can detect new or unknown malware variants using textual descriptions, code snippets, and network traffic, offering a more adaptive and intelligent detection approach []. LLMs such as GPT-3, BERT, and CodeBERT have demonstrated their usefulness in analysing code snippets and software behaviours to detect malware signatures and anomalies []. Their NLP capabilities allow for a contextual understanding of code and file behaviour, offering improved detection over traditional signature-based approaches. Rahali and Akhloufi [] proposed MalBERTv2, a code-aware BERT-based model for malware detection, and demonstrated that it significantly outperformed traditional models by incorporating semantic features of code into the classification process.
LLMs are particularly effective when applied to dynamic analysis, where they can monitor and interpret the behaviour of a program during execution. Rondanini et al. [] demonstrated that LLMs, when trained on large datasets of both benign and malicious behaviour, can predict new malware behaviours more accurately than traditional signature-based systems. LLMs’ contextual analysis allows them to recognize subtle indicators of suspicious activity that may not align with conventional signatures but still indicate malicious intent. Furthermore, LLMs can generate malware code and simulate attack behaviours for cyber security training purposes. Liu et al. [] demonstrated how LLMs were used to automatically generate sophisticated malware code for penetration testing, enabling security teams to test their defences against realistic and evolving threats.
5.2. Real-Time Phishing Detection
Phishing remains a major threat, often evading traditional filters. LLMs enhance detection by analysing email content and identifying suspicious elements, like misleading URLs, attachments, and writing styles, making them effective against evolving and sophisticated phishing tactics [,]. They can understand subtle cues, contextual anomalies, and linguistic patterns that suggest malicious intent. Jamal and Wimmer [] demonstrated how fine-tuned LLMs significantly outperform traditional ML models in identifying phishing content, especially zero-day phishing emails, due to their contextual reasoning capabilities.
By continuously learning from new data and adapting to evolving phishing tactics, LLMs can provide an additional layer of protection against email-based threats. This proactive, context-driven approach is crucial in defending against the increasing sophistication of phishing schemes. Furthermore, studies have shown that LLMs can analyse entire email threads rather than single messages in isolation, allowing them to detect more nuanced social engineering attacks that unfold across multiple messages. For example, Nahmias et al. [] utilized a fine-tuned LLM model to classify phishing emails with a focus on conversational flow and semantic coherence, achieving significant gains over static filters. Additionally, when integrated with SIEM systems, LLMs can help in real-time threat prioritization by flagging high-risk emails for immediate response, thereby reducing the burden on human analysts [].
5.3. Automated Vulnerability Management
Automating vulnerability management with LLMs offers great potential. This process, which typically demands manual effort and expertise, can be streamlined by LLMs that analyse and remediate vulnerabilities. Trained on extensive datasets, LLMs can predict flaws in new code or systems, easing the burden on security teams and enhancing efficiency []. Wang et al. [] showed that LLMs could analyse code for patterns that correspond to common vulnerabilities, such as SQL injection, cross-site scripting (XSS), or buffer overflows. The model’s ability to recognize subtle coding practices that may lead to security holes helps identify vulnerabilities early in the development lifecycle.
LLMs are increasingly used to improve software security through secure code generation and automated vulnerability remediation. Pearce et al. [] showed that Codex could detect vulnerable code patterns and suggest secure alternatives with greater accuracy than traditional static analysis tools. Ghosh et al. [] demonstrated LLMs’ ability to repair vulnerabilities, like buffer overflows, while explaining their fixes. Lin and Mohaisen [] found that LLMs can recommend secure coding alternatives to mitigate risks, aiding developers without deep security expertise. Additionally, LLMs trained on large codebases can generate security-hardened code snippets, reducing human error and promoting secure development practices []. These capabilities highlight LLMs’ potential to proactively support developers in writing and maintaining secure software.
5.4. Threat Intelligence Automation
LLMs can significantly enhance threat intelligence by automating the extraction of valuable insights from unstructured data sources, like blogs, news articles, and dark web forums, streamlining the process of identifying emerging threats []. A key application of LLMs is the analysis of OSINT to detect emerging threats. Shafee et al. [] demonstrated how LLMs can process massive volumes of OSINT data to identify trends in attack methods, tools, and targets. By analysing social media posts, forums, and even code repositories, LLMs can highlight potential vulnerabilities or zero-day exploits before they become widely known. This early detection allows organizations to take preventive actions to mitigate these risks.
LLMs can also automate the creation of threat intelligence reports by summarizing key findings from raw data. Gao et al. [] demonstrated that LLMs can take raw, unstructured threat data and generate structured reports, making it easier for cyber security professionals to digest and act on relevant intelligence. This capability is especially valuable in high-volume environments where security teams need to process large amounts of data quickly. In addition, fine-tuned LLMs can be used to classify and prioritize threats based on contextual relevance, helping analysts distinguish between benign anomalies and critical attack indicators []. For example, the use of transformer-based models has enabled cyber security platforms to automatically cluster similar threat indicators and associate them with specific adversary tactics and techniques [,].
5.5. Social Media and Dark Web Monitoring
Cybercriminals frequently use social media and dark web forums to communicate, share tools, and coordinate attacks. LLMs can monitor these platforms to detect potential threats by identifying discussions about exploits, attack vectors, and vulnerabilities []. They can be trained to recognize illicit cyber activities, such as the sale of exploit kits or ransomware-as-a-service. Shah and Parast [] demonstrated how LLMs can extract relevant intelligence from unstructured dark web data, aiding cyber security teams and law enforcement in identifying emerging threats. Additionally, LLMs can analyse social media posts to detect early signs of attacks, such as leaked data or targeted threats [].
Beyond passive monitoring, LLMs can actively support cyber threat-hunting workflows by identifying disinformation campaigns, social engineering tactics, and recruitment efforts by malicious actors. These models detect linguistic markers and behavioural patterns linked to bots, troll farms, or coordinated inauthentic behaviour []. Fang et al. [] used a fine-tuned GPT-based model to attribute social media content to known threat actors through linguistic fingerprinting and thematic clustering, which is particularly valuable for tracking APTs and state-sponsored campaigns. On the dark web, LLMs also help uncover relationships between threat actors by analysing language patterns, aliases, and interaction threads, revealing hidden networks and collaboration []. Social media and dark web monitoring is a promising area for LLMs in cyber security, offering opportunities to further develop automated threat detection by extracting actionable intelligence from unstructured and covert online communications.
5.6. Incident Response and Playbook Generation
LLMs can transform incident response by automating playbook creation and providing real-time, tailored recommendations. Kim et al. [] demonstrated that LLMs can generate customized playbooks based on specific attack scenarios, enabling rapid, consistent responses and reducing human error. LLMs also support decision-making and coordination across teams and tools by analysing real-time data []. Additionally, Scanlon et al. [] showed that LLMs can integrate with SOAR systems to automate mitigation actions, such as blocking malicious IPs or isolating affected systems, streamlining response efforts and improving efficiency.
Moreover, Wudaliet al. [] demonstrated how LLMs could convert NIST or MITRE ATT&CK documentation into automated workflows for policy enforcement or compliance reporting. These models help bridge the gap between human-readable policy language and machine-executable logic. By translating regulatory and procedural documents into actionable formats, LLMs ensure the consistent application of security policies across systems. This is particularly valuable for organizations navigating complex compliance environments, where aligning technical controls with legal and industry standards is critical []. LLMs can also assist in continuously monitoring for policy violations and recommending remediation steps when deviations are detected []. Incident response and playbook generation represent a key opportunity for LLMs in cyber security, where further exploration could enhance automated, context-aware decision-making and streamline coordinated responses to evolving threats.
5.7. Advanced Penetration Testing and Red Teaming
Penetration testing and red teaming are crucial activities for identifying vulnerabilities and testing an organization’s security defences. LLMs can play a role in automating these activities by generating attack simulations and analysing the effectiveness of defences []. LLMs can simulate sophisticated attack strategies, including advanced persistent threats (APTs), which can test an organization’s ability to withstand complex cyberattacks. Temara [] showed that LLMs could generate detailed penetration testing scripts and attack scenarios based on the latest tactics, techniques, and procedures used by cybercriminals. In red-teaming exercises, LLMs can act as virtual adversaries, simulating attacks to identify gaps in security defences. As reported by Brown et al. [], LLMs can generate realistic adversarial strategies, from phishing to exploiting vulnerabilities in web applications, providing security teams with a more comprehensive understanding of their weaknesses.
Moreover, LLMs are being used to augment automated reconnaissance and payload crafting, streamlining the preparation phases of red teaming. Happe et al. [] demonstrated that fine-tuned LLMs can craft custom payloads, obfuscate malicious commands, and adapt exploits based on target system configurations. These capabilities enable red teams to execute context-aware attacks that mimic real-world adversaries, increasing the realism and impact of penetration tests. Additionally, in continuous security validation scenarios, LLMs can generate test cases dynamically to evaluate endpoint and perimeter defences against evolving threats. This shift toward LLM-driven red teaming transforms the landscape of proactive defence by enabling scalable, intelligent, and adaptive adversarial simulation that would otherwise require extensive manual effort and expertise [,].
In summary, LLMs are transforming cyber security by offering intelligent, context-aware solutions across several critical domains. They enhance malware analysis by interpreting code behaviour, identifying unknown threats, and simulating malware for testing. LLMs also outperform traditional filters in phishing detection through their deep linguistic understanding and ability to analyse entire email threads. For vulnerability management, LLMs predict and remediate flaws, suggest secure coding practices, and generate secure code. They automate threat intelligence by extracting actionable insights from unstructured sources like OSINT and dark web forums. LLMs also monitor social media and dark web communications, uncovering disinformation campaigns and threat actor networks. In incident response, they generate adaptive playbooks, translate policy documents into workflows, and integrate with SOAR systems for real-time mitigation. Furthermore, LLMs bolster penetration testing and red teaming by simulating complex attacks and crafting adaptive payloads. These diverse applications highlight the vast and evolving potential of LLMs to automate, scale, and strengthen cyber defence capabilities.
6. Challenges of LLMs in Cyber Security
Despite the promising potential of LLMs in cyber security, they also present significant challenges. This section explores these challenges and possible solutions that should be implemented for the successful integration of LLMs into cyber security practices.
6.1. Model Reliability and Accuracy
One of the most persistent concerns in cyber security applications of LLMs is model reliability. LLMs often suffer from “hallucinations”, where they generate plausible but factually incorrect information. This is especially critical in security contexts, where accuracy can mean the difference between a successful defence and a breach []. According to Yan et al. [], hallucinations in cyber security question-answering systems can mislead learners and practitioners alike, making LLM outputs unreliable without external validation. Additionally, due to the static nature of their training data, many LLMs lack up-to-date knowledge of emerging threats, tactics, and vulnerabilities. This makes them ill-suited for real-time threat intelligence unless augmented with retrieval systems or live data [].
Furthermore, model reliability varies based on prompt structure, context, and domain-specific knowledge, making reproducibility a challenge. In high-stakes environments such as SOCs, where precision is critical, even minor inaccuracies in model output can mislead analysts or delay incident response []. To mitigate these issues, researchers have explored RAG approaches, where models pull in external, validated documents during inference [,]. Rajapaksha et al. [] proposed CyberRAG, which combines LLM capabilities with domain-specific knowledge graphs to validate outputs, improving answer reliability. However, such methods still require continuous updates to knowledge bases and ontologies to remain effective in rapidly evolving threat landscapes.
6.2. Explainability and Transparency
Another major challenge is the lack of explainability and transparency in LLM behaviour. These models operate as “black boxes”, producing outputs without clearly indicating how they arrived at those decisions. In mission-critical applications like intrusion detection or automated incident response, this opacity can undermine trust and accountability []. Users and auditors may be unable to verify or understand why a particular alert was generated or why a certain response action was recommended. Zhou et al. [] argue that the inability to trace model reasoning poses ethical and operational risks, particularly in regulated sectors. In cyber security, where evidence-based investigation is essential, the lack of transparency limits the usefulness of LLM-generated intelligence for forensic or legal purposes. Similarly, Hossain et al. [] emphasized that the opaque nature of LLM decisions makes it difficult for security teams to trust and verify model-generated insights, particularly in environments where transparency is mandated, such as government or financial sectors [].
Efforts to increase interpretability, such as attention heatmaps or prompt engineering, are not always effective or practical in complex cyber scenarios. For instance, the AISecKG ontology described by Agrawal et al. [] helps contextualize and validate model responses, offering a path toward greater transparency. However, Explainable AI (XAI) techniques are still maturing and integrating them into LLMs at scale without compromising performance is a significant hurdle []. Until these models can provide interpretable reasoning behind their outputs, their integration into decision-making processes in cyber security will remain constrained.
6.3. Adversarial Use of LLMs
LLMs present a double-edged sword in cyber security, as their capabilities can also be weaponized by malicious actors. One growing concern is the generation of sophisticated malware through prompt engineering []. As demonstrated by Yan et al. [], LLMs can be manipulated to write polymorphic code that mimics legitimate software and executes malicious payloads. Although providers have added safety filters, jailbreaks and prompt injections can often bypass these constraints. This makes LLMs a potential tool for low-skill attackers to develop functional malware with minimal effort []. Brown et al. [] demonstrated how GPT-4 could be prompted to generate obfuscated code resembling known malware, bypassing basic detection systems. This raises alarm over LLMs being weaponized by threat actors for automated exploit development.
Another alarming application is in social engineering, where LLMs can generate highly convincing phishing emails or social media messages. The natural language capabilities of LLMs make them ideal tools for crafting convincing phishing emails, as demonstrated in a study by Meguro and Chong [], where LLMs outperformed humans in generating deceptive content. Threat actors can use these models to craft personalized lures at scale, mimicking the style and tone of legitimate communications [,]. Prompt injection attacks can also compromise systems that integrate LLMs into workflows, enabling attackers to manipulate output or extract sensitive data. These risks highlight the need for robust input sanitization and access controls, which are often lacking in early-stage LLM deployments.
6.4. Security and Privacy Risks
LLMs are prone to prompt leakage, where sensitive data embedded in queries can be inadvertently exposed in outputs or logs. This risk is particularly severe in settings involving private threat intelligence or incident response. Moreover, the training data of many LLMs may contain sensitive or proprietary information scraped from online sources, leading to unintentional data disclosure []. Prompt leakage, where sensitive inputs can be inferred or exposed via crafted queries, is a notable risk. Yan et al. [] showed that LLMs can unintentionally memorize and regurgitate snippets of sensitive training data, including passwords or confidential information, especially if exposed during pretraining. This raises compliance issues under laws like GDPR or CCPA, particularly if user data is included without consent [].
Data poisoning is another critical issue. Attackers can manipulate the training data or fine-tuning process to embed backdoors or biases into the model. Ferrag et al. [] indicated that backdoored LLMs can produce specific outputs when triggered, allowing adversaries to covertly influence model behaviour. In cyber security, this could result in a model that ignores specific IoCs or misclassifies malicious code. Since LLMs rely heavily on large-scale, often uncrated data, ensuring dataset integrity is a monumental task. Techniques such as differential privacy, data auditing, and secure fine-tuning are being explored to mitigate these risks, but they require significant resources and expertise to implement effectively [,].
6.5. Ethics and Governance
Ethical considerations and governance frameworks are also essential when deploying LLMs in cyber security. A primary issue is the dual-use nature of these models; while they can be used defensively, they can also empower threat actors. This duality complicates discussions around open access, model release strategies, and responsible disclosure []. Kambourakis et al. [] argue that governance must evolve to address the societal risks posed by foundation models, especially in sensitive domains. In cyber security, this means implementing rigorous access controls, usage policies, and audit mechanisms to prevent misuse. Also, Yao et al. [] argue that governance in cyber security applications must ensure that LLMs are used responsibly, preventing misuse by users.
Moreover, regulatory compliance is an evolving landscape. LLMs must align with data protection regulations like GDPR, which mandates transparency, accountability, and user rights. However, enforcing these principles is challenging given the opaque nature of LLMs and the scale at which they operate []. For example, Article 22 of the GDPR grants individuals the right not to be subject to automated decisions with legal consequences, which may apply to automated LLM-based threat assessments []. Meeting such requirements demands significant investments in compliance tooling, documentation, and human oversight, all of which add to operational complexity.
6.6. Evaluation and Benchmarking
LLM evaluation in cyber security is limited by benchmarks that fail to capture complex, evolving threats, like polymorphic malware or obfuscated code, causing general-purpose models to underperform in domain-specific tasks []. Static datasets also struggle to reflect rapidly changing attack patterns. While CyberRAG uses ontology-based validation to align with cyber security rules [], it still depends on human oversight for complex cases, highlighting the need for dynamic, domain-specific evaluation frameworks to ensure reliable and automated model assessment. Another critical issue lies in the lack of standardized evaluation metrics tailored to cyber security applications. Many studies use accuracy or F1-scores without considering the cost of false positives or the severity of false negatives in threat detection scenarios []. Additionally, benchmarking LLMs across different subfields, such as vulnerability detection, threat intelligence summarization, and anomaly detection, requires nuanced, task-specific datasets that are rarely available due to sensitivity or proprietary concerns []. This scarcity of high-quality, annotated, and up-to-date data severely hampers reproducibility and cross-comparison, making it difficult to gauge true performance or progress in applying LLMs to cyber security.
6.7. Resource and Infrastructure Demands
LLMs are computationally intensive, requiring significant hardware for both training and inference. In cyber security environments where real-time processing is crucial, the latency introduced by large models can hinder operational effectiveness. Deploying LLMs for real-time threat detection may not be feasible in low-resource environments, like edge devices or air-gapped systems [,]. Moreover, maintaining an up-to-date, secure deployment pipeline for LLMs adds to the operational burden. This includes periodic retraining with fresh data, validating security updates, and ensuring compliance with internal and external standards. These challenges may limit LLM adoption in organizations with limited technical infrastructure or cyber security expertise [].
Figure 5 shows the radar chart of the key challenges associated with the use of LLMs in cyber security, highlighting the severity of each issue on a scale from 0 to 10. The graph identifies seven major areas of concern that were discussed earlier. The most critical challenge is the adversarial use of LLMs, rated at the maximum score of 10, emphasizing the serious threat posed by malicious exploitation, such as the generation of malware, phishing content, and prompt injection attacks. Security and privacy risks (9) and model reliability and accuracy (9) also rank high, indicating substantial concerns regarding the trustworthy and safe deployment of these models, particularly in real-time or sensitive environments. Explainability and transparency (8) further reinforce this issue, as opaque model reasoning can exacerbate the risks of hallucinations and erode trust in automated decision-making. The challenge of evaluation and benchmarking received the lowest score (6), suggesting that while it remains important, it is currently perceived as less urgent compared to direct threats and vulnerabilities.
Figure 5.
A radar chart of key challenges of LLMs in cyber security highlights the severity of each issue on a scale from 0 to 10.
Beyond simply ranking these challenges, the radar chart offers insight into their interdependence. For example, poor model reliability often stems from a lack of transparency, and both can contribute to heightened security risks. Similarly, adversarial misuse is closely tied to gaps in governance and the absence of robust input validation. The moderate rating for ethics and governance (7) reflects the growing recognition of dual-use concerns and regulatory implications but also highlights the need for more mature policy and oversight mechanisms. The relatively high score for resource and infrastructure demands (8) illustrates how technical feasibility, such as latency constraints and computational requirements, can significantly impact deployment in operational settings. Finally, although the challenge of evaluation and benchmarking received the lowest score, it plays a foundational role; without domain-specific benchmarks and robust metrics, it becomes difficult to systematically address other challenges.
7. LLMs in Cyber Security Education
The digital transformation driven by AI, cloud computing, and IoT has significantly broadened the cyber threat landscape, increasing the global demand for cyber security professionals. In the UK alone, there is a notable shortfall of approximately 11,200 cyber security specialists, with 37% of businesses lacking essential in-house skills []. This skills gap poses a serious threat to national infrastructure, economic resilience, and personal data privacy. To address this, UK higher education institutions have introduced a variety of cyber security degree programmes at both undergraduate and postgraduate levels. These are closely aligned with professional frameworks such as the Cyber Security Body of Knowledge (CyBOK) [] and government-backed schemes like the NCSC-certified degree programme []. Such alignment ensures that graduates acquire both foundational theories and practical competencies needed in the industry. As cyber threats and technologies evolve, universities must adapt by integrating emerging innovations, like AI and LLMs, into their curricula while promoting inclusive and flexible learning opportunities. Bridging the skills gap is a multifaceted challenge requiring educational reform and societal engagement.
Cyber Security Education (CSE) faces a multitude of challenges, many of which are rooted in the rapidly evolving nature of the field, a persistent skills shortage, and the complexity of delivering hands-on, interdisciplinary learning [,,,]. These challenges can be summarized as follows:
- Curriculum Lag: Academic courses often fail to keep pace with fast-changing industry threats due to slow update and validation processes [].
- Lecturer Shortage: There is a limited pool of educators with both academic and real-world cyber security expertise, especially in this specialized area [].
- Practical Learning Barriers: Many institutions lack the infrastructure for hands-on training, such as cyber labs and ranges, which are expensive and complex to maintain [].
- Steep Learning Curve: The technical breadth of cyber security can overwhelm students, particularly those from non-technical backgrounds [].
- Assessment Difficulties: Ensuring fairness and academic integrity in practical assessments is challenging amid the rising use of online solutions [].
- Lack of Interdisciplinary Content: Curricula often overemphasize technical skills, neglecting essential areas like policy, law, and ethics [,].
- Access and Equity: Financial, geographic, and institutional barriers prevent many students from accessing high-quality cyber security education [].
7.1. Related Work of LLMs in CSE
LLMs are revolutionizing cyber security education by enabling interactive, scalable, and personalized learning experiences. As the cyber security landscape becomes increasingly complex and dynamic, LLMs offer promising solutions to bridge knowledge gaps, automate content creation, and enhance educational support. Several recent studies have explored their application in this domain. Zhao et al. [] introduced CyberRAG, an ontology-aware RAG system that enhances question-answering by integrating domain-specific knowledge, thereby reducing hallucinations and improving response accuracy. Agrawal et al. [] also proposed CyberQ, a system that combines LLMs with knowledge graphs to generate relevant questions and answers for cyber security education. This approach automates the creation of educational content, facilitating more dynamic and responsive learning environments. CyberQ complements CyberRAG by focusing on content generation, whereas CyberRAG emphasizes answer validation. Further extending this work, Zhao et al. [] developed CyberBOT, a chatbot deployed in a graduate-level course that validates LLM outputs using course-specific materials and cyber security ontologies, demonstrating the practical value of structured reasoning in educational contexts.
Shepherd [] examined a UK master‘s-level cyber security program, uncovering key vulnerabilities in assessments due to generative AI misuse, especially in projects and reports. Factors like block teaching and a largely international student body heightened these risks. The study advocates for LLM-resistant assessment designs, AI detection tools, and an ethics-focused academic culture. Also, Ohm et al. [] conducted a three-year study on ChatGPT’s impact across different student cohorts. While weekly assignment performance remained stable, final exam scores declined with increased ChatGPT usage. Despite a mandatory disclosure policy, 80% of students admitted anonymously to using ChatGPT; however, none reported it officially. This highlights growing concerns around honesty, dependency, and maintaining integrity in AI-supported education.
Nizon-Deladoeuille et al. [] evaluated six leading LLMs on penetration testing tasks, finding models like GPT-4o mini and WhiteRabbitNeo particularly effective in offering context-aware guidance. This highlights LLMs’ capability to support hands-on technical training. Al-Dhamari and Clarke [] also focused on enhancing cyber security awareness programs through GPT-driven personalization. Their work demonstrated that adaptive, profile-based training significantly improves learner engagement and outcomes, suggesting LLMs can address the limitations of static, one-size-fits-all instruction. Extending the application to support underrepresented learners, Wang et al. [] proposed CyberMentor, a platform using agentic workflows and RAG to offer tailored mentorship and skills guidance. Tann et al. [] also evaluated leading LLMs, including ChatGPT, Bard, and Bing, in professional certification exams and Capture-The-Flag (CTF) challenges. They found ChatGPT effective in answering factual multiple-choice questions (82% accuracy) but limited in conceptual reasoning. Performance varied across different CTF challenges, with prompt engineering playing a significant role. The study also highlighted vulnerabilities in LLM safety mechanisms, revealing how jailbreak prompts can bypass ethical safeguards.
Shao et al. [] explored the effectiveness of LLMs in solving CTF challenges using both human-in-the-loop (HITL) and fully automated workflows. Their results show that LLMs outperform average human participants, highlighting their potential for automated cyber security training and assessment. This study highlights the ability of LLMs to solve real-world security puzzles, paving the way for their use in cyber security education. In addition, Nelson et al. [] proposed SENSAI, an AI-powered tutoring system deployed in an applied cyber security curriculum. SENSAI uses LLMs to deliver real-time, context-aware feedback by integrating with students’ terminals and files, significantly improving problem-solving efficiency across over 15,000 sessions. Its scalability, comparable in cost to a single teaching assistant, shows promise for broad educational reach. Additionally, Yamin et al. [] proposed CyExec, a framework that uses GPT models and RAG to dynamically craft complex, adaptive cyber security scenarios. By turning LLM hallucinations into creative leverage, CyExec offers diverse and realistic threat simulations rigorously validated through expert review. Park and Simmons [] also explored the use of generative AI in cyber security education to address challenges like outdated materials, diverse student needs, and the absence of teaching assistants. It highlights AI’s potential to create dynamic, personalized learning environments but also raises concerns about privacy, data ownership, and transparency in its implementation. Table 9 summarizes the contributions and limitations of recent LLM-based systems in cyber security education in ascending order per year of publication.
Table 9.
Summary of the contribution and limitations of recent LLM-based systems in cyber security education.
The studies reviewed emphasize the expanding role of LLMs in cyber security education, particularly in enhancing engagement, personalizing learning, and automating content generation. A recurring similarity across various studies, such as those by Agrawal et al. [], Yamin et al. [], and Zhao et al. [], is the integration of knowledge graphs or ontologies to improve question generation or response accuracy. However, this reliance introduces limitations around scalability and the need for regular content updates. Several studies, such as those by Tann et al. [] and Shao et al. [], highlight the effectiveness of LLMs in challenge-based scenarios, like CTFs, though they also caution against generalizing findings due to dataset or domain specificity. In addition, studies, like those by Al-Dhamari and Clarke [], Nelson et al. [], and Wang et al. [], focus on LLM-driven training and mentoring platforms, showcasing the potential for scalable and inclusive learning while recognizing risks related to AI bias, privacy, and the need for long-term validation. Shepherd [] and Ohm et al. [] offer more critical views, raising concerns about assessment integrity and cognitive dependency.
7.2. Opportunities for LLMs in CSE
LLMs offer innovative opportunities in cyber security education by enabling personalized, interactive, and scalable learning. They can automate content creation, simulate real-world scenarios, and provide adaptive support. LLMs can offer several opportunities/benefits to cyber security education, including the following:
7.2.1. Personalized Learning and Tutoring
LLMs present a novel approach to personalized learning by acting as intelligent tutors capable of delivering adaptive, real-time feedback. Through context-aware assistance, these models can cater to diverse learner profiles, offering both foundational and advanced explanations tailored to individual needs []. Al-Dhamari and Clarke [] demonstrated that GPT-driven personalization enhances cyber security awareness by increasing engagement through profile-based content. Similarly, Wang et al. [] developed CyberMentor, a RAG-powered platform that provides tailored guidance to under-represented students, emphasising the role of agentic workflows in skill development.
7.2.2. Curriculum Enhancement & Content Generation
LLMs facilitate efficient curriculum development by automating the generation of cyber security content, such as case studies, lecture materials, and assessments []. Agrawal et al. [] developed CyberQ, a system that utilizes LLMs with knowledge graphs to dynamically produce relevant questions and answers, thereby enriching learning environments with responsive and diverse content. Park and Simmons [] emphasized how generative AI can address the limitations of static curricula, particularly the rapid obsolescence of teaching materials in cyber security, by enabling continuously updated and student-centred educational resources.
7.2.3. Practical Skills Development
Beyond theoretical learning, LLMs contribute meaningfully to hands-on skills development. Nizon-Deladoeuille et al. [] evaluated multiple LLMs and found models like GPT-4o mini to be particularly adept at offering context-sensitive support in penetration testing, helping students grasp complex command-line tasks. Shao et al. [] also highlighted the effectiveness of LLMs in solving CTF challenges through both human-in-the-loop and automated approaches, demonstrating their potential to foster real-world cyber security competencies in training environments.
7.2.4. Assessment and Feedback Automation
LLMs can streamline assessment by generating individualized questions, automatically grading submissions, and offering rubric-aligned feedback. Zhao et al. [] implemented CyberBOT in a graduate cyber security course, where it validated LLM-generated answers using domain-specific ontologies and course materials, ensuring accurate and contextually grounded feedback. However, concerns remain regarding LLM misuse in assessments. Shepherd [] identified vulnerabilities in assessment integrity, especially in project-based tasks, and stressed the need for LLM-resilient assessment designs to maintain academic honesty and mitigate the risks posed by generative AI [].
7.2.5. Promoting Cyber Awareness
Promoting foundational cyber security awareness among students is another area where LLMs excel. LLMs demonstrated that adaptive, LLM-driven training modules significantly improve awareness and retention compared to traditional methods []. These systems not only explain best practices but can also simulate cyber incidents, allowing learners to experience realistic threat scenarios []. Park and Simmons [] advocate for LLMs as tools to address diverse learner needs, suggesting their application in broad-based cyber hygiene education across academic contexts. Feng et al. [] also showed that integrating ChatGPT into cyber security MOOCs increased student engagement and conceptual understanding, particularly in explaining abstract concepts like access control or encryption protocols.
7.2.6. Research and Innovation
LLMs play a pivotal role in advancing research and innovation in cyber security education by facilitating the exploration of new threats, trends, and technologies. These models can process vast amounts of data, helping researchers analyse emerging security challenges and generate novel solutions []. LLMs also support innovation by enabling the rapid creation of educational content, generating research ideas, and automating repetitive tasks, thus accelerating the pace of academic progress. Additionally, LLMs enhance collaboration by fostering the sharing of insights and knowledge across research teams, making them invaluable tools in the ongoing development of cutting-edge cyber security practices and solutions.
7.2.7. Peer Learning Support
LLMs empower educators as co-creators, instructional aides, and scalable mentors. Nelson et al. [] presented SENSAI, an AI-powered tutoring system integrated into a cyber security curriculum, delivering real-time feedback during student labs. With over 15,000 interactive sessions and operational costs equivalent to a single teaching assistant, SENSAI illustrates the scalability and efficiency of LLM-supported faculty assistance. This reduces workload, supports inclusive teaching, and improves peer learning environments by supplementing instructional gaps, especially in large or diverse cohorts. Beyond automation, LLMs foster enriched peer learning environments by offering consistent support, addressing common misconceptions, and enabling timely interventions tailored to diverse learner needs.
7.3. Challenges of LLMs in CSE
Despite their potential, the use of LLMs in cyber security education comes with several challenges, which include the following:
7.3.1. Accuracy and Hallucination
One of the most critical challenges in using LLMs in cyber security education is their tendency to “hallucinate”, which produces confident-sounding but factually incorrect information. This issue becomes particularly problematic in a domain like cyber security, where precision and correctness are vital [,]. Students may rely on these outputs without validating the information, leading to the internalization of misconceptions or faulty practices. This not only undermines the quality of learning but can also be dangerous if applied in real-world security contexts. Tools like CyberRAG [] have attempted to mitigate this by integrating domain ontologies for validation, but hallucinations remain a persistent issue, especially when LLMs are used without appropriate safeguards or human oversight.
7.3.2. Ethical and Academic Integrity
The integration of LLMs in educational contexts has raised significant ethical concerns, particularly regarding their misuse by students. Tools like ChatGPT can easily generate entire assignments, reports, or code snippets, enabling students to bypass critical learning processes. Research by Shepherd [] and Ohm et al. [] highlights that many students use LLMs without disclosure, which jeopardizes academic integrity and skews assessment outcomes. Furthermore, while some institutions have implemented disclosure policies, enforcement is inconsistent and often ineffective []. The challenge is cultivating a culture of responsible AI use that prioritizes transparency, ethics, and learning, alongside technical solutions, like AI detection tools and LLM-resistant assessments.
7.3.3. Assessment Validity
Traditional assessment formats, such as essays, written reports, and basic coding tasks, are increasingly vulnerable to automation by LLMs. Students can now generate high-quality responses with minimal effort, making it difficult for educators to distinguish between original work and AI-generated content. This erodes the reliability of assessment as a tool to evaluate true learning and skill development [,]. To address this, educators are being urged to adopt LLM-resilient assessment formats that emphasize critical thinking, problem-solving, and collaboration. These may include oral exams, live coding, group projects, and scenario-based tasks. The key challenge is ensuring these redesigned assessments remain scalable, inclusive, and aligned with learning outcomes in diverse and often large cohorts.
7.3.4. Data Privacy and Security
Deploying LLMs in educational environments raises critical concerns around data privacy, security, and compliance. When students or educators input prompts that include sensitive personal information, institutional data, or even real cyber security incidents, this information could be processed and retained by third-party AI providers. This is particularly concerning in jurisdictions governed by strict data protection laws, such as the GDPR []. Institutions must therefore carefully evaluate the terms of use of commercial LLMs and consider privacy-preserving alternatives or on-premise deployments. Transparency about data handling, student consent, and secure integration with learning platforms is essential to prevent accidental exposure of confidential information and ensure compliance with ethical and legal standards.
7.3.5. Limited Conceptual Reasoning
Despite their impressive performance in producing syntactically correct and well-structured outputs, LLMs often lack deep conceptual reasoning abilities. They are primarily pattern recognition systems, trained on vast amounts of data without true understanding. Tann et al. [] found that while models like ChatGPT performed well on factual multiple-choice questions, they struggled with tasks requiring deeper insight, contextual adaptation, or logical reasoning. This limitation becomes a barrier in advanced cyber security topics, where understanding abstract principles, system-level thinking, and contextual decision-making is essential []. Relying too heavily on LLMs for instruction or self-study in such areas may hinder the development of higher-order cognitive skills and critical thinking in students.
7.3.6. Instructor Readiness and Training
A significant barrier to effective LLM integration in cyber security education is the readiness of instructors to use these tools thoughtfully and critically. Many educators are still unfamiliar with prompt engineering, AI-assisted content validation, and the pedagogical implications of LLMs. Without proper training, instructors may either misuse the technology or avoid it altogether, leading to inconsistent learning experiences for students. Moreover, the fast-evolving nature of LLMs demands ongoing professional development to keep up with their capabilities, limitations, and ethical considerations. Institutions must, therefore, invest in capacity-building initiatives that equip faculties with the skills and confidence to harness LLMs in ways that enrich, rather than compromise, the learning environment.
8. Open Issues and Future Directions
The integration of LLMs into cyber security and cyber security education is promising but still emerging, with several open issues and future directions that researchers must explore. These include the following:
8.1. Trustworthiness and Hallucination Mitigation
A major concern with deploying LLMs in cyber security is their tendency to produce hallucinated or factually incorrect information. In high-stakes domains, such as digital forensics, malware analysis, or threat detection, even small inaccuracies can compromise security protocols. This issue becomes more pronounced when LLMs are used without domain-specific grounding, leading to responses that may appear confident but lack technical validity or relevance. While tools like CyberRAG [,] have attempted to mitigate this through ontology-aware RAG, much work is still needed to enhance output reliability. Future research should focus on integrating structured knowledge sources into LLM pipelines, using adaptive validation for trustworthiness, and exploring hybrid models combining LLMs with symbolic reasoning to improve transparency and reduce errors, especially in cyber security education. Also, future research should involve user-centred evaluations and adversarial testing protocols to assess the reliability of hallucination mitigation techniques in real-world scenarios.
8.2. Privacy and Security
A critical issue in the deployment of LLMs across cyber security domains is the potential for unintended data leakage and exposure of sensitive information. These models, when integrated into threat analysis systems, incident response platforms, or automated security tools, often process logs, configurations, or code snippets that may contain proprietary or confidential data. The risk is particularly severe when using third-party LLMs via cloud-based APIs, where prompts and outputs could be cached or monitored, inadvertently exposing organizational vulnerabilities []. This challenge extends to training and simulation environments, including educational settings, where student data or simulated attack scenarios may also be compromised []. More research is needed to develop privacy-preserving LLM architecture for secure, local deployment, using techniques like federated learning, differential privacy, and zero-trust access controls. Additionally, AI governance frameworks must be created to ensure responsible LLM integration in both industry and academic cyber security settings. Future research should utilize empirical case studies and simulation-based testing with federated architectures to assess the effectiveness of privacy-preserving LLM deployments.
8.3. Explainability
A key limitation of current LLMs is their lack of explainability, which undermines their credibility in both teaching and professional cyber security workflows. Students and practitioners need more than just correct answers; they need to understand the reasoning behind those answers to develop critical thinking and investigative skills []. This black-box nature is especially problematic in scenarios like policy auditing, secure coding, or threat mitigation, where step-by-step justification is crucial []. Future research should prioritize the development of interpretable LLMs that can articulate their reasoning processes in human-readable formats. Techniques such as chain-of-thought prompting, attention heatmaps, or integration with argumentation frameworks could help users trace the model’s logic. Embedding rationale generation directly into educational tools would empower learners to question, refine, and validate AI guidance, fostering deeper understanding and accountability. Also, user studies and think-aloud protocols can be used to evaluate how effectively LLMs convey their reasoning and enhance user understanding.
8.4. Adversarial Robustness
LLMs are vulnerable to adversarial inputs and jailbreak prompts that can manipulate their outputs or bypass ethical safeguards, posing a serious risk in both cyber security education and practice []. This problem is particularly critical in environments like CTF challenges or simulation-based training, where learners may experiment with edge-case queries or adversarial code []. Such vulnerabilities not only risk misinformation but can also normalize harmful behaviour if left unchecked. Future research should focus on developing prompt sanitization, adversarial training, and red-teaming frameworks for education while incorporating safety constraints and ethical reasoning in LLMs. Proactive monitoring using anomaly detection can prevent misuse and foster responsible AI engagement in instructional settings. Also, future research should employ red-teaming exercises and controlled adversarial trials to evaluate model resilience and identify failure modes under attack conditions.
8.5. Standardization and Benchmarks
The lack of standardized evaluation benchmarks for LLMs in cyber security poses a barrier to rigorous comparison and progress tracking []. Without consistent metrics and datasets, it is difficult to assess which models perform best across tasks such as phishing detection, malware triage, or cyber threat intelligence extraction []. Researchers should focus on building community-driven benchmarks that reflect real-world complexity and adversarial conditions. This includes curating diverse and dynamic datasets, defining robust metrics for utility and robustness, and fostering open competitions to drive innovation. Community-driven design workshops and open benchmarking competitions could also facilitate the development of standardized, task-specific evaluation frameworks.
8.6. Regulatory Compliance
The deployment of LLMs in cyber security intersects with complex regulatory frameworks around data protection, intellectual property, and algorithmic accountability. Using LLMs for threat detection or compliance auditing may involve the processing of personally identifiable information (PII), proprietary threat signatures, or sensitive logs, raising serious concerns under laws like GDPR, CCPA, and NIS2 []. There is also growing pressure to ensure algorithmic transparency and explainability in automated decision-making systems. To navigate this evolving landscape, future research should integrate legal and ethical compliance into the LLM design lifecycle, focusing on data minimization, consent-aware data handling, and audit trails. Interdisciplinary design research and policy analysis are recommended to co-develop compliant AI systems with input from legal and cyber security experts. Collaboration between cyber security experts, legal scholars, and ethicists is essential for creating regulation-aligned AI tools and frameworks that support ethical innovation in LLM applications.
8.7. Standardized Curriculum Framework
Despite the promising use cases of LLMs in platforms like CyberQ [] and SENSAI [], there is currently no consensus on how best to integrate LLMs into cyber security curricula. Without standardized frameworks, their use remains fragmented and inconsistent, making it difficult to align AI-powered tools with established learning outcomes, accreditation standards, or ethical guidelines. This gap can lead to unequal learning experiences and complicate assessment design. To overcome this, future directions should focus on the co-design of curriculum-aligned LLM platforms in collaboration with educators, accrediting bodies, and industry partners. This includes developing instructional blueprints for scaffolded LLM use, ethical disclosure policies, and AI literacy modules that prepare students to work responsibly with generative tools. Creating a repository of best practices, assessment templates, and validated prompt libraries can further aid in seamless and equitable integration.
8.8. Augmented Intelligence
While LLMs offer the potential to improve the cyber security workforce shortage by automating routine tasks such as log analysis, report generation, and incident summarization, over-reliance on automation may contribute to skills degradation or foster a false sense of expertise among junior analysts []. This concern extends to education, where students might lean on generative models without fully developing critical thinking or hands-on technical skills. Future research should focus on augmented intelligence, designing LLMs as collaborative tools that enhance human judgment. In education, scaffolded learning designs should integrate LLMs for feedback while preserving student input. In industry, upskilling programs and transparent AI policies are crucial to maintaining analyst competence alongside AI augmentation.
8.9. Evaluation Frameworks
As the adoption of LLMs in cyber security continues to expand, there is still a lack of standardized frameworks for evaluating the maturity, reliability, and pedagogical effectiveness of these models across diverse operational and educational settings []. Current studies often report outcomes in isolated contexts without consistent metrics for impact, scalability, or real-world feasibility [,]. Future research should focus on developing comprehensive evaluation frameworks that account for technical performance (e.g., accuracy, robustness, and interpretability), deployment readiness (e.g., integration complexity and resource requirements), and educational value (e.g., learning outcomes, engagement, and inclusivity). Such frameworks would guide researchers in benchmarking tools, assist practitioners in selecting appropriate solutions, and support educators in aligning LLM-based tools with pedagogical goals. Cross-institutional studies, expert validation, and stakeholder engagement will be essential in building robust and context-aware assessment models.
Table 10 summarizes the open issues and suggested future research directions. In summary, the integration of LLMs into cyber security and education presents significant promise but also introduces several critical open issues that require further research. The key challenges include ensuring trustworthiness and mitigating hallucinations, protecting sensitive data through privacy-preserving architectures, and addressing the lack of transparency in model outputs. Adversarial vulnerabilities and misuse highlight the need for robust safety measures and ethical frameworks. The absence of standardized benchmarks and evaluation protocols hinders the consistent assessment of model performance, while regulatory compliance adds complexity due to evolving data protection laws. Furthermore, the lack of a standardized curriculum framework limits the effective integration of LLMs into education, risking fragmented learning experiences. Concerns also persist around over-reliance on AI, which may diminish critical thinking and practical skills. To address these, future work should focus on building interpretable, secure, regulation-compliant LLM systems that support augmented intelligence and include robust evaluation and curricular integration frameworks.
Table 10.
Summary of open issues and suggested future research directions for LLMs in cyber security.
9. Conclusions
This paper presents a comprehensive review of the current and emerging applications of LLMs in cyber security, spanning both operational and educational contexts. It analyses the landscape of recent research, identifies key challenges, and introduces a structured framework to map LLM usage across six major domains: vulnerability detection, anomaly detection, cyber threat intelligence, blockchain security, penetration testing, and digital forensics. The review also examines the role of LLMs in cyber security education, highlighting how these models are being leveraged for personalized learning, hands-on simulation, and awareness-building. The key contribution of this paper lies in its holistic and integrative perspective, which distinguishes it from the existing literature. Prior reviews have tended to focus either on narrow technical implementations or on isolated use cases, often overlooking broader operational workflows and educational impact. In contrast, this paper systematically categorizes LLM applications across a wide spectrum of cyber security functions and uniquely integrates the educational dimension, which is an area that remains underexplored in current surveys. Furthermore, the paper identifies open issues and proposes future research directions that emphasize hybrid AI–human approaches, privacy-preserving architectures, and domain-specific evaluation frameworks. By bridging technical depth with educational insight, the paper advances the State of the Art and lays a foundation for more responsible, effective, and inclusive LLM adoption in the cyber security field. To support this transition, policymakers should promote ethical AI literacy and governance standards, while curriculum designers should embed LLM tools through scaffolded, skill-aligned learning frameworks.
Future work should focus on the co-development of privacy-preserving and explainable LLM systems, particularly for deployment in regulated environments and educational settings. Institutions are encouraged to implement hybrid AI–human frameworks that tackle the automation capabilities of LLMs while simultaneously promoting critical thinking and informed decision-making. Within the educational domain, the development of standardized curriculum integration blueprints and ethically grounded usage policies is essential to ensure equitable access, transparency, and the preservation of essential skills. Additionally, the establishment of open, domain-specific benchmarks is crucial to enable reproducible research and robust performance validation, guiding the evolution of LLMs in alignment with technical excellence and broader societal responsibilities.
Funding
This research received no external funding.
Conflicts of Interest
The author declares no conflicts of interest.
References
- Scanlon, M.; Breitinger, F.; Hargreaves, C.; Hilgert, J.-N.; Sheppard, J. ChatGPT for Digital Forensic Investigation: The Good, The Bad, and The Unknown. arXiv 2023, arXiv:2307.10195. [Google Scholar] [CrossRef]
- Kouliaridis, V.; Karopoulos, G.; Kambourakis, G. Assessing the Effectiveness of LLMs in Android Application Vulnerability Analysis. In Attacks and Defenses for the Internet-of-Things; Springer: Cham, Switzerland, 2025; Volume 15397, pp. 139–154. [Google Scholar] [CrossRef]
- Jin, J.; Tang, B.; Ma, M.; Liu, X.; Wang, Y.; Lai, Q.; Yang, J.; Zhou, C. Crimson: Empowering Strategic Reasoning in Cybersecurity through Large Language Models. arXiv 2024, arXiv:2403.00878. [Google Scholar] [CrossRef]
- Jada, I.; Mayayise, T.O. The impact of artificial intelligence on organisational cyber security: An outcome of a systematic literature review. Data Inf. Manag. 2024, 8, 100063. [Google Scholar] [CrossRef]
- Gandhi, P.A.; Wudali, P.N.; Amaru, Y.; Elovici, Y.; Shabtai, A. SHIELD: APT Detection and Intelligent Explanation Using LLM. arXiv 2025, arXiv:2502.02342. [Google Scholar] [CrossRef]
- Shafee, S.; Bessani, A.; Ferreira, P.M. Evaluation of LLM Chatbots for OSINT-based Cyber Threat Awareness. arXiv 2024, arXiv:2401.15127. [Google Scholar] [CrossRef]
- Sheng, Z.; Chen, Z.; Gu, S.; Huang, H.; Gu, G.; Huang, J. LLMs in Software Security: A Survey of Vulnerability Detection Techniques and Insights. arXiv 2025, arXiv:2502.07049. [Google Scholar] [CrossRef]
- Mudassar Yamin, M.; Hashmi, E.; Ullah, M.; Katt, B. Applications of LLMs for Generating Cyber Security Exercise Scenarios. IEEE Access 2024, 12, 143806–143822. [Google Scholar] [CrossRef]
- Agrawal, G. AISecKG: Knowledge Graph Dataset for Cybersecurity Education. In Proceedings of the AAAI-MAKE 2023: Challenges Requiring the Combination of Machine Learning 2023, San Francisco, CA, USA, 27–29 March 2023; Available online: https://par.nsf.gov/biblio/10401616-aiseckg-knowledge-graph-dataset-cybersecurity-education (accessed on 22 April 2025).
- Kumar, P. Large language models (LLMs): Survey, technical frameworks, and future challenges. Artif. Intell. Rev. 2024, 57, 260. [Google Scholar] [CrossRef]
- Yao, Y.; Duan, J.; Xu, K.; Cai, Y.; Sun, Z.; Zhang, Y. A survey on large language model (LLM) security and privacy: The Good, The Bad, and The Ugly. High-Confid. Comput. 2024, 4, 100211. [Google Scholar] [CrossRef]
- Raiaan, M.A.K.; Mukta, M.S.H.; Fatema, K.; Fahad, N.M.; Sakib, S.; Mim, M.M.J.; Ahmad, J.; Ali, M.E.; Azam, S. A Review on Large Language Models: Architectures, Applications, Taxonomies, Open Issues and Challenges. IEEE Access 2024, 12, 26839–26874. [Google Scholar] [CrossRef]
- Naveed, H.; Khan, A.U.; Qiu, S.; Saqib, M.; Anwar, S.; Usman, M.; Akhtar, N.; Barnes, N.; Mian, A. A Comprehensive Overview of Large Language Models. arXiv 2024, arXiv:2307.06435. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2023, arXiv:1706.03762. [Google Scholar] [CrossRef]
- Schmid, L.; Hey, T.; Armbruster, M.; Corallo, S.; Fuchß, D.; Keim, J.; Liu, H.; Koziolek, A. Software Architecture Meets LLMs: A Systematic Literature Review. arXiv 2025, arXiv:2505.16697. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2019, arXiv:1810.04805. [Google Scholar] [CrossRef]
- Fedus, W.; Zoph, B.; Shazeer, N. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. arXiv 2021, arXiv:2101.03961. [Google Scholar]
- Chang, Y.; Wang, X.; Wang, J.; Wu, Y.; Yang, L.; Zhu, K.; Chen, H.; Yi, X.; Wang, C.; Wang, Y.; et al. A Survey on Evaluation of Large Language Models. ACM Trans. Intell. Syst. Technol. 2024, 15, 1–45. [Google Scholar] [CrossRef]
- Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. arXiv 2020, arXiv:2005.14165. [Google Scholar] [CrossRef]
- Shoeybi, M.; Patwary, M.; Puri, R.; LeGresley, P.; Casper, J.; Catanzaro, B. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv 2020, arXiv:1909.08053. [Google Scholar] [CrossRef]
- Sahami, M.; Dumais, S.; Heckerman, D.; Horvitz, E. A Bayesian Approach to Filtering Junk Email. AAAI Technical Report; 1998; pp. 55–62. Available online: https://aaai.org/papers/055-ws98-05-009/ (accessed on 22 April 2025).
- Laskov, P.; Düssel, P.; Schäfer, C.; Rieck, K. Learning Intrusion Detection: Supervised or Unsupervised? In Image Analysis and Processing—ICIAP 2005; Roli, F., Vitulano, S., Eds.; Springer: Berlin/Heidelberg, Germany, 2005; pp. 50–57. [Google Scholar]
- Eskin, E.; Arnold, A.; Prerau, M.; Portnoy, L.; Stolfo, S. A Geometric Framework for Unsupervised Anomaly Detection; Barbará, D., Jajodia, S., Eds.; Springer US: Boston, MA, USA, 2002; Volume 6, pp. 77–101. [Google Scholar]
- Kim, G.; Lee, S.; Kim, S. A novel hybrid intrusion detection method integrating anomaly detection with misuse detection. Expert Syst. Appl. 2014, 41, 1690–1700. [Google Scholar] [CrossRef]
- Saxe, J.; Berlin, K. Deep neural network based malware detection using two dimensional binary program features. In Proceedings of the 2015 10th International Conference on Malicious and Unwanted Software (MALWARE), Fajardo, PR, USA, 20–22 October 2015; pp. 11–20. [Google Scholar]
- Biggio, B.; Roli, F. Wild patterns: Ten years after the rise of adversarial machine learning. Pattern Recognit. 2018, 84, 317–331. [Google Scholar] [CrossRef]
- Raff, E.; Barker, J.; Sylvester, J.; Brandon, R.; Catanzaro, B.; Nicholas, C. Malware Detection by Eating a Whole EXE. arXiv 2017, arXiv:1710.09435. [Google Scholar] [CrossRef]
- Kolias, C.; Kambourakis, G.; Stavrou, A.; Gritzalis, S. Intrusion Detection in 802.11 Networks: Empirical Evaluation of Threats and a Public Dataset. IEEE Commun. Surv. Tutor. 2016, 18, 184–208. [Google Scholar] [CrossRef]
- Sommer, R.; Paxson, V. Outside the Closed World: On Using Machine Learning for Network Intrusion Detection. In Proceedings of the 2010 IEEE Symposium on Security and Privacy, Oakland, CA, USA, 16–19 May 2010; pp. 305–316. [Google Scholar]
- García-Teodoro, P.; Díaz-Verdejo, J.; Maciá-Fernández, G.; Vázquez, E. Anomaly-based network intrusion detection: Techniques, systems and challenges. Comput. Secur. 2009, 28, 18–28. [Google Scholar] [CrossRef]
- Liu, X.; Yu, Z.; Zhang, Y.; Zhang, N.; Xiao, C. Automatic and Universal Prompt Injection Attacks against Large Language Models. arXiv 2024, arXiv:2403.04957. [Google Scholar] [CrossRef]
- Uddin, M.A.; Sarker, I.H. An Explainable Transformer-based Model for Phishing Email Detection: A Large Language Model Approach. arXiv 2024, arXiv:2402.13871. [Google Scholar] [CrossRef]
- Yan, B.; Li, K.; Xu, M.; Dong, Y.; Zhang, Y.; Ren, Z.; Cheng, X. On protecting the data privacy of Large Language Models (LLMs) and LLM agents: A literature review. High-Confid. Comput. 2025, 5, 100300. [Google Scholar] [CrossRef]
- Akinyele, A.R.; Ajayi, O.O.; Munyaneza, G.; Ibecheozor, U.H.B.; Gopakumar, N.; Akinyele, A.R.; Ajayi, O.O.; Munyaneza, G.; Ibecheozor, U.H.B.; Gopakumar, N. Leveraging Generative Artificial Intelligence (AI) for cybersecurity: Analyzing diffusion models in detecting and mitigating cyber threats. GSC Adv. Res. Rev. 2024, 21, 1–14. [Google Scholar] [CrossRef]
- Gupta, P.; Ding, B.; Guan, C.; Ding, D. Generative AI: A systematic review using topic modelling techniques. Data Inf. Manag. 2024, 8, 100066. [Google Scholar] [CrossRef]
- European Commission. Proposal for a Regulation Laying Down Harmonised Rules on Artificial Intelligence|Shaping Europe’s Digital Future. Available online: https://digital-strategy.ec.europa.eu/en/library/proposal-regulation-laying-down-harmonised-rules-artificial-intelligence (accessed on 20 April 2025).
- Noever, D. Can Large Language Models Find And Fix Vulnerable Software? arXiv 2023, arXiv:2308.10345. [Google Scholar] [CrossRef]
- Nahmias, D.; Engelberg, G.; Klein, D.; Shabtai, A. Prompted Contextual Vectors for Spear-Phishing Detection. arXiv 2024, arXiv:2402.08309. [Google Scholar] [CrossRef]
- Sovrano, F.; Bauer, A.; Bacchelli, A. Large Language Models for In-File Vulnerability Localization Can Be “Lost in the End”. Proc. ACM Softw. Eng. 2025, 2, FSE041. [Google Scholar] [CrossRef]
- Isozaki, I.; Shrestha, M.; Console, R.; Kim, E. Towards Automated Penetration Testing: Introducing LLM Benchmark, Analysis, and Improvements. arXiv 2025, arXiv:2410.17141. [Google Scholar] [CrossRef]
- Happe, A.; Kaplan, A.; Cito, J. LLMs as Hackers: Autonomous Linux Privilege Escalation Attacks. arXiv 2025, arXiv:2310.11409. [Google Scholar] [CrossRef]
- Zhang, C.; Liu, H.; Zeng, J.; Yang, K.; Li, Y.; Li, H. Prompt-Enhanced Software Vulnerability Detection Using ChatGPT. In Proceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings, Lisbon, Portugal, 14–20 April 2024; Association for Computing Machinery: New York, NY, USA, 2024; pp. 276–277. [Google Scholar]
- Zhou, X.; Cao, S.; Sun, X.; Lo, D. Large Language Model for Vulnerability Detection and Repair: Literature Review and the Road Ahead. arXiv 2024, arXiv:2404.02525. [Google Scholar] [CrossRef]
- Zhou, X.; Tran, D.-M.; Le-Cong, T.; Zhang, T.; Irsan, I.C.; Sumarlin, J.; Le, B.; Lo, D. Comparison of Static Application Security Testing Tools and Large Language Models for Repo-level Vulnerability Detection. arXiv 2024, arXiv:2407.16235. [Google Scholar] [CrossRef]
- Tamberg, K.; Bahsi, H. Harnessing Large Language Models for Software Vulnerability Detection: A Comprehensive Benchmarking Study. IEEE Access 2025, 13, 29698–29717. [Google Scholar] [CrossRef]
- Mahyari, A.A. Harnessing the Power of LLMs in Source Code Vulnerability Detection. arXiv 2024, arXiv:2408.03489. [Google Scholar] [CrossRef]
- Mao, Q.; Li, Z.; Hu, X.; Liu, K.; Xia, X.; Sun, J. Towards Explainable Vulnerability Detection with Large Language Models. arXiv 2025, arXiv:2406.09701. [Google Scholar] [CrossRef]
- Cheshkov, A.; Zadorozhny, P.; Levichev, R. Evaluation of ChatGPT Model for Vulnerability Detection. arXiv 2023, arXiv:2304.07232. [Google Scholar] [CrossRef]
- Purba, M.D.; Ghosh, A.; Radford, B.J.; Chu, B. Software Vulnerability Detection using Large Language Models. In Proceedings of the 2023 IEEE 34th International Symposium on Software Reliability Engineering Workshops (ISSREW), Florence, Italy, 9–12 October 2023; pp. 112–119. [Google Scholar]
- Dozono, K.; Gasiba, T.E.; Stocco, A. Large Language Models for Secure Code Assessment: A Multi-Language Empirical Study. arXiv 2024, arXiv:2408.06428. [Google Scholar] [CrossRef]
- Berabi, B.; Gronskiy, A.; Raychev, V.; Sivanrupan, G.; Chibotaru, V.; Vechev, M. DeepCode AI Fix: Fixing Security Vulnerabilities with Large Language Models. arXiv 2024, arXiv:2402.13291. [Google Scholar] [CrossRef]
- Sheng, Z.; Wu, F.; Zuo, X.; Li, C.; Qiao, Y.; Hang, L. LProtector: An LLM-driven Vulnerability Detection System. arXiv 2024, arXiv:2411.06493. [Google Scholar] [CrossRef]
- Khare, A.; Dutta, S.; Li, Z.; Solko-Breslin, A.; Alur, R.; Naik, M. Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities. arXiv 2024, arXiv:2311.16169. [Google Scholar] [CrossRef]
- Jensen, R.I.T.; Tawosi, V.; Alamir, S. Software Vulnerability and Functionality Assessment using LLMs. arXiv 2024, arXiv:2403.08429. [Google Scholar] [CrossRef]
- Shestov, A.; Levichev, R.; Mussabayev, R.; Maslov, E.; Cheshkov, A.; Zadorozhny, P. Finetuning Large Language Models for Vulnerability Detection. arXiv 2024, arXiv:2401.17010. [Google Scholar] [CrossRef]
- Li, H.; Hao, Y.; Zhai, Y.; Qian, Z. The Hitchhiker’s Guide to Program Analysis: A Journey with Large Language Models. arXiv 2023, arXiv:2308.00245. [Google Scholar] [CrossRef]
- Guo, Y.; Patsakis, C.; Hu, Q.; Tang, Q.; Casino, F. Outside the Comfort Zone: Analysing LLM Capabilities in Software Vulnerability Detection. In Computer Security—ESORICS 2024: 29th European Symposium on Research in Computer Security, Bydgoszcz, Poland, 16–20 September 2024, Proceedings, Part I; Springer: Berlin/Heidelberg, Germany, 2024; pp. 271–289. [Google Scholar]
- Wang, J.; Huang, Z.; Liu, H.; Yang, N.; Xiao, Y. DefectHunter: A Novel LLM-Driven Boosted-Conformer-based Code Vulnerability Detection Mechanism. arXiv 2023, arXiv:2309.15324. [Google Scholar] [CrossRef]
- Bakhshandeh, A.; Keramatfar, A.; Norouzi, A.; Chekidehkhoun, M.M. Using ChatGPT as a Static Application Security Testing Tool. arXiv 2023, arXiv:2308.14434. [Google Scholar] [CrossRef]
- Mathews, N.S.; Brus, Y.; Aafer, Y.; Nagappan, M.; McIntosh, S. LLbezpeky: Leveraging Large Language Models for Vulnerability Detection. arXiv 2024, arXiv:2401.01269. [Google Scholar] [CrossRef]
- Lin, Y.-Z.; Mamun, M.; Chowdhury, M.A.; Cai, S.; Zhu, M.; Latibari, B.S.; Gubbi, K.I.; Bavarsad, N.N.; Caputo, A.; Sasan, A.; et al. HW-V2W-Map: Hardware Vulnerability to Weakness Mapping Framework for Root Cause Analysis with GPT-assisted Mitigation Suggestion. arXiv 2023, arXiv:2312.13530. [Google Scholar] [CrossRef]
- Kaheh, M.; Kholgh, D.K.; Kostakos, P. Cyber Sentinel: Exploring Conversational Agents in Streamlining Security Tasks with GPT-4. arXiv 2023, arXiv:2309.16422. [Google Scholar] [CrossRef]
- Dolcetti, G.; Arceri, V.; Iotti, E.; Maffeis, S.; Cortesi, A.; Zaffanella, E. Helping LLMs Improve Code Generation Using Feedback from Testing and Static Analysis. arXiv 2025, arXiv:2412.14841. [Google Scholar] [CrossRef]
- Feng, R.; Pearce, H.; Liguori, P.; Sui, Y. CGP-Tuning: Structure-Aware Soft Prompt Tuning for Code Vulnerability Detection. arXiv 2025, arXiv:2501.04510. [Google Scholar] [CrossRef]
- Lin, J.; Mohaisen, D. Evaluating Large Language Models in Vulnerability Detection Under Variable Context Windows. arXiv 2025, arXiv:2502.00064. [Google Scholar] [CrossRef]
- Torkamani, M.J.; NG, J.; Mehrotra, N.; Chandramohan, M.; Krishnan, P.; Purandare, R. Streamlining Security Vulnerability Triage with Large Language Models. arXiv 2025, arXiv:2501.18908. [Google Scholar] [CrossRef]
- Ghosh, R.; von Stockhausen, H.-M.; Schmitt, M.; Vasile, G.M.; Karn, S.K.; Farri, O. CVE-LLM: Ontology-Assisted Automatic Vulnerability Evaluation Using Large Language Models. arXiv 2025, arXiv:2502.15932. [Google Scholar] [CrossRef]
- Liu, Y.; Tao, S.; Meng, W.; Wang, J.; Ma, W.; Chen, Y.; Zhao, Y.; Yang, H.; Jiang, Y. Interpretable Online Log Analysis Using Large Language Models with Prompt Strategies. In Proceedings of the 32nd IEEE/ACM International Conference on Program Comprehension, Lisbon, Portugal, 15–16 April 2024; Association for Computing Machinery: New York, NY, USA, 2024; pp. 35–46. [Google Scholar]
- Zhang, W.; Guan, X.; Yunhong, L.; Zhang, J.; Song, S.; Cheng, X.; Wu, Z.; Li, Z. Lemur: Log Parsing with Entropy Sampling and Chain-of-Thought Merging. arXiv 2025, arXiv:2402.18205. [Google Scholar] [CrossRef]
- Liu, J.; Huang, J.; Huo, Y.; Jiang, Z.; Gu, J.; Chen, Z.; Feng, C.; Yan, M.; Lyu, M.R. Log-based Anomaly Detection based on EVT Theory with feedback. arXiv 2023, arXiv:2306.05032. [Google Scholar] [CrossRef]
- Qi, J.; Huang, S.; Luan, Z.; Yang, S.; Fung, C.; Yang, H.; Qian, D.; Shang, J.; Xiao, Z.; Wu, Z. LogGPT: Exploring ChatGPT for Log-Based Anomaly Detection. In Proceedings of the 2023 IEEE International Conference on High Performance Computing & Communications, Data Science & Systems, Smart City & Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys), Melbourne, Australia, 13–15 December 2023; pp. 273–280. [Google Scholar]
- Han, X.; Yuan, S.; Trabelsi, M. LogGPT: Log Anomaly Detection via GPT. In Proceedings of the 2023 IEEE International Conference on Big Data (BigData), Sorrento, Italy, 5–18 December 2023; pp. 1117–1122. [Google Scholar]
- Jamal, S.; Wimmer, H. An Improved Transformer-based Model for Detecting Phishing, Spam, and Ham: A Large Language Model Approach. arXiv 2023, arXiv:2311.04913. [Google Scholar] [CrossRef]
- Si, S.; Wu, Y.; Tang, L.; Zhang, Y.; Wosik, J.; Su, Q. Evaluating the Performance of ChatGPT for Spam Email Detection. arXiv 2025, arXiv:2402.15537. [Google Scholar] [CrossRef]
- Heiding, F.; Schneier, B.; Vishwanath, A.; Bernstein, J.; Park, P.S. Devising and Detecting Phishing: Large Language Models vs. Smaller Human Models. arXiv 2023, arXiv:2308.12287. [Google Scholar] [CrossRef]
- Vörös, T.; Bergeron, S.P.; Berlin, K. Web Content Filtering Through Knowledge Distillation of Large Language Models. In Proceedings of the 2023 IEEE/WIC International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), Venice, Italy, 26–29 October 2023; pp. 357–361. [Google Scholar]
- Guastalla, M.; Li, Y.; Hekmati, A.; Krishnamachari, B. Application of Large Language Models to DDoS Attack Detection. In Security and Privacy in Cyber-Physical Systems and Smart Vehicles; Chen, Y., Lin, C.-W., Chen, B., Zhu, Q., Eds.; Springer Nature Switzerland: Cham, Switzerland, 2024; pp. 83–99. [Google Scholar]
- Alnegheimish, S.; Nguyen, L.; Berti-Equille, L.; Veeramachaneni, K. Large language models can be zero-shot anomaly detectors for time series? arXiv 2024, arXiv:2405.14755. [Google Scholar] [CrossRef]
- Gu, Z.; Zhu, B.; Zhu, G.; Chen, Y.; Tang, M.; Wang, J. AnomalyGPT: Detecting Industrial Anomalies Using Large Vision-Language Models. arXiv 2023, arXiv:2308.15366. [Google Scholar] [CrossRef]
- Elhafsi, A.; Sinha, R.; Agia, C.; Schmerling, E.; Nesnas, I.; Pavone, M. Semantic Anomaly Detection with Large Language Models. arXiv 2023, arXiv:2305.11307. [Google Scholar] [CrossRef]
- Li, A.; Zhao, Y.; Qiu, C.; Kloft, M.; Smyth, P.; Rudolph, M.; Mandt, S. Anomaly Detection of Tabular Data Using LLMs. arXiv 2024, arXiv:2406.16308. [Google Scholar] [CrossRef]
- Guan, W.; Cao, J.; Qian, S.; Gao, J.; Ouyang, C. LogLLM: Log-based Anomaly Detection Using Large Language Models. arXiv 2025, arXiv:2411.08561. [Google Scholar] [CrossRef]
- Su, J.; Jiang, C.; Jin, X.; Qiao, Y.; Xiao, T.; Ma, H.; Wei, R.; Jing, Z.; Xu, J.; Lin, J. Large Language Models for Forecasting and Anomaly Detection: A Systematic Literature Review. arXiv 2024, arXiv:2402.10350. [Google Scholar] [CrossRef]
- Ali, T.; Kostakos, P. HuntGPT: Integrating Machine Learning-Based Anomaly Detection and Explainable AI with Large Language Models (LLMs). arXiv 2023, arXiv:2309.16021. [Google Scholar] [CrossRef]
- Liu, X.; Liang, J.; Yan, Q.; Jang, J.; Mao, S.; Ye, M.; Jia, J.; Xi, Z. Cyber Defense Reinvented: Large Language Models as Threat Intelligence Copilots. arXiv 2025, arXiv:2502.20791. [Google Scholar] [CrossRef]
- Wang, Y.; Yang, X. Design and implementation of a distributed security threat detection system integrating federated learning and multimodal LLM. arXiv 2025, arXiv:2502.17763. [Google Scholar] [CrossRef]
- Qian, X.; Zheng, X.; He, Y.; Yang, S.; Cavallaro, L. LAMD: Context-driven Android Malware Detection and Classification with LLMs. arXiv 2025, arXiv:2502.13055. [Google Scholar] [CrossRef]
- Benabderrahmane, S.; Valtchev, P.; Cheney, J.; Rahwan, T. APT-LLM: Embedding-Based Anomaly Detection of Cyber Advanced Persistent Threats Using Large Language Models. arXiv 2025, arXiv:2502.09385. [Google Scholar] [CrossRef]
- Meguro, R.; Chong, N.S.T. AdaPhish: AI-Powered Adaptive Defense and Education Resource Against Deceptive Emails. In Proceedings of the 2025 IEEE 4th International Conference on AI in Cybersecurity (ICAIC), Houston, TX, USA, 5–7 February 2025; pp. 1–7. [Google Scholar]
- Akhtar, S.; Khan, S.; Parkinson, S. LLM-based event log analysis techniques: A survey. arXiv 2025, arXiv:2502.00677. [Google Scholar] [CrossRef]
- Walton, B.J.; Khatun, M.E.; Ghawaly, J.M.; Ali-Gombe, A. Exploring Large Language Models for Semantic Analysis and Categorization of Android Malware. arXiv 2025, arXiv:2501.04848. [Google Scholar] [CrossRef]
- Mitra, S.; Neupane, S.; Chakraborty, T.; Mittal, S.; Piplai, A.; Gaur, M.; Rahimi, S. LOCALINTEL: Generating Organizational Threat Intelligence from Global and Local Cyber Knowledge. arXiv 2025, arXiv:2401.10036. [Google Scholar] [CrossRef]
- Perrina, F.; Marchiori, F.; Conti, M.; Verde, N.V. AGIR: Automating Cyber Threat Intelligence Reporting with Natural Language Generation. In Proceedings of the 2023 IEEE International Conference on Big Data (BigData), Sorrento, Italy, 15–18 December 2023; pp. 3053–3062. [Google Scholar]
- Fayyazi, R.; Yang, S.J. On the Uses of Large Language Models to Interpret Ambiguous Cyberattack Descriptions. arXiv 2023, arXiv:2306.14062. [Google Scholar] [CrossRef]
- Schwartz, Y.; Benshimol, L.; Mimran, D.; Elovici, Y.; Shabtai, A. LLMCloudHunter: Harnessing LLMs for Automated Extraction of Detection Rules from Cloud-Based CTI. arXiv 2024, arXiv:2407.05194. [Google Scholar] [CrossRef]
- Clairoux-Trepanier, V.; Beauchamp, I.-M.; Ruellan, E.; Paquet-Clouston, M.; Paquette, S.-O.; Clay, E. The Use of Large Language Models (LLM) for Cyber Threat Intelligence (CTI) in Cybercrime Forums. arXiv 2024, arXiv:2408.03354. [Google Scholar] [CrossRef]
- Siracusano, G.; Sanvito, D.; Gonzalez, R.; Srinivasan, M.; Kamatchi, S.; Takahashi, W.; Kawakita, M.; Kakumaru, T.; Bifulco, R. Time for aCTIon: Automated Analysis of Cyber Threat Intelligence in the Wild. arXiv 2023, arXiv:2307.10214. [Google Scholar] [CrossRef]
- Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models’, presented at the International Conference on Learning Representations. 2022. Available online: https://openreview.net/forum?id=nZeVKeeFYf9 (accessed on 21 April 2025).
- Zhang, Y.; Du, T.; Ma, Y.; Wang, X.; Xie, Y.; Yang, G.; Lu, Y.; Chang, E.-C. AttacKG+:Boosting Attack Knowledge Graph Construction with Large Language Models. arXiv 2024, arXiv:2405.04753. [Google Scholar] [CrossRef]
- Fieblinger, R.; Alam, M.T.; Rastogi, N. Actionable Cyber Threat Intelligence using Knowledge Graphs and Large Language Models. arXiv 2024, arXiv:2407.02528. [Google Scholar] [CrossRef]
- Park, Y.; You, W. A Pretrained Language Model for Cyber Threat Intelligence. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track, Singapore, 6–10 December 2023; Wang, M., Zitouni, I., Eds.; Association for Computational Linguistics: Singapore, 2023; pp. 113–122. [Google Scholar]
- Gao, P.; Liu, X.; Choi, E.; Ma, S.; Yang, X.; Song, D. ThreatKG: An AI-Powered System for Automated Open-Source Cyber Threat Intelligence Gathering and Management. arXiv 2024, arXiv:2212.10388. [Google Scholar] [CrossRef]
- Sorokoletova, O.; Antonioni, E.; Colò, G. Towards a scalable AI-driven framework for data-independent Cyber Threat Intelligence Information Extraction. arXiv 2025, arXiv:2501.06239. [Google Scholar] [CrossRef]
- Wu, Z.; Tang, F.; Zhao, M.; Li, Y. KGV: Integrating Large Language Models with Knowledge Graphs for Cyber Threat Intelligence Credibility Assessment. arXiv 2024, arXiv:2408.08088. [Google Scholar] [CrossRef]
- Fayyazi, R.; Taghdimi, R.; Yang, S.J. Advancing TTP Analysis: Harnessing the Power of Large Language Models with Retrieval Augmented Generation. arXiv 2024, arXiv:2401.00280. [Google Scholar] [CrossRef]
- Singla, T.; Anandayuvaraj, D.; Kalu, K.G.; Schorlemmer, T.R.; Davis, J.C. An Empirical Study on Using Large Language Models to Analyze Software Supply Chain Security Failures. In Proceedings of the 2023 Workshop on Software Supply Chain Offensive Research and Ecosystem Defenses, Copenhagen, Denmark, 30 November 2023; Association for Computing Machinery: New York, NY, USA, 2023; pp. 5–15. [Google Scholar]
- Zhang, T.; Irsan, I.C.; Thung, F.; Lo, D. CUPID: Leveraging ChatGPT for More Accurate Duplicate Bug Report Detection. arXiv 2024, arXiv:2308.10022. [Google Scholar] [CrossRef]
- Tseng, P.; Yeh, Z.; Dai, X.; Liu, P. Using LLMs to Automate Threat Intelligence Analysis Workflows in Security Operation Centers. arXiv 2024, arXiv:2407.13093. [Google Scholar] [CrossRef]
- Rajapaksha, S.; Rani, R.; Karafili, E. A RAG-Based Question-Answering Solution for Cyber-Attack Investigation and Attribution. arXiv 2024, arXiv:2408.06272. [Google Scholar] [CrossRef]
- Shah, S.; Parast, F.K. AI-Driven Cyber Threat Intelligence Automation. arXiv 2024, arXiv:2410.20287. [Google Scholar] [CrossRef]
- Alevizos, L.; Dekker, M. Towards an AI-Enhanced Cyber Threat Intelligence Processing Pipeline. Electronics 2024, 13, 2021. [Google Scholar] [CrossRef]
- Daniel, N.; Kaiser, F.K.; Giladi, S.; Sharabi, S.; Moyal, R.; Shpolyansky, S.; Murillo, A.; Elyashar, A.; Puzis, R. Labeling NIDS Rules with MITRE ATT&CK Techniques: Machine Learning vs. Large Language Models. arXiv 2024, arXiv:2412.10978. [Google Scholar] [CrossRef]
- Alturkistani, H.; Chuprat, S. Artificial Intelligence and Large Language Models in Advancing Cyber Threat Intelligence: A Systematic Literature Review. Res. Sq. 2024, 1–49. [Google Scholar] [CrossRef]
- Paul, S.; Alemi, F.; Macwan, R. LLM-Assisted Proactive Threat Intelligence for Automated Reasoning. arXiv 2025, arXiv:2504.00428. [Google Scholar] [CrossRef]
- Chen, C.; Su, J.; Chen, J.; Wang, Y.; Bi, T.; Yu, J.; Wang, Y.; Lin, X.; Chen, T.; Zheng, Z. When ChatGPT Meets Smart Contract Vulnerability Detection: How Far Are We? ACM Trans. Softw. Eng. Methodol. 2024, 34, 3702973. [Google Scholar] [CrossRef]
- David, I.; Zhou, L.; Qin, K.; Song, D.; Cavallaro, L.; Gervais, A. Do you still need a manual smart contract audit? arXiv 2023, arXiv:2306.12338. [Google Scholar] [CrossRef]
- Gai, Y.; Zhou, L.; Qin, K.; Song, D.; Gervais, A. Blockchain Large Language Models. arXiv 2023, arXiv:2304.12749. [Google Scholar] [CrossRef]
- Hu, S.; Huang, T.; İlhan, F.; Tekin, S.F.; Liu, L. Large Language Model-Powered Smart Contract Vulnerability Detection: New Perspectives. arXiv 2023, arXiv:2310.01152. [Google Scholar] [CrossRef]
- Sun, Y.; Wu, D.; Xue, Y.; Liu, H.; Wang, H.; Xu, Z.; Xie, X.; Liu, Y. GPTScan: Detecting Logic Vulnerabilities in Smart Contracts by Combining GPT with Program Analysis. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, Lisbon, Portugal, 14–20 April 2024; pp. 1–13. [Google Scholar]
- Hossain, S.M.M.; Altarawneh, A.; Roberts, J. Leveraging Large Language Models and Machine Learning for Smart Contract Vulnerability Detection. In Proceedings of the 2025 IEEE 15th Annual Computing and Communication Workshop and Conference (CCWC), Las Vegas, NV, USA, 6–8 January 2025; pp. 00577–00583. [Google Scholar]
- Yu, L.; Chen, S.; Yuan, H.; Wang, P.; Huang, Z.; Zhang, J.; Shen, C.; Zhang, F.; Yang, L.; Ma, J. Smart-LLaMA: Two-Stage Post-Training of Large Language Models for Smart Contract Vulnerability Detection and Explanation. arXiv 2024, arXiv:2411.06221. [Google Scholar] [CrossRef]
- Zaazaa, O.; Bakkali, H.E. SmartLLMSentry: A Comprehensive LLM Based Smart Contract Vulnerability Detection Framework. J. Metaverse 2024, 4, 126–137. [Google Scholar] [CrossRef]
- Wei, Z.; Sun, J.; Zhang, Z.; Zhang, X.; Li, M.; Hou, Z. LLM-SmartAudit: Advanced Smart Contract Vulnerability Detection. arXiv 2024, arXiv:2410.09381. [Google Scholar] [CrossRef]
- He, Z.; Li, Z.; Yang, S.; Ye, H.; Qiao, A.; Zhang, X.; Luo, X.; Chen, T. Large Language Models for Blockchain Security: A Systematic Literature Review. arXiv 2025, arXiv:2403.14280. [Google Scholar] [CrossRef]
- Ding, H.; Liu, Y.; Piao, X.; Song, H.; Ji, Z. SmartGuard: An LLM-enhanced framework for smart contract vulnerability detection. Expert Syst. Appl. 2025, 269, 126479. [Google Scholar] [CrossRef]
- Ma, W.; Wu, D.; Sun, Y.; Wang, T.; Liu, S.; Zhang, J.; Xue, Y.; Liu, Y. Combining Fine-Tuning and LLM-based Agents for Intuitive Smart Contract Auditing with Justifications. arXiv 2024, arXiv:2403.16073. [Google Scholar] [CrossRef]
- Bu, J.; Li, W.; Li, Z.; Zhang, Z.; Li, X. Enhancing Smart Contract Vulnerability Detection in DApps Leveraging Fine-Tuned LLM. arXiv 2025, arXiv:2504.05006. [Google Scholar] [CrossRef]
- Deng, G.; Liu, Y.; Mayoral-Vilches, V.; Liu, P.; Li, Y.; Xu, Y.; Zhang, T.; Liu, Y.; Pinzger, M.; Rass, S. PentestGPT: An LLM-empowered Automatic Penetration Testing Tool. arXiv 2024, arXiv:2308.06782. [Google Scholar] [CrossRef]
- Fang, R.; Bindu, R.; Gupta, A.; Kang, D. LLM Agents can Autonomously Exploit One-day Vulnerabilities. arXiv 2024, arXiv:2404.08144. [Google Scholar] [CrossRef]
- Temara, S. Maximizing Penetration Testing Success with Effective Reconnaissance Techniques using ChatGPT. arXiv 2023, arXiv:2307.06391. [Google Scholar] [CrossRef]
- Deng, G.; Liu, Y.; Robotics, A.; Klagenfurt, A.-A.-U.; Liu, P.; Li, Y.; Zhang, T.; Liu, Y.; Klagenfurt, A.-A.-U.; Rass, S. PentestGPt: Evaluating and Harnessing Large Language Models for Automated Penetration Testing. arXiv 2024, arXiv:2308.06782v2. [Google Scholar]
- Happe, A.; Cito, J. Getting pwn’d by AI: Penetration Testing with Large Language Models. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, San Francisco, CA, USA, 3–9 December 2023; ACM: San Francisco CA USA, 2023; pp. 2082–2086. [Google Scholar]
- Bianou, S.G.; Batogna, R.G. PENTEST-AI, an LLM-Powered Multi-Agents Framework for Penetration Testing Automation Leveraging Mitre Attack. In Proceedings of the 2024 IEEE International Conference on Cyber Security and Resilience (CSR), London, UK, 2–4 September 2024; pp. 763–770. [Google Scholar]
- Yin, Z.; Wang, Z.; Xu, W.; Zhuang, J.; Mozumder, P.; Smith, A.; Zhang, W. Digital Forensics in the Age of Large Language Models. arXiv 2025, arXiv:2504.02963. [Google Scholar] [CrossRef]
- Wickramasekara, A.; Breitinger, F.; Scanlon, M. Exploring the Potential of Large Language Models for Improving Digital Forensic Investigation Efficiency. Forensic Sci. Int. Digit. Investig. 2025, 52, 301859. [Google Scholar] [CrossRef]
- Oh, D.B.; Kim, D.; Kim, D.; Kim, H.K. volGPT: Evaluation on triaging ransomware process in memory forensics with Large Language Model. Forensic Sci. Int. Digit. Investig. 2024, 49, 301756. [Google Scholar] [CrossRef]
- Voigt, L.L.; Freiling, F.; Hargreaves, C.J. Re-imagen: Generating coherent background activity in synthetic scenario-based forensic datasets using large language models. Forensic Sci. Int. Digit. Investig. 2024, 50, 301805. [Google Scholar] [CrossRef]
- Michelet, G.; Breitinger, F. ChatGPT, Llama, can you write my report? An experiment on assisted digital forensics reports written using (local) large language models. Forensic Sci. Int. Digit. Investig. 2024, 48, 301683. [Google Scholar] [CrossRef]
- Bhandarkar, A.; Wilson, R.; Swarup, A.; Zhu, M.; Woodard, D. Is the Digital Forensics and Incident Response Pipeline Ready for Text-Based Threats in LLM Era? arXiv 2024, arXiv:2407.17870. [Google Scholar] [CrossRef]
- Loumachi, F.Y.; Ghanem, M.C.; Ferrag, M.A. GenDFIR: Advancing Cyber Incident Timeline Analysis Through Retrieval Augmented Generation and Large Language Models. arXiv 2024, arXiv:2409.02572. [Google Scholar] [CrossRef]
- Zhou, H.; Xu, W.; Dehlinger, J.; Chakraborty, S.; Deng, L. An LLM-driven Approach to Gain Cybercrime Insights with Evidence Networks. In Proceedings of the 2024 USENIX Symposium on Usable Privacy and Security (SOUPS), Philadelphia, PA, USA, 11–13 August 2024; pp. 1–3. [Google Scholar]
- Kim, K.; Lee, C.; Bae, S.; Choi, J.; Kang, W. Digital Forensics in Law Enforcement: A Case Study of Llm-Driven Evidence Analysis; Social Science Research Network: Rochester, NY, USA, 2025; p. 5110258. [Google Scholar] [CrossRef]
- Sharma, B.; Ghawaly, J.; McCleary, K.; Webb, A.M.; Baggili, I. ForensicLLM: A local large language model for digital forensics. Forensic Sci. Int. Digit. Investig. 2025, 52, 301872. [Google Scholar] [CrossRef]
- Xu, E.; Zhang, W.; Xu, W. Transforming Digital Forensics with Large Language Models: Unlocking Automation, Insights, and Justice. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, Boise, IH, USA, 22–25 October 2024; Association for Computing Machinery: New York, NY, USA, 2024; pp. 5543–5546. [Google Scholar]
- Cho, S.-H.; Kim, D.; Kwon, H.-C.; Kim, M. Exploring the potential of large language models for author profiling tasks in digital text forensics. Forensic Sci. Int. Digit. Investig. 2024, 50, 301814. [Google Scholar] [CrossRef]
- Rahali, A.; Akhloufi, M.A. MalBERTv2: Code Aware BERT-Based Model for Malware Identification. Big Data Cogn. Comput. 2023, 7, 60. [Google Scholar] [CrossRef]
- Rondanini, C.; Carminati, B.; Ferrari, E.; Gaudiano, A.; Kundu, A. Malware Detection at the Edge with Lightweight LLMs: A Performance Evaluation. arXiv 2025, arXiv:2503.04302. [Google Scholar] [CrossRef]
- Pearce, H.; Ahmad, B.; Tan, B.; Dolan-Gavitt, B.; Karri, R. Asleep at the Keyboard? Assessing the Security of GitHub Copilot’s Code Contributions. arXiv 2021, arXiv:2108.09293. [Google Scholar] [CrossRef]
- Zeng, J.; Huang, R.; Malik, W.; Yin, L.; Babic, B.; Shacham, D.; Yan, X.; Yang, J.; He, Q. Large Language Models for Social Networks: Applications, Challenges, and Solutions. arXiv 2024, arXiv:2401.02575. [Google Scholar] [CrossRef]
- Wudali, P.N.; Kravchik, M.; Malul, E.; Gandhi, P.A.; Elovici, Y.; Shabtai, A. Rule-ATT&CK Mapper (RAM): Mapping SIEM Rules to TTPs Using LLMs. arXiv 2025, arXiv:2502.02337. [Google Scholar] [CrossRef]
- Ferrag, M.A.; Alwahedi, F.; Battah, A.; Cherif, B.; Mechri, A.; Tihanyi, N.; Bisztray, T.; Debbah, M. Generative AI in Cybersecurity: A Comprehensive Review of LLM Applications and Vulnerabilities. arXiv 2025, arXiv:2405.12750. [Google Scholar] [CrossRef]
- General Data Protection Regulation (GDPR)—Legal Text. General Data Protection Regulation (GDPR). Available online: https://gdpr-info.eu/ (accessed on 23 April 2025).
- Cyber Security Skills in the UK Labour Market 2023. Available online: https://www.gov.uk/government/publications/cyber-security-skills-in-the-uk-labour-market-2023 (accessed on 20 April 2025).
- CyBOK—The Cyber Security Body of Knowledge. Available online: https://www.cybok.org/ (accessed on 20 April 2025).
- National Cyber Security Centre NCSC-Certified Degrees. Available online: https://www.ncsc.gov.uk/information/ncsc-certified-degrees (accessed on 20 April 2025).
- Zhao, C.; Agrawal, G.; Kumarage, T.; Tan, Z.; Deng, Y.; Chen, Y.-C.; Liu, H. Ontology-Aware RAG for Improved Question-Answering in Cybersecurity Education. arXiv 2024, arXiv:2412.14191. [Google Scholar] [CrossRef]
- Agrawal, G.; Pal, K.; Deng, Y.; Liu, H.; Chen, Y.-C. CyberQ: Generating Questions and Answers for Cybersecurity Education Using Knowledge Graph-Augmented LLMs. Proc. AAAI Conf. Artif. Intell. 2024, 38, 23164–23172. [Google Scholar] [CrossRef]
- Zhao, C.; Maria, R.D.; Kumarage, T.; Chaudhary, K.S.; Agrawal, G.; Li, Y.; Park, J.; Deng, Y.; Chen, Y.-C.; Liu, H. CyberBOT: Towards Reliable Cybersecurity Education via Ontology-Grounded Retrieval Augmented Generation. arXiv 2025, arXiv:2504.00389. [Google Scholar] [CrossRef]
- Shepherd, C. Generative AI Misuse Potential in Cyber Security Education: A Case Study of a UK Degree Program. arXiv 2025, arXiv:2501.12883. [Google Scholar] [CrossRef]
- Ohm, M.; Bungartz, C.; Boes, F.; Meier, M. Assessing the Impact of Large Language Models on Cybersecurity Education: A Study of ChatGPT’s Influence on Student Performance. In Proceedings of the 19th International Conference on Availability, Reliability and Security, Vienna, Austria, 30 July 2024–2 August 2024; Association for Computing Machinery: New York, NY, USA, 2024; pp. 1–7. [Google Scholar]
- Nizon-Deladoeuille, M.; Stefánsson, B.; Neukirchen, H.; Welsh, T. Towards Supporting Penetration Testing Education with Large Language Models: An Evaluation and Comparison. In Proceedings of the 2024 11th International Conference on Social Networks Analysis, Management and Security (SNAMS), Gran Canaria, Spain, 9–11 December 2024; pp. 227–229. [Google Scholar]
- Al-Dhamari, N.; Clarke, N. GPT-Enabled Cybersecurity Training: A Tailored Approach for Effective Awareness. arXiv 2024, arXiv:2405.04138. [Google Scholar] [CrossRef]
- Wang, T.; Zhou, N.; Chen, Z. CyberMentor: AI Powered Learning Tool Platform to Address Diverse Student Needs in Cybersecurity Education. arXiv 2025, arXiv:2501.09709. [Google Scholar] [CrossRef]
- Tann, W.; Liu, Y.; Sim, J.H.; Seah, C.M.; Chang, E.-C. Using Large Language Models for Cybersecurity Capture-The-Flag Challenges and Certification Questions. arXiv 2023, arXiv:2308.10443. [Google Scholar] [CrossRef]
- Shao, M.; Chen, B.; Jancheska, S.; Dolan-Gavitt, B.; Garg, S.; Karri, R.; Shafique, M. An Empirical Evaluation of LLMs for Solving Offensive Security Challenges. arXiv 2024, arXiv:2402.11814. [Google Scholar] [CrossRef]
- Nelson, C.; Doupé, A.; Shoshitaishvili, Y. SENSAI: Large Language Models as Applied Cybersecurity Tutors. In Proceedings of the 56th ACM Technical Symposium on Computer Science Education V. 1, Pittsburgh, PA, USA, 26 February–1 March 2025; Association for Computing Machinery: New York, NY, USA, 2025; pp. 833–839. [Google Scholar]
- Joon; Simmons, R. Innovating Cybersecurity Education Through AI-augmented Teaching. ECCWS 2024, 23, 476–482. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).