HuntGPT: Integrating Machine Learning-Based Anomaly Detection and Explainable AI with Large Language Models (LLMs)

Ali, Tarek; Kostakos, Panos; Sheikhi, Saeid

doi:10.3390/telecom7030073

Open AccessFeature PaperEditor’s ChoiceArticle

HuntGPT: Integrating Machine Learning-Based Anomaly Detection and Explainable AI with Large Language Models (LLMs)

by

Tarek Ali

,

Panos Kostakos

^*

and

Saeid Sheikhi

Center for Applied Computing, University of Oulu, 90014 Oulu, Finland

^*

Author to whom correspondence should be addressed.

Telecom 2026, 7(3), 73; https://doi.org/10.3390/telecom7030073

Submission received: 1 April 2026 / Revised: 13 May 2026 / Accepted: 20 May 2026 / Published: 8 June 2026

Download

Browse Figures

Versions Notes

Abstract

Machine learning (ML) methods for network anomaly detection are emerging as effective proactive strategies in threat hunting, substantially reducing the time required for threat detection and response. However, the challenges in training and maintaining ML models, coupled with frequent false positives, diminish their acceptance and trustworthiness. In response, Explainable AI (XAI) techniques have been introduced to enable cybersecurity operations teams to assess alerts generated by AI systems more confidently. Despite these advancements, XAI tools have encountered limited acceptance from incident responders and have struggled to meet the decision-making needs of both analysts and model maintainers. Large Language Models (LLMs) offer a unique approach to tackling these challenges. Through tuning, LLMs have the ability to discern patterns across vast amounts of information and meet varying functional requirements. In this research, we introduce the development of HuntGPT, a specialized intrusion detection dashboard created to implement a Random Forest classifier trained utilizing the KDD99 dataset. The tool incorporates XAI frameworks like SHAP and Lime, enhancing user-friendliness and intuitiveness of the model. When combined with a GPT-3.5 Turbo conversational agent, HuntGPT aims to deliver detected threats in an easily explainable format, emphasizing user understanding and offering a smooth interactive experience. We investigate the system’s comprehensive architecture and its diverse components, assess the prototype’s technical accuracy using the Certified Information Security Manager (CISM) Practice Exams, and analyze the quality of response readability across six unique metrics. Our results indicate that conversational agents, underpinned by LLM technology and integrated with XAI, can enable a robust mechanism for generating explainable and actionable AI solutions, especially within the realm of intrusion detection systems.

Keywords:

intrusion detection; security; ChatGPT; XAI; Chatbots; OpenAI; explainable AI; security awareness

1. Introduction

In recent decades, there has been a substantial escalation in cyberattacks targeting critical and enterprise infrastructure. By 2025, anticipated annual financial damages from these cyber-attacks are projected to reach $10.5 trillion USD, a substantial leap from $3 trillion USD recorded in 2015 [1]. To counteract the evolving cyber threats, the National Institute of Standards and Technology (NIST) introduced a Cybersecurity Framework in 2014. This framework prescribes iterative cybersecurity policies for identification, protection, detection, response, and recovery processes related to cyber incidents [2].

In this backdrop, human experts play a vital role in analyzing extensive telemetry data and Indicators of Compromise (IoC) to isolate real threats [3,4]. Consequently, building on the foundation laid by the NIST Framework, an extensive ecosystem, comprising tools, methodologies, and techniques, has been established to enable the proactive identification of threats, a process referred to as Cyber Threat Hunting (CTH) [5,6,7]. Threat-hunting tools enable analysts to apply their specialized knowledge to formulate and test threat hypotheses by analyzing system telemetry and threat intelligence from external sources [3].

Machine learning-based anomaly detection tools are particularly noteworthy, as they are designed to uncover both known and unknown threats. Generally, network anomalies are categorized into performance-related anomalies, such as file server failures and transient congestion, and security-related anomalies, including Denial-of-Service attacks, spoofing, and network intrusions [8]. Evidently, the infusion of machine learning into CTH tools has notably increased the incidence of false positives in real-world operational environments [9].

Explainable Artificial Intelligence (XAI) is at the forefront of several proposed conceptual enhancements to existing cybersecurity frameworks, aiming to address the challenges posed by the integration of machine learning [10]. A key advancement in this domain is the evolution of Cybertrust frameworks, emphasizing the need to integrate explainable, interpretable, and actionable AI into cybersecurity operations [11]. Nevertheless, the swift advancements in the domain may lead to information overload for incident responders and ML model maintainers, potentially resulting in a sluggish adoption rate [12].

Large Language Models (LLMs), which are driving the rapid development of autonomous agents, show significant potential to transform the cybersecurity landscape. Their ability to seamlessly integrate diverse AI tasks and adapt to various use cases positions them as versatile solutions that can boost XAI adoption and reduce operational costs. Specifically, large language models, and conversational agents in particular, have demonstrated outstanding capabilities for actionable AI applications, which are essential for providing response suggestions to threat responders.

In this paper, we introduce a novel prototype, HuntGPT, designed to integrate actionable, interpretable, and explainable AI into cybersecurity operations. The prototype is designed to perform analysis on network traffic, utilizing a Random Forest classifier as the anomaly detection model. This model, trained using the KDD99 dataset [13], is deployed to classify the acquired packets systematically. We utilize explainability frameworks such as SHAP and LIME, in pair with a conversational agent powered by the Language Learning Model API from OpenAI, which is predominantly known to users as ChatGPT. The prototype is evaluated for technical accuracy using the Certified Information Security Manager (CISM) certification [14], and appraised across six metrics to gauge response readability.

The contribution of this paper is system-level: we integrate an ML classifier, SHAP/LIME explanations, and an LLM-based conversational interface into a single analyst-facing dashboard, in which the LLM is conditioned on a structured feature-attribution summary, the raw XAI plots are shown alongside its narrative, and iterative dialogue serves as the accessibility mechanism for non-expert users. We do not claim a new ML algorithm, XAI method, or LLM; the novelty lies in the integration architecture and its first feasibility evaluation.

The remainder of the paper is structured as follows: Section 2 reviews the relevant literature and summarizes findings from related studies. Section 3 details the system architecture and development process. Section 4 presents the evaluation and results. Section 5 discusses limitations, threats to validity, and security considerations. Finally, Section 6 concludes the paper and outlines future research directions.

2. Background

Adopting efficient cybersecurity strategies is challenging for small and medium-sized enterprises (SMEs) due to constraints such as limited budgets, a lack of skilled personnel, and insufficient time allocated to cybersecurity planning [15]. Maintaining a fully staffed Security Operations Center (SOC)—typically requiring multiple tiers of analysts, malware engineers, forensic specialists, and dedicated tooling, alongside SIEM, EDR, and forensic platforms—is therefore beyond the reach of many smaller organizations. This situation has motivated extensive research into automated, machine-learning-driven detection systems that can reduce analyst workload and lower the barrier of entry for advanced threat hunting. Table 1 provides a concise synthesis of representative machine-learning-based network anomaly detection studies discussed in this section, situating our work within the broader research landscape. The remainder of the section explores three key enabling technology areas that contribute to advancements in cybersecurity operations: network anomaly detection, explainable artificial intelligence, and conversational agents.

2.1. Network Anomaly Detection

The purpose of an anomaly detection mechanism is to analyze, understand, and characterize network traffic behavior, as well as to identify or classify abnormal traffic instances, such as malicious attempts, from normal traffic. Thus, from a machine learning perspective, the anomaly detection problem is a classification problem [24]. Over the years, detection systems have undergone considerable evolution, leading to the development of diverse approaches and deployment methods, including in fifth-generation (5G) communication networks and decentralized architectures.

Several Machine Learning (ML) techniques have been extensively applied in the domain of network anomaly detection, encompassing both supervised and unsupervised algorithms. Yihunie et al. [16] reviewed five representative algorithms: Stochastic Gradient Descent, Random Forests, Logistic Regression, Support Vector Machine, and Sequential Model, applying them to the NSL-KDD dataset. The empirical results from their study indicated that the Random Forest Classifier surpassed the other examined classifiers in terms of performance. Optimization-based approaches have also been explored for improving ML-IDS performance; for example, Painted Wolf Optimization has been applied to train an intrusion detection model and demonstrated competitive accuracy and F-measure in abnormal-traffic detection [25].

Eltanbouly et al. [17] introduce a hybrid system that combines the Random Forest and K-means algorithms. The proposed system is bifurcated into two distinct phases. The initial phase, known as the online phase, focuses on misuse detection by leveraging the Random Forest algorithm, followed by the offline phase, which categorizes random attacks through the use of the weighted K-means algorithm. Similarly, Zhao et al. [18] proposed a Multi-Task Deep Neural Network in Federated Learning (MT-DNN-FL) to simultaneously detect network anomalies, recognize VPN (Tor) traffic, and classify traffic, while preserving data confidentiality. Experimental results on representative datasets demonstrated superior detection and classification performance compared to several baseline methods under a centralized training architecture.

Preuveneers et al. [19] proposed a blockchain-based federated learning method that enables auditing of model updates without centralizing training data, thereby enhancing transparency and accountability in detecting malicious behavior. The experiments show that while integrating blockchain increases complexity, the impact on performance is minimal (ranging from 5 to 15%), and the method is adaptable to more sophisticated neural network architectures and diverse use cases. Notably, the adoption of federated self-learning for anomaly detection and threat hunting is a growing trend in IoT devices, focusing on improving detection accuracy while prioritizing data privacy [20,21,22]. In parallel, feature-selection research has begun combining federated learning with compact language models, demonstrating that small language models can support privacy-preserving, resource-aware feature selection for intrusion detection across datasets such as NSL-KDD, UNSW-NB15, and TON IoT [26].

Lately, the widespread adoption of 5G networks has increased the demand for the development of automated Intrusion Detection Systems (IDS), leading to a boost in specialized research in this domain [27]. Sheikhi et al. [23] focused on employing a federated learning-based method to identify DDoS attacks on the GTP protocol within a 5G core network.More recent work has extended this direction by applying unsupervised federated LSTM-based time-series modeling to 5G core traffic, demonstrating detection of distributed PFCP and IP-spoofing attacks on the GTP protocol while preserving local traffic privacy [28]. This approach capitalizes on the collective intelligence of various devices to proficiently and confidentially recognize DDoS attacks. While ML models exhibit strong performance in controlled settings, their efficacy in real-world environments is perceived as a significant barrier to their adoption [9]. In the following section, we will discuss various explainability methods that have been proposed as potential solutions to overcome the barriers associated with the real-world performance of ML models.

2.2. Explainable AI

Explainable artificial intelligence (XAI) refers to advanced techniques that aim to make the outputs of machine learning models more understandable and transparent to users. These techniques allow a practical deployment of ML models as they provide methods to ensure that the trained model is trustworthy by detecting biases in the model or in the corresponding training data, and increasing transparency in the predictions of the models by providing explanations for which input features had the most impact on the output of the model. For many critical applications in defense, medicine, finance, and law, explanations are essential for users to understand, trust, and effectively manage these new, artificial intelligence partners [29].

Recent research in Explainable Artificial Intelligence (XAI) has been actively applied to cybersecurity, particularly in specialized use cases like intrusion detection and malware identification [11]. Nguyen et al. develop GEE [30], a framework for detecting and explaining anomalies in network flow traffic. GEE comprises two components. The first one consists of an unsupervised Variational Autoencoder (VAE) model for detecting network anomalies. The second one is a gradient-based fingerprinting technique for explaining the detected anomalies in the VAE. The evaluation shows that their approach is effective in detecting different anomalies as well as identifying fingerprints that are good representations of these attacks.

Han et al. developed the DeepAID framework [31] to interpret unsupervised DL-based anomaly detection systems for cybersecurity. The approach helps security analysts understand why a certain sample is considered anomalous by searching for the difference between the anomaly and a normal reference data point. Additionally, they propose a model distiller that serves as an extension to the black-box DL model using a simpler and easier-to-understand finite-state machine model that allows analysts to get involved in the model decision-making process. While XAI holds promise in enhancing the adoption of ML models within existing cybersecurity frameworks, several challenges and considerations remain to be addressed.Recent work on logic-based XAI for 5G intrusion detection further highlights this need, showing that feature attributions can be complemented with high-fidelity rule extraction and validated LLM-generated explanations to make IDS decisions more transparent and actionable [32].

Nyre-Yu et al. [12] conducted a pilot study within an operational environment to assess an Explainable Artificial Intelligence (XAI) tool, focusing on insights gleaned from real-time interactions between cybersecurity analysts and XAI. The initial findings showed that the deployed XAI tools were not heavily used and did not improve analyst decision accuracy, highlighting the importance of aligning explanations with analysts’ workflows, operational environments, and trust requirements. Likewise, a recent systematic review [33] highlighted that research on XAI for cybersecurity shows that many XAI applications are crafted without a thorough understanding of their integration into analyst workflows. Moreover, the security literature frequently fails to differentiate among diverse use cases or to clearly separate the roles of model users and designers, which could lead to diminished adoption.

2.3. Chatbots for Security & ChatGPT

Conversational agents, often referred to as chatbots, have gained attention for their role in supporting cybersecurity within businesses through sharing network and security information with non-technical staff [34].

The security-focused chatbot introduced in [35], named SecBot, demonstrates the role of a conversational agent in supporting cybersecurity planning and management. SecBot applies concepts of neural networks and Natural Language Processing (NLP) to interact and extract information from a conversation to (a) identify cyber-attacks, (b) indicate solutions and configurations, and (c) provide insightful information for the decision on cybersecurity investments and risks.

Another significant advancement in Natural Language Understanding (NLU) with proven success in cybersecurity is the development of Generative Pre-trained Transformers (GPT) language models. These models can operate as standalone tools; for example, applying GPT to formulate cybersecurity policies, as demonstrated in McIntosh et al. [36], helps deter and mitigate the impact of ransomware attacks involving data exfiltration. The results of the study indicated that, in specific scenarios, policies generated by GPT could surpass those created by humans, mainly when supplied with customized input prompts. Similarly, Setianto et al. [37] developed a runtime system, GPT-2C, that uses a fine-tuned GPT-2 model to parse logs from a live Cowrie SSH honeypot effectively, achieving 89% inference accuracy in parsing Unix commands with minimal execution latency.

Furthermore, research has placed emphasis on exploring the potential threats posed by Large Language Models (LLMs) such as OpenAI’s ChatGPT and Bard, particularly their ability to facilitate cyberattacks [38,39]. Nonetheless, limited research has been conducted on the potential integration of LLMs into cyber-hunting interfaces. Bringing together conversational agents and GPT offers new opportunities for delivering knowledge and insights to non-professionals. Preliminary research and trials from other domains provide insight into this potential. For instance, ref. [40] examines the viability of leveraging Explainable AI (XAI) and language models, such as ChatGPT, to transform how financial knowledge is conveyed to those outside the financial sector. The findings suggest that ChatGPT holds significant promise in demystifying intricate financial principles for a broader audience.

Finally, prompt engineering, an emerging research area, shows promise as a solution to some of LLMs’ shortcomings, as simple, effective prompts have been shown to improve GPT-3’s reliability [41]. With appropriate prompts, GPT-3 is more reliable than smaller-scale supervised models on all these facets: generalizability, social biases, calibration, and factuality [41]. However, evaluating the performance of large language models across diverse domains presents an ongoing challenge. Several research studies have examined ChatGPT’s performance in diverse fields. Gilson et al. [42] investigated its performance on the United States Medical Licensing Examination, highlighting the ability of ChatGPT to provide logical and informative context in the majority of its responses. Furthermore, when ChatGPT delivers accurate and up-to-date information, users report higher satisfaction levels and make more informed decisions [43]. Therefore, understanding and quantifying the accuracy of the data provided in the cybersecurity domain is of high importance.

3. Architecture

The proposed system is organized into three distinct layers, each designed to fulfill specific functions and ensure optimal performance, as illustrated in Figure 1. Below, we outline the elements and functions of each layer, clarifying their roles and interactions within the overall system architecture.

Analytics engine: This powerhouse layer is responsible for performing the network packet analysis, examining network data, and detecting and managing anomalies and inconsistencies in the network flows.
Data Storage: We leverage Elasticsearch as our primary document storage, prized for its near real-time search capabilities, scalability, and reliability. It houses all detected network anomalies and the corresponding original flow data. For storing plots and images, Amazon S3 buckets are our go-to, guaranteeing security and accessibility.
User Interface (UI): The dashboard UI, constructed with Gradio, is the interactive front-end of our system, presenting the analytic engine’s outcomes to human analysts in a user-friendly manner. It integrates with OpenAI’s Language Model API, enabling seamless interactions between analysts and the system for ongoing discussions and analysis.

We selected this configuration due to its modular design, which prioritizes the segregation of components. This allows each layer to be developed, maintained, and enhanced independently, providing flexibility and promoting efficient scalability. The autonomous nature of each module ensures streamlined adaptability to evolving requirements, reinforcing the robustness and responsiveness of the system.

3.1. Component Diagram

In this section, we narrow our focus to provide more detailed insights into the architecture of the system, engaging in a concise discussion on each primary component. Figure 2 depicts a comprehensive illustration of the diverse components comprising our system, which include (i) the Anomaly Detection Application Server and (ii) the Intrusion Detection System Dashboard.

3.1.1. Anomaly Detection Application Server

The Anomaly Detection Application Server serves as the central orchestrator of the entire anomaly detection process. This component integrates several sub-modules to facilitate the efficient and accurate identification of anomalous network behavior.

ML Model Loader: loading the pre-trained machine learning model to assess incoming data points and then process the anomalies through explainability frameworks (SHAP and LIME) to provide interpretable explanations for predictions.
Elasticsearch Connector: Enabling seamless communication with Elasticsearch, the connector module handles authentication and index management, ensuring secure access to the Elasticsearch cluster. This component establishes a secure connection, creates and verifies indices, and ultimately enables the efficient storage of information about both detected and original packets.
Prediction: The prediction component analyzes individual network flows to determine the presence of anomalies.
Explainer: This component generates prediction factors, plots, and JSON documents. The explainer indexes the generated data into Elasticsearch, constructing a structured repository that facilitates efficient querying and exploration of detected and original network flows. To augment the interpretability of our findings, the explainer component uploads plot images to an AWS S3 bucket. These plots enhance our understanding of the model dynamics.
Elasticsearch: Elasticsearch plays a pivotal role in storing and organizing information. The system leverages Elasticsearch to manage both the “Detected Packets Index” and the “Original Packets Index,” optimizing data accessibility and analysis.
AWS S3 Bucket: Serving as a centralized repository for our visual resources, housing the uploaded plots.

3.1.2. Intrusion Detection System Dashboard

Our IDS Dashboard enhances trust in the AI system by revealing model insights and contributing features. It allows users to inspect original data packets for manual anomaly detection. AI-generated explanations further clarify predictions and suggest an appropriate course of action. The dashboard also facilitates interactive discussions with the AI assistant, offering custom insights into the model’s operations. This component integrates several sub-modules to facilitate its work in presenting explainability and fairness to the end user.

OpenAI Connector: The OpenAI Connector is mainly used for authentication with the OpenAI API and also initiates the prerecorded message history. It also keeps track of the user conversations.
Anomaly Packet Data Fetching: It looks through all the documents in the Elasticsearch index and extracts all the important information we need from that document.
OpenAI API Unit: Integrates the detected packet flow data with the curated fine-tuning prompts and feeds them to the OpenAI API.
AI Assistant Analysis: The AI Assistant receives all the data preserved in the document, accompanied by the refined prompts from the unit, and it generates a comprehensive analysis for the human agent. This analysis not only reveals the details but also enables direct interactive communication with the human agent, enabling a seamless exchange of information and insights.

3.2. Anomaly Detection Model & Dataset

The model driving our network anomaly detection is a Random Forest classifier trained and tested on a 10% sample of the KDD99 dataset [13], achieving an accuracy of 0.9973. The training dataset consisted of 494,021 records, among which 97,277 were normal, 391,458 DOS, 4107 Probe, 1126 R2L, and 52 U2R connections. In each connection, there are 41 attributes describing different features of the connection and a label assigned to each [44].

3.3. Elasticsearch Data Schemas

For telemetry storage, we principally utilize two Elasticsearch indices: detected-packets and original-packets. The “detected-packets” index serves as a structured repository for crucial information related to detected network anomalies. This schema is designed to store the attributes extracted using the two explainability frameworks, SHAP and LIME, which together contribute to a comprehensive understanding of the identified anomalies.

The schema for the “detected-packets” index is designed to align with the objective of offering insights into the workings of the anomaly detection model, ensuring transparency and a deeper understanding of its mechanisms. It provides a structured approach to storing textual predictions, influential factors, and other data, thereby enhancing the dashboard’s ability to explain anomalies effectively. Table 2 provides an overview of the “Detected Packets” index, detailing each label and its corresponding description to ensure clarity and a comprehensive understanding of the data structure. The table emphasizes key attributes pivotal for anomaly detection and subsequent analysis.

The “original-packets” index requires a more complex and dynamic mapping. An in-depth review of each field in the document is unnecessary, as the KDD99 dataset description [13] already provides comprehensive coverage. The schema for the original packets is crucial, as it aids anomaly detection through manual reviews and enhances trust between the human agent and the intrusion detection system.

3.4. User Experience & Use Cases

Collectively, the elements of the IDS Dashboard we previously discussed help users make well-informed decisions about network anomalies by incorporating explanatory visualizations, manual inspection, AI-generated explanations, and interactive conversations. The system is designed to be versatile and can assist users in multiple scenarios.

First, the Detection Engine provides updated threat identification and classification of incidents, swiftly categorizing network threats and enabling security analysts to make informed responses through insightful visualizations and AI-generated explanations available on the Dashboard. Second, the dashboard enables ML model Interpretability, serving as a crucial tool for developers and data scientists to comprehend model functionality and its shortcomings, aiding the creation of more advanced models. Third, to address the prevalent concern of soaring security operations costs, our dashboard promotes collaborative analysis and reporting by providing a user-friendly interface suitable for individuals with varying levels of security expertise and by equipping users with essential tools, thereby optimizing operational effectiveness and cost efficiency.

4. Evaluation and Results

4.1. Prototype Functionality

The dashboard shown in Figure 3 features a neatly organized layout that seamlessly integrates the multiple components of the HuntGPT. In the upper section, the dashboard provides a visual representation of the ML model’s inner workings. Moving down, the middle area presents the original data for manual inspection when needed. Lastly, the lower part is the AI assistant, which provides context and explainability to the end user and enables ongoing conversations. The system user can download a complete report of the incident, including all graphs and data related to the incident, using the ‘Generate Report’ function. This setup ensures user-friendliness while uniting various functions.

4.2. Response Quality Analysis

In this section, we examine how well our prototype explains detected anomalies and assists users through its chatbot feature. Assessing the chatbot’s responses, a common practice with conversational agents, posed distinctive challenges for quantifying the required metrics for performance appraisal. The analysis can be segmented into two key components: (i) Technical Knowledge in Cybersecurity and (ii) Response Evaluation.

We first examine whether ChatGPT (i.e., GPT-3.5 Turbo) possesses the requisite technical knowledge in the field of cybersecurity to effectively assist the user. A critical aspect of this analysis involves comparing ChatGPT’s knowledge to that of a certified IT professional. Following this knowledge assessment, we evaluate the responses provided by ChatGPT in terms of their quality and appropriateness. Special attention is paid to the level of difficulty in the answers since our goal is to provide knowledge to users with minimal cybersecurity experience.

4.2.1. Technical Knowledge in Cybersecurity

In this assessment, our primary focus is to measure the background cybersecurity competence of the LLM substrate (GPT-3.5-turbo) that powers the conversational layer of HuntGPT. This measurement asks a narrower question than “does HuntGPT correctly explain a given anomaly?”, which would require a task-based user study against a forensic ground truth, and which we identify as future work. Instead, it asks whether the LLM, considered as a component, has baseline professional knowledge of the cybersecurity domain at all. To address this question, we conducted tests using a set of standardized cybersecurity certification exams. The detailed list of the standardized exams used for our assessment is presented below:

CISM Certified Information Security Manager Practice Exams: The updated self-study guide, written by Peter H. Gregory, featuring hundreds of practice exam questions that match those on the live test [14].
ISACA official CISM practice Quiz. A free practice quiz includes questions from ISACA’s test prep solutions that are the same difficulty level as ISACA’s official CISM exam [45].
ISACA official cybersecurity fundamentals practice quiz: A practice quiz including questions from ISACA’s test prep solutions that are the same level of difficulty as ISACA’s official Cybersecurity Fundamentals exam [46].

Based on the results presented in Table 3 and visualized in Figure 4, GPT-3.5-turbo demonstrates substantial proficiency in cybersecurity knowledge, with success rates ranging from 72% to 82.5% across varied and reputable exams. However, there is room for improvement, particularly in the ISACA official cybersecurity fundamentals practice quiz, where the model achieved a lower success rate of 72%. Nevertheless, preliminary results from our evaluation are highly promising and indicate that the model possesses the capacity to provide well-informed security decisions.

We emphasize that these scores reflect only the LLM’s general cybersecurity knowledge and do not, in themselves, validate HuntGPT’s per-alert explanation fidelity. The latter is the subject of the future user study described in Section 6.

4.2.2. Response Evaluation

We evaluated the responses provided by GPT-3.5-turbo in terms of their quality and appropriateness. Table 4 presents the comprehensive findings of our study, in which we evaluated 20 generated explanations and 20 chat logs. Each chat log included approximately six questions regarding the detected anomaly. An example of a chat log is illustrated in Table 5.

We evaluated the responses using six unique readability formulas using Python version 3.11 and the py-readability-metrics library version 1.4 [47], each applicable to different fields due to their distinct characteristics. The Flesch-Kincaid Grade Level, adopted by the U.S. Army, is used for assessing the complexity of technical manuals. The Flesch Reading Ease is a standard test used by the U.S. Department of Defense to determine the readability of its documents and forms. The Dale-Chall readability formula is widely adopted in schools and educational institutions to assess the difficulty level of texts. The Automated Readability Index was initially designed for real-time monitoring of readability on electric typewriters. The Coleman-Liau Index has been specifically used to evaluate the readability of medical documents. Finally, the Linsear Write was purportedly developed for the United States Air Force to assess the readability of their technical manuals.

The readability results in Table 4, visualized side-by-side in Figure 5, were evaluated at the graduate level or equivalent under most formulas for both AI explanations and chatbot logs, indicating the presence of complex linguistic structures in the generated texts. While chat conversations can vary in readability depending on question types and required user details, the generated text is usually readable and typically requires only a basic, non-specialized college education for comprehension.

We note the apparent tension between these graduate-level readability scores and our stated goal of explicitly supporting users with limited cybersecurity expertise. We do not claim that the initial, one-shot explanation produced for an alert is itself suitable for a non-expert reader: that explanation must, by construction, name technical entities (specific features, attack categories, protocol fields) using their technical names, and flattening it to plain language would compromise precision. The accessibility mechanism we propose in HuntGPT is, instead, the follow-up dialogue: a user without prior security knowledge can ask iterative, clarifying questions, and the assistant scales the level of detail to the question. Table 5 illustrates this scaling, a user who does not know what a firewall is can move from a definition to a free alternative to installation instructions to a home-network configuration in a single conversation. We acknowledge that this accessibility claim has not yet been empirically validated with non-expert users in a controlled study, and we identify such a study as future work in Section 6.

5. Limitations, Threats to Validity, and Security Considerations

This section makes explicit the boundaries of the empirical claims of this paper and the security considerations that arise from placing an LLM in a SOC-adjacent role.

5.1. Dataset and Detection-Engine Scope

The Random Forest classifier reported in Section 3.2 is trained and evaluated on the KDD99 dataset. KDD99 is used here as a well-known, publicly available baseline to demonstrate the architectural integration among the classifier, the SHAP/LIME layer, the Elasticsearch and S3 storage layers, and the LLM-driven conversational interface. The anomaly-detection performance numbers reported on KDD99 should therefore not be read as a claim about contemporary threat detection: KDD99 does not represent modern protocol mixes (TLS 1.3, QUIC, encrypted IoT traffic) or modern attack classes. Retraining and benchmarking the detection engine on contemporary datasets such as CSE-CIC-IDS2018 and UNSW-NB15 is the first item of our future work plan. The architectural contribution of the paper, the way SHAP/LIME outputs are mediated to an analyst through an LLM, is independent of the choice of underlying classifier or dataset.

5.2. Evaluation Alignment

The evaluation in Section 4.2 measures two distinct, narrow quantities: (i) the readability of the LLM-generated explanations and chat responses, computed over 20 explanations and 20 chat logs; and (ii) the general cybersecurity competence of GPT-3.5-turbo, computed by running the model on standardized CISM and ISACA practice exams. Neither of these measurements validates the most operationally important property of HuntGPT, the fidelity of per-alert explanations relative to a forensic ground truth. This limitation is consistent with broader concerns in LLM evaluation, where narrow benchmark or exam-style scores may fail to capture real-world utility, robustness, and task-specific reliability [48]. Validating that property requires a task-based user study with security analysts, in which analysts are shown the original packet features, the SHAP/LIME plot, and the LLM narrative. It is asked to judge whether the narrative correctly reflects the statistical evidence and, separately, the underlying network reality. We identify this study as the central next step of this research program.

5.3. Semantic Fidelity and Information Loss

The natural-language layer is, by construction, a lossy projection of the SHAP and LIME outputs. The prototype mitigates this in two ways. First, the dashboard always displays the raw SHAP and LIME plots alongside the LLM narrative (see Figure 3), so the analyst can cross-check the linguistic summary against the underlying statistical evidence rather than relying solely on the narrative. Second, the LLM is conditioned on the structured Factors field in the Elasticsearch schema (Table 2), which preserves an auditable textual description of the top contributing features; this is the same description rendered to the analyst, so the LLM’s input is observable. Quantitative measurement of semantic fidelity, for instance, the fraction of cases in which the LLM narrative preserves the top-k SHAP features and their directional contribution, is identified as future work.

5.4. Adversarial Robustness and the LLM Attack Surface

Placing an LLM in a SOC triage workflow introduces three distinct risk classes that we acknowledge here without claiming to have resolved them.

Hallucinations. The LLM operates on a structured Factors string assembled from SHAP/LIME output, not on free-form analyst text, which reduces the surface for hallucination compared with open-domain question answering. The risk is nevertheless real, and the dashboard’s display of raw SHAP/LIME plots is the primary mitigation: an analyst can detect cases in which the narrative diverges from the statistical evidence. Quantitative hallucination measurement is future work.

Adversarial manipulation of feature attributions. An attacker who can manipulate packet features to induce a misleading SHAP attribution could, in principle, induce a misleading LLM narrative. This risk is upstream of the LLM; it is fundamentally an adversarial machine learning (ML) attack on the classifier and the post-hoc explainer. Defending against such attacks requires the use of certified-robust feature attribution methods and adversarially trained classifiers; this is an active area of research and is out of scope for the present system-architecture paper.

Prompt injection via ingested telemetry. Any free-text field ingested from the original-packets index (e.g., HTTP payload fragments) could in principle contain content designed to influence the LLM. The prototype mitigates this in two ways: (i) only sanitized, numerically encoded feature values are passed to the LLM, so free-form payload text is never injected into the prompt; and (ii) the system prompt explicitly instructs the model to treat all input as untrusted telemetry, not as instructions. Empirical red-teaming of this defense is identified as future work.

5.5. Context-Window Management

The current prototype is per-alert: each detected anomaly is summarized in an independent LLM call. The system prompt, the structured Factors string, and the rolling chat history fit comfortably within GPT-3.5-turbo’s context window, so context overflow is not a constraint in the current design. Multi-alert correlation, summarizing a sequence of related alerts in a single LLM call, or maintaining a longer-running session over a campaign, raises distinct context-window and prompt-engineering challenges and is an explicit future-work item.

5.6. Latency

HuntGPT is positioned as a SOC triage and explanation aid, not as an inline detector. The Random Forest classifier produces detections at line rate, and the LLM is invoked only when an analyst opens an alert in the dashboard. In that interactive-triage setting, the per-call latency of a single OpenAI completion (a few seconds in our usage) is acceptable. Quantitative end-to-end latency benchmarking and an evaluation of cheaper or locally hosted LLM substitutes are needed before any inline deployment and are identified as future work.

5.7. Scale and Breadth of the LLM Evaluation

The LLM-substrate evaluation reported in Section 4.2.1 covers 75 questions from three cybersecurity practice-exam sources and is restricted to a single LLM (GPT-3.5-turbo). It is therefore a feasibility test, not a benchmark. Two extensions are needed before strong claims about LLM choice can be made. First, the question set should be expanded in both volume and topical breadth, beyond general certification material to include operational categories such as incident analysis, malware classification, and threat-actor attribution. Second, the same expanded question set should be run against multiple LLM substrates, GPT-4-class models, Claude, and open-weight models such as Llama-3, so that the choice of LLM substrate becomes a controlled variable rather than a fixed component of the prototype. We identify both extensions as priority items in our future work plan.

6. Conclusions and Future Work

In conclusion, the study illustrates the efficacy of integrating LLM-based conversational agents with Explainable AI (XAI) within intrusion detection systems. Our prototype, HuntGPT, combines the advanced capabilities of GPT-3.5-turbo with a user-friendly dashboard, adeptly explaining the latent details of detected anomalies. The model demonstrated substantial proficiency in cybersecurity, as evidenced by its success rates ranging from 72% to 82.5% across various reputable cybersecurity certification exams. However, these results also highlight areas for improvement, primarily focusing on enhancing proficiency in fundamental cybersecurity concepts where the model achieved a lower success rate.

The extensive readability analysis of generated responses indicates that the content produced is generally comprehensible to individuals with a basic college education, fostering user understanding and interaction. The implemented conversational agent effectively generates actionable responses and promotes user engagement by communicating complex concepts, thereby making a substantial contribution to enhancing user comprehension in cybersecurity.

This research provides valuable insights into integrating advanced AI models into interactive, user-focused cybersecurity applications. The achieved success rates and the readability of generated responses emphasize the potential of implementing such integrated models in real-world applications, providing a solid foundation for developing more sophisticated, explainable, interpretable, actionable, and user-friendly cybersecurity solutions. This study serves as a stepping stone for further research and development in creating LLM-driven security tools that integrate XAI in response to the evolving landscape of cybersecurity threats and user needs.

Our future work plan addresses, in order of priority, the empirical limitations identified in Section 5. First, we will retrain and evaluate the detection engine on contemporary intrusion-detection datasets, CSE-CIC-IDS2018 and UNSW-NB15, and report performance under modern protocol mixes and attack classes. Second, we will run a task-based user study with security analysts of varying experience levels, in which the analysts judge the fidelity of HuntGPT’s per-alert explanations against a forensic ground truth and are timed on triage tasks with and without the LLM layer. Third, we will design and evaluate a response-trigger module that converts the assistant’s recommendations into machine-readable mitigation (e.g., Sigma or Suricata rules or executable response snippets) and emits them to the SIEM, thereby moving from decision support to closed-loop response. Fourth, we will empirically characterize the system’s exposure to hallucinations, prompt injection via ingested telemetry, and adversarial manipulation of feature attributions. Fifth, we will benchmark end-to-end latency and evaluate cheaper, locally hosted LLM substitutes (e.g., open-weight models) as alternatives to the OpenAI API.

Author Contributions

Conceptualization, T.A., S.S. and P.K.; methodology, T.A. and P.K.; software, T.A.; validation, T.A., P.K. and S.S.; formal analysis, T.A. and P.K.; investigation, P.K.; resources, P.K.; data curation, T.A.; writing—original draft preparation, T.A. and P.K.; writing—review and editing, T.A., S.S. and P.K.; visualization, T.A., S.S. and P.K.; supervision, P.K. and S.S.; project administration, P.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The KDD Cup 1999 dataset used to train and evaluate the Random Forest classifier in this study is openly available from the UCI Machine Learning Repository at https://doi.org/10.24432/C51C7N (accessed on 1 September 2023) (reference [13] in this manuscript). The Certified Information Security Manager (CISM) and ISACA practice exam materials used for the technical-knowledge evaluation in Section 4.2.1 are subject to copyright held by their respective publishers (McGraw-Hill [14] and ISACA [45]); the exam questions are therefore not redistributed in this study, and only aggregate evaluation results are reported. No additional proprietary datasets were generated or analyzed.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Morgan, S. 2023 Cybersecurity Almanac: 100 Facts, Figures, Predictions, and Statistics. 2023. Available online: https://cybersecurityventures.com/cybersecurity-almanac-2023/ (accessed on 23 September 2023).
Cybersecurity, C.I. Framework for Improving Critical Infrastructure Cybersecurity; National Institute of Standards and Technology: Gaithersburg, MD, USA, 2018; Volume 4162018. Available online: https://nvlpubs.nist.gov/nistpubs/cswp/nist.cswp.04162018.pdf (accessed on 1 September 2023).
Shu, X.; Araujo, F.; Schales, D.L.; Stoecklin, M.P.; Jang, J.; Huang, H.; Rao, J.R. Threat intelligence computing. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, Toronto, ON, Canada, 15–19 October 2018; pp. 1883–1898. [Google Scholar]
Bhuyan, M.H.; Bhattacharyya, D.K.; Kalita, J.K. Network Anomaly Detection: Methods, Systems and Tools. IEEE Commun. Surv. Tutor. 2014, 16, 303–336. [Google Scholar] [CrossRef]
Mavroeidis, V.; Jøsang, A. Data-Driven Threat Hunting Using Sysmon. In Proceedings of the 2nd International Conference on Cryptography, Security and Privacy, ICCSP 2018, New York, NY, USA, 16–18 March 2018; pp. 82–88. [Google Scholar] [CrossRef]
Subramanian, K.; Meng, W. Threat Hunting Using Elastic Stack: An Evaluation. In Proceedings of the 2021 IEEE International Conference on Service Operations and Logistics, and Informatics (SOLI), Singapore, 11–12 December 2021; pp. 1–6. [Google Scholar] [CrossRef]
Aldauiji, F.; Batarfi, O.; Bayousef, M. Utilizing Cyber Threat Hunting Techniques to Find Ransomware Attacks: A Survey of the State of the Art. IEEE Access 2022, 10, 61695–61706. [Google Scholar] [CrossRef]
Thottan, M.; Ji, C. Anomaly detection in IP networks. IEEE Trans. Signal Process. 2003, 51, 2191–2204. [Google Scholar] [CrossRef]
Cavallaro, L.; Kinder, J.; Pendlebury, F.; Pierazzi, F. Are Machine Learning Models for Malware Detection Ready for Prime Time? IEEE Secur. Priv. 2023, 21, 53–56. [Google Scholar] [CrossRef]
Charmet, F.; Tanuwidjaja, H.C.; Ayoubi, S.; Gimenez, P.F.; Han, Y.; Jmila, H.; Blanc, G.; Takahashi, T.; Zhang, Z. Explainable artificial intelligence for cybersecurity: A literature survey. Ann. Telecommun. 2022, 77, 789–812. [Google Scholar] [CrossRef]
Linkov, I.; Galaitsi, S.; Trump, B.D.; Keisler, J.M.; Kott, A. Cybertrust: From Explainable to Actionable and Interpretable Artificial Intelligence. Computer 2020, 53, 91–96. [Google Scholar] [CrossRef]
Nyre-Yu, M.; Morris, E.; Moss, B.C.; Smutz, C.; Smith, M. Explainable AI in Cybersecurity Operations: Lessons Learned from xAI Tool Deployment. In Proceedings of the Usable Security and Privacy (USEC) Symposium, San Diego, CA, USA, 28 April 2022; Volume 28. [Google Scholar]
Stolfo, S.; Fan, W.; Lee, W.; Prodromidis, A.; Chan, P. KDD Cup 1999 Data; UCI Machine Learning Repository: Irvine, CA, USA, 1999. [Google Scholar] [CrossRef]
Gregory, P. CISM Certified Information Security Manager Practice Exams, 2nd ed.; McGraw Hill LLC: New York, NY, USA, 2023. [Google Scholar]
Heikkilä, M.; Rättyä, A.; Pieskä, S.; Jämsä, J. Security challenges in small- and medium-sized manufacturing enterprises. In Proceedings of the 2016 International Symposium on Small-Scale Intelligent Manufacturing Systems (SIMS), Narvik, Norway, 21–24 June 2016; pp. 25–30. [Google Scholar] [CrossRef]
Yihunie, F.; Abdelfattah, E.; Regmi, A. Applying Machine Learning to Anomaly-Based Intrusion Detection Systems. In Proceedings of the 2019 IEEE Long Island Systems, Applications and Technology Conference (LISAT), Farmingdale, NY, USA, 3 May 2019; pp. 1–5. [Google Scholar] [CrossRef]
Eltanbouly, S.; Bashendy, M.; AlNaimi, N.; Chkirbene, Z.; Erbad, A. Machine Learning Techniques for Network Anomaly Detection: A Survey. In Proceedings of the 2020 IEEE International Conference on Informatics, IoT, and Enabling Technologies (ICIoT), Doha, Qatar, 2–5 February 2020; pp. 156–162. [Google Scholar] [CrossRef]
Zhao, Y.; Chen, J.; Wu, D.; Teng, J.; Yu, S. Multi-Task Network Anomaly Detection Using Federated Learning. In Proceedings of the 10th International Symposium on Information and Communication Technology, Hanoi, Vietnam, 4–6 December 2019; pp. 273–279. [Google Scholar] [CrossRef]
Preuveneers, D.; Rimmer, V.; Tsingenopoulos, I.; Spooren, J.; Joosen, W.; Ilie-Zudor, E. Chained Anomaly Detection Models for Federated Learning: An Intrusion Detection Case Study. Appl. Sci. 2018, 8, 2663. [Google Scholar] [CrossRef]
Mothukuri, V.; Khare, P.; Parizi, R.M.; Pouriyeh, S.; Dehghantanha, A.; Srivastava, G. Federated-Learning-Based Anomaly Detection for IoT Security Attacks. IEEE Internet Things J. 2022, 9, 2545–2554. [Google Scholar] [CrossRef]
Nguyen, T.D.; Marchal, S.; Miettinen, M.; Fereidooni, H.; Asokan, N.; Sadeghi, A.R. DÏoT: A Federated Self-learning Anomaly Detection System for IoT. In Proceedings of the 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS), Dallas, TX, USA, 7–10 July 2019; pp. 756–767. [Google Scholar] [CrossRef]
Sheikhi, S.; Kostakos, P. Cyber threat hunting using unsupervised federated learning and adversary emulation. In Proceedings of the 2023 IEEE International Conference on Cyber Security and Resilience (CSR), Venice, Italy, 31 July–2 August 2023; pp. 315–320. [Google Scholar] [CrossRef]
Sheikhi, S.; Kostakos, P. DDoS attack detection using unsupervised federated learning for 5G networks and beyond. In Proceedings of the 2023 Joint European Conference on Networks and Communications & 6G Summit (EuCNC/6G Summit), Gothenburg, Sweden, 6–9 June 2023; pp. 442–447. [Google Scholar] [CrossRef]
Bhattacharyya, D.K.; Kalita, J. Network Anomaly Detection: A Machine Learning Perspective; CRC Press: Boca Raton, FL, USA, 2013. [Google Scholar] [CrossRef]
Sheikhi, S. Painted Wolf Optimization: A Novel Nature-Inspired Metaheuristic Algorithm for Real-World Optimization Problems. Comput. Mater. Contin. 2026, 87, 7. [Google Scholar] [CrossRef]
Sheikhi, S. SLMFORGE: Small Language Models for Federated Feature Selection via Union Aggregation in Cybersecurity. In Proceedings of the 2025 IEEE International Conference on Big Data (BigData), Macau, China, 8–11 December 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 4784–4792. [Google Scholar]
Porambage, P.; Gür, G.; Osorio, D.P.M.; Liyanage, M.; Gurtov, A.; Ylianttila, M. The roadmap to 6G security and privacy. IEEE Open J. Commun. Soc. 2021, 2, 1094–1122. [Google Scholar] [CrossRef]
Sheikhi, S.; Kostakos, P. Advancing security in 5g core networks through unsupervised federated time series modeling. In Proceedings of the 2024 IEEE International Conference on Cyber Security and Resilience (CSR), London, UK, 2–4 September 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 353–356. [Google Scholar]
Gunning, D.; Stefik, M.; Choi, J.; Miller, T.; Stumpf, S.; Yang, G.Z. XAI—Explainable artificial intelligence. Sci. Robot. 2019, 4, eaay7120. [Google Scholar] [CrossRef] [PubMed]
Nguyen, Q.P.; Lim, K.W.; Divakaran, D.M.; Low, K.H.; Chan, M.C. Gee: A gradient-based explainable variational autoencoder for network anomaly detection. In Proceedings of the 2019 IEEE Conference on Communications and Network Security (CNS), Washington, DC, USA, 10–12 June 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 91–99. [Google Scholar]
Han, D.; Wang, Z.; Chen, W.; Zhong, Y.; Wang, S.; Zhang, H.; Yang, J.; Shi, X.; Yin, X. Deepaid: Interpreting and improving deep learning-based anomaly detection in security applications. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, Virtual, 15–19 November 2021; pp. 3197–3217. [Google Scholar]
Sheikhi, S.; Kostakos, P.; Loven, L. ExAI5G: A Logic-Based Explainable AI Framework for Intrusion Detection in 5G Networks. arXiv 2026, arXiv:2604.18052. [Google Scholar]
Nadeem, A.; Vos, D.; Cao, C.; Pajola, L.; Dieck, S.; Baumgartner, R.; Verwer, S. Sok: Explainable machine learning for computer security applications. In Proceedings of the 2023 IEEE 8th European Symposium on Security and Privacy (EuroS&P), Delft, The Netherlands, 3–7 July 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 221–240. [Google Scholar]
Ben Peretz, D. A Siri for Network Security: How Chatbots Can Enhance Business Agility. 2020. Available online: https://www.infosecurity-magazine.com/opinions/network-chatbots-agility/ (accessed on 22 July 2023).
Franco, M.F.; Rodrigues, B.; Scheid, E.J.; Jacobs, A.; Killer, C.; Granville, L.Z.; Stiller, B. SecBot: A Business-Driven Conversational Agent for Cybersecurity Planning and Management. In Proceedings of the 2020 16th International Conference on Network and Service Management (CNSM), Izmir, Turkey, 2–6 November 2020; pp. 1–7. [Google Scholar] [CrossRef]
McIntosh, T.; Liu, T.; Susnjak, T.; Alavizadeh, H.; Ng, A.; Nowrozy, R.; Watters, P. Harnessing GPT-4 for Generation of Cybersecurity GRC Policies: A Focus on Ransomware Attack Mitigation. Comput. Secur. 2023, 134, 103424. [Google Scholar] [CrossRef]
Setianto, F.; Tsani, E.; Sadiq, F.; Domalis, G.; Tsakalidis, D.; Kostakos, P. GPT-2C: A Parser for Honeypot Logs Using Large Pre-Trained Language Models. In Proceedings of the 2021 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM ’21, Virtual, 8–11 November 2022; pp. 649–653. [Google Scholar] [CrossRef]
Nour, B.; Pourzandi, M.; Debbabi, M. A Survey on Threat Hunting in Enterprise Networks. IEEE Commun. Surv. Tutor. 2023, 25, 2299–2324. [Google Scholar] [CrossRef]
Sebastian, G. Do ChatGPT and other AI chatbots pose a cybersecurity risk?: An exploratory study. Int. J. Secur. Priv. Pervasive Comput. (IJSPPC) 2023, 15, 1–11. [Google Scholar]
Yue, T.; Au, D.; Au, C.C.; Iu, K.Y. Democratizing financial knowledge with ChatGPT by OpenAI: Unleashing the Power of Technology. SSRN 2023. [Google Scholar] [CrossRef]
Si, C.; Gan, Z.; Yang, Z.; Wang, S.; Wang, J.; Boyd-Graber, J.; Wang, L. Prompting gpt-3 to be reliable. arXiv 2022, arXiv:2210.09150. [Google Scholar]
Gilson, A.; Safranek, C.W.; Huang, T.; Socrates, V.; Chi, L.; Taylor, R.A.; Chartash, D. How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med. Educ. 2023, 9, e45312. [Google Scholar] [CrossRef] [PubMed]
Chu, M.N. Assessing the Benefits of ChatGPT for Business: An Empirical Study on Organizational Performance. IEEE Access 2023, 11, 76427–76436. [Google Scholar] [CrossRef]
Olusola, A.A.; Oladele, A.S.; Abosede, D.O. Analysis of KDD’99 intrusion detection dataset for selection of relevance features. In Proceedings of the World Congress on Engineering and Computer Science. WCECS, San Francisco, CA, USA, 20–22 October 2010; Volume 1, pp. 20–22. [Google Scholar]
ISACA. CISM Practice Quiz. 2023. Available online: https://www.isaca.org/credentialing/cism/cism-practice-quiz (accessed on 1 October 2023).
ISACA. Cybersecurity Fundamentals Practice Quiz. Available online: https://www.isaca.org/-/media/info/csxf-practice-quiz/index.html (accessed on 1 October 2023).
py-Readability-Metrics: A Python Library for Computing Readability Metrics. Available online: https://pypi.org/project/py-readability-metrics/ (accessed on 1 October 2023).
Sheikhi, S.; Loven, L.; Kostakos, P. Beyond the Leaderboard: A Survey of the Science of Evaluation, Benchmarking, and Methodologies for Large Language Models. IEEE Access 2026, 14, 66493–66515. [Google Scholar] [CrossRef]

Figure 1. High-level diagram of dashboard integration.

Figure 2. System components diagram. The arrows indicate the direction of data, API, and message flow between the intrusion detection dashboard, OpenAI LLM, message history, message exchange, Elasticsearch document storage, AWS S3 bucket, anomaly detection application server, and data source.

Figure 3. Sections of detection and explainability in the dashboard. The local explanation panel uses red and green bars to indicate negative and positive feature contributions, respectively. The top feature-contribution panel uses different colors to distinguish the attack classes shown in the legend.

Figure 4. GPT-3.5-turbo success rates on the three cybersecurity practice-exam sources of Table 3. The chart highlights consistent performance across the two CISM sources and a lower score on the ISACA Cybersecurity Fundamentals quiz.

Figure 5. Readability of HuntGPT-generated text across the six formulas used in Table 4, comparing one-shot anomaly explanations against follow-up chatbot answers. The two text types are within one or two scale points of each other across all six formulas.

Table 1. Summary of representative machine-learning-based studies on network anomaly detection reviewed in this work.

Study	Approach	Dataset/Setting	Key Contribution
Yihunie et al. [16]	Comparative study of SGD, Random Forest, Logistic Regression, SVM, and a sequential model	NSL-KDD	Identified Random Forest as the best-performing classifier among the five evaluated baselines.
Eltanbouly et al. [17]	Hybrid Random Forest and weighted K-means method	Public IDS benchmarks	Proposed a two-phase scheme combining online misuse detection using Random Forest with offline categorization of random attacks using weighted K-means.
Zhao et al. [18]	Multi-task deep neural network under federated learning, MT-DNN-FL	Multiple public traffic datasets	Performed joint anomaly detection, VPN/Tor traffic recognition, and traffic classification while preserving data confidentiality.
Preuveneers et al. [19]	Chained anomaly detection with blockchain-audited federated learning	Network intrusion benchmarks	Enabled auditable and transparent model updates with only 5–15% performance overhead, while supporting richer neural architectures.
Mothukuri et al. [20]	Federated learning for IoT anomaly detection	IoT traffic	Developed a privacy-preserving anomaly detection approach tailored to IoT security attacks.
Nguyen et al. [21]	DÏoT federated self-learning system	IoT device telemetry	Enabled self-learning anomaly detection across distributed IoT deployments without centralized data collection.
Sheikhi and Kostakos [22]	Unsupervised federated learning combined with adversary emulation	Threat-hunting environments	Linked unsupervised federated learning with adversary emulation to improve cyber threat-hunting coverage.
Sheikhi and Kostakos [23]	Unsupervised federated learning for DDoS detection	5G core network, GTP protocol	Demonstrated DDoS detection in 5G core networks and showed the applicability of federated approaches to next-generation infrastructures.

Table 2. Detected Packets Index.

Label	Description
_id	Represents a unique identifier for each detected packet entry.
Prediction	Stores the prediction outcome associated with the detected packet, whether normal or malicious, and also the specific genre of the attack.
Factors	Captures the contributing factors that influenced the prediction outcome and the key attributes that led to the anomaly’s detection. We store textual descriptions of these factors for later translation into visual plots and for feeding into the chatbot to provide contextual answers.
Exp-img	Stores references to the plots used in presenting the prediction process, enhancing the comprehensibility of the anomaly detection process. Exp-img represents the local explanation plot provided by the LIME tabular explainer.
Shap-img	Shap-img represent the top features contributing to the predicted anomaly class using the SHAP framework.
Original-data	Denotes the number of the detected packets, which can be used as a foreign key to access the complete information regarding the detected packets.

Table 3. Comparison of CISM and Cybersecurity Exam Preparations: GPT-3.5 Success Rates.

Exam	No. of Questions	GPT-3.5 Turbo Success Rate
CISM Certified Information Security Manager Practice Exams [14]	40	82.5%
ISACA official CISM practice Quiz [45]	10	80%
ISACA official cybersecurity fundamentals practice quiz [46]	25	72%

Table 4. Evaluating generated text readability.

Readability Formula	Generated Anomaly Explanation		Chatbot Answers
Readability Formula	Score	Grade Level	Score	Grade Level
Flesch-Kincaid Grade Level	15.7	16	14.9	15
Flesch Reading Ease	22.7	graduate	23.9	graduate
Dale Chall	12.6	graduate	12.11	graduate
Automated Readability Index	16.3	graduate	15.9	graduate
Coleman Liau Index	15.3	15	16.3	16
Linsear Write	17.4	17	15.8	16

Table 5. Examples of submitted queries to HuntGPT and the related actionable responses.

Query	Chatbot Response
How can I prevent such an attack	To prevent DoS attacks, you can implement measures like traffic monitoring, firewalls, load balancers, and rate limiting to detect and mitigate abnormal traffic patterns. Ensuring network redundancy and having a robust incident response plan can also help minimize the impact.
What is a firewall, and how do I implement it (with examples)?	A firewall is a network security device that filters incoming and outgoing network traffic according to predefined security rules. Examples include Cisco ASA, Palo Alto Networks NGFW, Fortinet FortiGate, and Check Point. To implement a firewall, configure rule sets to allow or block specific traffic and define security policies to secure your network from unauthorized access.
Are there any free ones?	Yes. Examples include ZoneAlarm (free version for personal use) and Windows Firewall (built-in on Windows). Free solutions vary in features and protection level.
How can I install ZoneAlarm?	To install ZoneAlarm: visit the official website, navigate to Products/Downloads, download the free version, run the installer, and follow the on-screen prompts (license agreement, installation options, etc.). [proceeds to give detailed instructions on downloading ZoneAlarm]
What would be a good configuration for a small home network?	Install and update ZoneAlarm; use medium/high security mode; enable automatic program control; customize program permissions; enable Stealth Mode; set up Network Zones; monitor traffic; keep it updated.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ali, T.; Kostakos, P.; Sheikhi, S. HuntGPT: Integrating Machine Learning-Based Anomaly Detection and Explainable AI with Large Language Models (LLMs). Telecom 2026, 7, 73. https://doi.org/10.3390/telecom7030073

AMA Style

Ali T, Kostakos P, Sheikhi S. HuntGPT: Integrating Machine Learning-Based Anomaly Detection and Explainable AI with Large Language Models (LLMs). Telecom. 2026; 7(3):73. https://doi.org/10.3390/telecom7030073

Chicago/Turabian Style

Ali, Tarek, Panos Kostakos, and Saeid Sheikhi. 2026. "HuntGPT: Integrating Machine Learning-Based Anomaly Detection and Explainable AI with Large Language Models (LLMs)" Telecom 7, no. 3: 73. https://doi.org/10.3390/telecom7030073

APA Style

Ali, T., Kostakos, P., & Sheikhi, S. (2026). HuntGPT: Integrating Machine Learning-Based Anomaly Detection and Explainable AI with Large Language Models (LLMs). Telecom, 7(3), 73. https://doi.org/10.3390/telecom7030073

Article Menu

HuntGPT: Integrating Machine Learning-Based Anomaly Detection and Explainable AI with Large Language Models (LLMs)

Abstract

1. Introduction

2. Background

2.1. Network Anomaly Detection

2.2. Explainable AI

2.3. Chatbots for Security & ChatGPT

3. Architecture

3.1. Component Diagram

3.1.1. Anomaly Detection Application Server

3.1.2. Intrusion Detection System Dashboard

3.2. Anomaly Detection Model & Dataset

3.3. Elasticsearch Data Schemas

3.4. User Experience & Use Cases

4. Evaluation and Results

4.1. Prototype Functionality

4.2. Response Quality Analysis

4.2.1. Technical Knowledge in Cybersecurity

4.2.2. Response Evaluation

5. Limitations, Threats to Validity, and Security Considerations

5.1. Dataset and Detection-Engine Scope

5.2. Evaluation Alignment

5.3. Semantic Fidelity and Information Loss

5.4. Adversarial Robustness and the LLM Attack Surface

5.5. Context-Window Management

5.6. Latency

5.7. Scale and Breadth of the LLM Evaluation

6. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI