1. Introduction
The vision of continuous auditing, where compliance is monitored in near real-time, has gained urgency with the exponential growth of the Internet of Things (IoT). Data velocity, volume, and heterogeneity of IoT traffic render periodic manual audits insufficient [
1,
2]. While professional bodies such as the Institute of Internal Auditors (IIA) have championed continuous auditing as critical for modern three-line defence architectures [
3], Large Language Model (LLM) adoption remains limited—only 18% of organizations use AI for security compliance [
4]—hindered by concerns about black-box AI, testing errors, and unexplainable outputs [
5,
6].
The stakes are increasingly high in regulated industries transitioning to IoT environments. Organizations such as airlines and healthcare face regulatory pressure from standards such as ETSI EN 303 645 [
7] and HIPAA [
8], with non-compliance exposing them to legal liability, reputational damage, and operational disruptions. A single undetected vulnerability can cascade into supply chain breaches affecting millions of devices, as historically demonstrated by the Mirai botnet attacks. Professional auditors must provide legally defensible documentation tracing each compliance verdict to observable network evidence [
9]. However, manual analysis is time-consuming: a single 24 h capture from a small IoT deployment may contain millions of flows. This creates a critical need for automated systems that detect anomalies and generate natural-language compliance explanations at scale while maintaining evidentiary standards.
A fundamental technical problem persists: current AI-driven security systems exhibit an accuracy–faithfulness gap where high detection performance does not guarantee trustworthy explanations. Intrusion detection systems may achieve F1 scores exceeding 0.97 [
10,
11], yet LLM-generated compliance analyses can contain statements unsupported by retrieved evidence. For instance, an LLM might correctly flag a DDoS attack but cite an inapplicable control clause, or claim packet features justify a verdict when those features were absent from a retrieved context. Such unfaithful explanations render audit trails legally indefensible, undermining AI-assisted auditing’s core value proposition.
The research landscape has evolved along three tracks. First, traditional machine learning approaches achieve high detection accuracy (F1 > 0.95) through ensemble methods and deep neural networks on benchmarks like CIC-IoT2023 [
10] but provide no natural-language explanations. Second, LLM-based compliance reporting generates fluent explanations mapping attacks to regulatory provisions [
12], yet rarely measures whether narratives are grounded in actual network evidence versus parametric knowledge hallucinations [
13]. Third, Retrieval-Augmented Generation (RAG) grounds LLM outputs in external knowledge bases via rule-based heuristics [
14] or knowledge graph traversal [
10].
Alternative approaches present limitations. Long-context LLMs can theoretically ingest entire PCAP session logs but suffer from lost-in-the-middledegradation [
15], high inference costs, and no mechanism for attributing claims to specific evidence. Fine-tuning [
16] embeds compliance knowledge into model weights but requires costly retraining for policy updates and leaves no inspectable retrieval trace. RAG addresses these by retrieving query-relevant evidence regardless of corpus size, generating explanations grounded in retrieved context, and maintaining transparent retrieval traces satisfying auditability requirements. However, the RAG application to IoT network security auditing remains underexplored.
This reveals a critical gap; no prior work has compared retrieval paradigms on generating faithful audit explanations from real IoT traffic using quantitative metrics. While detection accuracy and architectural explainability (via LIME/SHAP and knowledge graphs) have been explored, the field lacks empirical evidence on which retrieval strategies produce LLM explanations genuinely grounded in observable network evidence. This deficiency carries direct implications for Responsible AI deployment. Frameworks such as the EU AI Act [
17] and NIST AI Risk Management Framework (RMF) [
18] require high-risk AI systems used in security auditing to demonstrate transparency, traceability, and accountability. AI-assisted auditing presenting hallucinated evidence as compliance justification violates these principles regardless of detection accuracy.
We address this gap by presenting RAGAS as an evaluation framework comparing three retrieval paradigms for generating faithful explanations in IoT network security auditing: rule-based heuristic scoring, dense vector embeddings, and knowledge graph traversal. RAGAS was selected for three reasons: (1) its metrics (faithfulness, context precision, context recall) map directly onto evidentiary and completeness requirements of compliance auditing, providing diagnostic insight beyond surface-level metrics like BLEU and ROUGE; (2) alternative frameworks like ARES [
19] require approximately 150 domain-annotated training samples, whereas this study has 30 human-annotated ground truth samples (each requiring manual PCAP inspection); (3) RAGAS operates predominantly reference-free, enabling ongoing evaluation in production without human labeling overhead [
20]. To validate architectural feasibility, we developed SACA (Security Audit Compliance Agent), a proof-of-concept prototype integrating Graph RAG with interactive graph visualization and continuous RAGAS monitoring.
Based on the current research landscape and identified gaps, our contributions are:
Adoption of RAGAS as the closest available proxy for network-domain RAG evaluation, while acknowledging protocol-aware faithfulness metrics as future work.
Quantification of the accuracy–faithfulness gap, where over 40% of LLM statements remain unsupported by retrieved evidence despite F1 scores exceeding 0.97.
Comparison analysis of three retrieval methods: rule-based, vector RAG, and graph RAG, in network flow data analysis, challenging assumptions that learned embeddings always outperform hand-crafted features.
SACA prototype integrating Graph RAG with interactive knowledge graph visualization and continuous RAGAS monitoring, demonstrating architecturally feasible auditable compliance reporting with human-in-the-loop oversight.
A handcrafted benchmark of 30 compliance scenarios validated by two expert auditors (), with a reusable RAGAS evaluation harness for reproducible faithfulness measurement in IoT network security.
The remainder of this paper is structured as follows:
Section 2 reviews related work in AI-driven IoT compliance solutions.
Section 3 details the experimental methodology, including ground truth construction and evaluation protocols.
Section 4 presents experimental results with the SACA prototype architecture.
Section 5 discusses findings, limitations, and implications for practice.
Section 6 concludes with future research directions.
2. Related Work
The intersection of artificial intelligence and IoT security compliance has generated a diverse body of research, yet systematic evaluation of explanation faithfulness remains largely unexplored.
Table 1 categorizes the current landscape of AI-driven solutions for auditing IoT security compliance across five main paradigms, revealing distinct strengths and persistent limitations in each approach.
Traditional machine learning and deep learning approaches have dominated early efforts in automated IoT security compliance. Studies such as Bagaa et al. [
21] and Sharma et al. [
22] demonstrate that ensemble methods combining deep learning and neural architectures can achieve high detection accuracy for intrusion classification and device profiling against standards like NIST IR 8259 [
23]. However, these approaches suffer from fundamental limitations in the compliance domain where explainability is paramount. While detection F1 scores exceed 0.95, these systems provide no mechanism for auditors to verify the reasoning chain from network observations to regulatory verdicts, rendering them unsuitable for audit scenarios where traceable findings are legally required.
The emergence of large language models has opened new possibilities for automated compliance analysis through natural language reasoning. Recent work by Oranekwu et al. [
24] and Ghosh et al. [
25] applies BERT and GPT architectures to map security findings to regulatory frameworks such as GDPR and ENISA guidelines, while Lin et al. [
26] and Hosseini et al. [
27] leverage ontologies and knowledge graphs to prioritize vulnerabilities in context. Despite these advances, the explainability problem persists in a different form. While LLMs can generate fluent natural-language explanations, the field lacks quantitative frameworks to validate that these explanations are faithful to the underlying evidence rather than plausible-sounding hallucinations or parametric knowledge.
Hybrid frameworks attempt to address explainability through post hoc interpretability methods. Tuncel et al. [
28] and Sawarkar et al. [
29] integrate SHAP and LIME with ensemble classifiers to provide feature attribution for GDPR-compliant auditing, while Abbas et al. [
30] and Cha et al. [
31] combine federated learning with blockchain to enable privacy-preserving forensic analysis. These approaches offer transparency into model decision boundaries but do not address the fundamental question of whether retrieved evidence supports generated explanations. SHAP values reveal which features influenced a classification but cannot verify that a natural-language compliance verdict accurately reflects the underlying network traffic patterns.
Table 1.
Current landscape of AI-driven solution for auditing IoT security compliance.
Table 1.
Current landscape of AI-driven solution for auditing IoT security compliance.
| AI Category | Key Studies | AI Components | Security Compliance Features | Limitations |
|---|
| Machine Learning (ML)/Deep Learning (DL) | [21,22,32,33] | SVM, Random Forest, CNNs, LSTM, ensemble methods | Real-time intrusion detection, device classification (NIST IR 8259) | Reliance on simulated datasets; adversarial vulnerability; limited real-world deployment evidence |
| Large Language Models (LLMs)/NLP | [24,25,26,34,35,36] | BERT, GPT, ontologies, knowledge graphs | Automated compliance mapping (GDPR, ENISA), vulnerability prioritization | Limited explainability; domain adaptation needs; sparse quantitative benchmarks |
| Hybrid Frameworks | [28,29,30,31,37] | Federated learning, blockchain, CatBoost, SHAP/LIME | Privacy-preserving auditing (GDPR), forensic analysis | Scalability in real-world IoT systems; computational overhead; privacy-utility trade-offs |
| Retrieval Augmented Generation (RAG)/Agentic AI | [25,34,38] | LLM + RAG, deep reinforcement learning | Dynamic vulnerability prioritization, DDoS defense | Immature implementations; sparse quantitative data; lack of standardized evaluation frameworks |
| Federated Learning/Blockchain | [28,30,39,40,41] | Federated ensemble, adversarial ML | GDPR-compliant architectures, real-time response | Privacy-utility trade-offs; limited longitudinal assessment; inconsistent metric reporting |
The most relevant paradigm to our work is the emerging field of Retrieval Augmented Generation for security applications. Ikegami et al. [
34] and Ghosh et al. [
25] demonstrate that grounding LLM outputs in retrieved vulnerability databases improves prioritization accuracy, while Kurniawan et al. [
38] apply knowledge graph RAG to cybersecurity retrieval-based reasoning tasks. However, these implementations remain immature, with limited quantitative evaluation of explanation faithfulness. The lack of standardized evaluation frameworks identified in
Table 1 is particularly significant in this category, where existing studies report task-specific accuracy metrics but rarely measure whether LLM-generated explanations genuinely reflect the retrieved context or instead rely on parametric knowledge that may hallucinate non-existent evidence.
Federated learning and blockchain approaches prioritize privacy and auditability through architectural guarantees rather than post hoc verification. Abbas et al. [
30] and Vutukuru et al. [
39] demonstrate GDPR-compliant intrusion detection through decentralized learning, while Tuncel et al. [
28] provide cryptographic audit trails for compliance decisions. These systems offer strong privacy guarantees but face scalability challenges in heterogeneous IoT deployments and exhibit privacy-utility trade-offs that limit detection performance. Moreover, the focus on architectural compliance does not address explanation faithfulness, as even cryptographically auditable systems can produce unfaithful natural-language explanations if the underlying LLM hallucinates.
The critical gap revealed by this landscape is the absence of quantitative faithfulness evaluation for compliance explanations in the IoT security domain. While detection accuracy has been extensively studied and architectural guarantees have been explored, no prior work has systematically measured the gap between detection performance and explanation quality. Professional auditors require not just accurate anomaly detection but a traceable explanation that withstands regulatory scrutiny. Existing RAG implementations for security lack the standardized evaluation frameworks needed to validate faithfulness, and traditional ML approaches provide no natural-language reasoning at all. This paper addresses this gap by introducing RAGAS faithfulness metrics to the IoT compliance domain and conducting a comparison of three retrieval paradigms on explanation quality rather than detection accuracy.
3. Materials and Methods
This section details the experimental design, dataset preparation, retrieval strategies, and evaluation methodology used to compare three RAG paradigms for IoT network security compliance analysis. Generative AI has been used in this section to generate graphics; the scripts are available in this paper’s GitHub repo, openly available at
https://github.com/obrina/security_audit_compliance_agent_v2 (accessed on 4 May 2026).
3.1. Dataset and Ground Truth Construction
Our evaluation methodology required constructing a ground truth dataset that enables controlled comparison of retrieval paradigms across realistic IoT attack scenarios.
Figure 1 illustrates the end-to-end data pipeline from raw network captures (CIC-IoT2023 dataset) through flow extraction and expert validation to the final ground truth scenarios. Details about the pipeline are explained in the following sections.
3.1.1. Source Data: CIC-IoT2023
We use the CIC-IoT2023 dataset [
42], a comprehensive IoT intrusion detection benchmark comprising raw PCAP files capturing traffic from 105 IoT devices under 33 attack scenarios across 7 categories: DDoS, DoS, Reconnaissance, Web-based, Brute Force, Spoofing, and Backdoor. The dataset provides labeled network flows with 97 statistical features extracted at 5 s windows, including packet counts, byte volumes, inter-arrival times, TCP flags, and protocol indicators. All PCAP files are publicly available through the Canadian Institute for Cybersecurity website.
3.1.2. GNN4ID Pipeline: From PCAP to Structured Flow Data
To transform raw PCAP files into the structured flow dataset used in our experiments, we employed the GNN4ID (Graph Neural Network for Intrusion Detection) pipeline taken from the work of Farrukh et al. [
10].
Figure 2 illustrates the complete flow from PCAP ingestion to the final CSV outputs used in our experiments.
Step 1. Flow and Packet-Level Feature Extraction. Each PCAP file is processed using NFStreamer [
43], a high-performance network flow analysis library, with a custom plugin (
My_Custom) that extends the standard flow extraction to capture per-packet metadata. The plugin hooks into the
on_init method (flow creation) and
on_update method (each subsequent packet) to record a per-flow list of: payload hex data (
payload_data), inter-packet arrival times (
delta_time), packet direction (forward/reverse), IP packet size (
ip_size), transport layer size (
transport_size), payload size (
payload_size), and all TCP flags (SYN, CWR, ECE, URG, ACK, PSH, RST, FIN). Extraction is invoked via a special pipeline that accepts a PCAP file path and an output directory argument, returning CSV files containing both the 97 standard NFStreamer flow features and the supplementary per-packet data columns.
Step 2. Temporal Feature Engineering. The extracted CSV files are enhanced with additional temporal features using a script that implements rolling-window aggregation over the preceding 350 flows (configurable), grouped by source-destination IP pair and by destination IP alone. The engineered features include:
Rolling SYN/ACK/RST/FIN/PSH sums: Counts of TCP flag occurrences in the window, computed per source-destination pair and per destination.
Rolling UDP/ICMP/DNS request counts: Protocol-specific activity indicators.
Rolling unique port counts: Number of distinct destination ports contacted within the window, capturing scanning behavior.
Rolling vulnerable port access: Binary indicator tracking access to commonly targeted ports (20, 21, 22, 23, 25, 53, 80, 110, 143, 443, 445, 3389, 8080).
Rolling average duration: Mean flow duration within the window per destination, capturing connection persistence.
Packet size variation: Standard deviation across min/max packet sizes from both directions, capturing payload irregularity.
These temporal features are critical for distinguishing distributed attacks (e.g., a DDoS flood where a single source contacts many ports) from spiky but benign traffic, and they directly inform the compliance reasoning used in our experiments. The script sorts flows chronologically by bidirectional_first_seen_ms before window computation to ensure temporal causality.
Step 3. Data preprocessing. The preprocessed CSVs are split 80/20 into training and test sets. For attack classes, only flows where the attacker’s MAC address (identified from the CIC-IoT2023 topology documentation) appears as either source or destination MAC are retained, ensuring that non-attack background traffic is excluded. For benign traffic, flows containing attacker MAC addresses are removed to eliminate contaminated samples. This train-test separation follows standard machine learning practice to prevent data leakage and ensure generalization. The training set (df_class_train.csv, N = 102,228 flows) is used exclusively for GNN model training to learn attack pattern representations and optimize network weights. The test set (df_class_test.csv, N = 25,557 flows) contains unseen flows reserved for faithfulness evaluation. Our ground truth compliance scenarios reference flows from the test set only, ensuring that explanation quality is measured on data the model has never encountered during training. This prevents overfitting artifacts from contaminating faithfulness assessments and ensures that attribution scores reflect genuine model reasoning rather than memorization.
Step 4. Graph-Centric Class Balancing. The original dataset exhibits severe class imbalance, with Backdoor attacks represented by only 1090 flows compared to over 67,000 DDoS flows. To address the imbalance, the preprocessing follows the GNN4ID workflow. A target of 20,000 samples per class is established for the training set. Classes falling below this threshold, such as Backdoor and BruteForce, are upsampled using SMOTE-inspired augmentation or random duplication of flow-packet graph structures to prevent the model from neglecting minority attack signatures. Conversely, majority classes like DDoS and DoS are downsampled to 20,000 to maintain a balanced gradient during training. Unlike standard tabular models, the test set is kept partially imbalanced (capped at 4000 samples per class) to provide a more realistic evaluation of the NIDS’s precision. Critically, while raw IP addresses and MAC addresses are removed to ensure model generalization, packet-level payload bytes (up to 1500 features) are preserved as attributes for packet_nodes, which are then linked to flow_nodes (82 statistical features) to form the heterogeneous graph objects required for the graph model.
3.1.3. Expert Validation of Ground Truth Scenarios
To establish authoritative ground truth, we conducted a structured expert validation exercise on all 30 compliance scenarios. Two independent expert auditors, each with over five years of experience in network security analysis and network forensics, independently reviewed each scenario to verify: (1) the correctness of the attack evidenced in the traffic flow and/or payload; (2) the correctness of the attack class to ETSI provision mapping; (3) the logical consistency between flow feature values and the attack verdict. The validation process used a standardized assessment form that documented the attack category, GT identifiers, and primary ETSI provision for each scenario group.
The 30 scenarios span seven attack categories mapped to eight ETSI EN 303 645 provisions: BruteForce Dictionary (GT-08, GT-24 → Provision 5.1), XSS/SQLi/Uploading (GT-01, 02, 03, 16, 18, 19 → Provision 5.13), Host Discovery/Reconnaissance (GT-04, 10, 20, 27 → Provision 5.6), DNS/ARP Spoofing (GT-05, 07, 11, 12, 21, 23 → Provision 5.5), DoS/DDoS (GT-09, 14, 15, 25, 26, 30 → Provision 5.9), Session Hijacking (GT-13, 28 → Provision 5.5), C2 Data Exfiltration (GT-17, 29 → Provision 5.10), and Mirai Botnet/Malware (GT-06, 22 → Provision 5.7). This distribution ensures comprehensive coverage of the IoT threat landscape relevant to compliance auditing.
Inter-rater agreement between the two experts was assessed using Cohen’s
, achieving
(95% CI: 0.81–1.03), indicating almost perfect agreement according to Landis and Koch’s interpretation guidelines [
44]. This validation process ensures that the ground-truth mappings between network evidence and ETSI provisions are technically accurate and reproducible.
3.1.4. Ground Truth Scenarios
The expert-validated PCAP analysis informed the construction of 30 ground-truth compliance scenarios stored in compliance_ground_truth.json. Each scenario is a structured JSON object containing:
id: A unique identifier (GT-01 through GT-30).
pcap_scenario: A textual summary of the flow characteristics (e.g., “BruteForce-Dictionary attack. src2dst_packets = 4, Rolling_SYN = 0, unique_ports = 235”).
etsi_provision: The ETSI EN 303 645 provision being tested (5.1, 5.5, 5.6, 5.7, 5.9, 5.13, or 5.10/6.1).
question: A natural-language compliance question framed from an auditor’s perspective.
ground_truth: The authoritative compliance answer citing both traffic features and the ETSI provision violation.
attack_class: The CIC-IoT2023 attack category.
cic_iot2023_label: The numeric class label from the dataset.
strategy_b_features: Key flow-level features used across all three retrieval strategies (Rolling_SYN_Sum, Rolling_ACK_Sum, Rolling_UDP_Sum, src2dst_packets, src2dst_bytes, bidirectional_mean_ps, packet_size_variation, Unique_Ports).
The attack class-to-ETSI provision mapping follows established cybersecurity principles. Brute Force attacks map to Provision 5.1 because password guessing exploits weak or default credentials, directly violating the requirement for unique device passwords. Web-based attacks (SQL injection, XSS, and uploading) map to Provision 5.13 as they exploit a failure to properly validate input data. Spoofing attacks (ARP, DNS, and session hijacking) map to Provision 5.5 because identity manipulation undermines secure communication and data confidentiality requirements. Reconnaissance activities (port scanning, OS fingerprinting) map to Provision 5.6 as they reveal unnecessarily exposed attack surfaces. Finally, DoS/DDoS attacks map to Provision 5.9 regarding system resilience against outages, while malware like Mirai maps to Provision 5.7 because unauthorized code execution compromises software integrity.
Table 2 summarizes these mappings.
3.2. Retrieval Strategies
We compare three retrieval paradigms, each transforming network flows into contexts for LLM generation. The paradigms span a spectrum from fully transparent rule-based logic to learned semantic representations to structured ontological reasoning.
3.2.1. Rule-Based Heuristic Scoring
Inspired by NetTraceAgentix [
45], this strategy computes an anomaly score using hand-crafted weights adapted from packet-level to flow-level features. The scoring function is defined as:
where:
, , , : Calibrated weights for SYN flood, ACK flood, UDP flood, and port scanning indicators, respectively (values: 0.4, 0.3, 0.2, 0.1)
: Indicator function that equals 1 when the condition is true, 0 otherwise
SYN, ACK, UDP: Rolling counts of TCP SYN flags, TCP ACK flags, and UDP packets in the 350-flow window
Unique_Ports: Number of distinct destination ports contacted within the window
Thresholds (300 for SYN/ACK, 200 for UDP): Empirically validated from 5 representative samples per attack class
The weights w were calibrated on 5 representative samples covering each attack class to ensure discriminative power while maintaining interpretability. Retrieved contexts consist of the top-5 flows by score, serialized in a human-readable format that explicitly states the attack type, anomaly score, and contributing features. For example: “Attack detected: BruteForce. Anomaly score: 49.2. Top features: Rolling_SYN_Sum = 359, Rolling_ACK_Sum = 4327, Rolling_UDP_Sum = 0, src2dst_packets = 8, Unique_Ports = 1.”
This approach provides complete transparency and requires no embedding model, making it computationally efficient and auditable. However, it may overlook semantic relationships between flows and provisions that require contextual understanding beyond numerical thresholds.
3.2.2. Vector RAG (Dense Embeddings)
Flows and ETSI provisions are embedded into a 768-dimensional continuous space using nomic-embed-text via Ollama. Each flow is serialized as a natural-language description that contextualizes numerical features within attack narratives. For instance: “Attack type: BruteForce. Flow characteristics: 8 packets, mean size 40.0 bytes, SYN count 359. Relevant ETSI provision: 5.1.” This serialization bridges structured numerical data with the linguistic domain where embedding models excel.
Embeddings are stored in ChromaDB, a vector database optimized for similarity search. Retrieval uses cosine similarity in the embedding space to identify the top-5 flows most semantically related to each compliance query. The document corpus includes 30 ground-truth flows plus ETSI provision texts, totaling 35 documents. This approach captures latent semantic relationships that may not be apparent from feature values alone, but it sacrifices the deterministic capability of rule-based scoring.
3.2.3. Graph RAG (Knowledge Graph Traversal)
We construct a Neo4j knowledge graph with three node types:
Flow,
AttackClass, and
Provision, connected by two relationship types that encode domain ontology, as illustrated in
Figure 3.
Each
Flow node stores the 8 key features as properties,
AttackClass nodes represent the 8 threat categories, and
Provision nodes contain ETSI requirement texts. Retrieval uses LightRAG [
46], which combines local traversal for entity-focused queries (e.g., retrieving flows matching specific feature patterns) with global traversal for relationship-focused queries (e.g., identifying multi-hop paths such as Flow → BruteForce → Provision 5.1).
The graph contains Flow, AttackClass, and Provision nodes, with edges explicitly encoding compliance violations. Retrieved contexts include both the flow features and the ontological path, enabling the LLM to reason about compliance relationships deterministically. This structured representation preserves the logical semantics of regulatory mapping while maintaining traceability through graph traversal paths.
3.3. LLM Generation
To isolate the influence of retrieval strategies from inherent model biases, three LLMs were evaluated under identical prompting conditions: DeepSeek-R1-8B represents the “Reasoning” model (Chain-of-Thought), Qwen-2.5-7B represents a highly optimized generalist model, and Llama-3.2-3B represents the “Edge AI” category (can it run on a gateway?). The use of local inference via the Ollama framework provides a twofold benefit: first, it simulates data confidentiality, ensuring that network metadata and compliance documents remain within the local security perimeter; second, it ensures experimental reproducibility by eliminating the variability and ’model drift’ associated with cloud-based APIs.
The prompt template follows a standardized structure to ensure fair comparison. It presents the compliance question, followed by retrieved evidence contexts, followed by ETSI provision definitions, and concludes with an instruction to provide a compliance verdict based solely on the evidence. The template is: “Question: {compliance_question}. Retrieved Evidence: {retrieved_contexts}. ETSI EN 303 645 Provisions: {etsi_provision_texts}. Based on the evidence, provide a compliance verdict.”
Each of the 30 scenarios is evaluated with each LLM and each retrieval strategy. This yields 30 scenarios multiplied by 3 LLMs multiplied by 3 strategies, totaling 270 scenarios. Temperature is set to 0.7 for all local models to balance determinism with response diversity. We have tried lowering the temperature (0.3–0.5), but it generated shorter responses that are problematic for RAGAS evaluation.
3.4. RAGAS Evaluation
We use RAGAS [
20] with GPT-4o-mini as the judge LLM to compute five metrics that operationalize different dimensions of explanation quality. The RAGAS framework consists of five primary metrics that collectively assess different aspects of RAG performance using the fundamental components: question (
Q), ground truth (
), generated answer (
A), and retrieved context (
C). These metrics provide quantifiable measures of RAG system reliability:
Context Precision measures the proportion of relevant items in the retrieved context relative to the question, calculated as:
where:
K: Total number of retrieved context chunks
k: Index of each retrieved chunk ()
: Binary relevance indicator for the k-th chunk (1 if relevant to Q, 0 otherwise)
: Precision computed at rank position k, defined as
Context Recall evaluates how well the retrieved context
C captures all relevant information from the ground truth
needed to answer the question:
where:
: Cardinality (count) of unique statements in the ground truth answer
: Cardinality (count) of unique statements in the retrieved context
: Count of ground truth statements that are verifiable in the retrieved context
: Set cardinality operator (counts the number of elements)
Faithfulness measures whether the generated answer
A is consistent with the retrieved context
C, preventing hallucination:
where:
: Set of atomic factual statements extracted from the generated answer
: Subset of claims that can be logically entailed from the retrieved context
: Set cardinality operator (counts the number of elements)
Answer Relevancy assesses how well the generated answer
A addresses the original question
Q:
where:
N: Number of artificially generated questions (default )
: Embedding vector of the original question
: Embedding vector of the i-th artificial question generated from answer A
: Cosine similarity function between two embedding vectors
Answer Correctness evaluates the factual accuracy of the generated answer by comparing it against the ground truth, combining both semantic similarity and factual alignment:
where:
: Token-level score (harmonic mean of precision and recall) between answer A and ground truth
: Semantic similarity score based on embedding-based approaches (range: 0–1)
, : Weighting parameters (typically ) to balance precision and semantic coherence
RAGAS Faithfulness is the primary metric for evaluating the accuracy–faithfulness gap, as it directly measures hallucination risk by verifying statement-level grounding. A faithfulness score of 0.5 indicates that half of the LLM’s claims are unsupported by retrieved evidence, which represent a critical audit trail deficiency. While surface-level metrics such as BLEU and F1 score measure lexical overlap between generated and reference text, and BERTScore captures semantic similarity via contextual embeddings, none of these verify whether individual claims are grounded in the retrieved context. A response may achieve high BLEU or BERTScore by reproducing provision language verbatim while still fabricating the supporting evidence, precisely the failure mode that matters most in compliance auditing. RAGAS Faithfulness addresses this gap directly by decomposing each response into atomic claims and checking each against the retrieved context, making it the appropriate metric for audit trail reliability assessment.
3.5. Data Filtering and Statistical Analysis
RAGAS evaluation relies on an internal LLM-as-judge cycle in which a separate LLM evaluates each generated response against the retrieved context. This judgment cycle does not always produce a valid result: the judge LLM may return a null score, fail to parse the response structure, or time out, resulting in missing metric values for that evaluation instance [
20]. Consequently, the number of valid scored evaluations is consistently lower than the number of QA pairs attempted.
Statistical analysis follows standard procedures for between-group comparisons [
47]. One-way ANOVA tests whether the mean RAGAS Faithfulness scores differ significantly across the three retrieval strategies by partitioning the retrieval method with an LLM model and scenario variability [
48]. A statistically significant ANOVA result (
) indicates that at least one method differs from the others, but does not identify which pair. Pairwise Welch’s
t-tests are therefore conducted for each method pair (Graph RAG vs. Rule-based, Graph RAG vs. Vector RAG, Rule-based vs. Vector RAG), with Bonferroni correction applied to control the family-wise error rate at
[
49].
The Coefficient of Variation (CV) measures relative dispersion of faithfulness scores within each retrieval strategy across LLMs and attack classes [
50], computed as:
where:
A lower CV indicates more consistent performance independent of the absolute score level, which is particularly relevant for auditing workflows requiring predictable explanation quality across diverse IoT attack scenarios.
3.6. Proof-of-Concept Prototype: SACA
To demonstrate the feasibility of auditable compliance reporting, we developed SACA (Security Audit Compliance Agent), a proof-of-concept web application that integrates Graph RAG with interactive Neo4j visualization.
Figure 4 illustrates the complete system architecture, showing the data flow from raw network captures through knowledge graph construction to compliance verdict generation with continuous RAGAS evaluation.
The pipeline begins at the Edge Layer, where Security Policy documents and network traffic from IoT devices are ingested into the system. The PCAP Processor extracts flow-level features and constructs the initial data representation. These flows are then serialized according to the chosen retrieval strategy and stored in their respective backends: heuristic scoring databases for Rule-based retrieval, ChromaDB for Vector RAG embeddings, or Neo4j for Graph RAG knowledge graphs. When a compliance query is issued through the User Interface, the system retrieves relevant contexts using the selected strategy, passes them to the LLM Generator for natural-language verdict synthesis, and presents results through an interactive dashboard. The Evaluation component continuously monitors faithfulness using RAGAS metrics with a judge LLM, flagging low-confidence outputs for human review. This architecture operationalizes the experimental findings by integrating Graph RAG with human-in-the-loop oversight, ensuring that automated compliance assessments maintain auditable evidence chains from traffic observations to regulatory verdicts.
4. Results
This section presents the experimental results across five dimensions: overall method comparison, LLM-specific performance, precision-recall trade-offs, attack-class robustness, and statistical significance testing. Results are organized to first establish high-level patterns before examining granular sources of variation.
4.1. Retrieval Strategy Comparison
Figure 5 shows mean RAGAS Faithfulness, Context Precision, and Context Recall across the three retrieval strategies. Graph RAG achieves the highest faithfulness (0.570), followed by Rule-based (0.524) and Vector RAG (0.509), though overlapping confidence intervals indicate substantial variance. Graph RAG also achieves near-perfect Context Precision (0.996) compared to Vector RAG (0.856) and Rule-based (0.814), meaning virtually all its retrieved flows are relevant to the compliance question. All three methods show similarly low Context Recall (none exceeding 0.224), suggesting that flow-level features impose a fundamental coverage ceiling regardless of retrieval strategy. Furthermore, Graph RAG leads on Answer Relevancy (0.645), while Vector RAG leads on Context Recall (0.224) and Answer Correctness (0.399). Averaged across all five metrics, Graph RAG achieves the highest composite score (0.552), ahead of Vector RAG (0.513) and Rule-based (0.476).
The two methods exhibit complementary strengths: Vector RAG’s higher Context Recall reflects broader semantic coverage, while Graph RAG’s near-perfect Precision with moderate Recall (0.189) is more favorable for auditing workflows where irrelevant retrieved flows increase auditor review burden. Graph RAG’s consistent top-two performance across all five metrics reinforces the methodological robustness observed in the LLM-agnostic and attack-class-agnostic analyses.
Statistical testing via one-way ANOVA reveals that faithfulness differences are not statistically significant at
, with
and
. This non-significance is explored in depth in
Section 5.
Table 3 summarizes the complete statistical profile for all three methods across all five RAGAS metrics. From 120 QA evaluations attempted per retrieval strategy, the final filtered dataset comprises 77 valid evaluations for Rule-based retrieval, 75 for Vector RAG, and 112 for Graph RAG after removing instances where any RAGAS metric returned a null value.
The table reveals that Context Precision is the only metric achieving statistical significance (), with Graph RAG’s near-perfect score substantially exceeding both alternatives. Context Recall also achieves significance (), though with Vector RAG outperforming Graph RAG (0.224 vs. 0.189). All other metrics, including the primary metric of faithfulness, show non-significant differences despite numerical variations.
Potential Bias from Data Filtering
We acknowledge that the differential filtering rates across methods introduce a potential source of bias that warrants careful examination. Graph RAG retains 93.3% of evaluations (112/120), whereas Rule-based retains only 64.2% (77/120) and Vector RAG retains 62.5% (75/120). The null evaluations are disproportionately concentrated in local LLM evaluations (DeepSeek-R1-8B and Llama-3.2-3B), where format compliance failures were more frequent for non-graph retrieval strategies. To assess the magnitude of this bias, we performed a worst-case sensitivity analysis [
51] in which all null evaluations are imputed with a faithfulness score of zero. Under this imputation, Graph RAG’s mean decreases by 0.038 (from 0.570 to 0.532), while Rule-based and Vector RAG decrease by 0.188–0.191 (to 0.336 and 0.318 respectively), widening Graph RAG’s advantage from 0.045 to 0.061 in the filtered data to 0.196–0.214 in the imputed analysis.
This asymmetry indicates that the differential filtering works against finding Graph RAG to be superior: by removing more low-quality responses from Rule-based and Vector RAG than from Graph RAG, the filtering artificially raises the apparent means of the weaker methods, narrowing the observed performance gap. In other words, the reported faithfulness differences between methods are likely an underestimate of the true difference.
Consequently, the non-significant ANOVA result () should be interpreted with the understanding that reduced sample sizes for Rule-based and Vector RAG diminish statistical power asymmetrically, and that the true population difference may be larger than observed. We retain the filtered analysis as our primary findings because imputing zeros conflates format compliance failures with genuine retrieval quality, but we encourage readers to interpret the non-significance within the context of this conservative filtering bias.
4.2. Faithfulness Consistency Across LLMs
Figure 6 decomposes faithfulness scores by LLM and method, revealing important consistency patterns that are obscured in aggregate statistics. Graph RAG shows the most consistent performance across LLMs with a coefficient of variation of 10.1%, while Vector RAG exhibits the highest variability at 21.5%. Rule-based scoring achieves 9.1% CV, comparable to Graph RAG.
Examining individual LLM performance reveals heterogeneous patterns. DeepSeek-R1 achieves its highest faithfulness with Graph RAG (0.623 compared to 0.515 for Vector RAG and 0.486 for Rule-based). However, Qwen-2.5 and Llama-3.2 both achieve the highest faithfulness with Rule-based scoring (0.582 and 0.522 respectively), suggesting that the optimal retrieval strategy may be LLM-dependent. The low variance of Graph RAG across LLMs suggests that it produces more LLM-agnostic evidence, reducing sensitivity to model-specific reasoning patterns.
Table 4 presents the complete breakdown of faithfulness scores by LLM and method.
The coefficient of variation quantifies robustness to LLM choice, with lower values indicating more stable performance. Graph RAG’s CV of 10.1% suggests that switching between DeepSeek, Qwen, and Llama introduces only modest faithfulness variation, whereas Vector RAG’s CV of 21.5% indicates that LLM selection substantially impacts explanation quality. For production deployment where LLM availability and pricing may necessitate model switching, this consistency advantage is operationally valuable.
Potential Bias from Data Filtering
Rule-based and Graph RAG show comparably consistent performance across LLMs, with coefficients of variation of 9.1% and 10.1% respectively, both substantially lower than Vector RAG at 21.5%. However, Rule-based’s CV should be interpreted cautiously: its higher filtering rate (62–64% retention) disproportionately affects certain LLMs, leaving as few as valid evaluations for some models compared to –30 for Graph RAG. With such small per-LLM samples, a single outlier response can substantially shift the mean and artificially deflate the CV, making Rule-based appear more consistent than it truly is.
4.3. Precision-Recall Trade-Offs
Figure 7 visualizes the precision–recall relationship across retrieval methods, with bubble size displaying the Faithfulness score. The three systems occupy distinct regions of the precision–recall space, reflecting different context retrieval scores.
Graph RAG achieves near-perfect context precision (0.996) at a recall of 0.189, positioning it as the precision-dominant approach, retrieving a small but highly relevant evidence set with virtually no irrelevant context included. Vector RAG achieves the highest recall (0.224) at a precision of 0.856, reflecting a broader retrieval strategy that surfaces more available evidence at the cost of increased noise. Rule-based retrieval occupies the least favorable position, combining the lowest recall (0.124) with the lowest precision (0.814), suggesting that deterministic pattern matching is both conservative and imprecise relative to the learned retrieval methods.
From a compliance auditing perspective, these precision–recall differences carry distinct practical implications. Graph RAG’s near-perfect precision means that virtually every retrieved flow is relevant, reducing auditor workload by eliminating false positive evidence that requires manual dismissal. Vector RAG’s higher recall (0.224) at lower precision (0.856) reflects a quantity-over-quality retrieval pattern that casts a wider semantic net but introduces noise that auditors must scrutinize and discard. Rule-based scoring’s high precision (0.814) but low recall (0.124) indicates conservative retrieval that provides reliable evidence but misses relevant contexts, potentially leading to incomplete compliance assessments.
The universally low recall across all methods, with none exceeding 22.4%, indicates a fundamental limitation. Flow-level features alone appear insufficient to fully support ground-truth compliance answers, which often reference packet-level signatures (e.g., SQL injection payloads, malware beacon patterns) invisible in aggregated statistics. This limitation is explored in
Section 5.4.
4.4. Method Robustness
Figure 8 shows per-attack-class faithfulness scores across five groups, revealing method robustness across threat types. The five-class scheme reflects the ETSI provision structure: Provisions 5.1, 5.3, 5.5, and 5.6 map to distinct attack classes (Brute Force, Web-Based, Spoofing, and Reconnaissance respectively), while Provision 5.7 covers a mixed set of 10 scenarios spanning Backdoor, DDoS, DoS, and Benign flows, which we aggregate under the label Malware/DoS.
Graph RAG demonstrates the most consistent performance across attack classes with a CV of 33.7%, ranging from 0.261 (Spoofing) to 0.739 (Reconnaissance). Rule-based and Vector RAG exhibit higher variability with CVs of 38.4% and 42.1% respectively. Class-specific winners vary by method. Rule-based scoring achieves the highest faithfulness on Brute Force attacks (0.763), likely because these attacks exhibit clear threshold-based signatures (e.g., high SYN counts) that align well with heuristic scoring. Vector RAG excels on Reconnaissance attacks (0.826), possibly because scanning patterns produce semantically coherent traffic descriptions that embed well. Graph RAG performs best on Reconnaissance (0.739) and Malware/DoS (0.653), where multi-hop reasoning through the attack-provision graph may better capture causality chains spanning multiple flows.
Spoofing emerges as the most challenging class for both Rule-based (0.326) and Graph RAG (0.261), while Vector RAG handles Spoofing comparatively better (0.635). The practical implication of Graph RAG’s low CV is robustness to unknown attack distributions. In real-world auditing, the attack type is not known in advance, and an auditor cannot reliably predict whether incoming traffic will be Brute Force (where Rule-based excels) or Reconnaissance (where Vector RAG excels), but Graph RAG provides stable faithfulness regardless of threat category.
4.5. Statistical Significance
Figure 9 presents box plots of faithfulness score distributions across the three retrieval methods. Despite observable numerical differences in mean faithfulness, one-way ANOVA reveals no statistically significant differences between methods (
,
,
). All pairwise comparisons yield
after Bonferroni correction, with effect sizes below Cohen’s conventional small-effect threshold of
[
52], indicating negligible practical significance.
Table 5 summarises the complete pairwise comparison results.
Descriptive statistics reveal the source of non-significance. All three methods exhibit high within-group variance, with standard deviations ranging from 0.27 to 0.31 and interquartile ranges spanning 0.27 to 0.50. Rule-based and Vector RAG share an identical median of 0.500, reflecting convergent central tendency despite differing distributional shapes; Graph RAG’s higher median (0.646) relative to its mean (0.570) indicates a left-skewed distribution, visible in the extended lower whisker of
Figure 9, wherein a subset of LLM-scenario combinations produce markedly low faithfulness scores that depress the mean below the median. Distributions overlap substantially across all three methods, consistent with the non-significant ANOVA result.
The lack of statistical significance has important implications for interpreting the findings. While Graph RAG exhibits numerically higher mean faithfulness, this advantage cannot be confidently attributed to methodological superiority rather than sampling variation. The small sample sizes (77 to 112 valid evaluations per method) limit statistical power to detect true differences. This limitation is specifically addressed in
Section 5.4.
4.6. Prototype: SACA
To demonstrate that auditable compliance reporting is architecturally feasible with current technology, we developed SACA (Security Audit Compliance Agent), a proof-of-concept web application integrating Graph RAG with interactive knowledge graph visualization.
4.6.1. Key Features
Figure 10 and
Figure 11 show the SACA interface in operation. The Security Overview Dashboard (
Figure 10) presents a threat assessment with a security score computed by aggregating faithfulness-weighted provision violations. The interactive knowledge graph visualization displays the ontological structure with Flow nodes (green), AttackClass nodes (orange), and Provision nodes (red), allowing auditors to click on nodes to inspect properties and trace evidence paths. The recommendations panel provides actionable remediation guidance derived from violated provisions.
The Executive Report view (
Figure 11) generates audit-ready documentation suitable for auditor review. The report provides an executive summary translating technical findings, specific provision violations with supporting flow evidence, and recommendations for remediation prioritized by risk severity. Low-faithfulness outputs are flagged for manual review, where auditors can view per-verdict faithfulness scores computed via RAGAS and investigate flagged verdicts below threshold.
Human-in-the-Loop
The workflow in SACA follows a human auditor workflow. First, the auditor uploads a PCAP file via the web interface, triggering the PCAP Processor to extract flows to construct the knowledge graph in Neo4j, with payload data to be referenced in analysis using vector embeddings. Second, the system generates automated compliance verdicts by querying the Graph RAG engine for evidence and calling the LLM Generator for each ETSI provision. Third, the auditor reviews the security overview dashboard to understand high-level threat patterns and identify critical violations. Fourth, the auditor interactively explores the knowledge graph to trace attack paths, such as following the chain from an Attacker IP through Port 8080 to an IoT Sensor device to a Provision 5.6 violation. Fifth, the auditor flags low-faithfulness outputs for manual investigation by setting a configurable threshold (could be between 0.3 and 0.5), manually reviewing flagged verdicts, and either accepting the AI verdict with annotation or overriding it with corrected reasoning.
5. Discussion
This section interprets the experimental findings, analyzes the SACA proof-of-concept prototype with responsible AI principles, discusses the accuracy–faithfulness gap, and acknowledges limitations.
5.1. Interpretation of Findings
Our experimental findings yield several insights on how the retrieval paradigm comparison should be interpreted. While no method achieves statistically significant superiority in faithfulness, the patterns of consistency, accuracy–faithfulness divergence, and retrieval architecture reveal important design principles for deploying RAG in compliance auditing contexts.
5.1.1. Statistical Non-Significance and the Consistency Argument
The central finding of this work is not that Graph RAG is superior to other methods, but rather that it demonstrates superior consistency and robustness across multiple variations. This consistency manifests in three key dimensions that matter for production deployment.
First, Graph RAG exhibits the lowest CV across LLMs at 33.7% compared to 38.4% for Rule-based and 42.1% for Vector RAG, indicating that its faithfulness is less sensitive to the choice of language model. This performance is critical for production deployment, where LLM availability and constraints (e.g., token limit, price, memory) may necessitate model switching.
Second, in real-world auditing contexts, the attack type is unknown in advance, making robustness more valuable than peak performance on specific classes. An auditor cannot reliably predict whether incoming traffic will be Brute Force (where Rule-based excels) or Reconnaissance (where Vector RAG excels), but Graph RAG provides stable faithfulness regardless of threat category. This attack-type-agnostic consistency reduces the risk of undetected faithfulness when deployed against novel attack distributions.
Third, Graph RAG achieves a near-perfect context precision of 0.996, meaning that virtually all retrieved flows are relevant to the compliance question. This excels against Rule-based at 0.814 and Vector RAG at 0.856, which retrieve substantially more noise alongside a signal. For auditors reviewing AI-generated compliance verdicts, high precision directly reduces workload by eliminating false positive evidence that must be manually scrutinized and dismissed.
The high variance observed across all methods is attributed to three compounding factors. Small sample sizes of only 75 to 112 successful evaluations per method from the planned 120 evaluations limit statistical power. LLM output variability introduces stochasticity even with identical prompts, as non-deterministic sampling can produce qualitatively different explanations for the same query. RAGAS judge variability contributes additional noise, as the GPT-4o-mini judge may score borderline cases inconsistently, particularly for statements that are partially supported by context.
We therefore reframe the contribution of this work. Graph RAG does not guarantee higher faithfulness than alternatives in all scenarios, but it provides more predictable and stable faithfulness across the multiple sources of variation that characterize real-world deployment. For risk-averse compliance contexts where unpredictable faithfulness degradation could expose organizations to regulatory liability, this consistency advantage has more practical value than that of mean performance.
5.1.2. The Accuracy–Faithfulness Gap
One of the empirical findings of this work is that high detection accuracy does not guarantee faithful explanations, thus exposing a fundamental disconnect between intrusion detection performance metrics and explainable/traceability requirements. Graph-based intrusion detection systems (GNN4ID) on the CIC-IoT2023 dataset achieve F1 scores exceeding 0.97 [
10], yet our experiments show that over 40% of statements in LLM-generated compliance answers are unsupported by retrieved evidence, with average faithfulness ranging from 0.50 to 0.57 across methods.
This accuracy–faithfulness gap arises from three architectural misalignments between detection optimization and explanation generation. Detection models optimize for classification accuracy by learning feature correlations that maximize discriminative power, not explainability. A model may correctly predict DDoS based on subtle multivariate patterns involving packet size variation, yet these correlations are invisible in the top-5 retrieved flows when each flow is scored independently. The detection model’s internal representation captures complex feature interactions that enable accurate classification, but the retrieval stage linearizes this information into a ranked list of individual flows, severing the multidimensional relationships.
LLMs fill gaps with prior knowledge when retrieved contexts prove insufficient. With context recall never exceeding 22.4% in our experiments, the LLM must have inferred substantial portions of the compliance answer from its parametric knowledge rather than from evidence. The LLM generates plausible-sounding compliance reasoning by drawing on network traffic patterns learned during pre-training, which may or may not align with the specific evidence at hand.
The RAGAS’s faithfulness measures statement-level grounding rather than verdict-level correctness. Even if the overall compliance verdict is correct, meaning the LLM correctly identifies a provision violation, individual supporting claims may be hallucinated. A statement such as “This violates Provision 5.3 due to SQL injection exploiting CVE-2023-XXXX” may be unsupported if the retrieved context mentions web-based attacks but does not specify SQL injection or reference CVE identifiers. The LLM infers reasonable attack details to construct a coherent narrative, but these details are not traceable to the evidence.
For responsible AI in security auditing, this gap is unacceptable from both professional and legal perspectives. A legally defensible audit trail requires that every claim be traceable to evidence, such that an external reviewer can reconstruct the reasoning chain without relying on the AI system’s internal knowledge. Professional audit standards demand documented evidence for each finding, not plausible inferences. Furthermore, hallucinated reasoning introduces liability risk, as regulators may challenge verdicts where supporting evidence is incomplete or fabricated. Our work demonstrates that measuring faithfulness is both feasible via RAGAS, providing a quantitative metric that complements detection accuracy measures.
5.1.3. Why Graph RAG Outperforms Dense Retrieval on Structured Data
Contrary to the widespread assumption that learned embeddings universally outperform hand-crafted features, we find that rule-based heuristics achieve comparable faithfulness to Vector RAG at 0.524 versus 0.509 with . Graph RAG numerically outperforms both at 0.570, though without statistical significance. This result challenges conventional wisdom and merits careful explanation.
The comparable performance of rule-based and vector retrieval stems from the nature of network flow data. Network flows are highly structured and numerical, with features like Rolling_SYN_Sum = 458 carrying precise semantic meaning as a SYN flood indicator. This numerical precision may be diluted when embedded into a high-dimensional continuous space optimized for natural-language similarity. The nomic-embed-text model was trained on web text to capture semantic relationships in linguistic data, not to preserve the magnitude relationships and threshold semantics of network security features. When a flow with Rolling_SYN_Sum = 458 is embedded, the resulting 768-dimensional vector may be proximate to other attack-related flows, but the critical threshold information, such as the fact that 458 exceeds the 300-packet heuristic for flagging SYN floods, is obscured in the continuous representation.
Compliance reasoning is ontological rather than semantic in character. The relationship between Brute Force attacks and Provision 5.1 is a hard logical link defined by regulatory interpretation, not a soft similarity that admits degrees. Graph traversal preserves this structure by explicitly encoding the VIOLATES relationship in the knowledge graph. When a compliance query asks about Brute Force violations, graph traversal follows the deterministic path from Flow nodes to the BruteForce node to the Provision 5.1 node, returning exactly the flows and provisions connected by this ontological chain.
Rule-based systems benefit from direct encoding of domain expertise. Our heuristic weights were validated on representative samples and directly encode the attack signatures that expert auditors expect to see, such as SYN counts exceeding 300 for flood detection or unique port diversity exceeding 10 for scanning detection. These thresholds are not arbitrary but reflect empirically observed attack characteristics in real network traffic. In contrast, the embedding model has no prior knowledge of network security conventions and must learn these patterns implicitly from the serialized flow descriptions. The semantic similarity between “SYN count 458” and “SYN count 301” may not reflect their equivalent classification as SYN floods, whereas the rule-based heuristic treats both identically by applying the same threshold test.
This finding suggests that hybrid approaches combining graph structure with embedding-based retrieval for unstructured text may be optimal. The graph could encode structured compliance ontology while embeddings handle free-text regulatory language, leveraging the strengths of each representation. Such hybrid architectures are a promising direction for future work.
5.2. Evaluation Prioritization for Network Traffic Auditing
The selection of appropriate evaluation metrics is a domain-dependent decision in the deployment of RAG systems. While many standard RAG evaluation benchmarks assess five dimensions: Faithfulness, Context Precision, Context Recall, Answer Correctness, and Answer Relevancy, their relative importance varies considerably across use-case or domains. In the context of network traffic auditing, especially compliance rather than forensics, the system usually ingests packet capture (PCAP) data and identifies violation evidence against a defined security policy ruleset. This section discusses the rationale for prioritizing specific RAG evaluation metrics within the compliance audit domain and contextualizes the observed scores accordingly.
5.2.1. Faithfulness as the Primary Metric
In compliance auditing, every violation claim surfaced by the system must be directly traceable to raw evidence present in the retrieved network flows or payload data. Faithfulness, defined as the proportion of answer claims that are grounded in the retrieved context, is therefore the most critical metric in this setting. An unfaithful response, wherein the system asserts a policy violation that cannot be substantiated by the underlying PCAP evidence, constitutes a fabrication that may carry serious legal and regulatory consequences. Unlike general-purpose question answering, where a partially hallucinated response may be merely inconvenient, a falsely generated violation claim in a compliance report risks wrongful attribution, erodes auditor credibility, and potentially exposes the organization to liability.
The evaluated systems returned faithfulness scores in the range of
–
, indicating that approximately half of the generated claims were grounded in the retrieved context. It is important to note that a faithfulness score of
does not represent a random chance-level baseline. Unlike metrics derived from classifiers such as the F1 score [
53], which has a definable chance-level floor dependent on class distribution, faithfulness is formally defined as the ratio of supported claims to total claims in the generated response [
20] and has no inherent probabilistic floor. The observed scores, therefore, indicate a substantive grounding deficiency that warrants targeted remediation before deployment in production.
5.2.2. Context Precision and Audit Report Credibility
Context Precision, the proportion of retrieved chunks that are relevant to the query, ranks as the second priority metric in this domain. In compliance auditing, the retrieved context forms the evidentiary basis of the generated report. The inclusion of irrelevant network flows or payload segments in the evidence could introduce noise that may mislead auditors, inflate reported violations, or undermine the audit findings. The Graph RAG architecture achieved a near-perfect context precision score of , demonstrating its capacity to retrieve only policy-relevant traffic patterns. This property is particularly advantageous when audit reports are subject to review by legal or regulatory bodies.
5.2.3. Context Recall and Audit Completeness
Although Context Recall is ranked third in the metric priority order for this domain, its operational significance must not be understated. A compliance audit derives its legal and regulatory validity from the completeness of its coverage. A system that exhibits high context precision but low recall retrieves a narrow but accurate subset of the relevant evidence, leaving the majority of policy violations undetected. This precision–recall imbalance is particularly evident in the Graph RAG system (precision , recall ), which retrieves only the evidence it is most confident about while omitting a substantial portion of the violation landscape. For audit purposes, a minimum recall threshold should be defined and enforced as a deployment criterion, as a report that fails to surface the majority of violations may provide a false assurance of compliance.
5.2.4. Metric Priority Summary
Table 6 summarises the recommended metric priority order for network traffic compliance auditing alongside the rationale for each ranking. Indicative scores are meant for comparison with the adjacent study by Arazzi [
54], which describes an IoT RAG system evaluated using RAGAS metrics. These represent observed results from a comparable IoT domain setting, not normative deployment thresholds. To the best of our knowledge, no universal RAGAS threshold standard currently exists in the literature.
5.2.5. Architectural Implications
The evaluation results suggest that no single retrieval architecture satisfies all metric priorities simultaneously. The Graph RAG system excels in context precision and answer relevancy, making it well-suited for the final evidence presentation stage of an audit pipeline. However, its critically low context recall renders it insufficient as a standalone detection mechanism. Conversely, the Vector RAG system demonstrates comparatively broader recall while maintaining adequate precision.
Based on these observations, a hybrid pipeline architecture is best suited for future prototypes, wherein Vector RAG or hybrid sparse-dense retrieval is employed during the initial violation sweep to maximize coverage, followed by Graph RAG re-ranking and evidence consolidation to ensure that only well-supported, policy-relevant findings are surfaced to the auditor. Such an architecture would align the system design with the metric priority order established in this section, balancing audit completeness with evidentiary rigor.
5.3. Implications for Responsible AI
Our findings support three actionable recommendations for deploying responsible AI in network security compliance auditing contexts, as studied by Shruti et al. [
55] and Xia et al. [
56].
First, professional audit standard bodies should mandate faithfulness measurement alongside traditional detection accuracy. Detection accuracy metrics such as F1, precision, and recall quantify classification performance but provide no information about explanation quality. An intrusion detection system achieving F1 exceeding 0.97 may generate compliance explanations in which over 40% of statements are unsupported by evidence, as our experiments demonstrate. Regulatory frameworks such as the EU AI Act and NIST AI RMF implicitly incorporate faithfulness requirements for high-risk AI systems:
EU AI Act requirements: for high-risk systems, the Act explicitly mandates accuracy, robustness, and explainability. These are the functional equivalents of “faithfulness” in a compliance context, ensuring the AI performs reliably as intended.
NIST AI RMF characteristics: NIST frames these as voluntary “trustworthy” characteristics, specifically highlighting that AI must be valid and reliable, explainable, and interpretable.
Second, organizations should prioritize robustness over peak performance when selecting RAG architectures for production deployment. We argue that Graph RAG’s consistency across LLMs and across attack types makes it more suitable for production than methods with higher variance, even if mean performance is similar. Production systems encounter diverse operating conditions, including LLM availability, novel attack patterns not represented in training data, and varying threat distributions across a heterogeneous network environment. A method that achieves peak faithfulness on specific LLM-attack combinations but degrades unpredictably under distribution shift introduces unacceptable operational risk.
Third, compliance AI systems must implement human-in-the-loop oversight with explainability mechanisms. Interactive knowledge graph visualization enables auditors to verify evidence chains by tracing retrieval paths from traffic observations through attack classifications to provision violations, satisfying transparency requirements without requiring auditors to understand the AI system’s internal architecture. Furthermore, as demonstrated by our prototype SACA, such a system could flag low-faithfulness outputs below configurable thresholds (e.g., 0.3–0.5) for recalibration or fine-tuning, ensuring that high-hallucination verdicts never reach final reports without human validation.
5.4. Limitations
We acknowledge three significant limitations that constrain the generalizability of our findings and should be taken into consideration for future work.
The lack of statistical significance indicates that our study is underpowered to detect small-to-medium differences between methods. With only 75 to 112 valid evaluations per method, the study achieves approximately 30% power to detect a medium effect size of at . To achieve the conventional 80% power threshold, we would require approximately 250 evaluations per method, more than doubling the current sample. This limitation arises from two practical constraints. Expert validation presents a bottleneck because constructing gold standard ground-truth scenarios requires specialized domain expertise at the intersection of network security and regulatory interpretation, limiting the scale of the dataset we could produce within project resources.
We evaluate only ETSI EN 303 645, which has a specific ontological structure with 5 provisions and clear attack-class mappings. The findings may not generalize to other standards with different structural properties. IEC 62443 [
57] for industrial control systems contains over 300 detailed requirements organized in a hierarchical capability model, presenting scalability challenges for knowledge graph construction. NISTIR 8259A [
58] for IoT device manufacturers employs a different threat model emphasizing manufacturer responsibilities rather than device behaviors, potentially disrupting the attack-to-provision mappings we relied upon. PCI DSS for the payment card industry focuses on transactional security with time-based requirements, such as quarterly vulnerability scans, introducing temporal reasoning that our static graph does not capture. Graph RAG’s advantage may be strongest for standards with clear ontological structure, which include entity-relationship hierarchies, and weaker for unstructured regulatory text lacking explicit semantic markup.
RAGAS relies on an LLM judge, specifically GPT-4o-mini in our implementation, to score faithfulness by decomposing answers into claims and verifying each against context. This introduces two sources of variability. Inter-judge reliability is unexamined because we used a single-judge model for all evaluations. Different judge models, such as Gemini or Claude, may score the same answer differently due to variations in instruction following and claim decomposition strategies, affecting absolute faithfulness scores.
While RAGAS Faithfulness employs LLM-based semantic entailment, which was originally designed for natural language corpora, its application remains valid in this context as the evaluation operates on the textual representation of network evidence rather than raw packet data. However, we acknowledge that domain-specific faithfulness metrics, capable of verifying claims against network protocol ground truth, represent a valuable direction for future work.