1. Introduction
The escalating complexity and frequency of cyberattacks underscore the critical need for robust and transparent mechanisms in digital forensic analysis and incident response. Intrusion detection systems (IDSs) serve as a pivotal component in identifying anomalous activities within network traffic. While machine learning (ML) and deep learning (DL) models have demonstrated significant capabilities in automating and scaling forensic tasks, their widespread adoption introduces a notable limitation: the lack of interpretability significantly limits the trust and usability of AI models in critical decision-making environments. This “black-box” nature is particularly concerning in high-stakes legal and forensic domains, where transparency and accountability are paramount, and where model outputs must be explainable, auditable, and suitable for evidentiary purposes. Without clear justifications for predictions, the deployment of such models risks undermining legal processes and eroding stakeholder confidence. Intrusion detection systems (IDSs) play a critical role in identifying anomalous patterns within network traffic. While machine learning (ML) and deep learning (DL) models have significantly advanced the automation and scalability of such tasks, their practical adoption in forensic contexts remains hindered by a key limitation: the lack of interpretability. This “black-box” nature limits their usefulness in legal settings, where decisions must be supported by clear, verifiable, and auditable reasoning. Addressing this challenge is essential to ensure the trustworthiness and admissibility of AI-driven forensic tools.
This study focuses on the forensic analysis of intrusion detection systems (IDSs), with specific emphasis on the incorporation of explainable artificial intelligence (XAI) to enhance the forensic interpretability of ML-based intrusion detection models. XAI has emerged as a promising approach to address the interpretability challenge by elucidating the underlying logic of model predictions. In forensic applications, these capabilities are essential for expert validation, transparent documentation of investigative reasoning, and adherence to regulatory standards [
1].
Among the various XAI techniques, SHAP (SHapley Additive Explanations) and LIME (Local Interpretable Model-Agnostic Explanations) have gained prominence for offering both global and local interpretability. SHAP, grounded in cooperative game theory, assigns consistent and accurate attribution values to features. In contrast, LIME builds local surrogate models that approximate the behavior of complex classifiers, thereby facilitating the understanding of individual predictions [
2,
3].
Recent research has sought to bridge the gap between automated intrusion detection and interpretability. For instance, Siganos et al. [
4] demonstrated the integration of SHAP explanations within IDSs deployed in IoT environments, underlining the relevance of interpretability in decision-making and model trustworthiness. Additionally, a systematic review by Mohale et al. [
1] highlighted the increasing integration of XAI in IDSs, emphasizing the necessity of transparent and interpretable mechanisms in cybersecurity.
Despite these advancements, the adoption of XAI methods in the forensic interpretation of IDS outputs remains limited. To address this gap, the present work proposes a structured methodology to evaluate and compare the forensic utility of two widely recognized XAI techniques, SHAP and LIME, applied to two representative AI-based IDS models: XGBoost, a gradient-boosted decision tree ensemble, and TabNet, a deep learning architecture optimized for tabular data. Both models were trained and validated using the UNSW-NB15 dataset. The analysis examines not only classification performance, but also the consistency, interpretability, and forensic applicability of the explanations produced. The goal is to promote the integration of XAI in digital forensic procedures by improving model transparency, trust, and evidentiary reliability in AI-assisted IDSs. The proposed method is innovative in combining SHAP and LIME within a forensic framework, evaluated across XGBoost and TabNet models. It introduces forensic-specific metrics (fidelity, stability, Jaccard) and aligns with legal standards, offering a novel and practical contribution to explainable AI in cybersecurity.
The main contributions of this work are summarized as follows:
Comparative Application of XAI Techniques in Cybersecurity: A rigorous comparison of SHAP and LIME is conducted in the context of intrusion detection, focusing on their interpretability contributions in forensic scenarios.
Dual-Model Evaluation Framework (XGBoost and TabNet): The study contrasts two machine learning paradigms—a tree-based ensemble model (XGBoost) and a deep learning architecture for tabular data (TabNet)—to assess how the explanation methods perform across different model types.
Multi-Criteria Explainability Assessment: The evaluation includes quantitative metrics such as the Jaccard similarity index, fidelity score, and explanation stability, providing a multidimensional perspective on the quality of explanations.
Use of a Realistic and Public Benchmark Dataset: The UNSW-NB15 dataset is employed to ensure realism, replicability, and alignment with network intrusion detection use cases.
Forensic-Centric and Trust-Oriented Interpretability: The study shows how SHAP and LIME can support evidence validation, regulatory compliance, and transparent decision-making in cybersecurity contexts.
The remainder of this paper is structured as follows.
Section 2 provides a comprehensive review of related work, emphasizing the integration of explainable artificial intelligence (XAI) techniques in intrusion detection systems.
Section 3 outlines the materials and methods, detailing the dataset characteristics, model architectures (XGBoost and TabNet), preprocessing pipeline, and implementation of SHAP and LIME for forensic interpretability.
Section 4 presents the experimental results, including classification performance, global and local explainability analyses, comparative evaluations, and metric-based assessments of fidelity, stability, and feature consistency.
Section 5 discusses the implications of the findings, highlights limitations, and offers strategic recommendations for the application of XAI in forensic intrusion detection. Finally,
Section 6 concludes the paper, summarizing the main contributions and identifying directions for future research.
3. Materials and Methods
This section describes the resources used and the methodology adopted to evaluate the explainability of intrusion detection models using explainable artificial intelligence (XAI) techniques in a forensic context. The selected dataset, the machine learning algorithms employed, the computational environments used, and the specific parameters applied to the SHAP and LIME methods are detailed. The proposed experimental approach seeks to ensure the reproducibility, traceability, and validity of the results, in compliance with forensic standards and good scientific research practices.
3.1. Explainable AI Methodology for Forensic Intrusion Detection
The methodological approach of this study integrates machine learning and explainable AI techniques to improve the transparency and interpretability of intrusion detection systems. Each stage of the process is outlined in
Figure 1.
The proposed methodology consists of five main stages. First, the UNSW-NB15 dataset is acquired and preprocessed to extract relevant features. Second, two models, XGBoost and TabNet, are trained for network intrusion classification. Third, predictions are evaluated using performance metrics. Fourth, explainability is incorporated through SHAP and LIME to generate both local and global interpretations. Finally, the insights derived from the explanations are synthesized to support expert-driven decision-making in forensic and cybersecurity contexts. This section outlines the theoretical foundations and practical implementations that support our comparative study of explainable artificial intelligence (XAI) models in digital forensic analysis. Following the methodological approach adopted, we define and justify the forensic framework, AI techniques, and XAI methods, particularly SHAP and LIME, used to analyze and explain intrusion detection systems (IDSs). To this end, we conduct a comparative evaluation of SHAP and LIME as post hoc interpretability techniques applied to two representative models: XGBoost, a tree-based ensemble learner, and TabNet, a deep learning architecture tailored for tabular data. Both models are trained and validated using the UNSW-NB15 dataset. The evaluation considers both classification performance and forensic interpretability criteria, including the consistency of explanations, alignment with forensic principles, and granularity of insight. This theoretical–methodological integration ensures that the experimental design is not only technically rigorous, but also aligned with the legal and procedural standards required in digital forensic investigations. The proposed methodology follows core principles of digital forensics, ensuring analytical validity and legal admissibility. It involves extracting and preprocessing relevant network features, preserving their integrity to enable meaningful explanations, and applying interpretable ML/DL models supported by SHAP and LIME. The results are presented using forensic-ready formats that enhance transparency, reproducibility, and usability in investigative and judicial contexts.
3.1.1. The UNSW-NB15 Dataset
The UNSW-NB15 dataset consists of 49 features organized into five main categories, each representing a distinct analytical perspective of network behavior. These feature groups enable a comprehensive understanding of traffic patterns and support effective intrusion detection using interpretable machine learning models [
18,
19].
The basic features include core connection metadata such as IP addresses, port numbers, protocol types, data-transfer volumes, and connection duration. These attributes offer foundational insights into the structure and directionality of traffic flows. From a forensic perspective, they are essential for tracking the origin and destination of communications, identifying targeted services, and reconstructing session context. Variables such as srcip, dsport, and dur are frequently highlighted by explainability methods (e.g., SHAP) due to their direct correlation with attack behaviors.
The content features describe transport-layer and application-layer elements, including TCP sequence numbers, average packet sizes, and HTTP transaction depths. These features allow for the inspection of payload-related behavior, which is crucial for detecting protocol misuse, content-based attacks, or unusual data patterns. Their inclusion supports forensic interpretations related to exfiltration, service probing, and tunneling.
The time features focus on the temporal dynamics of packet transmission. These include jitter, inter-packet arrival times, round-trip delays, and connection establishment times. Such metrics are indispensable in detecting stealthy or time-based attacks, such as slow-rate denial-of-service or command-and-control beaconing. They also provide the temporal granularity needed to establish event timelines during post-incident analysis.
The additional generated features consist of synthetic or derived attributes based on behavioral patterns across multiple flows. These include indicators of session similarity, the frequency of repeated services, and flow aggregation over temporal windows.
While these attributes are not used for model input, they are critical for supervised learning and model evaluation. In forensic contexts, they support the validation of automated detection results against known threat categories.
Overall, the structured division of features in UNSW-NB15 not only supports accurate detection through machine learning, but also enhances the interpretability of models in digital forensics. The categorization facilitates targeted analysis, improves the explainability of the model output, and supports the generation of legally sound evidence.
Table 1,
Table 2,
Table 3,
Table 4,
Table 5 and
Table 6 present the full set of 49 features from the UNSW-NB15 dataset, categorized for analysis and model interpretation.
To improve the interpretability of the model and prevent overfitting [
20], a total of 39 characteristics were selected from the original 49 available in UNSW-NB15.
The ten excluded features were classified into two principal groups. First, several attributes, namely srcip, dstip, sport, dsport, proto, state, and service, were identified as nominal identifiers or high-cardinality categorical variables. These features often function as quasi-identifiers or exhibit a large number of unique values, which complicates their integration into the model without advanced encoding techniques. Their inclusion could hinder model interpretability or increase the risk of overfitting.
Second, the temporal attributes stime and ltime, representing the start and end times of network flows, were omitted due to their incompatibility with non-sequential models such as XGBoost and TabNet. These models lack an inherent mechanism to model temporal dependencies or sequential order between samples. When improperly handled, timestamp variables may act as implicit indicators of data order or collection sequence, leading to data leakage by allowing the model to learn from non-generalizable temporal patterns. This can result in overly optimistic evaluation metrics and poor generalization to unseen data. Moreover, the raw timestamp values are highly unique and context-dependent, offering minimal semantic contribution to the classification task. Unlike derived features such as packet counts or byte volume, which capture consistent behavioral signatures, timestamps often reflect environment-specific logging practices. Their inclusion can introduce noise, increase model complexity, and promote overfitting to artifacts unrelated to actual attack behavior.
This exclusion process prioritized features with higher predictive value and removed those deemed redundant or noisy [
21]. Although the excluded features offer contextual information, they do not contribute significantly to detecting malicious activities. The streamlined feature set improves model simplicity and enhances transparency when applying explainability methods such as SHAP and LIME. This is particularly relevant for ensuring the forensic applicability of the models and meeting the ethical and legal standards in digital forensic analysis [
22].
3.1.2. Model Architectures: XGBoost and TabNet
This section briefly introduces the two models employed in this study. XGBoost is a gradient-boosted tree ensemble algorithm known for its efficiency on structured data. TabNet, in contrast, is a neural network architecture that uses attention and sparsity to perform feature selection and capture complex interactions. Both were selected for their strong performance and complementary interpretability characteristics. In this study, two advanced machine learning models, XGBoost and TabNet, were integrated to enhance the detection and interpretation of cyber threats embedded in network traffic. Each model was selected based on its unique capabilities in handling structured data, enabling interpretability, and delivering robust classification performance within the cybersecurity domain.
XGBoost was selected due to its demonstrated efficacy in classification tasks within cybersecurity and anomaly-detection domains [
23]. As a scalable and efficient gradient-boosting framework, it is particularly well suited for modeling structured data and capturing nonlinear interactions among features. The algorithm incorporates advanced regularization mechanisms (L1 and L2) and optimized tree construction techniques, which contribute to improved generalization performance and a reduced risk of overfitting. Moreover, XGBoost offers native compatibility with post hoc explainability tools such as SHAP and LIME, enabling transparent interpretation of model predictions, an essential requirement for intrusion detection systems operating in forensic or security-sensitive contexts.
Complementing this, TabNet is a deep learning architecture specifically designed for modeling tabular data, incorporating sequential attention and sparse feature selection mechanisms to identify and prioritize the most relevant attributes during training [
24]. This architecture achieves a balance between high predictive performance and intrinsic interpretability, an essential property for applications in cybersecurity and intrusion detection systems. Unlike conventional deep neural networks, which often operate as opaque black-box models, TabNet enables native explainability by design. Its built-in interpretability capabilities make it a strong complement to post hoc analysis techniques such as SHAP and LIME, providing transparent insights into model behavior at both global and instance levels.
In the context of this research, TabNet is employed to detect sophisticated attack patterns embedded in network traffic. Its sequential decision-making process and feature sparsity not only enhance classification accuracy, but also contribute to model robustness and transparency. The method uses the configuration employed for training the TabNet classifier to ensure consistent evaluation and fair comparison across models.
By integrating XGBoost and TabNet, this study leverages the complementary strengths of gradient boosting and deep learning with interpretability, thus aligning with the operational needs of modern intrusion detection systems, where performance, transparency, and forensic utility are paramount.
3.1.3. Prediction and Explanation Models: LIME and SHAP
This work combines a high-performing classifier with complementary explainability techniques to support forensic analysis. XGBoost, a gradient-boosted tree ensemble, is used for multiclass traffic prediction due to its robustness and efficiency on structured data. To interpret the model’s decisions, we integrate two widely adopted XAI methods: LIME, which generates local surrogate explanations, and SHAP, which assigns feature attributions grounded in game theory. Together, these components provide both predictive power and interpretable insights essential for transparent and legally defensible outcomes. LIME is a model-agnostic technique that is widely adopted for interpreting the decisions of black-box classifiers. It works by generating a synthetic dataset composed of perturbed instances derived from the original input, along with the corresponding predictions obtained from the target model. An interpretable surrogate model is then trained on this local neighborhood, assigning higher weights to instances closer to the observation of interest. This surrogate model provides a localized, linear approximation of the decision boundary, enabling the analysis of individual feature contributions.
The primary objective of LIME is to construct a locally faithful approximation that reveals the most influential features affecting the model’s decision in a given context. This local interpretability makes it particularly suitable for forensic applications, where transparency and case-specific explanations are critical. LIME explains a prediction
by approximating it locally with an interpretable model
, which is typically linear, trained on a perturbed sample space
Z around
x. The optimization objective is
where
is a loss function measuring the fidelity of
g to
f under the locality distribution
, and
penalizes complexity. LIME thus produces locally faithful explanations that are especially useful in justifying individual IDS alerts.
Recent studies have demonstrated LIME’s utility in producing actionable explanations across various domains, including healthcare and cybersecurity [
25,
26]. Enhancements in computational efficiency and scalability have further increased its applicability to large-scale datasets [
27,
28]. Furthermore, its flexibility in supporting different interpretable model types, such as decision trees or Lasso regression, has broadened its adoption [
29].
On the other hand, SHAP is a model-agnostic technique designed to interpret the prediction of a specific instance
x by assigning an importance value to each input feature. This method is grounded in cooperative game theory, specifically in the concept of Shapley values [
30]. In this framework, features are considered as “players” in a cooperative game, and the model’s output represents the “payout” to be fairly distributed among them. Shapley values thus quantify the average marginal contribution of each feature across all possible feature combinations, ensuring theoretically consistent and equitable allocation [
31]. SHAP is grounded in cooperative game theory, computing the contribution of each feature
i to a model prediction
using the Shapley value:
where
F is the full set of features, and
S is a subset not containing
i. SHAP ensures consistency, local accuracy, and additive feature attribution, making it ideal for forensic applications where interpretability must be both complete and legally defensible.
One of SHAP’s core advantages is its capacity to provide both local and global interpretability. Locally, it explains a single prediction by identifying the features that most influenced that specific output. Globally, it aggregates these values across the dataset to offer insights into overall model behavior and feature-importance trends. This dual capability makes SHAP highly suitable for domains where explainability and auditability are critical, such as cybersecurity and digital forensics.
These properties, along with efficiency, ensure that SHAP provides the most theoretically grounded method for feature attribution [
25]. Unlike alternative techniques such as LIME, which relies on local linear approximations, SHAP is built on solid theoretical guarantees, particularly completeness and consistency [
32,
33]. Recent studies have further confirmed SHAP’s applicability and reliability in high-stakes environments like finance and healthcare, where trust and transparency in AI systems are essential [
34,
35]. The efficiency property, unique to Shapley values and SHAP, guarantees the fair distribution of the model output among the features of an instance. This ensures that SHAP is the only known method capable of delivering complete and consistent explanations—an essential requirement in legally regulated domains [
2]. By adhering to these principles, SHAP satisfies key regulatory expectations for AI transparency and positions itself as a critical component in the development of ethically aligned AI systems [
36,
37].
Finally, LIME and SHAP are complementary approaches for explaining the decisions of complex machine learning models. While LIME provides fast, localized insights through linear surrogates, SHAP offers both global and local perspectives rooted in rigorous theoretical foundations, albeit with higher computational costs. The SHAP/LIME analysis phase focuses on generating interpretability by computing feature contributions and visualizing decision structures. In the subsequent insights phase, these outputs are leveraged to identify key variables, optimize predictive performance, and justify decisions to stakeholders. This structured process underscores the importance of XAI in ensuring model reliability, accountability, and usability in real-world applications. To visually illustrate the conceptual and methodological differences between LIME and SHAP,
Figure 2 contrasts their interpretability characteristics. This is a general representation intended to demonstrate the typical behavior of each model explanation approach over a two-feature input space.
Figure 2 provides a conceptual visualization of how LIME and SHAP generate explanations over a simplified two-feature input space (Feature A and Feature B). These features represent any pair of attributes extracted from the UNSW-NB15 dataset used in the classification of benign versus malicious network flows. On the left, LIME represents a localized approach using linear surrogate models over perturbed input spaces, which provides intuitive but potentially unstable explanations. On the right, SHAP leverages Shapley values to offer globally consistent and theoretically grounded attributions, ensuring completeness and robustness. This visual aid supports the comparative analysis by highlighting their respective scopes, mathematical foundations, and suitability in forensic interpretability scenarios. Both techniques significantly enhance the transparency and interpretability of model behavior.
Table 7 presents a comparative summary of these two methods.
The table highlights key differences in theoretical guarantees, interpretability scope, interaction awareness, and applicability to various analytical scenarios. These distinctions underscore the complementary nature of LIME and SHAP, particularly in forensic contexts, where explanation fidelity, stability, and legal defensibility are critical.
3.2. Performance Assessment and Explainability in Intrusion Detection Models
Model effectiveness is primarily assessed using accuracy as the main evaluation metric. In addition, confusion matrices are used to visualize and interpret the distribution of predictions across classes, providing detailed insights into the model’s ability to differentiate between legitimate and malicious traffic. To improve transparency in decision-making, particularly within the domain of intrusion detection, post hoc explainability methods, namely SHAP and LIME, are applied to identify the most influential features that contribute to the decision-making of the model. To mitigate the risk of overfitting, the machine learning models implemented in this study incorporate architectural strategies designed to improve generalization. In the case of XGBoost, this includes validation-based early stopping and L1/L2 regularization techniques, that is, the inclusion of penalty terms based on the absolute (L1) and squared (L2) magnitudes of model parameters, which effectively limit model complexity and prevent over-reliance on specific features. Conversely, TabNet employs a sparse, attentive mechanism that selectively emphasizes the most relevant features, inherently promoting generalization while preserving interpretability. The combination of these strategies, adaptive learning dynamics and built-in regularization, ensures stable and efficient training, reducing the need for additional overfitting-control procedures.
To evaluate the suitability of explainable artificial intelligence (XAI) techniques in forensic applications, it is essential to adopt metrics that go beyond general interpretability and address legal, procedural, and reliability concerns. Three core metrics are commonly adopted in the literature: the fidelity score, stability analysis, and Jaccard similarity.
Fidelity has been widely discussed as a means of assessing how faithfully an explanation reflects the underlying model’s behavior, especially in local contexts [
38]. High fidelity is critical for expert validation and for defending model decisions in legally sensitive environments [
33].
Stability, often evaluated by introducing controlled perturbations to the input, reflects the robustness and reproducibility of XAI methods. Stable explanations are increasingly considered essential for admissibility of AI-generated evidence in digital forensics [
39,
40].
Jaccard similarity has gained traction in comparative studies of XAI methods, quantifying overlap in the top explanatory features across methods or instances [
41]. This metric supports the identification of convergent explanatory patterns, which is useful in multi-perspective forensic validation settings.
Table 8 summarizes these metrics and their relevance for trustworthy and transparent forensic analysis.
The evaluation of XAI methods through fidelity, stability, and Jaccard similarity offers a multidimensional understanding of their explanatory reliability. These metrics are not only essential for validating interpretability in machine learning, but also play a critical role in digital forensics, where explanations must be transparent, reproducible, and legally defensible. By incorporating these evaluation criteria, this study enhances the rigor of forensic AI analysis and supports the integration of explainability tools in high-stakes cybersecurity environments.
3.3. System Environment and XAI Hyperparameter Settings
All experiments were conducted using a workstation equipped with an Intel Core i9-13900K processor, 64 GB of RAM, and an NVIDIA RTX 4090 GPU with 24 GB of dedicated memory. The operating system was Ubuntu 22.04 LTS (64-bit). The models were implemented using Python 3.10, leveraging the PyTorch 2.1 deep learning framework and the Scikit-learn library for traditional machine learning metrics and utilities.
To ensure transparency and reproducibility,
Table 9 summarizes the key hyperparameter settings used in our experiments for both SHAP (KernelExplainer) and LIME (LimeTabularExplainer). These parameters were selected following best practices and the prior literature to ensure consistent and interpretable results across different model instances.
These settings were selected based on standard practice and sensitivity trade-offs reported in the literature [
42], and remained consistent across all models to ensure fair comparison.
3.4. Data-Processing Pipeline
The UNSW-NB15 dataset was selected due to its comprehensive representation of contemporary cyber threats, making it a widely accepted benchmark for evaluating intrusion detection systems. Recent studies have validated its effectiveness in cybersecurity research involving machine learning approaches [
43]. The dataset comprises 175,341 distinct network traffic records, with each instance labeled according to one of the following defined attack categories: Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shellcode, and Worms.
Prior to model training, input features underwent a structured preprocessing pipeline. Categorical variables, such as protocol and state, were numerically encoded; ordinal encoding was used for XGBoost, while TabNet handled embeddings internally. Continuous features were normalized using min–max scaling to ensure proportional influence across the models. Features deemed non-informative or as having high cardinality, such as IP addresses and port numbers, were excluded to reduce noise and improve model interpretability. This configuration reflects commonly accepted best practices in cybersecurity applications of XGBoost, balancing predictive performance, regularization, and explainability for robust intrusion detection modeling [
44]. Regarding feature normalization, only numerical attributes were subjected to min–max scaling within the [0, 1] range. Categorical features, when present, were first transformed using one-hot encoding, resulting in binary indicator variables. Due to their discrete and bounded nature, these encoded variables were not further scaled. This distinction ensured that normalization preserved the semantic structure of both numerical and categorical attributes without introducing distortions in interpretability or model behavior.
Table 10 summarizes the applied preprocessing techniques by stage.
To ensure robust model evaluation, the dataset was partitioned following standard practices: 70% for training, 20% for validation, and 10% for testing [
45]. In addition, care was taken to ensure that the distribution of attack categories was proportionally maintained across all the training, validation, and test subsets. This balancing strategy was implemented to preserve the statistical representativeness of the overall dataset and to reduce the risk of sampling bias [
46], where certain attack types might otherwise be over- or under-represented in specific subsets. Maintaining this distribution helped to ensure that the model was exposed to a similar variety of attack patterns in each phase of learning and evaluation, ultimately supporting more robust and generalizable results.
Nonetheless, one important limitation of the UNSW-NB15 dataset is the inherent class imbalance. Several attack categories, such as Worms and Backdoor, are significantly under-represented compared to dominant classes like Generic or Exploits. This imbalance can skew both model training and post hoc explainability analysis, resulting in less reliable interpretations for minority attack types, which are often of high forensic relevance. Although this study focused on evaluating explainability techniques under standard dataset conditions, future work will incorporate class-rebalancing strategies to mitigate this issue. These include oversampling methods (e.g., SMOTE), class weighting during model training, and stratified performance evaluation. Such strategies are essential for improving model robustness and forensic interpretability in real-world settings where rare attacks must be reliably detected and explained.
3.5. Model Training and Explainability Setup
Following preprocessing and feature selection, two machine learning models were trained to support a comparative analysis of explainable artificial intelligence (XAI). The first model employed XGBoost, a tree-based ensemble method known for its efficiency and effectiveness in anomaly detection tasks within cybersecurity. The second model utilized TabNet, a deep learning architecture designed for tabular data that combines embedded feature selection with native interpretability.
Both models were trained using the processed dataset and were optimized using early stopping based on validation loss to mitigate overfitting. This training strategy ensured efficient learning convergence while preserving model-generalization capacity. The classifiers were further evaluated through SHAP and LIME to explore their respective explainability characteristics and to assess how model transparency varied across inherently interpretable and black-box-like algorithms.
XGBoost was initialized with 100 boosting rounds (n_estimators = 100) and a moderate learning rate of 0.1, which balanced convergence speed with model stability. Early stopping was applied with a patience of 10 rounds to prevent overfitting, based on validation loss (early_stopping_rounds = 10).
Tree complexity was controlled using a maximum tree depth of 6 (max_depth = 6) and a minimum child weight of 1 (min_child_weight = 1), allowing the model to learn detailed patterns without excessive granularity. To promote generalization, both row (subsample = 0.8) and column (colsample_bytree = 0.8) subsampling were used during training. The classifier was optimized for binary classification using a logistic objective (objective = ’binary:logistic’), and the evaluation metric selected was log loss (eval_metric = ’logloss’), which provides a probabilistic performance measure that is suitable for imbalanced datasets. A fixed seed (random_state = 42) ensured reproducibility, and the native label encoder was disabled (use_label_encoder = False) to allow for custom preprocessing and compatibility with external label formats.
TabNet was initialized with 8 decision steps (n_d = 8) and 8 attentive transformation dimensions (n_a = 8), forming the core of its feature representation and attention-based decision process. The number of sequential decision steps was set to 3 (n_steps = 3), and a sparsity regularization coefficient of (lambda_sparse) was used to enforce selective attention on the most informative features. The relaxation parameter controlled the decision-step behavior, enhancing feature reuse. No categorical embeddings were explicitly defined in cat_emb_dim or cat_idxs, given the preprocessed input format. Optimization was performed using the Adam optimizer (optimizer_fn = torch.optim.Adam) with a learning rate of , and learning rate decay was managed through a step scheduler that reduced the learning rate every 10 epochs by a factor of 0.1. The model was deployed on the designated computational device (device_name = str(device)). This configuration was selected to ensure a balance between training efficiency, regularization, and model interpretability.
4. Results
This section presents a comparative evaluation of the XGBoost and TabNet models trained on the UNSW-NB15 dataset, followed by an interpretability analysis using LIME and SHAP. The discussion integrates classification performance metrics with post hoc explanations, emphasizing their relevance for forensic analysis and model transparency.
4.1. Classification Performance Overview
The analysis begins with a review of the training dynamics using accuracy curves, and continues with an assessment of the models’ ability to generalize based on an independent test set. This approach allows for evaluating not only the models’ capacity to fit the training data, but also their effectiveness in handling previously unseen instances.
Figure 3 presents the training and validation accuracy curves for both XGBoost and TabNet across the full training process. The plot reveals a consistent improvement in both the training and validation phases, with XGBoost and TabNet achieving comparable validation performance towards the end of training. Specifically, XGBoost reaches a validation accuracy of 97.8%, while TabNet attains 97.68%. While both models perform well on validation data, XGBoost demonstrates a slight advantage in test generalization. This may be due to its ensemble-based structure, which can better capture feature interactions and mitigate the effects of class imbalance. In contrast, TabNet may be more sensitive to rare class misclassifications, potentially affecting its robustness in the case of unseen data.
To further assess generalization, we examined the models on the held-out test set, where XGBoost achieved a test accuracy of 85.91%, slightly outperforming TabNet, which achieved an accuracy of 84.41%. The discrepancy observed between the validation and testing accuracy for both models suggests a moderate degree of overfitting. Nonetheless, both exhibited a strong discriminative capacity during training. The discrepancy between the validation and testing phases indicates overfitting in both models, although each demonstrated strong discriminative capacity during training. It is important to clarify that the x-axis does not represent the proportion of training data used, but rather the relative progress of training for each model. In the case of XGBoost, this corresponds to the number of completed boosting rounds (up to 100), while for TabNet, it reflects the number of training epochs. Both models were trained using early stopping based on validation loss to prevent overfitting, and the curves are scaled to show normalized training progression. Accuracy values were monitored incrementally throughout training to illustrate convergence dynamics, offering a visual comparison of how each model learns over time. This approach allowed for a more intuitive assessment of learning behavior and model stability during training, independent of absolute iteration counts. Overall, while both models achieved high validation performance, XGBoost maintained a slight edge in test accuracy. This may be attributed to its ensemble architecture’s ability to manage feature interactions in imbalanced data scenarios, whereas TabNet’s performance suggests increased sensitivity to rare class misclassification.
Figure 4 compares the ROC curves of XGBoost and TabNet on the test set, revealing low discriminative ability in both (macro AUC: 0.37 and 0.43, respectively), confirming the observed overfitting. XGBoost achieves higher overall accuracy (85.91%), benefiting from its architecture in unbalanced scenarios, while TabNet shows a better AUC in specific classes, highlighting its sensitivity to minority classes.
The classes are indexed as follows: (0): Analysis, (1): Backdoor, (2): DoS, 3: Exploits, 4: Fuzzers, 5: Generic, 6: Normal, 7: Reconnaissance, 8: Shellcode, and 9: Worms.
The model demonstrates high accuracy in identifying common traffic types such as Generic and Normal (indices 5 and 6), with over 95% correct classification. In contrast, categories such as Backdoor, Shellcode, and Worms show higher misclassification rates, reflecting the known challenge of detecting low-prevalence or obfuscated attack types. Notably, Analysis traffic (class 0) is often misclassified as Exploits (class 3), suggesting feature overlap between these classes. This visualization supports the evaluation of model reliability in high-stakes forensic contexts.
While accuracy provides an overall view, it is not sufficient in multiclass or imbalanced settings. Therefore, confusion matrices were used to analyze the model performance per class, especially regarding attack detection. These matrices highlight specific strengths in categories like extitNormal and extitGeneric, and recurring weaknesses in under-represented categories such as extitBackdoor, extitFuzzers, and extitAnalysis. This information is crucial for cybersecurity contexts, where misclassification of rare but harmful traffic may have serious consequences.
Figure 3 shows the training accuracy progression of TabNet and XGBoost. The initial accuracy values reflect early learning stages. For TabNet, the curve is normalized per epoch. XGBoost was trained with a fixed number of rounds based on early convergence; future work may explore extended training.
While accuracy helps visualize training dynamics, we acknowledge its limitations in multiclass settings. Therefore, our main evaluation relies on the AUC and recall. Future work will consider including F1-score metrics to provide a more comprehensive assessment.
Multiclass Confusion Matrix Analysis
To evaluate the performance of both models across multiple traffic categories, normalized confusion matrices were generated using ten classes from the UNSW-NB15 dataset. These matrices express the percentage of correctly and incorrectly classified instances per true class, offering detailed insights into class-wise behavior. As illustrated in
Figure 5 and
Figure 6, both XGBoost and TabNet demonstrate strong performance in classifying the most prevalent categories, such as Generic and Normal, with high true-positive rates. In contrast, their effectiveness decreases for less frequent classes like Fuzzers, Backdoor, and Worms, where misclassification is notably higher.
Although the general classification patterns are similar, a closer inspection reveals that XGBoost achieves slightly more concentrated predictions on key classes like Generic, while TabNet tends to distribute its predictions more broadly across categories. This behavior may lead to increased confusion among rare classes. These findings underscore the importance of evaluating explainability and class-wise robustness, especially in cybersecurity applications, where the detection of infrequent attacks is critical.
4.2. SHAP and LIME for Explainability and Interpretability
To better understand the internal decision-making process of the models, SHAP (SHapley Additive exPlanations) values were used to quantify the contribution of each input feature to the final prediction.
Figure 7 presents a comparative summary of the mean SHAP values for the top features identified by XGBoost and TabNet. This visualization allows for a direct comparison of how each model prioritizes specific variables when classifying instances. Notably, while both models highlight features related to traffic state and packet flow, the magnitude and ranking of importance vary, reflecting differences in how each architecture captures and leverages patterns in the data.
The comparative SHAP summary reveals several insights into how XGBoost and TabNet differ in their interpretation of feature relevance. The most striking observation is the disproportionately high SHAP value assigned to the sttl feature by XGBoost, suggesting that this model relies heavily on the TTL (time-to-live) field as a discriminative indicator. In contrast, TabNet distributes importance more evenly across multiple features, with ct_state_ttl emerging as the top contributor—an indicator that it leverages aggregated connection-state and TTL behavior, rather than relying on a single raw feature.
Another interesting divergence is the role of dbytes and sbytes (destination and source byte counts), which are more influential in XGBoost, possibly due to its tree-based structure’s ability to capture value thresholds. Meanwhile, TabNet attributes more weight to features like ct_src_dport_ltm and spkts, which reflect connection-level patterns over time, aligning with its sequential attention-based mechanism.
These differences suggest that XGBoost tends to exploit sharply defined thresholds in a few dominant features, leading to more concentrated feature reliance. In contrast, TabNet demonstrates a more distributed pattern of interpretability, possibly enhancing its robustness to minor input variations.
Overall, the comparative SHAP analysis not only highlights key predictive features across both models, but also provides a window into their architectural biases—XGBoost’s tendency for feature dominance versus TabNet’s distributed feature sensitivity.
In addition to the global interpretability analysis, local explanations were generated for specific prediction instances using both LIME and KernelSHAP. The objective was to compare the classifications assigned by each model on a given instance and to identify the most influential features that contributed to the decision. This approach allowed for an in-depth examination of how each explainer attributes importance to individual input variables, highlighting potential divergences in their interpretative focus and reinforcing the need for multi-perspective explainability in model evaluation. To complement the global interpretability assessment, local explanations were generated using LIME for three randomly selected instances from the test set.
Figure 8 presents the individual feature contributions to the model’s predictions for samples 133,349, 18,339, and 45,127. This visual comparison reveals consistent patterns in the most influential features, while also highlighting slight variations across instances. Such insights are crucial for validating the reliability and consistency of the model’s decision-making process.
The LIME-based explanations for instances 133,349, 18,339, and 45,127 reveal both consistencies and variations in the feature contributions to the model’s predictions. In all three cases, the features smean and ct_srv_dst consistently emerge as the most influential, positively reinforcing the classification output. This suggests that the model relies heavily on traffic-level summary statistics and service-destination patterns to distinguish between normal and attack instances.
Notably, while the top contributing features remain largely the same, their relative magnitudes vary slightly across instances. For example, instance 18,339 exhibits a stronger positive influence from smean, while 133,349 and 45,127 show more balanced contributions from the next-ranked features. Additionally, features such as dbytes, sbytes, and tcprtt appear with negative contributions, consistently pushing against the predicted class, which may reflect patterns associated with benign traffic, in contrast to anomalous flows.
These findings indicate a level of interpretability and coherence in the model’s behavior, where key features maintain relevance across cases, yet respond adaptively to the characteristics of each instance. This reinforces the usefulness of local explainability techniques in validating the reliability and reasoning of black-box models in cybersecurity applications.
Although certain ROC curves (e.g., for XGBoost) suggest low classification performance, their inclusion in this study is intentional and methodologically relevant. The primary objective of our work is not only to assess predictive performance, but also to analyze the behavior and reliability of explainability methods such as SHAP and LIME under varying learning conditions. Evaluating these tools in models that fail to generalize well offers valuable insights into the stability and limitations of post hoc interpretability techniques. Moreover, in real-world forensic scenarios, it is common to encounter poorly performing or partially trained models. Understanding how explanations behave under such conditions contributes to building more robust, transparent, and trustworthy systems. Thus, presenting these results is justified as part of a comprehensive evaluation framework.
4.3. Complementary Analysis: TabNet Attention vs. SHAP and LIME Rankings
To complement the LIME-based explanations shown in
Figure 8, we performed a comparative analysis between the top features highlighted by SHAP and LIME and the native attention scores produced by TabNet across decision steps.
Table 11 summarizes the comparison for the ten most frequently appearing features in the instance-level explanations.
The analysis reveals a partial alignment between the features emphasized by TabNet’s native attention mechanism and those prioritized by SHAP and LIME. Notably, tcprtt, dbytes, and sbytes are consistently ranked among the top features across all three interpretability sources. This consistency reinforces their relevance in detecting anomalous traffic patterns.
On the other hand, while TabNet assigns substantial attention to features such as ct_state_ttl and sttl, these features are ranked lower in SHAP and LIME outputs, suggesting that TabNet may leverage more distributed feature importance across its decision steps. This distributed nature of attention may support robustness, but may also reduce traceability compared to post hoc methods.
These findings reinforce the value of combining intrinsic interpretability mechanisms, such as TabNet’s native attention masks, with post hoc explainability techniques like SHAP and LIME to obtain a more comprehensive understanding of model behavior. This multi-perspective approach allows for the validation of feature relevance both during and after inference, contributing to a more robust and transparent interpretation process. In forensic applications, where transparency, traceability, and evidentiary reliability are essential, the convergence of different explainability strategies can significantly enhance expert validation and strengthen the legal defensibility of AI-driven decisions.
XGBoost exhibited a significant drop in performance from validation (97.8%) to test data (85.91%), indicating overfitting. We verified that the data partitioning preserved temporal integrity and avoided leakage, suggesting that inadequate regularization was the likely cause. Future work will address this by adjusting regularization parameters and incorporating generalization-focused model selection, which is essential for forensic reliability.
4.4. Evaluation of XAI Metrics
To quantitatively evaluate the quality of model explanations, three XAI evaluation metrics were computed: the Jaccard Index, fidelity score, and stability score. These metrics provide complementary insights into the agreement between explainers, their faithfulness to the model’s true decision function, and their robustness to input perturbations, respectively.
Figure 9 presents a comparative summary of these metrics for the XGBoost and TabNet models.
As shown, XGBoost consistently outperforms TabNet across all three evaluation dimensions, with higher similarity in top features (Jaccard Index = 0.82), stronger alignment between explanation and model output (fidelity = 0.91), and more stable explanations under local variations (stability = 0.89). These results indicate that the explainability techniques yield more consistent and reliable interpretations for the XGBoost model, reinforcing its interpretability advantage in this forensic analysis context.
Although TabNet has built-in explainability, it is not sufficient for forensic contexts. The presented metrics show lower stability and fidelity compared to SHAP and LIME. Therefore, it is concluded that TabNet’s native explainability does not eliminate the need for external methods, especially in applications requiring legal traceability.
Figure 9 illustrates the comparative evaluation of the three key XAI metrics the Jaccard Index, fidelity score, and stability score applied to the XGBoost and TabNet models. The results demonstrate that XGBoost consistently outperforms TabNet across all metrics.
Specifically, the Jaccard Index value is higher for XGBoost (0.82) than for TabNet (0.65), indicating greater agreement between SHAP and LIME in identifying relevant features, and therefore higher consistency in the generated explanations. In terms of fidelity, which assesses how accurately an explanation approximates the model’s original prediction, XGBoost achieves a score of 0.91, slightly above TabNet’s 0.85. This suggests that the explanations generated for XGBoost are more faithful to the underlying model behavior.
The most notable difference is observed in the stability score, which reaches 0.89 for XGBoost, compared to TabNet’s 0.72. This metric reflects the robustness of explanations when small perturbations are introduced to the input data, a property of particular importance in forensic applications, where reproducibility and consistency are critical.
These findings highlight the interpretability advantages of tree-based ensemble models (XGBoost) over deep learning architectures (TabNet) in the context of explainable intrusion detection.
5. Discussion
This study provides an exhaustive comparative analysis of SHAP and LIME techniques applied to XGBoost and TabNet models, integrating interpretability, stability, and fidelity metrics, with a focus on forensic applicability. This approach not only complements recent studies in explainable cybersecurity [
38], but also contributes a critical examination of the technical, operational, and forensic strengths and limitations of each technique, promoting an integral and standards-aligned approach, as required by emerging regulatory frameworks [
47].
5.1. Comprehensive Comparison of Global and Local Interpretability
The results indicate that SHAP demonstrates considerable advantages in global interpretability, offering complete, coherent, and auditable explanations, which are essential for ensuring algorithmic decision traceability in forensic and judicial scenarios, where transparency and legal defensibility are paramount [
2]. This capability is reinforced when combined with tree-based models such as XGBoost, where SHAP not only achieves higher stability and fidelity, but also delivers robust behavior under perturbations, a critical element for the acceptability of explanations in forensic audits and digital examinations [
3]. In contrast, LIME, although limited in fidelity and consistency, offers significant operational advantages in scenarios where agility is required, providing localized, instance-level explanations, which are particularly valuable in incident-response activities, alert triage, and rapid classification within security operations centers (SOCs). Nevertheless, its use should be restricted to exploratory or diagnostic functions, as its explanations present lower stability, especially when applied to deep learning models like TabNet, thereby increasing the risk of interpretive errors in highly complex scenarios [
38].
Our findings contrast with those of [
5] who employed explainability techniques primarily for feature selection in IDS models, rather than for forensic analysis. While their approach improved model interpretability, it did not evaluate forensic auditability or explanation stability. In contrast, our work explicitly measures explanation fidelity and forensic relevance using both SHAP and LIME, providing a more targeted assessment for digital-evidence contexts. Similarly, Solanke et al. [
6] focused on improving detection accuracy through ensemble deep learning methods, but without integrating explainable AI (XAI) components. As a result, their approach lacked the traceability and accountability required in forensic environments. Our study fills this gap by evaluating the forensic robustness of XAI-driven IDS models and quantifying their reliability in legally sensitive scenarios.
These differences highlight the novelty and applied contribution of our work in bridging XAI and forensic cybersecurity.
5.2. Robustness and Stability of Explanations: Critical Differences for Digital Forensics
The quantitative evaluation demonstrated that the XGBoost model interpreted with SHAP outperforms TabNet across all key metrics, achieving significantly higher stability (0.89 vs. 0.72) and fidelity (0.91 vs. 0.85) [
38]. This technical superiority translates into more stable, more reproducible, and less sensitive explanations to input perturbations, which are essential properties in forensic contexts, where methodological consistency is fundamental for the admissibility of digital evidence before judicial or regulatory bodies [
48]. Conversely, TabNet, despite its capabilities for sequential attention and automatic feature selection, yielded more volatile explanations, limiting its suitability in scenarios that require legally defensible and auditable explanations.
5.3. Consolidated Comparative Analysis and Strategic Recommendations for Forensic XAI in IDS
This study reinforces that SHAP and LIME should be considered complementary, rather than exclusive, approaches when configuring explainable and forensic-ready IDSs. SHAP should be prioritized in validation processes, forensic reporting, auditing, and pericial documentation, aligning with the principles of transparency, accountability, and explainability outlined by regulatory frameworks such as GDPR, the EU AI Act, and international standards such as ISO/IEC 23894:2023 [
47]. Simultaneously, LIME finds its role in immediate-operational-response and exploratory-analysis phases, where interpretive agility is a key factor, albeit with evident limitations for use in technical or legal defense processes. From a forensic-auditing perspective, it is also essential to consider the interaction between intrinsic and post hoc explanation mechanisms. While intrinsic models (e.g., TabNet) provide built-in transparency through attention-based or feature-selection processes, these explanations may lack consistency or completeness under certain conditions. Post hoc methods like SHAP and LIME can complement these limitations by offering more stable and auditable outputs. However, divergence between intrinsic and post hoc interpretations can also introduce confusion or reduce interpretative trust. Thus, combining both approaches in a hybrid framework strengthens interpretability, legal defensibility, and alignment with regulatory standards.
Beyond the choice of explainability method, forensic reliability also depends critically on the generalization capacity of the underlying model. Overfitted models, even when paired with advanced explanation tools, may generate outputs that appear interpretable, but fail to reflect consistent or legally defensible reasoning under real-world conditions. This risk is particularly concerning in forensic auditing, where explanations must withstand scrutiny, replication, and cross-examination. Therefore, explanation metrics such as fidelity and stability should be interpreted not only as technical indicators, but as safeguards against misleading conclusions in high-stakes environments.
Moreover, it is recommended to advance towards the integration of hybrid architectures that combine SHAP’s global explanatory power with LIME’s local agility, complemented by robust explanatory validation metrics (fidelity, stability, Jaccard similarity), as proposed in this study [
33,
38]. Future research should focus on exploring emerging XAI techniques such as graph-based explainability, causal reasoning, and counterfactual methods, thus strengthening the ability to generate contextualized and defensible explanations in complex and high-criticality forensic scenarios [
49].
This study confirms that XAI methods, particularly SHAP and LIME, significantly enhance the interpretability, transparency, and forensic reliability of AI-driven intrusion detection systems. The choice between the individual use of these techniques or their combined deployment should be context-dependent: SHAP is well suited for scenarios requiring compliance, accountability, and traceability, whereas LIME excels in agile, instance-level diagnostics and operational decision-making. Altogether, this approach reinforces the trustworthiness of explainable IDSs and promotes their adoption within legal and regulatory frameworks.
5.4. Computational Limitations of SHAP in Real-Time Forensic Applications
While SHAP has demonstrated strong capabilities in producing consistent and high-fidelity explanations, one of its main limitations lies in its computational cost. This becomes particularly critical when applied to complex models or high-dimensional datasets, where explanation generation requires multiple model evaluations to estimate Shapley values. As a result, SHAP is less suitable for real-time forensic applications or environments that demand low-latency decision-making, such as critical infrastructure, online detection systems, or industrial platforms with constrained processing capabilities. Consequently, SHAP is better suited for post-incident forensic analysis, digital auditing, or retrospective investigations, where interpretability and explanatory robustness are prioritized over operational immediacy. This limitation should be carefully considered when deploying XAI techniques in cybersecurity systems with forensic requirements.
5.5. Computational Constraints of SHAP in Real-World SOC Environments
SHAP provides theoretically grounded and comprehensive local and global explainability for machine learning models. However, its use in real-time digital forensics and security operations center (SOC) environments presents significant computational challenges. Specifically, KernelSHAP, the model-agnostic variant employed in this study, requires multiple model evaluations per instance and feature coalition, resulting in exponential time complexity with respect to the number of features.
In our experiments with the UNSW-NB15 dataset (after dimensionality reduction to 39 features), the average explanation time per instance using KernelSHAP exceeded 8.3 s on a workstation with 64 GB of RAM and an Intel Core i9. This latency is critical when processing high-throughput network traffic in operational SOC environments, where responses are expected in near-real time.
Furthermore, although TreeSHAP improves efficiency for tree-based models such as XGBoost, its optimizations are not applicable to architectures like TabNet, which must rely on the slower KernelSHAP variant. These performance limitations hinder SHAP’s integration into live detection pipelines unless mitigated through model-specific approximations or hardware acceleration (e.g., GPU-based parallel computation).
While SHAP offers superior transparency and legal interpretability, its computational burden poses a barrier to real-time forensic analysis. Future research should explore hybrid explainability strategies that use faster techniques like LIME for preliminary interpretation, reserving SHAP for high-confidence post-incident forensic validation.
5.6. Future Work: Parameter Optimization
and Dataset Balancing for Greater Reliability
The results presented in this study highlight the potential for achieving substantial improvements in the performance of intrusion detection models through targeted parameter optimization. Initial experiments conducted with default configurations yielded relatively low macro AUC scores (0.38 for XGBoost and 0.43 for TabNet), limiting both the interpretability and forensic utility of the results.
However, preliminary adjustments to key hyperparameters—particularly the learning_rate (learning_rate, from 0.1 to 0.05), the number of trees (n_estimators, from 100 to 300), and the L1 regularization term (reg_alpha, from unspecified to 1.0) in XGBoost, as well as the number of decision steps (n_steps, from 3 to 5), the attention dimension (n_d = n_a, from 8 to 32), and the relaxation parameter (gamma, from 1.3 to 1.5) in TabNet—led to notable performance gains. These modifications enabled a more precise tuning of the models’ capacity to capture discriminative patterns in the data, especially in the presence of class imbalance. As a result, the macro AUC scores increased from 0.38 to 0.65 in XGBoost and from 0.43 to 0.66 in TabNet, demonstrating significant improvements in multiclass detection capabilities, particularly for previously under-represented classes.
This suggests that parameter optimization can meaningfully enhance both classification performance and the reliability of model outputs. Given the wide range of possible values for each hyperparameter, future work should consider the use of systematic search strategies such as Grid Search to efficiently explore the configuration space. Parameters such as learning_rate, n_estimators, and max_depth in XGBoost, as well as n_steps, gamma, and lambda_sparse in TabNet, are especially influential and warrant detailed investigation.
In parallel, the class imbalance observed in the UNSW-NB15 dataset remains a critical factor affecting model performance, particularly for low-frequency categories such as Worms or Shellcode. These under-represented classes consistently yielded lower AUC values, likely due to insufficient training instances. Therefore, combining hyperparameter tuning with class-balancing techniques, such as SMOTE or stratified sampling, may offer additional improvements, particularly in the detection of minority classes.
Taken together, these findings reinforce that, with appropriate hyperparameter tuning and careful data preparation, it is indeed possible to significantly enhance the discriminative power and forensic reliability of AI-based intrusion detection models. Future experiments will focus on refining these approaches to build more robust and explainable systems that are suitable for high-stakes cybersecurity applications.
Additionally, although a formal ablation experiment was not conducted in this study, future work should pursue a structured comparative evaluation of explainability techniques particularly SHAP and LIME within the context of intrusion detection systems (IDS). Given their complementary strengths, a systematic ablation study examining their individual and joint impact across key dimensions, such as explanatory fidelity, stability, Jaccard similarity, computational efficiency, and forensic relevance would yield valuable insights into their respective trade offs and operational suitability. This line of research could guide the integration of hybrid XAI strategies tailored to forensic cybersecurity applications.
6. Conclusions
This study has demonstrated the critical role of Explainable Artificial Intelligence (XAI) techniques, specifically SHAP and LIME, in enhancing the forensic reliability of machine learning models used in intrusion detection systems (IDSs). By applying these techniques to both a tree-based model (XGBoost) and a deep learning architecture (TabNet), we have provided a comprehensive evaluation of their interpretability, consistency, and forensic applicability using the UNSW-NB15 dataset.
From a forensic perspective, SHAP emerges as a robust tool for ensuring the legal defensibility of AI-generated evidence. Its adherence to game-theoretic properties, completeness, local accuracy, and consistency enables transparent and auditable attributions that align with the evidentiary standards required in judicial contexts. In contrast, LIME offers rapid and localized explanations, making it a valuable asset during incident response and real-time forensic triage.
The comparative results highlight that while SHAP provides more coherent global explanations, particularly when used with structured models like XGBoost, LIME complements this by offering granular and interpretable insights at the instance level, which are especially useful in the validation of individual alerts. The joint application of both methods is therefore recommended in forensic workflows, allowing analysts to benefit from both macro-level understanding and micro-level justification of model decisions.
Furthermore, the integration of fidelity, stability, and similarity metrics provides a quantitative framework for assessing explanation robustness an essential requirement for their acceptance as digital evidence. By incorporating these metrics, forensic practitioners can better assess the reliability of XAI outputs in support of investigative hypotheses.
In addition to post hoc methods, the use of intrinsically interpretable models like TabNet allowed for a direct comparison between native attention masks and the attribution rankings derived from SHAP and LIME. This complementary analysis revealed that TabNet’s internal attention mechanisms partially aligned with external explainers in highlighting critical features such as tcprtt, dbytes, and sbytes, validating the consistency of interpretability sources across architectures.
The findings also underscore that explanation quality can degrade when applied to models with suboptimal learning behavior. As such, we deliberately included low-performing classifiers to evaluate the resilience of interpretability tools in realistic forensic conditions, where imperfect models are often encountered. This aspect reinforces the importance of measuring explanation stability even under uncertainty.
In summary, this research supports the adoption of SHAP and LIME as complementary explainability tools in forensic cybersecurity. Their inclusion in the digital forensic process not only improves transparency and expert confidence, but also promotes compliance with emerging regulatory frameworks for trustworthy AI. Future work should focus on integrating these XAI techniques into forensic platforms, developing visualization dashboards for legal stakeholders, and formalizing forensic-readiness evaluation criteria for explainable models.
Looking forward, future research should prioritize the development of hybrid explainability frameworks that integrate both global and local interpretability capabilities, incorporate native model explainability such as attention mechanisms, and leverage forensic-ready visualization platforms that facilitate legal admissibility and enhance analyst usability in high-stakes environments.