Adversarial Training for Mitigating Insider-Driven XAI-Based Backdoor Attacks

Gayathri, R. G.; Sajjanhar, Atul; Xiang, Yong

doi:10.3390/fi17050209

Open AccessArticle

Adversarial Training for Mitigating Insider-Driven XAI-Based Backdoor Attacks

by

R. G. Gayathri

^*,

Atul Sajjanhar

^*

and

Yong Xiang

School of Information Technology, Deakin University, Geelong, VIC 3217, Australia

^*

Authors to whom correspondence should be addressed.

Future Internet 2025, 17(5), 209; https://doi.org/10.3390/fi17050209

Submission received: 29 March 2025 / Revised: 27 April 2025 / Accepted: 2 May 2025 / Published: 6 May 2025

(This article belongs to the Special Issue Generative Artificial Intelligence (AI) for Cybersecurity)

Download

Browse Figures

Versions Notes

Abstract

The study investigates how adversarial training techniques can be used to introduce backdoors into deep learning models by an insider with privileged access to training data. The research demonstrates an insider-driven poison-label backdoor approach in which triggers are introduced into the training dataset. These triggers misclassify poisoned inputs while maintaining standard classification on clean data. An adversary can improve the stealth and effectiveness of such attacks by utilizing XAI techniques, which makes the detection of such attacks more difficult. The study uses publicly available datasets to evaluate the robustness of the deep learning models in this situation. Our experiments show that adversarial training considerably reduces backdoor attacks. These results are verified using various performance metrics, revealing model vulnerabilities and possible countermeasures. The findings demonstrate the importance of robust training techniques and effective adversarial defenses to improve the security of deep learning models against insider-driven backdoor attacks.

Keywords:

adversarial training; backdoor attacks; data poisoning; insider threat; generative models; explainable AI

1. Introduction

Deep learning (DL) is a key technique in many fields, including autonomous systems and healthcare. However, deep learning is still vulnerable to security threats, compromising reliability and legitimacy. One of the most significant problems is the presence of adversarial attacks, in which well-designed perturbations compromise the neural networks by producing incorrect predictions [1]. Adversarial training has been a popular defensive technique for handling such threats, where models are trained on adversarial samples to increase their robustness. However, existing studies indicate that adversarial training may be manipulated to create backdoor vulnerabilities, creating further security issues.

In data poisoning attacks, attackers may inject incorrectly labeled or altered training data, generating unexpected results that adversaries may exploit. According to [2], these risks are increased when surrogate models are used, in which adversarial examples are created for the target model by training alternative models. However, XAI, which was designed to improve the interpretability of models, unintentionally helps attackers by exposing flaws that can be used to improve attack tactics [3].

Backdoor attack is a specific attack category where the attacker inserts secret triggers into the learning models during the training phase [4]. According to [5], the model performs normally in most situations but intentionally alters its behavior when specific triggers are present. The potential for adversarial training to introduce backdoor vulnerabilities is a serious concern in situations where training data is not securely managed. This danger is increased when an insider, such as an employee within the organization, is aware of the training pipeline and can alter data to introduce a backdoor attack. There is a serious risk from an insider familiar with the dataset and training process. An insider can utilize adversarial perturbations to include backdoor functionality and inject poisoned samples with backdoor triggers during adversarial training, exploiting the knowledge of XAI outputs to strategize the attacks further. The attacker can also use surrogate models to maximize the attack with the fewest possible traces.

Nowadays, model explainability is used to aid in deducing the results produced by machine learning (ML) and deep learning models. However, attackers can use it to create effective attacks using surrogate models and the data accessible to them. Recently, studies have focused on the vulnerability of XAI and how attacks can be generated using XAI results [6]. Though explainable artificial intelligence (XAI) and surrogate models have been thoroughly examined with adversarial robustness, the literature has not yet given enough attention to their roles and consequences in adversarially triggered backdoor attacks.

Substantial research has been conducted on backdoor attacks and adversarial training, but little is known about how these two areas interact. Existing research mainly concentrates on adversarial training as a defensive strategy, ignoring the possibility that this method could be used to introduce backdoor vulnerabilities. Furthermore, most studies do not consider insider threats when discussing deep learning security. The paucity of thorough research on how adversarial training can act as a backdoor attack vector highlights a significant weakness in the security of deep learning systems.

In this study, we focus on the effects of the vulnerability of XAI and how a backdoor attack can impact adversarial training. Our primary goal is to use the model explainability methods for selecting a subset of features and their values for the backdoor trigger generation and propose a backdoor attack using model explainability methods. The contributions of the paper are as follows:

We propose an insider-driven poison-label backdoor attack that adopts the use of model interpretability from explainable AI (XAI) to study the vulnerability of adversarial training when exploited by an insider.
We performed comprehensive experiments to evaluate and analyze the robustness of adversarially trained tabular deep learning models against the insider-driven poison label attack.

Motivation

The role of insiders in organizations exposes a critical vulnerability. Insiders with access to training data can orchestrate backdoor attacks by manipulating data labeled for training [7]. Researchers like Yan et al. [8] highlight the ease with which insiders can inject triggers during training to achieve control over a model’s outputs, posing risks where access and permissions intersect. The interaction of backdoor triggers and adversarial instances emphasizes the need for robust detection methods and an improved understanding of how these concepts intersect.

Data augmentation techniques often address imbalanced class distributions and data scarcity. Despite their advantages, augmented samples may unintentionally inject adversarially generated data into the training dataset, leaving models vulnerable to manipulation. Properly generated augmented samples resembling the original data distribution enable models to increase performance by fine-tuning their decision boundary. On the other hand, attackers may deliberately inject poorly designed or maliciously crafted samples, which could lead to decision boundaries that accept destructive inputs. Consequently, it is essential to understand the security implications of augmented data, especially in the presence of an insider familiar with training scenarios.

The growing use of deep learning models in critical and significant applications, including cybersecurity, requires reliable and robust solutions. Organizations usually believe that adversarial training makes their artificial intelligence systems more secure. However, they frequently fail to consider situations where insiders could use this technique to create backdoor attacks that are difficult to discover. Attacks are made more complex and severe by insiders who are potential adversaries with privileged access to datasets and training pipelines. Therefore, a thorough analysis of the possible misuse of adversarial training from the perspective of insider threats is important to address a critical gap in the body of existing knowledge.

The rest of this paper is organized as follows. Section 2 provides the state of the art of the main ideas employed in this article. Section 3 presents a technical overview and details of the proposed backdoor attacks and the problem setting adopted to build the solution. Section 4 describes the dataset used for validation and the algorithms used, followed by an evaluation of our proposed method. Section 5 concludes the paper and identifies potential future research directions.

2. Related Work

The literature review in this section is organized to emphasize important topics essential to comprehending the security environment of deep learning models, especially in the presence of insiders. In the first section, we introduce the generative adversarial networks (GANs), primarily focusing on the difference of using GANs for tabular vs. image datasets. Following that, we cover adversarial training (AT), emphasizing how it can be used to defend against malicious manipulations and as a possible means of introducing hidden vulnerabilities. Current developments in explainable artificial intelligence (XAI) for cybersecurity are reviewed, focusing on how model interpretability techniques might unintentionally lead to exploitable vulnerabilities. In the final section, the concept of backdoor attacks is examined, as well as the methods by which attackers can stealthily compromise the integrity of the model when it is being trained.

2.1. Generative Adversarial Networks (GANs)

Variational autoencoders (VAEs), generative adversarial networks (GANs), and conventional oversampling techniques like SMOTE are well-known strategies for handling data imbalance in challenges with classification. GAN-based methods improve model generalization and robustness by synthesizing realistic samples using adversarial training to learn the underlying distribution of the minority class [9]. Although oversampling techniques, such as SMOTE, offer computational efficiency, they may introduce noise and redundancy that limit their effectiveness, particularly in complex, high-dimensional datasets. Instead, they create synthetic minority class instances by interpolating between neighboring instances [10]. Compared to traditional oversampling, GANs and VAEs can generate more realistic and varied synthetic samples; however, their training challenges and computational demands require careful execution and hyperparameter adjustment.

Recent research has focused considerable attention on the use of generative adversarial networks (GANs) for the creation of synthetic data. Notably, the methods used for generating tabular data and those used for generating images differ significantly. Recent developments are summarized in this literature overview, highlighting modifications and domain-specific challenges reported in the research.

In tabular data synthesis [11], GAN-based approaches have been significantly tailored to handle heterogeneous data types, including numerical, binary, and categorical variables. Studies focusing on cybersecurity, healthcare, and fraud detection have applied tabular GANs and their variants to produce synthetic datasets that preserve the underlying statistical properties of the original data [12]. The Conditional Tabular GAN (CTGAN) [13] is one of the most prominent models extensively applied to synthesize realistic tabular datasets. For example, in healthcare, the CTGAN and CopulaGAN have been used to generate synthetic electronic health records that maintain logical relationships and distributional similarities with real datasets, enabling improved predictive modeling and mitigating issues related to data scarcity. Ongoing research continues to reinforce the importance of domain-specific adaptations to meet the varying challenges of synthetic data generation.

In summary, while both tabular and image data synthesis via GANs share the foundational adversarial training framework, they have evolved to address the unique characteristics of their respective data modalities. Tabular data generation emphasizes statistical properties and the preservation of multivariate relationships, whereas image data generation focuses on spatial feature hierarchies and perceptual realism [14,15]. This dichotomy illustrates the versatility of GAN architectures and underscores the importance of domain-specific adaptations to meet the varying challenges of synthetic data generation.

2.2. Adversarial Training

Adversarial training aims to enhance model robustness by exposing it to adversarial examples during training [16]. However, this approach can inadvertently reinforce the backdoor, primarily when the adversary controls the dataset or the poisoning process. According to Gu et al., poorly managed adversarial training can offer a false sense of security, leaving the model perpetually vulnerable to backdoor triggers embedded within the dataset [5]. Moreover, adversarial examples can be used against the trained model, allowing attackers to devise credible backdoors that activate under specific conditions without alerting the system [17].

In a classical adversarial setting, inputs are manipulated through perturbations that aim to mislead the model at inference [18]. However, in a backdoor context, these perturbations often activate the backdoor instead of causing label misclassifications across the board. Existing works [19,20] use data augmentation to defend against data poisoning attacks. Adversarial training-based data augmentation is a method of enhancing data to create more reliable models. Though recent works propose defense methods for adversarial attacks, this work uses adversarial training as a defense mechanism to study the effects of using GANs.

2.3. Explainable AI for Cybersecurity

This section provides an overview of explainable AI and its existing uses in the cybersecurity domain. The precision of AI-based techniques is increasing, providing specialists with more justifications or knowledge about why certain decisions and instructions are required [21,22]. Decisions must be clear and explicable for the AI systems to be deployed successfully and for the professionals to accept or adopt them. One way to increase user confidence in the models is through explainability. Through explainable AI (XAI), humans can now perceive, understand, interpret, and describe how an AI model or system comes to a conclusion and makes judgments [23].

Paper [24] presents an empirical study of the privacy risks of explainable machine learning. The study uses a new membership inference attack strategy that uses prediction trajectories derived from explanations. The work [25] discusses how to defend against attacks and design robust interpretation methods. In the survey, Baniecki and Biecek [25] explore the emerging field of adversarial explainable AI (AdvXAI), focusing on how adversarial attacks can manipulate explanations provided by explainable AI (XAI) methods. The authors highlight the vulnerabilities of current XAI techniques and discuss potential defense mechanisms to enhance their robustness. In this work, we employ the XAI-enabled threat generation adopted from these papers.

2.4. Backdoor Attacks

Here, we discuss the backdoor attacks and the state-of-the-art backdoor attacking strategies. Deep learning models have a high prediction capacity and can learn complex tasks. Nevertheless, their enormous capacity makes them open to privacy and security concerns. The model training set, parameters, and inputs are vulnerable to adversarial manipulation. By poisoning the training data, the attacker can reduce the predictive ability of the model or manipulate its behavior to suit the adversarial goals.

Despite recent advances in deep neural networks (DNNs), these systems are still susceptible to attack in volatile situations [5]. A malicious backdoor could be implanted in a model by contaminating the training dataset to make the infected model produce incorrect predictions during inference when the particular trigger emerges.

Backdoor attacks are a class of data poisoning attacks that aim to introduce an intentionally harmful behavior into a DL model. This is achieved by introducing a trigger pattern using a poisoned [26] or clean-label attack [27]. Backdoor attacks are more effective and practical for the attackers since they require less work to generate malicious inputs and have shown they can target real systems in both digital and physical environments [28].

Recently, there has been a surge in data poisoning attacks, leading to many researchers addressing this problem [29]. Backdoor attacks have been performed in various application settings like computer vision [30] and network traffic analysis [31].

The following section provides details of the reason for this study, with the background for the problem definition, followed by the backdoor trigger generation and attacking strategy.

3. Proposed Approach

This section provides the overall process followed in the study. Adversaries can exploit the vulnerabilities implanted by insiders. The process starts with training data, which consist of the original dataset used to train a deep learning model. These data are vulnerable to manipulation, particularly when adversaries have access to the training pipeline. Figure 1 illustrates the sequence of processes involved in an insider-driven backdoor attack during adversarial training in deep learning models. The process flow demonstrates how an adversary can exploit the training process to inject hidden triggers.

An insider with privileged access to the training pipeline injects poisoned samples into the dataset. These samples contain backdoor triggers that do not affect the model’s normal behavior during standard training. Poison-label data poisoning, where specific samples are labeled incorrectly, is employed to generate the backdoor attack. The model learns to recognize hidden triggers in the poisoned data after introducing the samples. As a result, the attacker can manipulate the model’s predictions while keeping them highly accurate when using standard inputs.

A surrogate model is a threat model similar to the targeted deep learning model adopted by the adversary. The insider can use a surrogate model to optimize backdoor triggers without directly accessing the original deep learning model. The insider tests the poisoned data on the surrogate model so that they can refine the backdoor attack to ensure it remains stealthy and effective. Section 3.1 explains the proposed XAI-based backdoor attack, and Section 3.2 describes the approach used for adversarial training.

3.1. Insider-Driven Backdoor Attack Using XAI

According to [26], an insider attacker with access to the model or data poses a serious security risk since they may easily initiate a backdoor attack. The labels of the data that have been poisoned are replaced with the target labels in poison-label attacks. As a result, the target labels are predicted when the backdoored model identifies the triggers. In this study, we focus on a poison-label attack triggered by an insider.

This work proposes a new backdoor attack exploiting the transparency that explainable AI provides through global model interpretation, as shown in Figure 2.

Adversary’s capacity: We assume the adversary to be someone part of the organization and who has access to dataset and is capable of triggering an adversarial attack through data poisoning during the training and testing stages of the threat detection model generation process.

Adversaries can establish a backdoor by inserting a specific pattern into models. Attackers usually create and add additional data into the training set with specific patterns to train or improve models. When these models get contaminated, they wrongly allocate input to a particular target class in response to a backdoor trigger, but they maintain high accuracy for benign data. Model interpretability can provide details of the features that can influence the classification. The adversary uses XAI to identify the subtle features that drastically affect the threat identification process. So, the attacker can use the most critical features to create minimal perturbations that classify the malicious data as non-malicious and mask them so that they can affect the insider threat identification.

Adversary’s objective: The adversary uses control over (a subset of) the features in a backdoor attack to cause misclassifications due to the presence of poisoned values in those feature dimensions. Assuming the attack produces a dense area of poisoned samples within the feature subspace encompassing the trigger, the classifier modifies its decision boundary to account for this density. The decision boundary is still subject to the attacker’s control, even if they only have a limited subspace. The attacker can modify both the region of the decision boundary and the density of attack points by carefully choosing the feature dimensions and values of the pattern as well as the number of poisoned data points they inject.

The various explainability strategies used after creating a system or model, called post-model approaches, can produce significant insights into the information that a system or model learns during training. This critical information explains why a sample is classified into a particular class. It can be considered a reverse engineering process to use a model to understand how the data behave, and hence, it can be interesting to adversaries. Here, we consider the vulnerability of the XAI results and adopt Shapley Additive Explanations (SHAP) values [32] that provide a data-driven decision model to generate a backdoor attack.

In the coming sections, we provide the details of the surrogate threat model being used and the backdoor trigger generation process using XAI.

3.1.1. Trigger Generation

Despite being transparent, XAI can be exploited to perturb data to create complex insider-driven backdoor attacks. Attackers can identify relatively small yet important features or minor changes that significantly alter the prediction results through analyzing XAI-generated explanations. Insider analyzes the model using XAI approaches to determine which features significantly affect the predictions.

SHAP [32] offers feature-level attribution, which indicates the exact contribution of each input feature to a prediction, making it an effective XAI technique. According to cooperative game theory, SHAP values measure the significance of features for individual predictions. SHAP provides local explanations that help attackers determine which features have the greatest impact and how a specific input is classified.

In this work, we consider positive SHAP values to indicate features influencing the model to decide on an insider behavior, whereas negative SHAP values show features driving the model to decide on a non-malicious activity. Motivated by this fact, we consider using SHAP values to generate the backdoor trigger using a feature subspace capable of changing the decision boundary of the learning models, thereby ending up in misclassifications. The feature space of the insider threat analysis is in tabular form. Hence, creating a backdoor trigger is quite challenging.

The adversary, being an insider, uses the tree-based models to generate predictions and interpret the models to identify the critical features in the prediction. Tree-based ensembles have been proven to be efficient for a small amount of data. This motivated us to use the tree ensemble XGBoost, which proved its efficiency even for smaller tabular datasets. The SHAP values of a tree ensemble are the (weighted) average of the SHAP values of the individual trees. Simple logic underpins the significance of SHAP features. Large absolute SHAP values for features are essential, and the absolute SHAP values for each feature across the data are averaged when we want to determine the global importance, as shown in Equation (1), where N is the number of features,

ϕ_{j}^{(i)}

is the feature contribution for a feature j for sample i.

I_{j} = \frac{1}{N} \sum_{i = 1}^{N} |ϕ_{j}^{(i)}|

(1)

We used the Tree SHAP model interpretation on the XGBoost algorithm. For tree-based machine learning models, including gradient-boosted trees, decision trees, and random forests, Lundberg et al. created Tree SHAP [33], a version of SHAP. In more detail, Tree SHAP considers the topology of the tree. The Shapley value is then used to estimate the importance of features to the interested instances. The attack becomes efficient with a minimum number of features. When all the features are used to generate a trigger, it creates a new instance of the data, which can contribute to the training process, whereas using the most significant features to generate a trigger is more like tweaking the existing data to create misinterpretable samples that can confuse the training process. We considered the feature importance value greater than zero to generate a trigger for this dataset. Algorithm 1 provides the detailed steps for generating the backdoor trigger (poisoned samples).

Algorithm 1: Insider-driven backdoor attack using XAI

Consider a subset of training data

X^{T r a i n}

, and use the surrogate model to train the data. Apply the SHAP interpretation and obtain the Shapley values as a 2D matrix represented as S. Each row corresponds to a prediction produced by the model, and each column corresponds to a feature the model uses. The SHAP value for each feature indicates how much that feature contributes to the prediction outcome for this row. The size of matrix S is

M \times N

, where M is the number of samples for inference, and N is the number of features. A positive SHAP value positively impacts prediction, causing the model to predict 1 (malicious class). A negative SHAP value indicates a negative influence, which causes the model to forecast 0 (non-malicious class). The features are ranked based on their influence on the model prediction. Consider the mean of the absolute values of the importance of features with non-zero values.

Consider a targeted attack on non-malicious samples. Create the backdoor trigger using important features that help classify non-malicious instances. The algorithm then identifies the most influential features, creating the backdoor trigger set

F_{B_{j}}

. These features are used for malicious trigger generation. Finally, the attacker iterates over the selected training samples

x \in X^{'}

, manipulating these malicious features

F_{B_{j}}

into each sample. Each poisoned sample

b_{i}

thus becomes a combination of original features

F_{mal}

and injected backdoor features

F_{B_{j}}

. The attacker assigns these manipulated samples to a predefined target label

y_{target}

, typically a specific misclassification intended by the attacker. Once the backdoor is generated, the adversary needs to verify the efficacy of the proposed backdoor attack. This is achieved using a surrogate model that mimics the original model used for the prediction. The purpose and details of the surrogate model are provided in the section below.

3.1.2. Surrogate Model

The adversary can use the training data or a subset from the input feature space and pass it to a surrogate model to build a model. This model can be investigated to understand how the features play a role in classifying the data samples. Using this information, the insider creates backdoor data samples. In this work, we use the threat model with the assumption from BadNets [5] that the attacker can acquire and change the training data but cannot access the parameters, structure, or training process of the victim model. As there are publicly available data for insider threat analysis, an attacker can use these data to generate backdoor triggers. The attacker uses the publicly available data to generate a threat model and the transparency of explainable AI to obtain the most important features that influence the threat detection to perform a targeted backdoor attack, such that the insider activities are left unidentified.

According to the assumption, the adversary can exploit the DL training algorithm. Given the accessibility to public datasets, we make the reasonable assumption that the adversary can access the dataset.

X^{T r a i n}

denotes the model input space, and each input instance x has a matching class label y. Backdoor attacks are defined by a backdoor trigger applied to each x in the input space. Backdoors are a set of data samples B added to the input space, and

B (x)

is classified as

y_{t a r g e t} < > y

, where

y_{t a r g e t}

is a target label of the attacker’s choice. We refer to the input instances with backdoor instances as B. The backdoored model performs well on most normal inputs but exhibits targeted misclassification when given an input containing a trigger specified by the attacker.

The proposed attack is demonstrated with experiments and results in Section 4. The following section details the adversarial training adopted to perform the threat detection in the presence of poisoned data samples.

3.2. Adversarial Training

This section explains how adversarial training validates the robustness of the model. Here, we perform adversarial training using GAN-generated synthetic samples, focusing on reducing the extreme class imbalance that hinders the usage of efficient threat detection methods for insider analysis. Adversarial samples are merged into the original training set to perform adversarial training.

In the paper [7], we proposed a generative adversarial model named the CWGAN-GP (conditional Wasserstein GAN with gradient penalty) for adversarial training to improve the robustness of insider threat identification, thereby reducing the effects of data imbalance. This GAN model generates realistic synthetic data to help machine and deep learning classifiers generalize more effectively. We evaluated multiple classifiers, including linear, non-linear, and ensemble algorithms, in that work.

However, in this study, the focus is on using DL models for threat identification and their robustness to an insider-driven backdoor attack using XAI. Figure 3 shows the adversarial training process in the presence of the backdoor and synthetic data for data augmentation and the original training data.

We study the impact of a backdoor attack on the adversarially trained models used for insider threat detection, assuming that the attacker can somehow be an insider related to the organization.

We designed multi-class classification DL models in such a way that the goal is to construct a function that, given a new data point, will correctly predict the class to which the new point belongs. It can be denoted as

f (x, θ) = y

. An adversarially trained model merges a subset of original training data

X^{T r a i n}

and the synthetic data

X^{S y n t h e t i c}

to perform the model training.

The distribution of data samples affects the decision boundary of the classification model, and adding new data samples distorts the original decision boundary. The adversarial training proved to result in complex decision surfaces, such that the adversarial samples are also accommodated in the updated model. This is how AT models become resistant to adversarial attacks. Data poisoning occurs during training time, so this type of backdoor attack aims to affect the model. Hence, AT models should be able to resist the data poisoning samples.

Training Models

Recent works have investigated the effectiveness of deep learning models on tabular data and concluded that regularized networks could handle tabular data well [34]. This work employs DL tabular learning models to perform multi-class classification. Much research has been carried out to develop transformer architectures that can process huge tabular datasets successfully since transformer architectures were first introduced to tabular data [35]. Similarly, new approaches use self-normalized neural networks (SNNs) [36].

We propose deep learning models adopting the self-normalizing neural network on a conventional Multi-Layer Perceptron (MLP) and one-dimensional 1DCNN. The features of the training data are strengthened by the use of scaled exponential linear units (SeLUs). Activation can spread through the network’s layers while maintaining normalization due to self-normalizing features. SeLUs can thus preserve the network’s stability and convergence while enhancing the model’s generalization abilities.

The details of the proposed architectures are explained in the following sections.

SNN-MLP architecture: We designed the SNN-MLP as a set of fully connected layers followed by a classification layer. There are four fully connected layers with

512, 256, 128, 64

neurons. The LeCun uniform initializer is used to initialize the layers and has an alpha-dropout rate of 0.05. Also, the bias initializer is set to zeroes. The classification layer is a dense layer with neurons equal to the number of classes and softmax activation.

SNN-1DCNN architecture: The model uses two 1D convolutional layers with SeLU activation and batch normalization to learn the local patterns and extract the features. One-dimensional convolutional layers are more suitable for tabular data. The LeCun uniform initializer is used to initialize the convolutional kernel size; the pooling kernel size is set to 2. Following these 1D layers, three fully connected layers with

64, 32, 16

neurons are added with an alpha-dropout rate of

0.2

, the optimizer is Adam, the iteration number is set to 100, and the batch size for each training round is 1024. Finally, there is a classification layer with the number of neurons equal to the number of classes in the dataset and a softmax activation to perform multiclass classification.

The performance of the model, determined experimentally in this study, depends critically on the size and step length of the convolutional kernels and the number of layers and convolutional kernels. The 1DCNN with SeLUs is used in fault diagnosis applications and has proved useful in combating the overfitting issue [37]. Hence, adopting a similar combination in this work, the issue of overfitting the model is successfully tackled by combining the 1DCNN with the self-normalized neural network, SNN, and utilizing the self-normalization property of the SeLU activation function.

In the existing works, ReLU activation is used; this makes the output of neurons with a negative input value zero, thereby reducing the interdependence among parameters. However, the SeLU activation used in SNN preserves the positive and negative values and is known to learn faster. The alpha-dropout layer is used to regularize the training process. The dropout layer added to the network improves the performance and improves generalization. Better generalization capability of the model results in reduced overfitting.

TabNet: This study also employs the TabNet [35] model, which maintains the end-to-end and representation learning capabilities of a DNN. Based on its qualities, it also has the benefits of interpretability from the tree model and sparse feature selection, making it comparable to the widely used tree model currently used in the table data task and having the benefits of a DNN. The model improves the network’s interpretability while also increasing the model performance.

TabNet can allow different samples to select different features, as the sample mask vector can change. The feature transformer layer performs calculations and processing for the features chosen in the most recent stage. The size relationship of a single feature, or the decision manifold, is combined to form the decision tree structure. TabNet is more effective than decision trees in feature combination and performs feature calculation through a more intricate feature transform layer.

Network architecture: The hyperparameter settings for this study were chosen using TabNet’s hyperparameter reference criteria. N-steps [3, 10] are the best selection for the majority of datasets. However, when more feature variables need to be learned, the N-steps value should be higher since if the network is too deep, it could result in severely ill-conditioned matrices. Here, we set the number of N-steps as 5.

Given the limited quantity of extended data utilized in this experiment and consideration of the setting of N-steps for smaller datasets in TabNet, the debugging range of N-steps for this experiment was 3–5, and a setting of 3 produces the best results while leaving other hyperparameters unaltered.

N d

and

N a

balance the model performance and complexity, with equal values appropriate for most datasets. The values of

N d

and

N a

should not be high simultaneously because this could lead to overfitting and poor generalization. In this case, we used the number of features as the

N d

and

N a

.

4. Experiments

We evaluated the performance of the proposed method. We provide the dataset description followed by the models used for training and performance evaluation. We implemented the method using Python 3, Tensorflow 2, and Keras 3.

The evaluation considers three main aspects of the problem design and attack strategy. Using model evaluation metrics, we investigated the deep learning models on the insider threat dataset. We measured the effectiveness of our proposed backdoor attack using the attack success rate and the model performance when backdoors are injected. We also analyzed the performance drop of the adversarially trained and backdoored models.

4.1. Dataset Description

The lack of real-world data substantially affected the experimentation. Typically, either data collected from the real world or data created artificially is employed. The most frequent malicious insiders are employees or entities linked to an organization with access rights. Employee actions and behaviors are directly tracked and recorded as part of data collection. This concerns privacy and confidentiality risks within an organization. Our experimental analysis leveraged the CMU CERT dataset [38], a widely recognized benchmark dataset for insider threat detection studies [39,40].

Although there are other CERT dataset versions, CERT v4.2 is frequently utilized because it has the most insider cases organized into three scenarios. The data include 70 insiders from 1000 users over 500 days. Additionally, we used version 5.2 with 2000 users and 30 insiders from five scenarios. Malicious insider-related data only make up roughly 1.278 percent of the data distribution in version 4.2, which is significantly skewed. Table 1 shows the dataset description.

4.2. Adversarial Training

We performed adversarial training using GAN-generated synthetic samples, focusing on reducing the extreme class imbalance that hinders the usage of efficient threat detection methods for insider analysis. Adversarial samples were merged into the original training set to perform the adversarial training. The performance of the algorithms was demonstrated in the absence and presence of adversarial attacks like backdoor (train–time) data poisoning attacks.

Here, we illustrate the performance of various classifiers with and without adversarial training using various synthetic data generation methods. The main focus is on GAN-based adversarial training; hence, we used various GAN models like the CGAN, CWGAN-GP, and ACGAN. The evaluation uses various performance metrics commonly used for machine learning problems.

4.2.1. Training Models

We employed three DL models, SNN-MLP, SNN-1DCNN, and TabNet, to detect insider threats. The SNN-MLP and SNN-1DCNN designs use SNNs on an MLP and 1DCNN. The network architecture is detailed in Section 3.2. Table 2 and Table 3 provided in this section give the performance analysis of the proposed DL models. Moreover, we compared these models with the tree-based models known for their efficiency in tabular data. We have performed the GAN-based experiments under various conditions like mode collapse and convergence in the work [7,41] and provided the experimental results. In the tables presented, values shown in bold indicate the best-performing results across the experiments.

4.2.2. Performance Metrics

Precision (P), recall (R), and F-score (F) were used to validate the classification models. High precision and recall are strongly recommended. However, these metrics may not accurately represent how well different models perform. The majority class is typically considered the negative class. Since the malicious samples are uncommon, the majority class is considered positive. The results were validated against false positives and false negatives, and they were examined based on precision and recall. We have added two more metrics, Cohen’s Kappa (K) and the Matthews Correlation Coefficient (M), for the experimental evaluation since we recognize the significance of the confusion matrix, also known as an error matrix.

We performed multi-class classification using various models like tree-based ensembles, such as Random Forest (RF), XGBoost (XGB), and LightGBM (LGBM). To enable more complex deep learning models, we propose two DL networks for insider threat detection referred to as the SNN-MLP and SNN-1DCNN. Moreover, we employed TabNet to train the insider threat data.

Table 2 gives the overall performance of various models on the original data with extremely skewed class distribution. The results show that XGBoost provides the best results for the skewed data. All other models, including other tree ensembles, RF, and LGBM, could not model the data well; this illustrates the need to handle the adverse effects of class imbalance.

We trained the models on original data and used adversarial training that combined the original dataset with artificial samples produced by generative adversarial models. We used Random Oversampling (ROS), SMOTE oversampling, and a variational autoencoder (VAE). Table 3 and Table 4 give the performance analysis of the multi-class classification with and without data augmentation.

The Kappa and MCC measures the number of false positives and false negatives in the model that result in higher misclassification. In an ideal case, all the metrics should have higher values. Since the original data were insufficient, we applied the oversampling methods ROS and SMOTE. Even though ROS and SMOTE generated more samples, the diversity of data being created was not appreciable and is reflected in the model performance. Overall, ROS did not help in boosting the model, whereas SMOTE led to improved metrics in specific models. This trend is seen in both versions of the data, which have varying classes and class distribution.

As shown Table 3, the original extremely imbalanced data led to worse performance in all models except for XGBoost. As usual, the tree-based algorithms work well for insider threat data, even with extreme imbalance. Specifically, the XGBoost algorithm can classify well when the data are highly imbalanced, whereas Random Forest does not perform well. Random Forest slightly increased the metric when trained using the SMOTE oversampling. As seen in the table, TabNet and LightGBM could yield better precision. LightGBM shows improvement in all metrics, whereas TabNet obtained high precision but did not show improvement in other metrics. Though oversampling is widely used, the data generated from oversampling are not as diverse as those derived from generative models and hence cannot be considered an efficient data augmentation method.

The CGAN is designed as an MLP for the generator and the discriminator. The generator MLP comprises three fully connected (FC) layers with 32, 64, and 128 neurons in each layer with LeakyReLU activation function in the first two layers and a linear activation in the last layer. The CGAN discriminator comprises three FC layers with 256, 128, and 32 neurons and LeakyReLU activation, followed by a last FC layer with sigmoid activation. The CGAN is conditioned on minority class labels and uses the latent dimension equal to the number of features in the data. The same architecture of the CGAN is used for the ACGAN as well.

The CWGAN-GP architecture uses four layers in G and D (critic) with 20, 32, 64, and 100 nodes in G and the same number of hidden neurons in descending order for D. The critic is trained five times as in the original WGAN. We used the LeakyReLU activation in all layers in G and D, with a linear activation in the last layer of G.

Table 4 gives the results of adversarial training using GAN-generated synthetic samples. Adversarial samples are generated using three GAN variants: the CGAN, CWGAN-GP, and ACGAN. We chose these three GAN models as they proved efficient in many existing works. The CGAN is trained on original insider threat data conditioned on the scenario-based class labels to generate data for any malicious scenario. A similar approach is used in training the CWGAN-GP and ACGAN. All three GAN variants are used to create malicious data samples and are used in the training process. Table 5 summarizes the GAN architectures used in this work.

Experiments show that the augmentation using GAN-based adversarial training yielded improved metrics during the evaluation. Compared to the CGAN and ACGAN, CWGAN-GP-based training improves performance. As seen in Table 4, all the metrics considerably improve the evaluation. XGBoost shows exemplary performance for both datasets. All other models significantly increased all the metrics in the adversarial training with GAN-generated synthetic values.

Figure 4 illustrates the validation loss curves. Because of its instability, CGAN frequently oscillates and converges more slowly after starting with the largest loss. Faster convergence and less loss indicate the superiority of the ACGAN to the CGAN, but it still exhibits greater variation. Improved training stability and generalization are demonstrated by the CWGAN-GP, which shows the constant achievement of the smoothest fall and the lowest validation loss. This shows that the CWGAN-GP performs better than the CGAN and ACGAN because of its gradient penalty regularization.

Furthermore, we conducted experiments to compare the performance of GAN-based augmentation and other model-agnostic adversarial sample generation methods like the Fast Gradient Sign Method (FGSM), DeepFool (DF), Carlini and Wagner (CW), and Jacobian-based Saliency Map Attack (JSMA). The GAN-based data generation gave improved performance for the CERT datasets. Table 6 provides the results for the comparison.

4.3. Discussion

A successful backdoor attack should have two characteristics: (1) the backdoor should be injected on clean data without causing the model’s accuracy to drop, and (2) the poisoned model should bypass the verification process after injection. When applying triggers to the testing data, we measure attack success rate and classification performance to evaluate the efficiency of backdoor injection.

4.3.1. Success of Attack

The success of the proposed attack is monitored using a metric attack success rate (ASR) calculated as the percentage of adversarial samples with the target label as given in Equation (2). In addition, as a benchmark, we analyzed classification performance for each model using clean data and the same training settings.

A S R = \frac{# successful_attacks}{# total_attempts_made}

(2)

Table 7 provides the attack success rate (ASR) for the backdoor generated using XAI. We performed the backdoor with a different number of samples to see how the poison sample rate affects the attack. The higher ASR means that the attack on the model is effective.

The data presented illustrate the efficacy and consequences of poison-label backdoor assaults in adversarial training settings with synthetic, GAN-generated data. The attack success rate (ASR) for CERT datasets shows that deep learning models like MLP, one-dimensional convolutional neural networks (1DCNNs), and TabNet show different levels of vulnerability. For example, with CGAN-based poisoning, the SNN-MLP exhibits a significant performance loss (F-score drop) of up to 0.289 (28.9%), indicating vulnerability. In contrast, the SNN-1DCNN shows inconsistent results, with slight performance decreases (up to −0.027 with the CGAN), suggesting that the adversarially trained backdoor model occasionally outperforms baseline adversarially trained models. Similarly, TabNet demonstrates an apparent vulnerability to insider-driven attacks, especially regarding ACGAN poisoning (performance loss of 0.435).

Additionally, examining the CWGAN-GP—the technique combining the gradient penalty and Wasserstein GAN—shows that it improves robustness among evaluated models. The CWGAN-GP has increased robustness, highlighting its potential to strengthen defense techniques against backdoor attacks. These results highlight the need for demanding robustness evaluations and thorough adversarial training method selection, particularly for deep neural architectures like the MLP, 1DCNN, and TabNet, which are frequently used as tabular DL models.

The complexity of the methods depends on the type of samples used in the data. The dataset used in this work is not as complex as image-based datasets commonly used for experiments in the existing works. So, the computational complexity will not be excessive. In this case, the data are not high-dimensional. Moreover, the GAN-based augmentation and the trigger generation are not interlinked; they are independent processes. The underlying assumption is that the attacker performs the trigger generation using a subset of the sample space; hence, trigger generation cannot be too complicated. Hence, the proposed approach does not tend to be computationally complex.

4.3.2. Effect of the Number of Poisoned Samples

The experimental findings show that the efficacy of backdoor attacks correlates significantly with the percentage of poisoned samples inserted into the target class training data, up to a critical threshold of roughly 35%. After this point, the attack success rate usually reaches a plateau, suggesting that additional poisoning yields fewer benefits. The idea behind this saturation phenomenon is that additional poisoned data have no effect after a certain percentage of the training data has been compromised. This behavior emphasizes how crucial it is to learn the ideal poison injection threshold to carry out or prevent such attacks.

In particular, the performance of the models on clean data is almost unchanged regardless of whether the injection rate of poisoned data samples is increased. This illustrates how poison-label backdoor attacks are hidden since models maintain their accuracy on clean inputs, making detection more difficult. Moreover, the studies show that model robustness against these threats is not significantly improved or deteriorated by increasing the number of poisoned samples above the identified threshold; however, the threshold depends on the size of the dataset. The need to establish a balance between the robustness, poisoning rates, and training data amount to guarantee the secure and dependable usage of deep learning models is highlighted by these findings.

4.4. Robustness Analysis of Backdoor in AT

Here, we provide the robustness analysis of the backdoor in AT models. We used the performance drop metric to study the performance reduction in the adversarially trained DL models with and without the backdoor. The performance comparison among various models reveals that the CWGAN-GP-based training produces a smaller performance drop.

For the analysis, we use the metric performance drop in metric values denoted as

M_{drop}

as shown in Equation (3).

M_{drop}

is the difference in the metric value for the performance metrics precision, recall, and F-score. Let the performance metric of the adversarially trained model be

M_{AT}

and the metric of the backdoored model be

M_{B}

, where M can be any performance metric used in the analysis.

M_{D r o p} = M_{A T} - M_{B}

(3)

The

M_{drop}

is calculated for the metrics precision as

P_{drop}

, recall as

R_{drop}

, and F-score as

F_{drop}

as in Table 7. The performance drop can be positive or negative. A positive value means the adversarial training gives a higher metric, whereas a negative value means that the adversarially trained backdoor model gives a higher metric value.

The altered model achieves nearly identical metrics on the testing data but with somewhat lower precision but not for all models. The precision, recall, and F-score values indicate a stable model, even in the presence of poisoned samples. The backdoored model supports non-malicious classification since the attack treats all (triggered) malicious samples as non-malicious, slightly altering the malicious-to-non-malicious ratio. The results demonstrate that GAN-based adversarial training can resist the backdoor attack to the fullest extent. As expected in the ideal attack strategy, the attack does not considerably affect the model performance.

The various GAN models behave differently for the backdoor models. The models trained with GAN samples are not considerably affected by the backdoored samples, as seen in Table 7. Out of the three models, the CWGANGP resists the backdoor attack more than the CGAN and ACGAN. When trained in the adversarial setting, a model alters the decision boundary to accommodate the newly added adversarial samples, and injecting the backdoor samples also alters the decision boundary. Since adversarial training can be considered a form of data poisoning during training time, it already alters the decision boundary, and hence, the AT models can resist the backdoor to the fullest extent.

In this work, we consider AT as the defense method. Regularization using data augmentation during training on the clean data and perturbed data were utilized to fine-tune a trained (poisoned) model to assess the defense and mitigation of backdoor data poisoning. We show that a defender can significantly reduce backdoor attack effects by fine-tuning the model on a reliable source of clean data across a range of models without explicit knowledge of poisoning techniques. Moreover, the attack is formulated to investigate backdoor performance on AT in insider threat analysis. Backdoor attacks can occur in insider threat detection and can go unnoticed. So, we aim to use the AT as a mechanism that can resist the backdoor to the fullest extent. The backdoor used in this work is framed using global model interpretability. The proposed approach has room for improvement in the sense that the value selection method can be modified.

The AT using GAN-based augmentation can create data samples that can be considered positioned data samples. These samples are generated from within the original data distribution to mimic the training dataset closely. Building a model with these data results in an updated decision boundary that can accommodate the newly added samples, and the model will be more robust, though without very high performance. Backdoored models also affect the decision boundary as new samples are added. The backdoor poisoning samples are also generated from within the original data distribution. Hence, they can be treated as original transformed samples similar to the synthetically generated ones. This is how AT can resist the backdoor attack to the greatest extent possible.

5. Conclusions

This study investigated the vulnerability of deep learning models to insider-driven backdoor attacks that use poison-label backdoor attacks and adversarial training. In particular, we proposed a poison-label backdoor attack that enabled the use of explainable AI (XAI) insights, which was evaluated, demonstrating how interpretability techniques targeted for improving model transparency might unintentionally enable cunning and advanced insider attacks. The size and quality of poisoned samples substantially impact attack effectiveness, according to empirical analyses conducted using benchmark CERT datasets; however, excessive poisoning does not correspondingly increase the adversary’s success.

In the future, we intend to refine the backdoor trigger, poison sample generation, and backdoor detection methods using explainable AI. We will also focus on using methods other than adversarial training to create robust insider threat detection models. Furthermore, the vulnerabilities of the model interpretation and the scope for its usage in cyber attacks can be investigated.

Author Contributions

Conceptualization, R.G.G. and A.S.; methodology, R.G.G. and A.S.; software, R.G.G. and A.S.; validation, R.G.G. and A.S.; writing—original draft preparation, R.G.G.; writing—review and editing, A.S. and Y.X.; supervision, A.S. and Y.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; Vladu, A. Towards Deep Learning Models Resistant to Adversarial Attacks. arXiv 2017, arXiv:1706.06083. [Google Scholar]
Papernot, N.; McDaniel, P.; Goodfellow, I.; Jha, S.; Celik, Z.B.; Swami, A. Practical Black-Box Attacks against Machine Learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security—ASIA CCS’17, Saadiyat Island, Abu Dhabi, 2–6 April 2017. [Google Scholar]
Lin, Y.-S.; Lee, W.-C.; Celik, Z.B. What Do You See? Evaluation of Explainable Artificial Intelligence (XAI) Interpretability through Neural Backdoors. arXiv 2020, arXiv:2009.10639. [Google Scholar]
Li, Y.; Jiang, Y.; Li, Z.; Xia, S.-T. Backdoor Learning: A Survey. IEEE Trans. Neural Netw. Learn. Syst. 2022, 35, 5–22. [Google Scholar] [CrossRef] [PubMed]
Gu, T.; Liu, K.; Dolan-Gavitt, B.; Garg, S. BadNets: Evaluating Backdooring Attacks on Deep Neural Networks. IEEE Access 2019, 7, 47230–47244. [Google Scholar] [CrossRef]
Ali, H.; Khan, M.S.; Al-Fuqaha, A.; Qadir, J. Tamp-X: Attacking Explainable Natural Language Classifiers through Tampered Activations. Comput. Secur. 2022, 120, 102791. [Google Scholar] [CrossRef]
Gayathri, R.; Sajjanhar, A.; Xiang, Y. Adversarial Training for Robust Insider Threat Detection. In Proceedings of the 2022 International Joint Conference on Neural Networks (IJCNN), Padua, Italy, 18–23 July 2022. [Google Scholar]
Yan, Z.; Li, G.; TIan, Y.; Wu, J.; Li, S.; Chen, M.; Poor, H.V. DeHiB: Deep Hidden Backdoor Attack on Semi-Supervised Learning via Adversarial Perturbation. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; Volume 35, pp. 10585–10593. [Google Scholar]
Li, Z.; Jiang, H.; Wang, X. A novel reinforcement learning agent for rotating machinery fault diagnosis with data augmentation. Reliab. Eng. Syst. Saf. 2021, 253, 110570. [Google Scholar] [CrossRef]
Švábenský, V.; Borchers, C.; Cloude, E.B.; Shimada, A. Evaluating the impact of data augmentation on predictive model performance. In Proceedings of the 15th International Learning Analytics and Knowledge Conference, Dublin, Ireland, 3–7 March 2025; pp. 126–136. [Google Scholar]
Wang, A.X.; Chukova, S.S.; Simpson, C.R.; Nguyen, B.P. Challenges and opportunities of generative models on tabular data. Appl. Soft Comput. 2024, 166, 112223. [Google Scholar] [CrossRef]
Kang, H.Y.J.; Ko, M.; Ryu, K.S. Tabular transformer generative adversarial network for heterogeneous distribution in healthcare. Sci. Rep. 2025, 15, 10254. [Google Scholar] [CrossRef]
Lei, X.; Skoularidou, M.; Cuesta-Infante, A.; Veeramachaneni, K. Modeling tabular data using conditional gan. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
Dou, H.; Chen, C.; Hu, X.; Xuan, Z.; Hu, Z.; Peng, S. PCA-SRGAN: Incremental orthogonal projection discrimination for face super-resolution. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 1891–1899. [Google Scholar]
Dou, H.; Chen, C.; Hu, X.; Jia, L.; Peng, S. Asymmetric CycleGAN for image-to-image translations with uneven complexities. Neurocomputing 2020, 415, 114–122. [Google Scholar] [CrossRef]
Khazane, H.; Ridouani, M.; Salahdine, F.; Kaabouch, N. A Holistic Review of Machine Learning Adversarial Attacks in IoT Networks. Future Internet 2024, 16, 32. [Google Scholar] [CrossRef]
Gao, Y.; Doan, B.G.; Zhang, Z.; Ma, S.; Zhang, J.; Fu, A.; Nepal, S.; Kim, H. Backdoor Attacks and Countermeasures on Deep Learning: A Comprehensive Review. arXiv 2020, arXiv:2007.10760. [Google Scholar]
Cui, C.; Du, H.; Jia, Z.; Zhang, X.; He, Y.; Yang, Y. Data Poisoning Attacks with Hybrid Particle Swarm Optimization Algorithms against Federated Learning in Connected and Autonomous Vehicles. IEEE Access 2023, 11, 136361–136369. [Google Scholar] [CrossRef]
Borgnia, E.; Cherepanova, V.; Fowl, L.; Ghiasi, A.; Geiping, J.; Goldblum, M.; Goldstein, T.; Gupta, A.K. Strong Data Augmentation Sanitizes Poisoning and Backdoor Attacks without an Accuracy Tradeoff. In Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021. [Google Scholar]
Qiu, H.; Zeng, Y.; Guo, S.; Zhang, T.; Qiu, M.; Thuraisingham, B. DeepSweep: An Evaluation Framework for Mitigating DNN Backdoor Attacks Using Data Augmentation. In Proceedings of the 2021 ACM Asia Conference on Computer and Communications Security, Hong Kong, China, 7–11 June 2021. [Google Scholar]
Agarwal, G. Explainable AI (XAI) for Cyber Defense: Enhancing Transparency and Trust in AI-Driven Security Solutions. Int. J. Adv. Res. Sci. Commun. Technol. 2025, 5, 132–138. [Google Scholar] [CrossRef]
Sahakyan, M.; Aung, Z.; Rahwan, T. Explainable Artificial Intelligence for Tabular Data: A Survey. IEEE Access 2021, 9, 135392–135422. [Google Scholar] [CrossRef]
Eldrandaly, K.A.; Abdel-Basset, M.; Ibrahim, M.; Abdel-Aziz, N.M. Explainable and Secure Artificial Intelligence: Taxonomy, Cases of Study, Learned Lessons, Challenges and Future Directions. Enterp. Inf. Syst. 2022, 17, 2098537. [Google Scholar] [CrossRef]
Liu, H.; Wu, Y.; Yu, Z.; Zhang, N. Please Tell Me More: Privacy Impact of Explainability through the Lens of Membership Inference Attack. IEEE Symp. Secur. Priv. 2024, 31, 4791–4809. [Google Scholar]
Baniecki, H.; Biecek, P. Adversarial Attacks and Defenses in Explainable Artificial Intelligence: A Survey. arXiv 2023, arXiv:2306.06123. [Google Scholar] [CrossRef]
Chen, X.; Liu, C.; Li, B.; Lu, K.; Song, D. Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning. arXiv 2017, arXiv:1712.05526. [Google Scholar]
Saha, A.; Subramanya, A.; Pirsiavash, H. Hidden Trigger Backdoor Attacks. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 9–11 February 2020; Volume 34, pp. 11957–11965. [Google Scholar]
Ning, R.; Li, J.; Xin, C.; Wu, H. Invisible Poison: A Blackbox Clean Label Backdoor Attack to Deep Neural Networks. In Proceedings of the IEEE INFOCOM 2021—IEEE Conference on Computer Communications, Vancouver, BC, Canada, 10–13 May 2021. [Google Scholar]
Miller, D.J.; Xiang, Z.; Kesidis, G. Adversarial Learning Targeting Deep Neural Network Classification: A Comprehensive Review of Defenses against Attacks. Proc. IEEE 2020, 108, 402–433. [Google Scholar] [CrossRef]
Gu, T.; Dolan-Gavitt, B.; Garg, S. BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain. arXiv 2017, arXiv:1708.06733. [Google Scholar]
Ning, R.; Xin, C.; Wu, H. TrojanFlow: A Neural Backdoor Attack to Deep Learning-Based Network Traffic Classifiers. In Proceedings of the IEEE INFOCOM 2022—IEEE Conference on Computer Communications, Virtual, 2–5 May 2022. [Google Scholar]
Lundberg, S.; Lee, S.-I. A Unified Approach to Interpreting Model Predictions. Neural Inf. Process. Syst. 2017, 30, 4768–4777. [Google Scholar]
Lundberg, S.M.; Erion, G.G.; Lee, S.-I. Consistent Individualized Feature Attribution for Tree Ensembles. arXiv 2018, arXiv:1802.03888. [Google Scholar]
Kadra, A.; Lindauer, M.; Hutter, F.; Grabocka, J. Well-Tuned Simple Nets Excel on Tabular Datasets. arXiv 2021, arXiv:2106.11189. [Google Scholar]
Arik, S.Ö.; Pfister, T. TabNet: Attentive Interpretable Tabular Learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 6679–6687. [Google Scholar]
Klambauer, G.; Unterthiner, T.; Mayr, A.; Hochreiter, S. Self-normalizing neural networks. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Li, Y.; Zou, L.; Jiang, L.; Zhou, X. Fault Diagnosis of Rotating Machinery Based on Combination of Deep Belief Network and One-Dimensional Convolutional Neural Network. IEEE Access 2019, 7, 165710–165723. [Google Scholar] [CrossRef]
Cmu.edu. Insider Threat Test Dataset. Available online: https://resources.sei.cmu.edu/library/asset-view.cfm?assetid=508099 (accessed on 28 March 2025).
Xiao, H.; Zhu, Y.; Zhang, B.; Lu, Z.; Du, D.; Liu, Y. Unveiling shadows: A comprehensive framework for insider threat detection based on statistical and sequential analysis. Comput. Secur. 2024, 138, 103665. [Google Scholar] [CrossRef]
Gao, P.; Zhang, H.; Wang, M.; Yang, W.; Wei, X.; Lv, Z.; Ma, Z. Deep temporal graph infomax for imbalanced insider threat detection. J. Comput. Inf. Syst. 2025, 65, 108–118. [Google Scholar] [CrossRef]
Gayathri, R.G.; Sajjanhar, A.; Xiang, Y.; Ma, X. Anomaly Detection for Scenario-Based Insider Activities Using CGAN Augmented Data. In Proceedings of the 2021 IEEE 20th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), Shenyang, China, 20–22 October 2021. [Google Scholar]

Figure 1. Overview of the proposed approach.

Figure 2. Insider-driven backdoor attack using XAI—trigger generation.

Figure 3. Adversarial training in the presence of backdoor and synthetic data, and the original training data.

Figure 4. Validation loss curves for CGAN, ACGAN, and CWGAN-GP.

Table 1. Dataset summary.

Data	No of Instances		No. of Users	No. of Insiders	Scenarios
Data	Total	Mal.	No. of Users	No. of Insiders	S1	S2	S3	S4
v4.2	330,452	966	1000	70	30	30	10	-
v5.2	1,048,575	906	2000	99	29	30	10	30

Table 2. Performance of models using original data.

Data	Model	Real
Data	Model	P	R	F	K	M
V4	RF	0.496	0.704	0.500	0.505	0.555
	XGB	0.754	0.748	l0.739	l0.920	0.921
	LGB	0.269	0.259	0.261	0.042	0.042
	SNN_MLP	0.276	0.870	0.289	0.069	0.187
	SNN_1DCNN	0.275	0.921	0.282	0.047	0.048
	TabNet	0.582	0.660	0.611	0.943	0.944
V5	RF	0.325	0.445	0.332	0.241	0.272
	XGB	0.552	0.454	0.491	0.489	0.493
	LGB	0.309	0.230	0.247	0.147	0.148
	SNN_MLP	0.205	0.699	0.172	0.006	0.061
	SNN_1DCNN	0.204	0.738	0.158	0.008	0.055
	TabNet	0.295	0.204	0.183	0.006	0.031

Table 3. Performance comparison of various models using oversampling and VAE methods.

Data	Model	ROS					SMOTE					VAE
Data	Model	P	R	F	K	M	P	R	F	K	M	P	R	F	K	M
V4	RF	0.549	0.729	0.544	0.343	0.426	0.556	0.734	0.615	0.779	0.786	0.581	0.712	0.610	0.811	0.810
	XGB	0.782	0.761	0.759	0.891	0.886	0.749	0.518	0.575	0.427	0.279	1.000	0.709	0.792	0.789	0.869
	LGB	0.368	0.866	0.419	0.156	0.290	0.391	0.793	0.422	0.071	0.193	0.420	0.80	0.481	0.201	0.32
	SNN_MLP	0.531	0.778	0.551	0.358	0.461	0.556	0.764	0.568	0.491	0.563	0.510	0.550	0.61	0.331	0.452
	SNN_1DCNN	0.277	0.921	0.289	0.061	0.174	0.44	0.793	0.517	0.487	0.557	0.45	0.801	0.53	0.5	0.462
	TabNet	0.373	0.491	0.298	0.086	0.086	0.742	0.487	0.508	0.367	0.462	0.341	0.440	0.54	0.381	0.420
V5	RF	0.317	0.456	0.341	0.225	0.261	0.253	0.464	0.287	0.175	0.234	0.280	0.481	0.312	0.190	0.24
	XGB	0.208	0.719	0.181	0.008	0.059	0.207	0.714	0.189	0.012	0.073	0.220	0.731	0.21	0.021	0.08
	LGB	0.372	0.665	0.337	0.069	0.171	0.236	0.659	0.262	0.079	0.179	0.251	0.67	0.272	0.08	0.181
	SNN_MLP	0.231	0.606	0.252	0.089	0.177	0.221	0.614	0.234	0.048	0.131	0.240	0.622	0.261	0.054	0.146
	SNN_1DCNN	0.204	0.779	0.16	0.007	0.056	0.745	0.329	0.724	0.039	0.14	0.321	0.750	0.342	0.052	0.159
	TabNet	0.383	0.364	0.375	0.324	0.324	0.625	0.556	0.111	0.134	0.541	0.630	0.563	0.121	0.145	0.083

Table 4. Performance comparison of various GAN-based adversarial training on deep learning models.

Data	Model	CGAN					ACGAN					CWGANGP
Data	Model	P	R	F	K	M	P	R	F	K	M	P	R	F	K	M
V4	RF	0.751	0.748	0.739	0.920	0.920	0.746	0.608	0.609	0.636	0.748	0.796	0.727	0.750	0.882	0.846
	XGB	0.782	0.761	0.759	0.891	0.886	0.749	0.518	0.575	0.427	0.279	1.000	0.709	0.792	0.789	0.869
	LGB	0.714	0.875	0.773	0.545	0.567	0.718	0.716	0.696	0.816	0.818	0.930	0.712	0.768	0.803	0.816
	SNN_MLP	0.749	0.445	0.514	0.599	0.627	0.779	0.787	0.783	0.898	0.898	0.832	0.748	0.786	0.938	0.939
	SNN_1DCNN	0.569	0.429	0.482	0.612	0.653	0.750	0.647	0.916	0.916	0.919	0.804	0.722	0.752	0.752	0.762
	TabNet	0.708	0.635	0.667	0.900	0.903	0.792	0.973	0.861	0.722	0.744	0.935	0.757	0.789	0.890	0.840
V5	RF	0.465	0.447	0.454	0.428	0.428	0.683	0.511	0.540	0.606	0.643	0.815	0.748	0.779	0.935	0.936
	XGB	0.998	0.751	0.816	0.948	0.949	1.000	0.629	0.732	0.861	0.869	0.994	0.763	0.820	0.963	0.963
	LGB	0.764	0.448	0.521	0.620	0.649	0.839	0.834	0.819	0.458	0.480	0.921	0.746	0.772	0.843	0.848
	SNN_MLP	0.588	0.389	0.454	0.588	0.636	0.815	0.748	0.779	0.935	0.936	0.875	0.747	0.798	0.940	0.941
	SNN_1DCNN	0.562	0.418	0.470	0.592	0.638	0.847	0.762	0.724	0.636	0.670	0.869	0.763	0.803	0.961	0.961
	TabNet	0.826	0.942	0.875	0.751	0.760	0.889	0.589	0.641	0.677	0.712	0.916	0.774	0.824	0.883	0.883

Table 5. GAN Architectures used in adversarial training.

Model	Generator			Discriminator			Hyperparameters
Model	Layers	Units	Non-Linearity	Layers	Units	Non-Linearity	Optimizer	Epochs	Batch Size
CGAN	Dense	32	LeakyReLU	Dense	256	LeakyReLU	Adam (lr = 0.0002, beta_1 = 0.5)	300	64
	Dense	64	LeakyReLU	Dense	128	LeakyReLU
	Dense	128	Linear	Dense	32	LeakyReLU
				Dense	1	Sigmoid
ACGAN	Dense	32	LeakyReLU	Dense	256	LeakyReLU	Adam (lr = 0.0002, beta_1 = 0.5)	300	64
	Dense	64	LeakyReLU	Dense	128	LeakyReLU
	Dense	128	Linear	Dense	32	LeakyReLU
				Dense	1	Sigmoid
CWGANGP	Dense	20	LeakyReLU	Dense	100	LeakyReLU	Adam (lr = 0.0002, beta_1 = 0.5)	300	64
	Dense	32	LeakyReLU	Dense	64	LeakyReLU
	Dense	64	LeakyReLU	Dense	32	LeakyReLU
	Dense	100	Linear	Dense	20	LeakyReLU
				Dense	1	Sigmoid

Table 6. Performance comparison of various adversarial sample generation methods.

Model	V4				V5
Model	FGSM	DF	CW	JSMA	FGSM	DF	CW	JSMA
RF	0.512	0.561	0.550	0.489	0.408	0.549	0.522	0.463
XGB	0.999	0.452	0.635	0.612	0.750	0.701	0.667	0.636
LGB	0.633	0.571	0.619	0.571	0.615	0.612	0.655	0.539
SNN_MLP	0.459	0.616	0.474	0.489	0.549	0.625	0.761	0.750
SNN_1DCNN	0.5321	0.681	0.612	0.539	0.513	0.523	0.554	0.561
TabNet	0.734	0.714	0.762	0.716	0.704	0.725	0.681	0.702

Table 7. Performance drop from the adversarially trained models and the backdoored models.

Data	Model	ASR %	CGAN			ACGAN			CWGANGP
Data	Model	ASR %	P_drop	R_drop	F_drop	P_drop	R_drop	F_drop	P_drop	R_drop	F_drop
V4	RF	90	0.026	0.016	0.014	0.038	0.079	0.023	0.002	0.307	0.244
	XGB	100	0.018	0.313	0.238	0.075	0.164	0.154	0.2464	0.209	0.217
	LGB	96	0.057	0.238	0.126	0.056	0.271	0.276	0.077	0.035	0.026
	SNN_MLP	82	0.029	−0.551	−0.289	0.020	0.351	0.252	0.060	0.015	0.044
	SNN_1DCNN	96	−0.027	0.189	0.226	0.002	0.068	0.319	0.005	0.347	0.290
	TabNet	92	0.038	0.134	0.243	0.072	0.435	0.255	0.043	0.340	0.322
V5	RF	90	0.056	0.139	0.135	0.241	0.065	0.101	0.008	0.009	0.008
	XGB	100	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
	LGB	96	0.085	0.005	0.001	0.057	0.287	0.186	0.027	0.024	0.022
	SNN_MLP	86	0.011	0.049	0.049	0.008	0.310	0.262	0.001	0.220	0.174
	SNN_1DCNN	97	0.019	−0.065	−0.030	0.014	0.317	0.192	0.025	0.005	0.067
	TabNet	95	0.023	0.495	0.354	0.037	0.014	−0.041	0.016	0.058	0.073

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gayathri, R.G.; Sajjanhar, A.; Xiang, Y. Adversarial Training for Mitigating Insider-Driven XAI-Based Backdoor Attacks. Future Internet 2025, 17, 209. https://doi.org/10.3390/fi17050209

AMA Style

Gayathri RG, Sajjanhar A, Xiang Y. Adversarial Training for Mitigating Insider-Driven XAI-Based Backdoor Attacks. Future Internet. 2025; 17(5):209. https://doi.org/10.3390/fi17050209

Chicago/Turabian Style

Gayathri, R. G., Atul Sajjanhar, and Yong Xiang. 2025. "Adversarial Training for Mitigating Insider-Driven XAI-Based Backdoor Attacks" Future Internet 17, no. 5: 209. https://doi.org/10.3390/fi17050209

APA Style

Gayathri, R. G., Sajjanhar, A., & Xiang, Y. (2025). Adversarial Training for Mitigating Insider-Driven XAI-Based Backdoor Attacks. Future Internet, 17(5), 209. https://doi.org/10.3390/fi17050209

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Adversarial Training for Mitigating Insider-Driven XAI-Based Backdoor Attacks

Abstract

1. Introduction

Motivation

2. Related Work

2.1. Generative Adversarial Networks (GANs)

2.2. Adversarial Training

2.3. Explainable AI for Cybersecurity

2.4. Backdoor Attacks

3. Proposed Approach

3.1. Insider-Driven Backdoor Attack Using XAI

3.1.1. Trigger Generation

3.1.2. Surrogate Model

3.2. Adversarial Training

Training Models

4. Experiments

4.1. Dataset Description

4.2. Adversarial Training

4.2.1. Training Models

4.2.2. Performance Metrics

4.3. Discussion

4.3.1. Success of Attack

4.3.2. Effect of the Number of Poisoned Samples

4.4. Robustness Analysis of Backdoor in AT

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI