Exploring Synergy of Denoising and Distillation: Novel Method for Efficient Adversarial Defense

Hong, Inpyo; Lee, Sokjoon

doi:10.3390/app142310872

Open AccessArticle

Exploring Synergy of Denoising and Distillation: Novel Method for Efficient Adversarial Defense

by

Inpyo Hong

¹

and

Sokjoon Lee

^2,*

¹

Department of Computer Science, Yonsei University, Seoul 03722, Republic of Korea

²

Department of Computer Engineering, Gachon University, Seongnam 13120, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(23), 10872; https://doi.org/10.3390/app142310872

Submission received: 16 October 2024 / Revised: 16 November 2024 / Accepted: 21 November 2024 / Published: 24 November 2024

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Escalating advancements in artificial intelligence (AI) has prompted significant security concerns, especially with its increasing commercialization. This necessitates research on safety measures to securely utilize AI models. Existing AI models are vulnerable to adversarial attacks, which are a specific form of assault methodology. Although various countermeasures have been explored, practical defense models are scarce. Current adversarial defense methods suffer from reduced accuracy, increased training time, and incomplete defense against adversarial attacks, indicating performance limitations and a lack of robustness. To address these limitations, we propose a composite defense model, the knowledge Distillation and deNoising Network (DiNo-Net), which integrates knowledge distillation and feature denoising techniques. Furthermore, we analyzed a correlation between the loss surface of adversarial perturbations and denoising techniques. Using DiNo-Net, we confirmed that increasing the temperature during the knowledge distillation process effectively amplifies the loss surface around the ground truth. Consequently, this enables more efficient denoising of the adversarial perturbations. It achieved a defense success rate of 72.7%, which is a remarkable improvement over the 41.0% success rate of models with only denoising defense mechanisms. Furthermore, DiNo-Net reduced the training time and maintained higher accuracy, confirming its efficient defense performance. We hope that this relationship will spur the development of fundamental defense strategies.

Keywords:

adversarial attack; adversarial robustness; knowledge distillation; feature denoising

1. Introduction

Technological advancements in artificial intelligence (AI) and computational capabilities continue to progress. The commercialization of AI is rapidly expanding at an unprecedented rate [1]. Recently, research on model lightweighting for the commercialization of AI has been actively conducted, and technologies incorporating AI in various fields, such as autonomous driving, robotics engineering, Internet of Things (IoT), and medical AI, are already being applied in real life [2,3,4,5]. However, if malicious attacks are carried out against AI models, the damage could be significant, and caution is required for actual use [6,7]. In particular, the damage could be even greater when AI models were attacked, including autonomous driving [8] and medical AI [9,10], which have already been commercialized and used [11,12]. In such situations, research on building more advanced and efficient defense systems from a security perspective is required for the practical use of AI technology.

From a security perspective, AI systems can be exposed to various security threats throughout the pipeline, from data collection to model deployment and utilization. During the data collection and transmission stages, there exist a man-in-the-middle attack [13], where an attacker intercepts or manipulates communication, and a data extraction attack, which captures and leaks sensitive data such as personal information [14]. In addition, during the data preprocessing stage, there is a data tampering attack, in which attackers maliciously tamper with the data to include malicious data in the preprocessing process [15,16].

Among the various attacks that threaten AI, adversarial attacks are the most fatal [17]. Adversarial attacks can carry out various attacks, depending on the attacker’s purpose, by reverse engineering the model information for targeting the target model. Adversarial attacks can be classified into three types depending on their purposes. An evasion attack can be used to insert noise (perturbation) into normal data to tamper with the data [18], whereas there exists a poisoning attack that inserts malicious data into the dataset during model learning to interfere with model learning and prevent the AI model from conducting normal learning [19]. In addition, an inversion attack extracts the learning data in reverse by analyzing the output value for a given input from the model [20], which can cause serious security threats to the AI model. Such adversarial attacks pose serious security threats to AI models, making it essential to develop robust countermeasures. These attacks can critically undermine AI model reliability and safety, hindering secure deployment in real-world applications. Therefore, defending against these threats is crucial to ensure AI models operate safely and effectively, especially as they become increasingly integrated into sensitive domains.

Therefore, this study conducts a vulnerability analysis related to evasion attacks, which tamper with data to prevent the model from functioning normally among adversarial attacks, and aims to build a defense model that is robust and efficient in terms of AI security. To maximize the denoising effect, we first designed a model using knowledge distillation techniques [21]. We then incorporated a denoising module [12] into the student model, which users will ultimately use.

By constructing a composite defense model using knowledge distillation and denoising modules, it is possible to respond to adversarial attacks more efficiently than the existing defense models. The research process was as follows:

Review current defense strategies against adversarial attacks and classification of defense mechanisms;
Analysis of the transitivity of adversarial robustness and the definition of considerations required to build an adversarial defense model;
Conceptualization and design of a defense model based on knowledge distillation and feature denoising for constructing an effective adversarial attack defense measure.

Through this research, we devised a technique that could fundamentally defend against adversarial attacks. The contributions of this study are as follows:

We propose that the knowledge distillation method for adversarial robustness enhances the effect of denoising, thereby enabling a more efficient response to adversarial attacks.
We propose that the use of smoothing in distillation can strengthen defense performance under adversarial attacks while maintaining high model accuracy.
We achieved a notable defense success rate of 72.7%, which represents a significant improvement over the 41.0% success rate of models with only denoising defenses.

2. Related Works

The aim of this research is to conduct a security vulnerability assessment of deep learning algorithms used for image and video classification and to develop effective defense methods. Thus, we examined the theoretical underpinnings of previous studies by assessing their merits and constraints.

In this context, we analyze the research concerning adversarial attacks, primarily focusing on the principles and attributes of the fast gradient sign method (FGSM) [14]. This method serves as the benchmark for adversarial attacks. Subsequently, we scrutinized various defense mechanisms equipped to counter these adversarial attacks.

A key objective of this study is to propose defensive techniques that demonstrate resilience to adversarial attacks. To achieve this, we explored the vulnerabilities inherent in these attacks and identified the elements that should be considered when designing defensive strategies. A prime example of such an element is an obfuscated gradient [22]. Considering the obfuscated gradient characteristics, we examine the security limitations of the current defense strategies. This in-depth exploration of previous research serves as a foundation upon which we can build efficient defense techniques. Thus, we aim to contribute to the field by enhancing the resilience of deep-learning algorithms against adversarial attacks.

Therefore, we analyzed related research on adversarial attacks, obfuscated gradients, and adversarial defenses. Based on this comprehensive analysis, we aim to establish a foundation for our methodology, findings, and implications as presented in the following sections.

2.1. Adversarial Attack

AI models are typically optimized based on the training data. This optimization makes them prone to deception through minor manipulations of the input data. To address this security problem, adversarial attacks must be considered. This section explores the concept of adversarial attacks and investigates the current trends. In addition, we delve into the principles that form the basis of FGSM attacks. These attacks are commonly used as adversarial strategies. Through this discussion, we aim to provide an extensive overview of adversarial attacks and their impact. Moreover, it is essential to understand these attacks to advance AI security and broaden AI technology applications. Our research highlights the need for elucidating robust defense mechanisms. These mechanisms should resist adversarial attacks and confirm the dependability and integrity of AI systems.

2.1.1. Exploring Adversarial Attack

Adversarial attacks pose a significant challenge to the security of AI models by exploiting algorithmic vulnerabilities to inversely manipulate model decisions [17]. An essential aspect of assessing the robustness of a machine learning model involves evaluating its defensive capability against adversarial attacks, which is typically measured by a decrease in the accuracy rate. Adversarial attacks aim to cause misclassifications by creating malicious data that disrupt the decision boundary of a trained model, as illustrated in Figure 1. This disruption is achieved by generating an optimal output that confuses the model within the norm constraints. From a security perspective, adversarial attacks can compromise the confidentiality and integrity of an AI model by effectively violating two of its three foundational security elements. Initial research on adversarial attacks primarily focused on identifying the vulnerabilities of deep-learning models. Szegedy et al. [23] first proposed that even minor changes in the input can significantly affect a model’s predictions. Goodfellow et al. [20] introduced the concept of an adversarial example in 2015. These examples refer to the malicious datasets created by combining benign data with perturbations that an attacker can obtain by understanding the gradient of the target model. These adversarial examples are then used as inputs for the target model, resulting in inaccurate predictions. Goodfellow et al. [20] proposed an effective attack method known as the fast gradient sign method (FGSM), which has since become a baseline attack method in adversarial attack-related research. Building on the FGSM, Kurakin et al. [24] introduced an iterative FGSM by adding perturbations over multiple stages.

Among the adversarial attacks that improved upon the FGSM, projected gradient descent (PGD) was suggested by Madry et al. [25]. The PGD calculates the gradient over multiple stages and generates adversarial examples within the allowable range of changes. Through multiple stages, it progressively refines adversarial examples, thereby enabling powerful attacks. Carlini et al. [26] proposed a C&W attack capable of creating potent adversarial examples by adjusting the input data to minimize the loss function. Furthermore, real-world applications of adversarial attacks have been investigated. For instance, Sharif et al. [27] proposed a method for attacking human face recognition systems, and Athalye et al. [28] suggested a method for applying adversarial examples to three-dimensional objects for real-world recognition. Notably, Athalye et al. [28] used 3D printing techniques to transform adversarial examples into physical objects, demonstrating the potential for severe attacks under various real-world conditions, including different angles and illuminations. Consequently, adversarial attacks represent a significant threat to practical applications of AI. Furthermore, ongoing research has applied lethal adversarial attacks to black-box environments, which were previously critical only in white-box settings. Therefore, threats to AI commercialization are escalating [29,30,31].

Among these, we employed FGSM, which is the most widely used and fundamental approach in adversarial attacks, to validate our defense model. Therefore, we aim to provide an in-depth analysis of FGSM attack techniques.

2.1.2. Fast Gradient Sign Method

FGSM [20] is a prominent adversarial attack technique that manipulates the gradient of a neural network to confuse a target model. A key component of FGSM’s strategy is the generation of perturbations by leveraging the target model gradient. These perturbations are then utilized to produce adversarial examples, which are a unique type of dataset constructed with the intent of confusing neural networks.

Adversarial examples were generated by evaluating the contribution of each pixel in an image to the loss value using a gradient. Subsequently, distortions are introduced into each pixel value in proportion to its contribution, producing datasets that are likely to induce misclassification in the trained models. By implementing this mechanism, FGSM can launch a detrimental attack on a target model.

Goodfellow et al. [20] observed that the decision boundaries of most deep-learning models were excessively linear. This characteristic facilitates the estimation of the decision boundary, rendering models susceptible to security breaches.

Adversarial examples that can trigger misclassifications in the model can be broadly represented, as shown in Equation (1).

x_{a d v} = x + δ, {∥ δ ∥}_{\infty} \leq ε

(1)

δ = ε s i g n (▽_{x} L (θ, x, y))

(2)

In Equation (1), x denotes the original dataset while

x_{a d v}

symbolizes the adversarial example that is being crafted. The variable

δ

illustrates the perturbation generated by the adversarial assault. Consequently, an adversarial example can be created by integrating this perturbation into the original data. More importantly,

δ

was confined to a certain range under the

L_{\infty}

norm constraint. Hence,

{∥ δ ∥}_{\infty} \leq ε

signifies that the

L_{\infty}

-norm constraint is not violated. The magnitude of perturbation that can result in misclassification is given by Equation (2). In this equation,

ε

is represented by an epsilon that indicates the perturbation intensity. The loss function L delineates the process of updating the gradient in the direction opposite to the label y using the

θ

parameter of the deep learning model. This operation is known as the gradient ascent. Consequently, the implementation of gradient ascent can induce a misclassification in the model. The FGSM poses a formidable challenge for AI models. The FGSM technique offers the flexibility to adjust the attack intensity subtly through an epsilon hyperparameter, as illustrated in Figure 2. An increase in the epsilon value intensifies the attack but simultaneously leads to the creation of unnatural adversarial examples, making their detection by humans relatively easy. Conversely, a smaller epsilon value makes adversarial examples less noticeable to humans yet allows AI to counter adversarial attacks more efficiently. Therefore, an attacker must judiciously select the epsilon value to ensure an effective attack. The versatility of the FGSM in executing attacks of varying intensities underscores its potential to severely compromise the target model [20].

2.2. Obfuscated Gradient

The obfuscated gradient property represents a defense strategy employed by deep learning models to counter adversarial attacks [22]. This defense mechanism aims to inhibit an attacker’s comprehension of the gradient of the target model, a strategy often referred to as gradient masking. However, it should be emphasized that the obfuscated gradient only hinders the attacker’s comprehension of the gradient rather than genuinely strengthening the model against adversarial attacks. This suggests that fundamental defense remains elusive [22]. Therefore, to achieve substantial protection against adversarial attacks, generating effective defense outcomes without resorting to the properties of an obfuscated gradient is vital. This section presents a comprehensive examination of obfuscated gradient characteristics with the aim of developing a robust model devoid of these attributes. Obfuscated gradients can be classified into three distinct categories, each of which is outlined in the following sections.

2.2.1. Shattered Gradient

Shattered gradients pose a significant challenge in the training of deep learning models because they cause gradients to become erratic and inconsistent. This issue typically arises because of a lack of regularity or continuity in the network’s weight updates, and it can severely hinder the learning process. Therefore, it is crucial to mitigate the effects of shattered gradients in order to achieve optimal learning outcomes. In terms of adversarial defense, shattered gradients utilize their damaging effects to conceal the gradients. This is achieved by either increasing the complexity of the network structure or incorporating nondifferentiable functions such as the ReLU activation function. The objective is to complicate the attacker’s task of generating appropriate perturbations using a gradient. Nevertheless, the backward-pass differentiable approximation (BPDA) attack technique [22] demonstrated the feasibility of assaulting models with shattered gradient properties via gradient estimation. This evidence suggests that relying on shattered gradients may not be an effective defense strategy.

2.2.2. Stochastic Gradient

A stochastic gradient introduces probabilistic elements into the model, thereby inducing uncertainty in the gradient computations of the attacker. By integrating these probabilistic aspects, the model produces diverse outputs and gradients for the same input, creating challenges for attackers to accurately estimate the gradient. This can be achieved using methods such as dropout and batch normalization. To enhance the generalization capability of the model and avoid overfitting, a dropout was employed, which randomly deactivates certain neurons. This added complexity further hinders an attacker’s gradient estimation, making dropout a representative technique in the realm of stochastic gradients. However, an attack method known as expectation over transformation (EOT) [28] has been proven to be capable of counteracting the stochastic gradient, indicating that it may not be a steadfast defense mechanism.

2.2.3. Vanishing and Exploding Gradients

The vanishing or exploding gradient is another technique designed to obstruct backward estimation of the gradient. The vanishing gradient problem occurs when the model produces highly confident predictions and minor changes to the input have negligible effects on the output. Consequently, the gradient decreases, rendering adversarial attacks ineffective. A common approach to solving this problem is to use a sigmoid activation function. In contrast, the exploding gradient technique incorporates an arbitrary amplification of the gradient value, which makes standard estimation unattainable. A typical example is the design of a loop within a model to prevent gradient estimation. However, the vanishing/exploding gradient is neutralized using a specific attack technique. The defense vulnerability of this technique was confirmed through reparameterization, which replaced a loop with a differentiable [22]. Therefore, it was concluded that the vanishing/exploding gradient is not a suitable defense technique.

2.3. Adversarial Defense

Adversarial defense is a vital research area aimed at safeguarding deep learning models against adversarial incursions. These attacks, achieved by subtly altering input data through minimal distortions, present a significant threat to machine learning algorithms while remaining virtually undetectable to human perception. Therefore, studying adversarial defense is critical for the deployment of attack-resilient models [32]. We provide an in-depth examination of contemporary adversarial defenses and discuss the factors that should be considered when devising strategies for adversarial attacks.

2.3.1. Adversarial Training

Adversarial training serves as a defense mechanism to enhance resilience against adversarial attacks. This was achieved by incorporating adversarial instances during the model-training phase. As a defense method, it utilizes data augmentation to mitigate entropy loss in the face of adversarial attacks. As demonstrated by Equation (3), Goodfellow et al. [20] suggested adversarial training as a potential countermeasure for withstanding adversarial attacks. This strategy is implemented during model training and is categorized under defense methods that leverage data augmentation.

α \times L (θ, x, y) + (1 - α) \times L (θ, x_{a d v}, y)

(3)

Equation (3) illustrates a training technique that incorporates both the original and adversarial examples by applying a weighted sum. The scale factor

α

serves as a control mechanism to balance the training proportion between the adversarial examples and the original data, which is directly determined by the user during adversarial training. This method facilitates resistance to adversarial assaults by learning dual data distributions using a singular loss function. Additionally, the approach extends beyond simply adding adversarial examples to the original data; it also includes a process that disrupts the maximization phase inherent in adversarial attacks. Equation (4) [25] describes the external minimization process that equips the model with the capacity to generate precise predictions.

m i n δ_{θ} L (θ, x_{a d v}, y), w h e r e x_{a d v} = x + m a x_{δ} L (θ, x + δ, y)

(4)

In particular, the adversarial training proposed by Madry et al. [25] exhibited superior accuracy compared to conventional training when exposed to FGSM and PGD attacks, underscoring its resilience against adversarial attacks. Among the existing defense methods against adversarial attacks, adversarial training is the only fundamental technique that does not align with obfuscated gradients. Nonetheless, there are constraints associated with the efficiency of the learning duration because adversarial training necessitates additional learning phases.

2.3.2. Defensive Distillation

Defensive distillation [33] is an approach that implements knowledge distillation to fend off adversarial attacks, thereby increasing the complexity of grasping the model’s gradient and executing attacks. The essence of knowledge distillation is to amplify the gradient through temperature values [21]. This study employs knowledge distillation as a fundamental technique for constructing an efficient model to counter adversarial attacks. Therefore, we analyzed the concept of knowledge distillation and subsequently scrutinized defensive distillation.

Knowledge distillation is a technique that facilitates the transfer of knowledge from a pretrained network, often called a teacher network. This knowledge is subsequently incorporated into a smaller network known as the student network, which is designed for eventual user applications [21]. Therefore, the purpose of the knowledge distillation process is to extract superior performance from a smaller model by acquiring intricate features from a larger network that are too intricate to learn from a smaller one. The student model designed for the ultimate user application underwent proficient training through a learning process that consisted of two loss computations.

L_{soft distillation} = (1 - λ) L_{CE} (ψ (Z_{s}), y) + λ τ^{2} KL (ψ (Z_{s} / τ), ψ (Z_{t} / τ))

(5)

S o f t m a x (Z_{i}) = \frac{e^{Z_{j}}}{\sum_{j = 1}^{K} e^{z_{j}}}, S o f t m a x (z_{i}) w i t h τ = \frac{e^{Z_{j}} / τ}{\sum_{j = 1}^{K} e^{z_{j}} / τ}

(6)

The soft distillation loss function in Equation (5) is a crucial component of knowledge distillation. The student model, which is subjected to a two-phase learning process, utilizes two distinct loss functions. The proportional weight assigned to each loss function is represented by

λ

in Equation (5).

The student model’s supervised learning from labels aligns with the function

L_{C E}

in Equation (5), where the logits output from the student model is denoted by

ψ (Z_{s})

. Here, y denotes the label, thereby enabling traditional supervised learning.

The function

K L

represents the Kullback–Leibler divergence function, which is a measure of dissimilarity between two probability distributions. The learning process aims to minimize this divergence. Within the

K L

function, the two distributions correspond to the student model output (

ψ (Z_{s} / τ)

) and the teacher model output (

ψ (Z_{t} / τ)

).

The logit value from each model is divided by

τ

in the label smoothing process to optimize the learning effect of the soft labels. This value, referred to as ’temperature

τ

’ by Hinton et al. [21], is used in the softmax function, as depicted in Equation (6).

Equation (5) provides a comprehensive understanding of how the student model absorbs the weight information from the teacher model. This led to the development of several subsequent models, including data-efficient image transformers (DeiT) [34] and DeiT3 [35], all of which incorporate the KD principle of knowledge distillation. Many studies have substantiated the information transfer efficacy of knowledge distillation, which, in turn, validates its utility in enhancing and lightening models [21]. In addition, several studies have confirmed the effectiveness of knowledge distillation in object detection [36,37,38], and there have been numerous in-depth analyses of its successful implementation in natural language processing [39,40]. Previous studies established that knowledge distillation extends across various domains beyond image processing. Particularly in the context of AI security, Goldblum et al. [41] demonstrated that knowledge distillation can be a potent method in adversarial defense as it allows for the partial transfer of the defense performance of the teacher model. There have also been initiatives to counter adversarial attacks by directly using knowledge distillation. In defensive distillation, the gradient amplification concept derived from knowledge distillation is used defensively to hinder the attackers’ access to the gradient. Unlike knowledge distillation, defensive distillation employs the same structure for both the teacher and student models.

Defensive distillation escalates the softmax output of the supporting network during the learning process via temperature modulation and transfers it to the target network, thereby reinforcing adversarial resilience. After training, the temperature returned to 1, which amplified the certainty in the target class. This makes it difficult for attackers to decipher the gradient while simultaneously augmenting the classification process of the model.

However, defensive distillation did not enhance the robustness of the model. Rather, it prevents attackers from understanding the gradient, thereby aligning it with the characteristics of the obfuscated gradients. These characteristics suggest limitations when applied alone, as they do not conform to fundamental defense strategies.

In accordance with the technique proposed by Carlini et al. [26], it was ascertained that defensive distillation lacks robustness when utilized independently. Furthermore, studies investigating the use of knowledge distillation in a defensive context have confirmed that a model employing both knowledge and defensive distillation techniques has security vulnerabilities in the commercialization stage [42,43].

2.3.3. Feature Denoising

Feature denoising [44] mitigates or eliminates disturbances in images and videos. From the perspective of adversarial defense, denoising serves as a protective mechanism against adversarial assault by discarding the perturbations derived from the model. This defense strategy diminishes the effect of adversarial instances by recalibrating them to align with the standard data distribution via a denoising procedure. The ultimate objective of incorporating a denoising module into the AI model is to ensure accurate predictions despite any harmful modifications to the input data.

Despite their effectiveness, denoising methods apply filtering to all input data uniformly, which can sometimes lead to an inadvertent alteration of the general data. To counteract this, it is essential to ascertain whether the input data are adversarial examples or the original data. Although there are statistical methods such as the ‘maximum mean discrepancy’ [45] and the technique of analyzing the final hidden layer neural network output distribution [46], they have yet to prove their effectiveness as defense mechanisms. Currently, no known statistical methods can accurately identify adversarial examples [47]. Furthermore, if an attacker gains an understanding of the denoising module, they can craft perturbations that consider this module, thereby enabling lethal adversarial attacks. This implies that the denoising technique aligns with the obfuscated gradient properties [22].

Although the removal of perturbations through denoising is considered a positive defensive strategy, it has the potential to damage the original data unintentionally. To address this issue, researchers proposed a defense technique that merges feature denoising with adversarial training [48]. Wu et al. [48] proposed a two-step process, starting with adversarial training, followed by denoising. From the decision boundary perspective, the classification of the original data remained unaltered. Thus, it opens up the possibility of effective regulation of adversarial examples toward the ground-truth class. Nevertheless, adversarial training comes with its own set of drawbacks, such as decreased classification performance and increased learning time.

Therefore, this study aims to expand the research of Wu et al. [48]. Instead of directly conducting adversarial training, this study intends to learn adversarial robustness through knowledge distillation. Subsequently, denoising is conducted.

3. Distillation and deNoise Network(DiNo-Net)

We propose a robust defense model, the DiNo-Net: Distillation and deNoise Networks, that does not adhere to the characteristics of obfuscated gradients. DiNo-Net maximizes the effect of feature denoising using distillation. In this study, we applied a knowledge distillation technique to learn the features of adversarial robustness indirectly. By utilizing feature denoising, we propose an effective defense model for adversarial attacks that curtails the increase in the number of model parameters and learning time. Figure 3 shows the configuration of the proposed model (DiNo-Net). As shown in Figure 3, DiNo-Net deploys two defense techniques to effectively learn adversarial robustness. In the subsequent sections, this study provides a detailed description of these two methodologies for enhancing defense performance. Finally, this chapter explains the factors that led to the enhanced performance of our proposed defense technique compared to existing techniques.

3.1. Adversarial Robustness Training

The primary technique involves indirect adversarial robustness learning by leveraging knowledge distillation, whereas the secondary technique involves feature denoising. Consequently, this paper describes two strategies for increasing adversarial robustness.

3.1.1. Adversarial Training for Teacher Model

In the process of training a student model to be adversarially robust, it is necessary to establish a teacher model that retains the characteristics of adversarial robustness. Therefore, the development of a pretrained model that demonstrates adversarial robustness is crucial. As depicted in Figure 3, adversarial training was employed during the teacher model’s initial learning process to understand adversarial robustness attributes. It was confirmed that learning adversarial examples through post-general learning using a benign dataset enhanced the accuracy of the model. If this approach is adopted instead of the inclusion of adversarial perturbations during the initial learning stage of the model, its defense performance can also be improved [20]. To ensure the successful adversarial robustness training of the student model, this study necessitates the selection of a teacher model that exhibits superior performance in terms of both accuracy and defense. Therefore, the adversarial training technique proposed by Goodfellow et al. [20] was implemented in this study.

3.1.2. Feature Denoising for Student Model

The final defense mechanism of the student model relies on a denoising module, as depicted in Figure 3. Unlike the adversarial robustness provided by the teacher model, the denoising feature directly mitigates the perturbations of the adversarial examples. Previous shortcomings in treating all inputs identically during the denoising process have led to suboptimal classification outcomes. Xie et al. [44] proposed a solution involving adversarial training prior to denoising, which resulted in effective perturbation removal without impairing the classification of the original inputs. The efficacy of this approach is evident in Figure 4B and Figure 5B, which show a smoother loss surface around the ground truth compared with the untrained denoising process shown in Figure 4A and Figure 5A. An increase in the r-value signifies a less sensitive response to adversarial attacks and feature denoising. Despite the additional computational demand for adversarial training, it enables effective defense through feature denoising.

Figure 4C shows how the DiNo-Net expands the loss surface around the optimal point, mitigating the impact of denoising on the accuracy of the original image. The narrow loss surface near the decision boundary does not hinder the effective removal of adversarial example perturbations, as shown in Figure 5C. Denoising through knowledge distillation, as shown in Figure 5C, achieves a more efficient and steady defensive performance than adversarial training, as depicted in Figure 5B, by enhancing ground-truth smoothing. Additionally, our approach does not involve direct adversarial training, as is the case in [48]. DiNo-Net effectively learns robustness indirectly through distillation. The subsequent sections detail methods for effectively merging these two defense mechanisms.

3.2. Fusion Method Based on Distillation

An effective combination of these two defense strategies is achieved using knowledge distillation and KL divergence loss. The temperature in Equation (6) increases during the knowledge distillation process to smooth the logits related to the labels. This technique enables the student model to transfer its adversarial robustness readily from the teacher model. Additionally, as our analysis indicates, the expanded loss surface around the ground truth owing to the increased temperature aids in more stable adversarial denoising as shown in Figure 4 and Figure 5. We utilized the soft distillation loss to integrate these into one comprehensive loss function. The combination of KL divergence loss and cross-entropy loss was applied in a ratio of 0.2 for the teacher model and 0.8 for the student model. This ratio has been confirmed to be optimal through these experiments.

4. Experiment and Evaluation

4.1. Experimental Environment

This study aims to enhance robustness against adversarial attacks. For this purpose, we conducted experiments using benchmark datasets that provide an impartial measure of AI model performance. We used the CIFAR-10 dataset [49], which is one of the most commonly used datasets in computer vision. The CIFAR-10 dataset was composed of 10 classes of 32 × 32 pixel RGB images and included 60,000 images: 50,000 for training and 10,000 for testing. An experiment comparing the proposed method with previous methods was conducted using an Intel Core i9 CPU (Intel, Santa Clara, CA, USA) and an NVIDIA RTX 3090 GPU (NVIDIA, Santa Clara, CA, USA). The software environment included Python version 3.8.13 and PyTorch version 1.12.1. Regarding the dataset of the proposed model, we divided the dataset into training, validation, and testing sets. We also employed k-fold validation, with K set to 5, to address potential test data bias and overfitting issues and to optimize the hyperparameters. The performance of the final model was assessed using a test dataset. The models used for comparison were divided into convolutional neural network (CNN) and knowledge distillation-based models. All models were trained for 300 epochs. For the CNN models, we used the cross-entropy loss as the criterion, Adam as the optimizer, and cosine-annealing LR as the scheduler. In addition, the learning rate was set to 1 × 10^{$- 4$}, and the batch size was set to 128. By contrast, the knowledge distillation models used soft distillation loss as the criterion, with the temperature set to 10. All other learning conditions were the same as those used for the CNN-based models.

A t t a c k S u c c e s s R a t e = B e n i g n A c c u r a c y - A t t a c k e d A c c u r a c y

(7)

D e f e n s e S u c c e s s R a t e = 100 - A t t a c k S u c c e s s R a t e

(8)

A c c u r a c y = \frac{(T P + T N)}{(T P + T N + F P + F N)}

(9)

F 1 - Score = 2 \times \frac{(P r e c i s i o n \times R e c a l l)}{(P r e c i s i o n + R e c a l l)}

(10)

For the practical application of the model, not only its defensive performance but factors such as model parameters and learning time are also crucial. Therefore, we primarily focused on the analysis of the following metrics for a comprehensive comparison of the implementation of the defense model: (1) Accuracy, F1-score, and defense success rate; (2) Model parameters and model learning time. The main metric for assessing defensive performance is the attack success rate, with a lower rate indicating a superior defense. This rate is often referred to as the rate of accuracy decrease, as described in Equation (7). In addition, we employ another metric, the defense success rate, as detailed in Equation (8). The model performance was evaluated using accuracy and the F1-score, as outlined in Equations (9) and (10).

In Equation (7), ’benign accuracy’ signifies the accuracy evaluated using the original test dataset that is devoid of any attacks, whereas ‘attacked accuracy’ represents the accuracy calculated from adversarial examples. Equation (8) introduces the defense success rate for a more straightforward evaluation of the model’s defensive abilities. A higher rate signifies a stronger defense model. We used the defense success rate, accuracy, and F1-score as metrics to assess the performance and defensive robustness of the model. The feasibility of model deployment was assessed based on its parameters and learning time. In addition, all attacks were conducted in a white-box setting [50], operating under the assumption that the attacker was fully informed about the model. We utilized the FGSM method based on the

L_{\infty}

norm for model attacks. The perturbation magnitude (

ε

) was automatically adjusted by the foolbox library [51] used in our experiments to ensure effective and bounded perturbations.

4.2. Experimental Result and Evaluation

We verify and analyze an adversarial defense model designed to respond effectively to adversarial attacks. Examination of the experimental results concentrated on defensive performance and efficiency. The aim was to execute individual evaluations centered on these focal points. First, we performed an analysis focusing on the model accuracy and defense success rate for the performance assessment of the model. Subsequently, we conducted an analysis focusing on the defense success rate in relation to the model parameters and the model learning time to assess the efficiency of the model.

4.2.1. Evaluation of Defensive Performance

We conducted performance assessments focusing on DiNo-Net by distinguishing between instances with and without FGSM attacks. The experimental results are presented in Table 1. Before proceeding with the comparative evaluations based on various defense techniques, we performed a defense performance evaluation for a general model that did not incorporate a defense function. Although it showed respectable classification performance with over 91% accuracy without any attacks, the accuracy fell sharply to nearly 25% when an attack was executed, indicating acute defensive vulnerability. As discussed in Section 2 (Related Works), the denoising technique matches the characteristics of obfuscated gradients and thus does not constitute a fundamental defense technique. Furthermore, if an attacker identifies a denoising module and generates an adversarial example, the denoising technique will become ineffective against adversarial attacks. Consequently, the denoising methods are vulnerable and difficult to employ independently. For the second model in Table 1, this study observed a decrease in accuracy of approximately 10% compared with the standard model without defense capabilities. This is because the denoising performed on the original dataset damaged the pixel values of the original data. Moreover, this study analyzed the defensive vulnerability using a defense success rate of 31.7%, as the attacker generated an adversarial example by considering the denoising module and initiated an attack.

In the context of adversarial training, this study demonstrated superior effectiveness in acquiring defensive attributes. However, this poses a drawback in that it requires additional learning time during adversarial training. Furthermore, this may have decreased the accuracy of the original dataset. The third model in Table 1 pertains to a ResNet that has undergone adversarial training. We determined that it possesses a defensive performance of 71.8%, suggesting that it can effectively defend against adversarial attacks. However, a 15 % reduction in accuracy was observed compared to the regular model (the first model in Table 1), underscoring the limitations associated with performance reduction. We also validate a defense technique that integrates denoising and adversarial training. The fourth model in Table 1 presents the experimental results. The fusion of denoising and adversarial training demonstrated a defense performance of 72.0%, thereby validating it as the most efficient defense technique among the compared models. Although our analysis shows that this model has 77% accuracy, it does not overcome the decreased accuracy issues stemming from adversarial training. The assessment of the defense model using knowledge distillation aligns with the fifth model in Table 1. This model demonstrated 84.8% accuracy for benign inputs, exhibiting almost identical performance to the 84.7% accuracy of DiNo-Net. Furthermore, it was observed to have the highest defense success rate of 71.6% among the comparison models. However, it displayed a lower defense performance than the 72.7% defense success rate of DiNo-Net.

The proposed DiNo-Net employs knowledge distillation and incorporates a denoising module into the student model to eliminate perturbations and augment the defense capabilities. Therefore, we verified that it achieved a 1.1% performance enhancement over the best attack accuracy of 56.4% among the compared models, with an attack accuracy of 57.5%. Furthermore, DiNo-Net demonstrated a superior defensive performance of 72.7%, outperforming both standalone feature denoising and the combined use of denoising and adversarial training. Hence, it has been proven that the fusion of the denoising module and knowledge distillation can improve defense performance against adversarial attacks. Moreover, DiNo-Net does not directly participate in adversarial training. Consequently, it reduces the accuracy degradation issue, which is a commonly mentioned limitation of adversarial training. It also executes learning to enhance the accuracy. Hence, our experiments confirmed that DiNo-Net maximizes the denoising effect, effectively learning adversarial robustness while reducing the decrease in non-attacked accuracy.

We measured the stability of learning for the ablation study. Figure 6 shows the Hessian trace of the loss function, derived from its second derivative. The Hessian trace effectively captures the curvature, and a lower value indicates more stable learning. As seen in Figure 6, our method demonstrates the most stable performance compared to all other baseline models.

4.2.2. Evaluation of Defensive Efficiency

In this section, the investigation emphasizes the evaluation of the model’s efficiency predicated on its defense performance, targeting the future commercialization of AI security.

According to the outcomes presented in Table 2, denoising ResNet required the shortest learning time among the defense models; however, it was confirmed to have the lowest defense performance, with a defense success rate of 31.7%. In contrast, the adversarially trained ResNet exhibited a relatively superior defense performance with a defense success rate of 71.8%; however, it required a lengthy learning time of 1 h 47 min and 40 s. The model that integrated denoising and adversarial training showed the most outstanding defense performance among the comparative models, with a defense success rate of 72.0%; however, it was verified that it required the longest learning time of 1 h 47 min and 40 s. Based on these results, this study ascertained that the technique proposed by Wu et al. [48] can exhibit superior performance; however, the limitations associated with the learning time of adversarial training persist. The model utilizing knowledge distillation exhibited the best results in terms of parameters and learning time. However, its defense performance was less effective than DiNo-Net, with a defense success rate of 71.6%. The DiNo-Net proposed in this study exhibited the highest defense performance, with a defense success rate of 72.7%. In addition, by composing 11.24 million model parameters, the increase in parameters was minimized, and the learning time was reduced to 1 h 12 min 33 s. Therefore, integrating the feature denoising technique with knowledge distillation reduces the learning time and effectively improves the defensive performance. Furthermore, our method achieves the best robustness without directly employing adversarial training, demonstrating its efficiency and effectiveness in improving model robustness.

5. Discussion

As the commercialization of AI technology expands, we are tasked with protecting sensitive data, ensuring trustworthy AI decision-making, and maintaining robust system security. In this paper, we propose a resilient and efficient defense system against adversarial attacks. The proposed DiNo-Net learns adversarial robustness indirectly through knowledge distillation, effectively reducing the learning time. Additionally, we suggest that increasing the temperature can expand the loss surface around the ground truth, which can be effective for feature denoising. With a defense success rate of 72.7%, DiNo-Net demonstrated excellent defensive performance. In addition, we proved that the defense effect could be maximized while minimizing the increase in model parameters through 11.24 million model parameters. Furthermore, as opposed to the conventional defense approach combining adversarial training and denoising, the training duration was reduced by over 35 min, thereby confirming the impact of training time reduction. Although DiNo-Net does not provide a complete solution for adversarial attacks, we present a new method for expanding the loss surface of adversarial examples, thereby offering a direction for future research to effectively eliminate adversarial perturbations.

Moreover, our research primarily focuses on the field of computer vision. As such, the effects of denoising in other domains where adversarial attacks can occur may differ from the vision domain (e.g., signal processing). Consequently, our method requires additional validation for application in other domains. This presents an opportunity for future research to explore the adaptability of our method across various fields.

6. Conclusions

AI technology requires new considerations beyond the security issues addressed in traditional computer science such as data sensitivity, autonomous decision-making by AI, and personal information protection. Therefore, security considerations for AI are an indispensable concern in its commercialization, necessitating in-depth research. The purpose of this study was to construct an efficient defense system capable of responding to adversarial attacks, which are the most lethal attacks on AI. Consequently, through the proposal of DiNo-Net, this study enhances the security of AI systems while minimizing the increase in training time and parameters. DiNo-Net uses knowledge distillation, an AI optimization technique, to ensure AI performance and indirectly learn adversarial robustness. In addition, the insertion of a denoising module into the student model of DiNo-Net effectively removed adversarial perturbations. By comparing the defensive techniques proposed in this study with existing techniques, it was confirmed that adding a denoising module to the student model, which users ultimately use in the knowledge distillation process, effectively increases the loss surface around the ground truth. We propose that the temperature increase process in knowledge distillation effectively transfers the adversarial robustness of the teacher model to the student model and can effectively manage adversarial perturbations during the denoising process. Our study highlights the synergistic effect of knowledge distillation and denoising in improving defense models. We anticipate that this approach will lead to more effective and optimized AI defense models when a fundamental technique surpassing adversarial training is introduced as a teacher model.

Author Contributions

I.H.: methodology, software, writing; S.L.: supervision, project administration, funding acquisition. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korean government (MSIT) (No. RS-2023-00225201, Development of Control Rights Protection Technology to Prevent Reverse Use of Military Unmanned Vehicles and No. RS-2024-00396797, Development of core technology for intelligent O-RAN security platform).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ahmad, T.; Zhu, H.; Zhang, D.; Tariq, R.; Bassam, A.; Ullah, F.; AlGhamdi, A.S.; Alshamrani, S.S. Energetics Systems and artificial intelligence: Applications of industry 4.0. Energy Rep. 2022, 8, 334–361. [Google Scholar] [CrossRef]
Hu, Z.; Zhang, Y.; Xing, Y.; Zhao, Y.; Cao, D.; Lv, C. Toward human-centered automated driving: A novel spatiotemporal vision transformer-enabled head tracker. IEEE Veh. Technol. Mag. 2022, 17, 57–64. [Google Scholar] [CrossRef]
Liu, Y.; Ping, Y.; Zhang, L.; Wang, L.; Xu, X. Scheduling of decentralized robot services in cloud manufacturing with deep reinforcement learning. Robot. Comput.-Integr. Manuf. 2023, 80, 102454. [Google Scholar] [CrossRef]
Alexander, A.; Jiang, A.; Ferreira, C.; Zurkiya, D. An intelligent future for medical imaging: A market outlook on artificial intelligence for medical imaging. J. Am. Coll. Radiol. 2020, 17, 165–170. [Google Scholar] [CrossRef] [PubMed]
Park, H.C.; Hong, I.P.; Poudel, S.; Choi, C. Data Augmentation based on Generative Adversarial Networks for Endoscopic Image Classification. IEEE Access 2023. [Google Scholar] [CrossRef]
Guembe, B.; Azeta, A.; Misra, S.; Osamor, V.C.; Fernandez-Sanz, L.; Pospelova, V. The emerging threat of ai-driven cyber attacks: A review. Appl. Artif. Intell. 2022, 36, 2037254. [Google Scholar] [CrossRef]
Qiu, S.; Liu, Q.; Zhou, S.; Wu, C. Review of artificial intelligence adversarial attack and defense technologies. Appl. Sci. 2019, 9, 909. [Google Scholar] [CrossRef]
Girdhar, M.; Hong, J.; Moore, J. Cybersecurity of Autonomous Vehicles: A Systematic Literature Review of Adversarial Attacks and Defense Models. IEEE Open J. Veh. Technol. 2023, 4, 417–437. [Google Scholar] [CrossRef]
Kaviani, S.; Han, K.J.; Sohn, I. Adversarial attacks and defenses on AI in medical imaging informatics: A survey. Expert Syst. Appl. 2022, 198, 116815. [Google Scholar] [CrossRef]
Finlayson, S.G.; Bowers, J.D.; Ito, J.; Zittrain, J.L.; Beam, A.L.; Kohane, I.S. Adversarial attacks on medical machine learning. Science 2019, 363, 1287–1289. [Google Scholar] [CrossRef]
Kloukiniotis, A.; Papandreou, A.; Lalos, A.; Kapsalas, P.; Nguyen, D.V.; Moustakas, K. Countering adversarial attacks on autonomous vehicles using denoising techniques: A review. IEEE Open J. Intell. Transp. Syst. 2022, 3, 61–80. [Google Scholar] [CrossRef]
Yin, H.; Wang, R.; Liu, B.; Yan, J. On adversarial robustness of semantic segmentation models for automated driving. In Proceedings of the 2022 IEEE Intelligent Vehicles Symposium (IV), Aachen, Germany, 5–9 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 867–873. [Google Scholar]
Memos, V.A.; Psannis, K.E. NFV-based scheme for effective protection against bot attacks in AI-enabled IoT. IEEE Internet Things Mag. 2022, 5, 91–95. [Google Scholar] [CrossRef]
Carlini, N.; Hayes, J.; Nasr, M.; Jagielski, M.; Sehwag, V.; Tramer, F.; Balle, B.; Ippolito, D.; Wallace, E. Extracting training data from diffusion models. In Proceedings of the 32nd USENIX Security Symposium (USENIX Security 23), Anaheim, CA, USA, 9–11 August 2023; pp. 5253–5270. [Google Scholar]
Fritsch, L.; Jaber, A.; Yazidi, A. An Overview of Artificial Intelligence Used in Malware. In Proceedings of the Symposium of the Norwegian AI Society; Springer: Berlin/Heidelberg, Germany, 2022; pp. 41–51. [Google Scholar]
Mirsky, Y.; Demontis, A.; Kotak, J.; Shankar, R.; Gelei, D.; Yang, L.; Zhang, X.; Pintor, M.; Lee, W.; Elovici, Y.; et al. The threat of offensive ai to organizations. Comput. Secur. 2023, 124, 103006. [Google Scholar] [CrossRef]
Nguyen, A.; Yosinski, J.; Clune, J. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 427–436. [Google Scholar]
Ayub, M.A.; Johnson, W.A.; Talbert, D.A.; Siraj, A. Model evasion attack on intrusion detection systems using adversarial machine learning. In Proceedings of the 2020 54th Annual Conference on Information Sciences and Systems (CISS), Princeton, NJ, USA, 18–20 March 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–6. [Google Scholar]
Oprea, A.; Singhal, A.; Vassilev, A. Poisoning Attacks Against Machine Learning: Can Machine Learning Be Trustworthy? Computer 2022, 55, 94–99. [Google Scholar] [CrossRef]
Goodfellow, I.J.; Shlens, J.; Szegedy, C. Explaining and harnessing adversarial examples. arXiv 2014, arXiv:1412.6572. [Google Scholar]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
Athalye, A.; Carlini, N.; Wagner, D. Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; PMLR: Cambridge, MA, USA, 2018; pp. 274–283. [Google Scholar]
Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, D.; Goodfellow, I.; Fergus, R. Intriguing properties of neural networks. arXiv 2013, arXiv:1312.6199. [Google Scholar]
Kurakin, A.; Goodfellow, I.J.; Bengio, S. Adversarial examples in the physical world. In Artificial Intelligence Safety and Security; Chapman and Hall/CRC: Boca Raton, FL, USA, 2018; pp. 99–112. [Google Scholar]
Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; Vladu, A. Towards deep learning models resistant to adversarial attacks. arXiv 2017, arXiv:1706.06083. [Google Scholar]
Carlini, N.; Wagner, D. Towards evaluating the robustness of neural networks. In Proceedings of the 2017 IEEE Symposium on Security and Privacy (sp), San Jose, CA, USA, 22–26 May 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 39–57. [Google Scholar]
Sharif, M.; Bhagavatula, S.; Bauer, L.; Reiter, M.K. Accessorize to a crime: Real and stealthy attacks on state-of-the-art face recognition. In Proceedings of the 2016 ACM Sigsac Conference on Computer and Communications Security, Vienna, Austria, 24–28 October 2016; pp. 1528–1540. [Google Scholar]
Athalye, A.; Engstrom, L.; Ilyas, A.; Kwok, K. Synthesizing robust adversarial examples. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; PMLR: Cambridge, MA, USA, 2018; pp. 284–293. [Google Scholar]
Bai, Y.; Wang, Y.; Zeng, Y.; Jiang, Y.; Xia, S.T. Query efficient black-box adversarial attack on deep neural networks. Pattern Recognit. 2023, 133, 109037. [Google Scholar] [CrossRef]
Williams, P.N.; Li, K. Black-Box Sparse Adversarial Attack via Multi-Objective Optimisation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 12291–12301. [Google Scholar]
Lu, S.; Wang, M.; Wang, D.; Wei, X.; Xiao, S.; Wang, Z.; Han, N.; Wang, L. Black-box attacks against log anomaly detection with adversarial examples. Inf. Sci. 2023, 619, 249–262. [Google Scholar] [CrossRef]
Bertino, E.; Kantarcioglu, M.; Akcora, C.G.; Samtani, S.; Mittal, S.; Gupta, M. AI for Security and Security for AI. In Proceedings of the Eleventh ACM Conference on Data and Application Security and Privacy, Virtual Event, 26–28 April 2021; pp. 333–334. [Google Scholar]
Papernot, N.; McDaniel, P.; Wu, X.; Jha, S.; Swami, A. Distillation as a defense to adversarial perturbations against deep neural networks. In Proceedings of the 2016 IEEE Symposium on Security and Privacy (SP), San Jose, CA, USA, 22–26 May 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 582–597. [Google Scholar]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; PMLR: Cambridge, MA, USA, 2021; pp. 10347–10357. [Google Scholar]
Touvron, H.; Cord, M.; Jégou, H. Deit iii: Revenge of the vit. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 516–533. [Google Scholar]
Chen, G.; Choi, W.; Yu, X.; Han, T.; Chandraker, M. Learning efficient object detection models with knowledge distillation. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Chen, L.; Yu, C.; Chen, L. A new knowledge distillation for incremental object detection. In Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 14–19 July 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–7. [Google Scholar]
Kang, Z.; Zhang, P.; Zhang, X.; Sun, J.; Zheng, N. Instance-conditional knowledge distillation for object detection. Adv. Neural Inf. Process. Syst. 2021, 34, 16468–16480. [Google Scholar]
Hahn, S.; Choi, H. Self-knowledge distillation in natural language processing. arXiv 2019, arXiv:1908.01851. [Google Scholar]
Fu, H.; Zhou, S.; Yang, Q.; Tang, J.; Liu, G.; Liu, K.; Li, X. LRC-BERT: Latent-representation contrastive knowledge distillation for natural language understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, virtual, 11–15 October 2021; Volume 35, pp. 12830–12838. [Google Scholar]
Goldblum, M.; Fowl, L.; Feizi, S.; Goldstein, T. Adversarially robust distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 3996–4003. [Google Scholar]
Hong, I.p.; Choi, G.h.; Kim, P.k.; Choi, C. Security Verification Software Platform of Data-efficient Image Transformer Based on Fast Gradient Sign Method. In Proceedings of the 38th ACM/SIGAPP Symposium on Applied Computing, Tallinn, Estonia, 26–30 March 2023; pp. 1669–1672. [Google Scholar]
Hong, I.; Choi, C. Knowledge distillation vulnerability of DeiT through CNN adversarial attack. Neural Comput. Appl. 2023, 1–11. [Google Scholar] [CrossRef]
Xie, C.; Wu, Y.; Maaten, L.v.d.; Yuille, A.L.; He, K. Feature denoising for improving adversarial robustness. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 501–509. [Google Scholar]
Li, X.; Li, F. Adversarial examples detection in deep networks with convolutional filter statistics. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5764–5772. [Google Scholar]
Grosse, K.; Manoharan, P.; Papernot, N.; Backes, M.; McDaniel, P. On the (statistical) detection of adversarial examples. arXiv 2017, arXiv:1702.06280. [Google Scholar]
Carlini, N.; Wagner, D. Adversarial examples are not easily detected: Bypassing ten detection methods. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, Dallas, TX, USA, 3 November 2017; pp. 3–14. [Google Scholar]
Wu, B.; Pan, H.; Shen, L.; Gu, J.; Zhao, S.; Li, Z.; Cai, D.; He, X.; Liu, W. Attacking adversarial attacks as a defense. arXiv 2021, arXiv:2106.04938. [Google Scholar]
Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images. 2009. Available online: https://www.cs.utoronto.ca/~kriz/learning-features-2009-TR.pdf (accessed on 15 October 2024).
Xiao, C.; Li, B.; Zhu, J.Y.; He, W.; Liu, M.; Song, D. Generating adversarial examples with adversarial networks. arXiv 2018, arXiv:1801.02610. [Google Scholar]
Rauber, J.; Zimmermann, R.; Bethge, M.; Brendel, W. Foolbox Native: Fast adversarial attacks to benchmark the robustness of machine learning models in PyTorch, TensorFlow, and JAX. J. Open Source Softw. 2020, 5, 2607. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]

Figure 1. Illustration of the adversarial attack process on the decision boundary. The (left) plot shows the initial decision boundary separating two groups of data points. In the (right) plot, adversarial perturbations are applied to selected data points, shifting them across the decision boundary and causing misclassification.

Figure 2. Examples of FGSM Attack.

Figure 3. Structure of Proposed Model (DiNo-Net).

Figure 4. Original input loss surface by defensive techniques.

Figure 5. Adversarial example loss surface by defensive techniques.

Figure 6. Ablation study of Hessian traces from second derivatives on training stability.

Table 1. Performances of defensive models (DN: denoising, ADT: adversarial training, KD: knowledge distillation).

Model	Defense Model	Attack	F1 Score	Accuracy	Attack Success Rate ( $%, ↓$ )	Defense Success Rate ( $%, ↑$ )
ResNet18 [52]	-	False	0.917 (0.010)	0.918 (0.010)	66.5	33.5
	-	True	0.254 (0.019)	0.253 (0.023)	66.5	33.5
	DN [45]	False	0.819 (0.07)	0.819 (0.007)	68.3	31.7
	DN [45]	True	0.142 (0.016)	0.136 (0.017)	68.3	31.7
	ADT [20]	False	0.766 (0.005)	0.767 (0.004)	28.2	71.8
	ADT [20]	True	0.484 (0.008)	0.484 (0.004)	28.2	71.8
	DN + ADT [48]	False	0.769 (0.004)	0.770 (0.004)	28.0	72.0
	DN + ADT [48]	True	0.488 (0.007)	0.491 (0.006)	28.0	72.0
	KD [41]	False	0.848 (0.004)	0.848 (0.004)	28.4	71.6
	KD [41]	True	0.567 (0.009)	0.564 (0.009)	28.4	71.6
DiNo-Net (Proposed)	KD + DN	False	0.847 (0.002)	0.847 (0.002)	27.3	72.7
DiNo-Net (Proposed)	KD + DN	True	0.579 (0.004)	0.575 (0.004)	27.3	72.7

Table 2. Efficiency Performances of defensive models (DN: denoising, ADT: adversarial training, KD: knowledge distillation).

Model	Defense Method	Parameter	Learning Time	Defense Success Rate ( $%, ↑$ )
ResNet18 [52]	-	11.17 M	1:06:22	33.5
	DN [45]	11.24 M	1:11:46	31.7
	ADT [20]	11.17 M	1:47:13	71.8
	DN+ADT [48]	11.24 M	1:47:40	72.0
	KD [41]	11.17 M	1:12:06	71.6
DiNo-Net (Proposed)	KD + DN	11.24 M	1:12:33	72.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hong, I.; Lee, S. Exploring Synergy of Denoising and Distillation: Novel Method for Efficient Adversarial Defense. Appl. Sci. 2024, 14, 10872. https://doi.org/10.3390/app142310872

AMA Style

Hong I, Lee S. Exploring Synergy of Denoising and Distillation: Novel Method for Efficient Adversarial Defense. Applied Sciences. 2024; 14(23):10872. https://doi.org/10.3390/app142310872

Chicago/Turabian Style

Hong, Inpyo, and Sokjoon Lee. 2024. "Exploring Synergy of Denoising and Distillation: Novel Method for Efficient Adversarial Defense" Applied Sciences 14, no. 23: 10872. https://doi.org/10.3390/app142310872

APA Style

Hong, I., & Lee, S. (2024). Exploring Synergy of Denoising and Distillation: Novel Method for Efficient Adversarial Defense. Applied Sciences, 14(23), 10872. https://doi.org/10.3390/app142310872

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Exploring Synergy of Denoising and Distillation: Novel Method for Efficient Adversarial Defense

Abstract

1. Introduction

2. Related Works

2.1. Adversarial Attack

2.1.1. Exploring Adversarial Attack

2.1.2. Fast Gradient Sign Method

2.2. Obfuscated Gradient

2.2.1. Shattered Gradient

2.2.2. Stochastic Gradient

2.2.3. Vanishing and Exploding Gradients

2.3. Adversarial Defense

2.3.1. Adversarial Training

2.3.2. Defensive Distillation

2.3.3. Feature Denoising

3. Distillation and deNoise Network(DiNo-Net)

3.1. Adversarial Robustness Training

3.1.1. Adversarial Training for Teacher Model

3.1.2. Feature Denoising for Student Model

3.2. Fusion Method Based on Distillation

4. Experiment and Evaluation

4.1. Experimental Environment

4.2. Experimental Result and Evaluation

4.2.1. Evaluation of Defensive Performance

4.2.2. Evaluation of Defensive Efficiency

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI