Not So Robust After All: Evaluating the Robustness of Deep Neural Networks to Unseen Adversarial Attacks

Deep neural networks (DNNs) have gained prominence in various applications, such as classification, recognition, and prediction, prompting increased scrutiny of their properties. A fundamental attribute of traditional DNNs is their vulnerability to modifications in input data, which has resulted in the investigation of adversarial attacks. These attacks manipulate the data in order to mislead a DNN. This study aims to challenge the efficacy and generalization of contemporary defense mechanisms against adversarial attacks. Specifically, we explore the hypothesis proposed by Ilyas et. al, which posits that DNN image features can be either robust or non-robust, with adversarial attacks targeting the latter. This hypothesis suggests that training a DNN on a dataset consisting solely of robust features should produce a model resistant to adversarial attacks. However, our experiments demonstrate that this is not universally true. To gain further insights into our findings, we analyze the impact of adversarial attack norms on DNN representations, focusing on samples subjected to $L_2$ and $L_{\infty}$ norm attacks. Further, we employ canonical correlation analysis, visualize the representations, and calculate the mean distance between these representations and various DNN decision boundaries. Our results reveal a significant difference between $L_2$ and $L_{\infty}$ norms, which could provide insights into the potential dangers posed by $L_{\infty}$ norm attacks, previously underestimated by the research community.


Introduction
The growth of computing power and data availability has led to the development of more efficient pattern recognition techniques, such as machine learning and its subclasses, neural networks and deep learning.When a sample accurately represents a data population and is of sufficient size, machine learning methods can yield impressive results, even on unseen data, making them suitable for tasks such as classification and prediction tasks [14,21].Although these methods offer significant advantages, they also come with limitations, such as the sensitivity of deep neural networks (DNNs) to data quality and source, and overfitting when trained on insufficient amounts of data.Researchers continue to develop new approaches and techniques to address these challenges and improve the overall effectiveness of pattern recognition [2,3,13,24].Despite these attempts, variations in input samples from different domains or insufficient training data can still significantly impact DNNs' performance.As DNNs are increasingly being employed in critical applications such as medicine and transportation, enhancing their robustness is essential due to the potentially severe consequences of their unpredictability.
One approach to examine DNN robustness to diverse inputs is through the use of adversarial attacks for analysis or training.Adversarial attacks aim to generate the smallest adversarial perturbation -a change in input that results in a misclassification by the model.This study investigates the abilities and drawbacks of modern defense techniques against adversarial attacks, such as adversarial and robust training.The latter one refers to the hypothesis proposed by Ilyas et al. [11], which suggests that adversarial attacks exploit non-robust features inherent to the dataset rather than the objects in the images.According to this hypothesis, removing these features from the dataset and training a model on the modified data should render adversarial attacks ineffective.We replicate the experiments from the original paper, training a model on robust features and testing it on unseen attacks.Our tests reveal that the models trained on robust features are generally not resistant to L ∞ norm perturbations.
To explain this behavior, we compare the representations of adversarial examples using canonical correlation analysis (CCA) and discover that L ∞ norm attacks cause the most dispersion in the latent representation.We visualize neural network representations under L 2 and L ∞ norm attacks to illustrate the impact of norms on the distributions of representations.Additionally, we tested the corresponding adversarial trained models used in robust training, which provided insight into the relationship between robust and adversarial training, suggesting that robust training is a specific case of adversarial training.
These insights might be used by the researchers for development of more robust neural networks by adversarial training: while L 2 and L ∞ norm perturbations look similar, the L ∞ norm ones are much harder to resist.As far as we know, no one paid such attention to the differences between the attack norms before.
The structure of this paper is as follows.Section 2 presents a brief overview of various attacks on image classifiers and discusses potential defense strategies that we use in our experiments.Section 3 explores current hypotheses regarding the nature of adversarial examples.Section 4 details the experimental methodology and the results.The conclusion of our study is presented in Section 5.

Related works
We focus on the attacks on image classifiers as they are the most widespread and mature, however, adversarial attacks are not limited by the type of input or task.The reader may find the examples of attacks in other domains, such as malicious URLs classification [23], communication systems [15], time series classification [12], malware detection in PDF files [4], etc.
In this research, we suppose that an adversary has complete information about the neural network, including weights, gradients, and other internal details (white-box scenario).We also assume that in most of the cases the adversary's goal is to simply cause the classifier to produce an incorrect output, without specifying a particular target class (untargeted attack scenario).Other possible scenarios can be found, for example, in [6].
Fast gradient sign method (FGSM) FGSM was proposed by Goodfellow et al. in [10].Adversarial example x for image x is calculated as: where ϵ is the perturbation, J is the cost function for neural network with weights θ, calculated for the input image x with true classification label y.
Projected gradient descent (PGD) PGD attack, introduced by Madry et al. in [17], is an iterative variant of FGSM, carrying out the similar operation as (1) with projection on the ϵ-ball: where π is the projection of an adversarial example on the set of possible perturbations S , and t is the number of step in iteration.
Carlini-Wagner (C-W) attack Carlini and Wagner [5] introduced a targeted attack which solves the adversarial optimization problem without using Sign(∇).
The authors derived a new optimization task for adversarial attack which could be solved by a regular optimizer (for example, SGD) and does not constrain DeepFool Moosavi-Dezfooli et al. [19] provided a simple iterative algorithm to perturb images to the closest wrong class.In other words, DeepFool is equal to the orthogonal projection onto the classifier's decision boundary.This property allows to use the attack to test the robustness of a models: where ρadv -is the average robustness, f -is the classifier, r -is the successful perturbation from DeepFool, D -is the dataset.DeepFool can be used to calculate the distance from a data point to the closest point on the decision boundary [18].

Adversarial training as a defense method
One of the most popular approaches to defending the neural networks against attacks is called adversarial training.It proposes to "include" possible adversarial examples into the training data sets to prepare a model for the attacks.
The following min-max optimization problem [17] should be solved to get the adversarial trained model: where, θ -are the weights of neural network, D -the training data set, S -the space of possible perturbations, S = {δ : ∥δ∥ p < ϵ} for a given radius ϵ.
Adversarial training was introduced by Goodfellow et al. in [10].Madry et al. [17] proposed using a PGD attack during the training procedure and presented better robustness against adversarial attacks.However, Wong et al. [28] achieved about the same accuracy against adversarial attacks, using simple onestep FGSM.
It is important to note that the provable defense (i.e."certified robust") against any small-ϵ attack has already been studied, for example, by Wong and Kolter [27] and Wong et al. [29].However, the experiments in these works were conducted with relatively small perturbations: in [29] the maximum radius of L ∞ perturbation is 2 255 , while the same norm in adversarial training package [9] is 8 255 .Thus, we do not include certified robust methods in our research.

Hypothesis about the cause of adversarial attacks
The exact reason why neural networks are susceptible to small changes in input data remains unclear.Goodfellow et al.
[10] argued that adversarial examples result from models being overly linear rather than nonlinear.However, another perspective considers poor generalization as the source of attacks.Ilyas et al. [11] hypothesized that neural networks' vulnerability to adversarial attacks arises from their data representation.Classifiers aim to extract useful features from data to minimize a cost function.Ideally, these features should be related to object classification (robust features), but neural networks may utilize unexpected properties specific to a particular dataset (non-robust features), which can be manipulated by an adversary.If a classifier can be trained on a dataset containing only robust features, it should be resistant to adversarial attacks.We refer to this process as robust training.
We decided to challenge this hypothesis for several reasons.First, it is not fully proven, except for the toy example in [11] and experiments on robust and non-robust dataset creation.Second, subsequent works like [25] and [30] consider this hypothesis, while it might not be entirely accurate.For example, Zhang et al. [30] proposed the similar experiment, referring to [11], and developed it onto universal perturbation.Although the results of [11] was discussed in [8], the accuracy of robust trained model was tested nowhere but in the original work, and we would like to fill this gap.
By testing this hypothesis, we discover certain properties of adversarial attacks and further explore them through our experiments.We found not only cases where robust features can be corrupted, but also that L ∞ -norm attacks are more dangerous than those with L 2 -norm.We hypothesize that even if perturbations of these two norms are both imperceptible to humans, L ∞ -norm attacks have a greater impact on model representations.

Experiments
In this work, we establish the following Research Aims (RAs): -(RA 1) Our research aims to explore the generalization of robust and adversarial training methods.Specifically, we investigate the classification accuracy of robust and adversarial trained models when subjected to various unseen attacks.Additionally, we aim to compare the behavior of these models under identical attack conditions and develop a methodology to measure such differences.-(RA 2) Our study also seeks to identify potential shortcomings of robust trained models.We aim to determine the proximity of adversarial and benign samples in the latent space of a robust trained model, i.e. measure the distance between the representations.The degree of similarity between representations of samples with small L ∞ or L 2 norms would indicate the stability of a model.-(RA 3) Addressing the previous RA, we aim to investigate the impact of robust and adversarial training methods on the decision boundary of a model.We seek to determine how the mean distance between samples and decision boundaries varies for different models when exposed to adversarial attacks.
To address the research aims, we present a series of experiments.In the first experiment, we replicate robust training from [11], employing a broader testing setup that included various attack norms, data sets, and model architectures.To compare the performance of adversarial and robustly trained models, we conduct the Kolmogorov-Smirnov test.Next, we analyze the representations of neural networks from the first experiment by singular value canonical correlation analysis (SVCCA) and perform principal component analysis (PCA) for their visualization.Finally, we assess the mean distance from samples to the decision boundaries of different models using the DeepFool attack.In the rest of the section, we describe in details the experiment setups and the achieved results.

The broad testing of robust and adversarial training (RA 1)
Here and later, we refer to training based on empirical risk minimization as "regular" since it does not involve any adversarial attack and is traditionally used to train neural networks.
The motivation for these experiments stems from the work of Tramer et al. [26], which demonstrated that even adversarially trained models could be compromised by unseen attacks.The core premise of this experiment involves taking an adversarially trained deep model, trained to defend against a specific attack on a given dataset, and utilizing it to develop the robust version of this model, and then evaluate the generalisation perofmance of both models against unseen attacks.This process is undertaken as follows: 1. We select every image within the dataset and use the process outlined in [11] to extract an image that only possesses robust features.The adversarially trained model aids in this extraction.Initially, we generate a random noise image for each target image.Then, we iteratively compute the representations of both the random and target images as interpreted by the pre-trained model.At each iteration, we slightly adjust the random image to minimize the distance between the vectors of the two images.After a set number of steps, this method produces a modified dataset.2. We then proceed to train the regular version of our selected adversarially trained model on this modified dataset, ultimately resulting in a robustly trained model.3. Subsequently, we test the generalisation capacity of both the adversarial and robust models against attacks which were not considered during the adversarial training phase, and hence, not used in the formation of the robust model.
This entire experiment is practically implemented for two distinct models: ResNet50s and InceptionV3.For ResNet50, we create both the adversarially trained and robust versions, aiming them at the CIFAR-10 [16] dataset and defending against the PGD attack with L 2 and L ∞ norms.We then test their performance against FGSM (L 1 , L 2 , L ∞ norms), PGD (L 1 , L 2 , L ∞ norms), C-W (L 2 norm), and DeepFool (L 2 norm) attacks.The performance results of the L 2 -trained model are outlined in Table 1, and those of the L ∞ trained model are presented in Table 2.
The adversarially trained and robust versions of InceptionV3, on the other hand, were trained only against the PGD attack with the L ∞ norm for CIFAR-10 dataset.The testing parameters for these models were akin to those used with ResNet50, and the results are displayed in Table 3.All tables highlight the corresponding values of ϵ-s and the number of steps for iterative attacks used in each case.Note that ϵ for the attacks differs from the similar ones for ResNet50-s because the model has the bigger input shape (224x224 vs 32x32, respectively).
On examination, it's apparent that both tested models demonstrate reasonable stability against some variations of PGD and FGSM attacks.However, all attacks with an L ∞ norm significantly undermined the accuracy of the models.Hence, the model doesn't ensure generalisation against all attack types, primarily as a simple increase in norm or perturbation shows a drastic impact.Despite this, the models did demonstrate resistance to certain unseen attacks, particularly those with an L 1 norm, exhibiting only a minor drop in accuracy.The accuracy of adversarially and robustly trained models might vary, but a similar ratio between them implies similar behavior.We evaluate this accuracy employing the Kolmogorov-Smirnov test for goodness.The null hypothesis H 0 at a significance level of 0.05 asserts that the samples, i.e., accuracy of robust and adversarial trained models under different attacks, are from the same distribution.The test for accuracy presented in Table 1 illustrates that, with a p − valueof 0.42, we cannot reject the H 0 hypothesis.Therefore, robust models may be considered as a specific case of adversarial training, with its strengths and weaknesses.
In a bid to further extend this experiment, we opt to manage the entire adversarial training pipeline ourselves, thereby performing the process from scratch.Owing to computational constraints, we select the relatively more manageable ResNet18 architecture.The models are trained on the PGD attack with 5 iterations and L 2 and L ∞ norms.For datasets, we utilize CIFAR-10 (the results are displayed in Table 4) and CINIC-10 [7] (refer to Table 5), and for test attacks, we use PGD and FGSM.Consistent with the previous pattern, attacks with the L-inf norm result in a more significant drop in accuracy than similar ones with the L2 norm.Canonical correlation analysis (CCA) is a method utilized to compare the representations of neural networks.Its objective is to identify linear combinations of two sets of random variables that maximize their correlation.In previous studies [20], [1], CCA has been employed to compare activations from different layers of neural networks.Raghu et al. [22] proposed an extension of CCA, singular value canonical correlation analysis (SVCCA), for the analysis of neural networks.

Attack
In this study, we employ SVCCA to compare the representations of the original and corresponding adversarial images.For each experiment, we take a batch of 128 images, compute the related adversarial examples under some attack, calculate SVCCA for the representations, and take the mean.We test three models in each experiment: regular and two adversarial ResNet50-s trained with PGD using L 2 and L ∞ norms.In this experiment, a high mean correlation coefficient indicates that the representations are similar to each other and that small perturbations do not impact the model.
While the attack norms are the primary variables in these experiments, we also test two different attacks (FGSM and PGD) to eliminate the threat of validity.The means of SVCCA for the attacks are presented in Table 6, while Table 7 shows the same for the best and worst cases from Tables 1 and 2.
Overall, the results presented in tables 6 and 7 demonstrate that the impact of adversarial attacks on models is most significant when using L ∞ -norm perturbation.This effect is evident even in models that have been trained specifically to handle L ∞ -norm attacks.These findings emphasize the importance of considering the norms of adversarial attacks when evaluating the robustness of models because testing on L 2 -norm may provide a false sense of security.The visualization of representations presents a challenge due to the high dimensionality of representation vectors.Nonetheless, such visualization can be useful for analysis purposes.To address this issue, we employ Principal Component Analysis (PCA) to reduce the dimensionality of representations from 512 to 2. We limit our experimentation to ResNet18 due to computational constraints.

Attack
We present the visualization of representations for different combinations of models and norms of PGD attacks in Figures 1 -5. Figure 1 depicts the representations of samples from CIFAR-10 for a regularly trained ResNet18, which serves as a baseline case.In Figures 2 and 3, we visualize the data as adversarial samples with L 2 and L ∞ norms of attack, respectively.Furthermore, we examine the representations of adversarial samples for adversarial trained models in Figures 4 and 5. To challenge the models, we use alter norms of attacks from the training ones, where the model trained on PGD with L 2 norm is tested on L ∞ -norm PGD attack (Figure 5), and vice versa (Figure 4).We group the representations of different classes by colors in all figures to comprehend how the representations of different classes intermingle under adversarial attacks.
The regular trained network representations are clustered according to the classes in the data set, as depicted in Figure 1.However, when subjected to adversarial attacks, all representations become heavily mixed, resulting in a more challenging classification task due to overlapping representations of different classes.The most mixed representations were observed on the regular trained network through the L ∞ -norm PGD attack (Figure 3).In contrast, adversarial training, as shown in Figures 4 and 5, resulted in some class representations (e.g., "Automobile" or "Truck") remaining clustered while becoming closer to each other than those without attacks.The same pattern was observed for the L 2 -norm attack (Figure 3).

Regular model
Adv  4 and 5).However, the impact of these norms on representations is not immediately apparent from a plain accuracy analysis.Indeed, the visualization of the representations distribution yielded intriguing patterns that warrant further investigation.

Distance to decision boundary (RA 3)
This study seeks to investigate the impact of adversarial training on the decision boundaries of models.Specifically, the mean distance between samples and the decision boundary is examined to determine how it differs for adversarially trained models.We use the idea of Mickisch et al. [18] who utilized the DeepFool attack 2 to measure this distance.
The decision boundary is defined as the set of input images where two or more classes share the same maximum probability, indicating that the classifier is uncertain about the class of the image.8.The table displays the mean difference between decision boundaries of models and images from CIFAR-10, calculated using both L 2 and L ∞ distances.The "steps" column represents the mean iteration of DeepFool spent during the attack generation.The comparison is made between the models from Tables 4, 1, 2.
The results indicate that the mean distance for ResNet18-s (for all types of training) to the decision boundary is relatively small, especially for L ∞ norm.A small distance from a sample to the decision boundary implies that it is easier to misclassify this sample because it does not require a significant perturbation.Interestingly, the distance for L ∞ norm adversarial trained model is actually less than training ϵ.However, testing of this model under PGD attack (Table 4) suggests that it has some robustness.Therefore, the reliability of the popular method of model testing used in this study is called into question.

Conclusion
The experiments demonstrate that the model, trained on a "robust" data set, is still vulnerable to some attacks; thus, adversarial attacks do not compromise only non-robust features.Therefore, the robust features are not well generalized, especially on L ∞ norm of attack.We assume that the small difference between clean and adversary inputs for the L ∞ attack leads to a huge gap in latent space between them; SVCCA analysis of different attack representations confirms this assumption.Our visualization of neural network representation also displays the difference between L 2 and L ∞ norms of attack.In light of these results, we recommend that researchers in the field of adversarial attacks and defense mechanisms pay closer attention to L ∞ -norm attacks to avoid a false sense of security.It is crucial to consider this norm in their tests to ensure the robustness of models against potential attacks.
wrong class.The distance of the sample x to the decision boundary D can be measured as: d (x ) = min ϵ, s.t.x + ϵ ∈ D The pretrained ResNet50 model is used with 20 steps of PGD attack during training, while ResNet18 models are trained under PGD with only 5 steps.The testing dataset is CIFAR-10.The outcomes of ResNet18 and ResNet50 models with different training configurations are presented in Table

Table 1 :
Robust ResNet50 trained on L 2 data set and related adversarial model.Epsilon column stands for budget of perturbation, norm -the way of measurement of this budget, steps -maximum iterations for adversarial example creation, Robust acc.-accuracy of robust trained model, Adv.acc.-accuracy of the corresponding adversarial trained model resistance to L 1 norm attacks is crucial for these models in terms of generalisation, given they weren't exposed to such an attack during training.

Table 2 :
Robust ResNet50 trained on L ∞ data set and related adversarial model 4.2 Comparison of representations under adversarial attacks (RA 2)

Table 3 :
Robust InceptionV3 trained on L ∞ data set 4.3 Visualization of representations (RA 2)

Table 4 :
.trained, L ∞ norm L ∞ attack 89 % -L ∞ attack 87 % -L ∞ attackAccuracy of ResNet18-s, trained on CIFAR-10 data set Both L ∞ and L 2 demonstrate a marked decrease in accuracy, as evidenced by the aforementioned experiments (Tables

Table 5 :
where c -is the number of classes.Under this definition, the usage of the Deep-Fool attack looks natural because it aims to find a perturbation to the closest L ∞ attack 87 % -L ∞ attack 85% -L ∞ attack Accuracy of ResNet18-s, trainied on CINIC-10 data set

Table 7 :
Mean correlation coefficient for the best and worst cases in Tables1 and 2

Table 8 :
Mean distance of samples to decision boundary