Clustering Approach for Detecting Multiple Types of Adversarial Examples

With intentional feature perturbations to a deep learning model, the adversary generates an adversarial example to deceive the deep learning model. As an adversarial example has recently been considered in the most severe problem of deep learning technology, its defense methods have been actively studied. Such effective defense methods against adversarial examples are categorized into one of the three architectures: (1) model retraining architecture; (2) input transformation architecture; and (3) adversarial example detection architecture. Especially, defense methods using adversarial example detection architecture have been actively studied. This is because defense methods using adversarial example detection architecture do not make wrong decisions for the legitimate input data while others do. In this paper, we note that current defense methods using adversarial example detection architecture can classify the input data into only either a legitimate one or an adversarial one. That is, the current defense methods using adversarial example detection architecture can only detect the adversarial examples and cannot classify the input data into multiple classes of data, i.e., legitimate input data and various types of adversarial examples. To classify the input data into multiple classes of data while increasing the accuracy of the clustering model, we propose an advanced defense method using adversarial example detection architecture, which extracts the key features from the input data and feeds the extracted features into a clustering model. From the experimental results under various application datasets, we show that the proposed method can detect the adversarial examples while classifying the types of adversarial examples. We also show that the accuracy of the proposed method outperforms the accuracy of recent defense methods using adversarial example detection architecture.


Introduction
As a core part of current real-world applications, deep learning technology has been widely applied not only in various fields but also even in security sensitive fields such as selfdriving cars [1], malware classification [2], and face recognition system [3]. However, many studies showed that deep learning technology is very vulnerable to adversarial examples, which deceive the deep learning model by adding human-imperceptible perturbations to input data. As a representative example, Hussain, S. et al. recently showed that even deep learning-based defense method, such as DeepFake detector, can be bypassed by adversarial examples at WACV conference in 2021 [4].
To defend against adversarial examples, many defense methods have been proposed. Based on the types of defense architectures, such defense methods are commonly classified into three categories, i.e., model retraining architecture, input transformation architecture and adversarial example detection architecture. The model retraining architecture makes deep learning model more robust before adversaries generate adversarial examples by retraining deep learning models or training with new models [5][6][7]. The model retraining architecture is known to be effective for adversarial examples with large perturbation such as Fast Gradient Sign Method (FGSM) [8] and Basic Iterative Method (BIM) [9]. On the other hand, the input transformation architecture reduces the perturbation of adversarial examples by transforming the input data before feeding into the deep learning model [10][11][12]. The input transformation architecture can provide robustness with low memory usage and computation costs against adversarial examples with small perturbation such as DeepFool [13] and C&W [14]. Even though previous defense methods belonging to two architectures provided good robustness against adversarial examples, the classification accuracy decreases when predicting the legitimate input data because such two architectures affect the legitimate input data or deep learning model. Specifically, for the defense methods using model retraining architecture, the classification accuracy of deep learning model decreases because they train or re-train the model with noise contained dataset. For the defense methods using input transformation architecture, the classification accuracy of deep learning model decreases because they transform not only adversarial examples but also legitimate input data.
To address this problem, defense methods using adversarial example detection architecture have been actively studied [15][16][17][18]. As shown in Figure 1a, the defense methods using adversarial example detection architecture classify the input data into either legitimate one or adversarial one. To achieve this, the defense methods using adversarial example detection architecture compute the probability that the input data is an adversarial example or compare the distribution between adversarial examples and legitimate input data. However, the defense methods using adversarial example detection architecture do not classify the input data into multiple classes of data, i.e., legitimate input data and various types of adversarial examples. Classifying the input data into multiple classes of data is important for many researchers and security experts to analyze the characteristics of each adversarial example. Based on these characteristics, they can establish an appropriate defense method or design a new defense method for unknown adversarial examples. In this paper, to overcome the limitation of the defense methods using adversarial example detection architecture, we propose an advanced defense method using adversarial example detection architecture. Different from the previous defense methods using adversarial example detection architecture, the proposed method can  examples means that the proposed method can be utilized as an analysis tool  to analyze the characteristics of adversarial examples. A high-level operations of the proposed method is as follow: first, given an input data (a legitimate input data or an adversarial example), the proposed method extracts adversarial perturbations using denoising techniques, such as a binary filter and median smoothing filter. Second, the proposed method reduces the dimensionality of the extracted adversarial perturbations. Finally, the proposed method uses a clustering model as a detector. Specifically, the clustering model classifies the reduced features of input data into multiple classes, i.e., legitimate input data and various types of adversarial examples.
From the experimental results under various datasets collected from diverse applications, we show that only the proposed method can classify the types of the adversarial examples. We also show that the accuracy of the proposed method outperforms the accuracy of recent defense methods using adversarial example detection architecture, e.g., Carrara et al.'s method [15] and Feature Squeezing [16]. Such results can be actively used to design new defense architecture, analyze the characteristics of adversarial examples, and improve the existing defense methods. For example, we can analyze the characteristics of adversarial examples from the clustering results and selectively apply the defense methods suitable for each characteristic.
The contributions are summarized as follows: • We propose a novel defense method using adversarial example detection architecture. Different from the current defense methods using adversarial example detection architecture, which can classify the input data into only either legitimate one or adversarial one, the proposed method classifies the input data into multiple classes of data; • To the best of our knowledge, the existing defense methods including model retraining architecture, input transformation architecture and adversarial example detection architecture use the adversarial example itself for analysis or detection. Thus, this is the first work which approximates the adversarial perturbation of each adversarial example and uses it to detect adversarial examples; • From analysis results under various adversarial examples and application datasets, we show that the proposed method provides better accuracy than recent defense methods using adversarial example detection architecture, e.g., Carrara et al.'s method [15] and Feature Squeezing [16].
The rest of the paper is organized as follows. In Section 2, we overview the well-known adversarial examples and the recent defense methods using adversarial example detection architecture. In Section 3, we describe the overall operation and the details of the proposed method. In Section 4, we verify the effectiveness of the proposed method from various experimental results under different adversarial examples, different datasets and so on. Finally, we conclude this paper in Section 6.

Adversarial Examples
In this section, we summarize the characteristics of five well-known adversarial examples, which are used for experimental evaluation in this paper [5,8,9,13,19]. Goodfellow [9]. Different from FGSM which performs only one gradient update, BIM performs iterative gradient updates for fine optimization. Projected Gradient Descent (PGD) is another iterative-based method introduced by Madry et al. [5]. To perform fine optimization efficiently, PGD performs iterative gradient updates from randomly selected an initial point.
To generate adversarial examples with minimal perturbations, Moosavi-Dezfooli et al. introduced a DeepFool which performs an iterative linearization of the deep learning model [13]. In each iteration, DeepFool finds the nearest decision boundary from an input X, and updates the adversarial perturbation to reach that decision boundary. To increase attack success rate while generating adversarial examples with minimum perturbations, Carlini and Wagner introduced three adversarial examples based on various distance metrics such as L 0 , L 1 and L 2 norm [19]. In this paper, we consider the L 2 type of C&W as a representative method because it is most frequently mentioned in other works [14].

Defense Methods Using Adversarial Example Detection Architecture
In this section, we summarize some recent defense methods using adversarial example detection architecture.
Dathathri et al. [20] proposed a signature-based detection method against adversarial examples. To detect adversarial examples, they generate a NeuralFingerprinting (NFP) from the input data and checked a behavioral consistency of the target model for the NFP. After reducing the dimensionality, they generated the sequential vector using distance metric-based embedding technique and trained the LSTM-based detector. Wang et al. [18] proposed a model-agnostic approach to detect adversarial examples. They argued that there is an intrinsic difference of the logit semantic distribution between legitimate input data and adversarial examples. To capture differences in the distribution of logits sequences, they used the LSTM network as a detector. We summarize the characteristics of representative defense methods using adversarial example detection architecture in Table 1 for comparison. Table 1. Summary of the recent defense methods using adversarial example detection architecture. Most of the methods cannot classify the input data into multiple classes of data.

References Binary Classification Multi-Label Classification
Signature-based Detector S. Dathathri

Proposed Method
In this section, we overview the operation of the proposed method in details. We also describe each operation of the proposed method and support it with specific examples.

Overall Operation
To detect adversarial examples while classifying the types of adversarial examples, the proposed method follows three steps: (1) Adversarial Perturbation Extraction; (2) Dimensionality Reduction; and (3) Clustering.
In Figure 2, a high-level illustration of the proposed method is described. In Adversarial Perturbation Extraction step, the proposed method extracts the perturbations, which are added by the adversary from the input data. Since it is not known whether an adversarial example has happened or which adversarial example has performed, the proposed method first applies the denoising techniques to the input data to obtain the denoised input. Then, the proposed method calculates the perturbations between the input data and the denoised input. In Dimensionality Reduction step, the proposed method extracts the key features from the perturbations, which are extracted in Adversarial Perturbation Extraction step, by reducing the feature dimensionality. In the Clustering step, the proposed method feeds

Adversarial Perturbation Extraction
Different from the previous defense methods using adversarial example detection architecture which use the adversarial examples itself for detecting, the proposed method uses the adversarial perturbation which was added to input data by the adversary. However, it is impossible to get the adversarial perturbation directly from the adversarial example because it is not known whether an adversarial example has happened to input data or which adversarial example has performed on input data. Thus, the adversarial perturbation is approximated by using two denoising techniques, binary filter and median smoothing filter, which have shown good performance in other works [14,16].
For a gray-scale image dataset such as the MNIST dataset [27], the proposed method approximates the adversarial perturbation by using the binary filter which can reduce the redundancy while keeping the key features of gray-scale images. Here, since the binary filter perceives adversarial perturbation as redundancy, it can remove adversarial perturbation well. Specifically, given an input data X, the binary filter reduces the bit-depth of input data X to i-bit depth (1 ≤ i ≤ 7) as follow: where round(·) is a round function, which rounds to the nearest integer. We set i of the binary filter into 1, which showed the best performance for the MNIST dataset [16,28]. Some examples of a binary filter for the MNIST dataset under various adversarial examples are shown in Figure 3.
For color image dataset such as CIFAR-10 dataset [29], the proposed method approximates the adversarial perturbation by the median smoothing filter because it can remove adversarial perturbation well while preserving the sharpness of edges for color images. To remove the adversarial perturbation, the median smoothing filter reduces variation among pixels. Specifically, given an input data X, the filtered input X f can be obtained as follows: where [u, v] is the position of input data X and median{·} is a median function, which returns a   After getting the filtered input from two denoising techniques, the adversarial perturbation is approximated by calculating arithmetic subtraction operations between the input data and the filtered input. By using the extracted adversarial perturbations rather than the adversarial examples, the proposed method can consider only the key features of the adversarial examples, without considering any key features of legitimate input data.

Dimensionality Reduction
Using the extracted perturbation directly to classify the types of adversarial examples is not efficient because the perturbation contains unnecessary values. For example, as shown in Figure 4, most pixels of the extracted adversarial perturbation are black in color and it means that such pixels do not affect classification of the types of adversarial examples. Thus, the proposed method performs the dimensionality reduction to extract the key features from the adversarial perturbation. Specifically, the proposed method uses two dimensionality reduction methods, i.e., Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA).
PCA is an unsupervised dimensionality reduction method which maximizes the variance between data without considering the class separation. By finding a few orthogonal linear combinations, which are called Principal Components of multivariate data with the largest variance, and PCA can reduce the dimension of the data. As one of the most commonly used dimensionality reduction methods in various fields, PCA has shown good feature extraction performance even for image data.
On the other hand, LDA is a supervised dimensionality reduction method which maximizes the variance between samples in different classes while minimizing the variance between samples in the same class. Different from PCA, which deals with the data in its entirety and searches for vectors that best describe the data, LDA deals directly with discrimination between classes and searches for vectors that best describe the class. LDA can reduce the dimensionality without loss of information. The detailed impact of each dimensionality reduction method on clustering performance is described in Section 4.2.2.

Clustering
The goal of the proposed method is not only to detect adversarial examples but also to classify the types of adversarial examples. To achieve this, the proposed method uses a clustering model as a detector. When the extracted features F and the ground truth label l are given, the objective function of the detector can be expressed into: where C(·) is a clustering model. In this paper, the proposed method uses two clustering models, i.e., the k-means algorithm and the Density-based spatial clustering of applications with noise(DBSCAN) algorithm.
The k-means algorithm is one of the most well-known simplest clustering methods that has been used in a variety of application domains. To find clusters, the k-means algorithm groups and classifies data that are similar to each other in the entire dataset. The pseudo code of the k-means algorithm is illustrated in Algorithm 1. The k-means algorithm requires a parameter k, which represents the number of output clusters. k can be set according to the number of adversarial examples, which the defender wants to classify.

Algorithm 1 Pseudo code of k-means algorithm
Input: set of input data D X , number of desired clusters k Output: set of k clusters C 1 Randomly select k input data in D X as initial centroids 2 while convergence criteria has not been met do 3 Assigns each input data to the clusters which has the closest centroid 4 Calculates new mean for each cluster The DBSCAN algorithm is a well-known density-based clustering method which is very effective when finding arbitrarily shaped clusters. As shown in Algorithm 2, the DB-SCAN algorithm finds clusters by calculating the density of each data with neighboring data. The DBSCAN algorithm requires two parameters: (1) epsilon (P eps ), which represents how close points should be to each other to be contained in the same cluster; and (2) the minimum number of points (P MinPts ), which represents minimum number samples to be contained in one cluster. Different from the k-means algorithm, the DBSCAN algorithm does not need a parameter for the number of output clusters. In other words, the DBSCAN algorithm can learn the number of clusters automatically.

Algorithm 2 Pseudo code of the DBSCAN algorithm
Input: set of input data D X , Radius of data points P eps , Minimum number of data points to from a cluster P MinPts Output: set of clusters C 1 Randomly select a input data D in D X 2 while ∀ inputdata D ∈ D X do 3 Retrieve all points density-reachable from D with respect to P eps and P MinPts

Operational Example
In this section, we show an example of how the proposed method works for the CIFAR-10 test dataset. Here, let us consider that a target model is the ResNet-34 model [30], and the proposed method uses LDA and the DBSCAN algorithm as a dimensionality reduction method and a clustering model, respectively. Let us consider PGD occurring to the horse sample. As shown in Figure 5, when the attacked horse sample is given as an input data, the proposed method extracts perturbation from the input data. Here, it is observed that most of the pixels have a value of 0 (black color) and only a few pixels have non zero values. Next, the proposed method extracts key features from the extracted adversarial perturbation by reducing the feature space. Specifically, the proposed method reduces the extracted adversarial perturbation of 32 × 32 × 3 dimensionality to a vector of 5 × 1 dimensionality. These extracted features of 5 × 1 dimensionality are used as an input to a clustering model. As a result, it is observed that the attacked horse sample by PGD is clustered in the same cluster with other PGD samples.

Evaluation Results
To show how efficient the proposed method is when classifying the types of adversarial examples, we measured the performance of the proposed method under various conditions including different adversarial examples.

Experimental Setup
In this section, we describe experimental environments including dataset, target model, evaluation metric, and so on.
Dataset: The experiments are conducted using the MNIST dataset [27] and the CIFAR-10 dataset [29]. The MNIST dataset is grayscale Hand-written digits dataset and consists of 60,000 training images and 10,000 testing images corresponding to 10 classes. CIFAR-10 is color image dataset and consists of 50,000 training images and 10,000 testing images corresponding to 10 classes.  [31] and some representative works [14,16].
Evaluation Metric: To evaluate the performance of the proposed method under different types of dimensionality reduction methods and clustering methods, we measure the V-measure and its two components, i.e., Homogeneity and Completeness.
Here, Homogeneity is a measure of how well a clustering model matches each cluster to each type of adversarial examples. When an array of input data X and the ground truth label l are given, Homogeneity can be defined as: where H(·) and H(·|·) are the entropy and conditional entropy functions, respectively. As a measure of how well all data points of each type of adversarial examples are contained in the same cluster, Completeness can be defined as: V-measure is a measure defined as the harmonic mean of Homogeneity and Completeness and can be defined as: Implementation Environment: The classification models are implemented using TensorFlow-gpu version 1.14.1 and Python version 3.6.9. Adversarial examples are generated by using the cleverhans software library [31], which provides standardized reference implementations of adversarial examples. For the efficient experiments, the performance is measured on the Ubuntu 18.04.3 LTS machine with kernel version 5.3.0-62-generic, 2.40 GHz CPU clock Intel Xeon CPU E5-2630 v3), GeForce GTX 2080 Ti, and 64 GB memory.

Ablation Analysis
To verify the effectiveness of each step of the proposed method, we conducted ablation analysis on each step of the proposed method. Specifically, we measured the clustering performances of the proposed method according to the combination of each step using three evaluation metrics, i.e., Homogeneity, Completeness, and V-measure. Here, we considered LDA and the k-means algorithm as the dimensionality reduction method and the clustering method, respectively. The experimental results are summarized in Table 2. In Table 2, it is observed that adversarial examples are not suitable for classifying the types of adversarial examples. For example, while the proposed method using an adversarial example showed 0.249 and 0.128 for clustering performance on average under the MNIST dataset and the CIFAR-10 dataset, respectively, the proposed method using adversarial perturbations showed 0.864 and 0.701 for clustering performance on average. It is also observed that the proposed method showed a higher performance as it performs each step. For the MNIST dataset, while the proposed method that performed steps 1 and 3 showed 0.627 for the clustering performance on average, the proposed method that performed steps 1, 2 and 3 showed 0.864 for the clustering performance on average. For the CIFAR-10 dataset, while the proposed method that performed steps 1 and 3 showed 0.378 for the clustering performance on average, the proposed method that performed steps 1, 2 and 3 showed 0.701 for the clustering performance on average.
In Figures 6 and 7, we also show the visualized clustering results of the proposed method. As shown in Figures 6c and 7c, the proposed method can classify the types of adversarial examples when it performed steps 1, 2 and 3.

Influence of Dimensionality Reduction and Clustering Methods
To evaluate the proposed method under various combinations of dimensionality reduction methods and clustering methods, the clustering performances of the proposed method are measured using three evaluation metrics, i.e., Homogeneity, Completeness, and Vmeasure. To train the clustering model, five types of adversarial examples (FGSM, BIM, PGD, DeepFool, C&W) are used. Each type of adversarial example is generated using 1000 randomly selected datapoints from the MNIST test dataset and the CIFAR-10 test dataset, respectively. The k of k-means algorithm is set to 5, P eps of the DBSCAN algorithm is set to 0.878 for the MNIST dataset and 0.446 for the CIFAR-10 dataset, respectively.
In Table 3, it is observed that the different combination of dimensionality reduction methods and clustering methods showed the different influence on the clustering performance. From the 'Average' column in Table 3, it is observed that the proposed method showed the clustering performance from 0.764 to 0.865 under the MNIST dataset and from 0.513 to 0.716 under the CIFAR-10 dataset. Especially, the combination of PCA and the DB-SCAN algorithm showed the lowest clustering performance on average under both MNIST and CIFAR-10 dataset. On the other hand, the combination of LDA and the DBSCAN algorithm showed the highest clustering performance on average. It is also observed that LDA provides better feature extraction than PCA. For example, for the CIFAR-10 dataset, while the combination of PCA and the k-means algorithm and the combination of PCA and the DBSCAN algorithm showed 0.646 and 0.513 on average for the clustering performance respectively, the combination of LDA and the k-means algorithm and the combination of LDA and the DBSCAN algorithm showed 0.701 and 0.716 on average for the clustering performance respectively.   For the CIFAR-10 dataset, as shown in Figure 9, the clustering performance of the proposed method began to decrease when the number of adversarial examples was higher than or equal to three.

Comparison to Other Defense Methods Using Adversarial Example Detection Architecture
To compare the proposed method with other recent defense methods using adversarial example detection architecture, the accuracy, precision, recall and F1-score of the proposed method are compared with those of Carrara et al.'s method [15] and Feature Squeezing [16]. Since the source codes of both methods are provided as open source, we selected them for comparison with the proposed method. For Carrara et al.'s method, we used Euclidean distance or cosine similarity as an embedding function and used Long Short-Term Memory (LSTM) and multi-layer perceptron network (MLP) as a detector. For Feature Squeezing, we used two squeezing methods, binary filter and median filter and used three distance metrics, L 1 , L 2 and K-L diversity.
As observed from the "Accuracy" column to the "F1-score" column in In Table 5, it is observed that the proposed method can classify the types of adversarial examples. Here, since the recent defense methods did not provide the classification performance for multi-label, we only measured the classification performance of the proposed method. For the MNIST dataset, the proposed method showed accuracy, precision, recall and F1-score as much as 66.33%, 66.29%, 66.66% and 66.56%, respectively. For the CIFAR-10 dataset, the proposed method showed accuracy, precision, recall and F1-score as much as 47.18%, 46.30%, 46.23% and 46.39%, respectively.

Discussion
In this section, the limitations and the directions of the future research are discussed. From the experimental results, it is observed that the proposed method can detect adversarial examples while classifying the types of adversarial examples. Although the proposed method showed better binary classification performances than the state-of-the-art defense methods using adversarial example detection architecture, the multi-label classification performance of the proposed method is relatively low compared to the binary classification performances. This is for the following reasons: (1) Inaccuracy in the extraction of adversarial perturbation; and (2) Homogeneity that decreases with the increasing number of adversarial examples. In other words, the high accuracy of adversarial perturbation extraction can help characterize the adversarial perturbation of each adversarial example. The high homogeneity can help the clustering model to match each cluster to each type of adversarial example. Therefore, there are several possible ways to improve the multi-label classification performance of the proposed method: (1) Designing a denoising method specialized for the extraction of adversarial perturbation, rather than the simple application of general denoising techniques, i.e., binary filter and median filter; (2) Applying techniques to improve the Homogeneity such as multi-stage clustering [32,33], growing self-organizing maps [34], and the fractional derivatives [35]. We will leave those kinds of improvements for future work.
In addition, the proposed method can be extended to large-scale application datasets such as the ImageNet dataset and the malware dataset by reflecting some considerations, such as the complexity of high-resolution data and the structured characteristics.

Conclusions
Adversarial example detection architecture has been actively studied because it can defend adversarial examples without decreasing the accuracy for the legitimate input data. While the previous methods belonging to adversarial example detection architecture can only detect whether adversarial examples have occurred, they cannot know which adversarial example has occurred. In this paper, we propose an advanced defense method using adversarial example detection architecture, which not only detects adversarial examples but also classifies the types of adversarial examples. Specifically, after extracting key features from adversarial perturbation, the proposed method feeds the key features into a clustering model. From evaluation results under various experimental conditions, we showed that the proposed method provided good clustering performance for the MNIST dataset and the CIFAR-10 dataset, respectively. We also observed that the proposed method showed better classification accuracy than recent defense methods using adversarial example detection architecture. From such results, we believe that the proposed method can be actively used to design new defense architecture, analyze the characteristics of adversarial examples, and improve the existing defense methods.

Data Availability Statement:
The data presented in this study are openly available in reference numbers [27,29].

Conflicts of Interest:
The authors declare no conflict of interest.