You are currently viewing a new version of our website. To view the old version click .
Mathematics
  • Article
  • Open Access

16 February 2023

DDSG-GAN: Generative Adversarial Network with Dual Discriminators and Single Generator for Black-Box Attacks

,
,
,
and
Key Laboratory of Network and Information Security of Hebei Province, College of Computer & Cyber Security, Hebei Normal University, Shijiazhuang 050024, China
*
Authors to whom correspondence should be addressed.
This article belongs to the Special Issue Advanced Mathematical Methods in Intelligent Multimedia: Security and Applications

Abstract

As one of the top ten security threats faced by artificial intelligence, the adversarial attack has caused scholars to think deeply from theory to practice. However, in the black-box attack scenario, how to raise the visual quality of an adversarial example (AE) and perform a more efficient query should be further explored. This study aims to use the architecture of GAN combined with the model-stealing attack to train surrogate models and generate high-quality AE. This study proposes an image AE generation method based on the generative adversarial networks with dual discriminators and a single generator (DDSG-GAN) and designs the corresponding loss function for each model. The generator can generate adversarial perturbation, and two discriminators constrain the perturbation, respectively, to ensure the visual quality and attack effect of the generated AE. We extensively experiment on MNIST, CIFAR10, and Tiny-ImageNet datasets. The experimental results illustrate that our method can effectively use query feedback to generate an AE, which significantly reduces the number of queries on the target model and can implement effective attacks.

1. Introduction

With the emergence of deep neural networks, the security issues of artificial intelligence (AI) have become increasingly prominent. Because of the wide application of deep learning technology, the security of deep neural networks has also been increasingly questioned. The existence of an AE makes deep neural networks (DNN) cause disastrous consequences in many fields, such as the occurrence of traffic accidents in the field of automatic driving [1], malicious code successfully escaping detection [2], etc. AE is a major obstacle that various machine learning systems and even artificial intelligence (AI) must overcome. Its existence not only makes the output results of the model deviate greatly but can even make this deviation inevitable. This indicates that machine learning models rely on unreliable features to maximize performance. If the features are disturbed, it will lead to the misclassification of the model. For example, FGSM [3] enables a machine-learning model that classifies the original image as a panda with a probability of 57.7% but classifies its AE as a gibbon with a very high probability of 99.3%.
The vulnerability of DNN to AE has led to adversarial learning. On the one hand, studying an AE is to understand the mechanism of adversarial attacks to better develop corresponding defense technologies and construct more robust deep learning models. In addition, the existence of an AE reveals the serious security threat of DNN. Research on AE can provide a more comprehensive index for evaluating the robustness of DNN.
In adversarial attacks, because different models own different information access rights, the attackers need to consider different attack scenarios to design AE. Adversarial attacks contain two categories: white-box and black-box attacks.
In white-box attacks, adversaries can acquire the structure and parameter information of the target model. Therefore, they can use the target model’s gradient information to construct the AE. However, black-box attacks are more challenging to implement. In a black-box attack, the attackers can only interact with the target model through the input, which increases the difficulty of constructing AE. Still, it is more consistent with real-world attack scenarios.
Black-box attacks contain query-based and transfer-based attacks. Although the former can achieve a good attack effect, the query complexity is high, the query results are not fully utilized in the current attack methods, and masses of queries are easily resisted by defense mechanisms. The latter attack avoids the query to the target model. However, its attack effect is not ideal.
Combining transfer-based and query-based attacks, we design a generative adversary network (GAN) with dual discriminators and a single generator (DDSG-GAN) to generate an AE with better attack performance. We use the generator G of the DDSG-GAN to generate the adversarial perturbation, and the trained discriminator D 1 can act as a surrogate model of the target model T . The discriminator D 2 is used to distinguish whether the input image is original. We experimentally evaluate our method on the MNIST [4], CIFAR10 [5], and Tiny-ImageNet [6] datasets and compare it with the state-of-the-art (SOTA) experimental results. The experiment results demonstrate that the proposed method has a high attack success rate and greatly reduces the number of queries to the target model. The generated AE with our proposed method has a higher visual quality. The main contributions are the following:
(1)
This study presents a novel image AE generation method based on the GANs of dual discriminators. The generator G generates adversarial perturbation, and two discriminators constrain the generator in different aspects. The constraint of discriminator D 1 guarantees the success of the attack, and the constraint of discriminator D 2 ensures the visual quality of the generated AE.
(2)
This study designs a new method to train the surrogate model; we use original images and AEs to train our substitute model together. The training process contains two stages: pre-training and fine-tuning. To make the most of the query results of the AE, we put the query results of the AEs into the circular queue for the subsequent training, which greatly reduces the query requirement of the target model and makes efficient use of the query results.
(3)
This study introduces a clipping mechanism so that the generated AEs are within the ϵ neighborhood of the original image.
The remainder of our paper is organized as follows. Section 2 introduces the related work of adversarial attacks. The proposed method of generating AE is described in Section 3. Section 4 demonstrates the effectiveness of the attack method through extensive experiments. Section 5 summarizes this paper.

3. Methodology

3.1. Preliminaries

3.1.1. Adversarial Examples and Adversarial Attack

Modifying the original images in a human-imperceptible way so that the modified images can be misclassified by the machine learning model, and the modified images are called AE. For a victim image classification model T , we use ( x , y ) as the original image-label pair. The goal of the adversarial attack is to generate an AE x ^ so that target model T can misclassify it. For the untargeted attack setting, it can be formulated as follows:
a r g m a x   T ( x ) = y ,   and     a r g m a x   T ( x ^ ) y , s . t . x ^ x ϵ .
For the targeted attack setting, it can be formulated as follows:
a r g m a x   T ( x ) = y ,   and     a r g m a x   T ( x ^ ) y , s . t . x ^ x p ϵ .
where · p denotes the l p norm, t is the target class in the targeted attack, and ϵ is the upper bound of the perturbation.

3.1.2. Attack Scenarios

In this paper, we consider the adversarial attack in the black-box scenario. Query-based black-box attacks can be divided into decision-based attacks and score-based attacks. In this paper, we focus on a decision-based attack scenario.
(1)
Score-based attacks. In this scenario, the attacker is unknown to any structure and parameter information of the target model, but for any input, the adversary can acquire the classification confidence.
(2)
Decision-based attacks. Similar to the attack scenario of score-based attacks, the adversary doesn’t know any structure and parameter information of the target model, but for any input, the attacker can acquire the classification label.

3.2. Model Architecture

In this section, we will introduce the method of generating AEs based on the dual discriminators and single generator of GAN (DDSG-GAN). This paper introduces the model architecture of GAN and designs a GAN with dual discriminators and a single generator to generate AEs. DDSG-GAN uses generator G to generate adversarial perturbations and uses discriminators D 1 and D 2 to constrain the generated perturbations. Then, the trained discriminator D 1 can be used as a surrogate model of the target model T , and the overall structure of DDSG-GAN is shown in Figure 1. The input of Generator G is the original image x , and the output is perturbation vector δ = G ( x ; θ g ) . Adding the perturbation vector to the original image and clipping it to obtain the query sample x ^ . Input x and x ^ into the target model T to acquire the output T ( x ) and T ( x ^ ) . Discriminator D 1 uses image-output pairs ( x , T ( x ) ) and ( x ^ , T ( x ^ ) ) for training, and Discriminator D 2 uses image-output pairs ( x , 1 ) and ( x ^ , 0 ) for training.
Figure 1. The proposed DDSG-GAN model.
In DDSG-GAN, T is the victim image classification black-box model. The generator G will generate the perturbation vector δ of the input image x and add δ to x . Then, through clip operation, we can get query sample x ^ . The T ’s query result is used to train discriminator D 1 , and discriminator D 2 is used to identify whether the input is the original image. Both discriminators, D 1 and D 2 , will constrain the generated perturbations.
In this training process, the generator and discriminator play a game relationship with each other. In each iteration, target model T and discriminators D 1 and D 2 will calculate corresponding prediction results for each input. The discriminator D 1 fits the target model according to the output of the target model T . With the increasing of iterations, the fitting ability of the discriminator D 1 to T is constantly enhanced so that the attack ability of generator G to target model T continues to increase. At the same time, the discriminator D 2 is increasingly capable of classifying true and fake samples so that generator G will generate AEs closer to the original data distribution. This training process forms discriminators D 1 and D 2 with generator G to keep playing games and making progress.

3.2.1. The Training of Discriminator D 1

The input of generator G is the original image x , and the output is the perturbation vector δ = G ( x ; θ g ) about the original image x . Add the generated perturbation vector to x to get the query sample x . To ensure that the generated sample is within the ϵ neighborhood of the original image, we clip x or δ to get the final query sample x ^ . In the l 2 norm attack,
x ^ = C l i p ( x , x ) = { x + ϵ x x p · ( x x ) , x x p ϵ ,   x , x x p < ϵ ,
where x x p denotes the l p norm between x and x , and ϵ is the upper bound of the perturbation.
In the l norm attack,
δ = c l i p ( δ , α 1 , α 2 ) = { α 1 ,                   δ < α 1 ; δ ,                       α 1 δ α 2 , α 2 ,                   δ > α 2 ;
where α 1 and α 2 are the upper bound and lower bound of clipping respectively. The final query sample x ^ = x + δ .
The adversarial attack’s goal is to make the target model misclassify the AE. It can be formulated as follows:
T ( x ^ ) y ,
For the convenience of training, we convert (5) to maximize the following objective function:
max x ^ L ( T ( x ^ ) , y ) ,
where L ( · , · ) measures the difference between the output of target model T and y .
In the process of solving the optimization problem (6), it is necessary to continuously query the target model T to obtain T ( x ^ ) . However, this will make the query calculation very large, which is easily avoided by the defense mechanism. In the cause of reducing the number of queries to the target model, we train the discriminator D 1 as a surrogate model for T so that the query of T can be transferred to D 1 , which will greatly reduce the number of queries to the target model.
The training goal of the discriminator D 1 is to make it to be used as a surrogate model to simulate the function of model T . For the purpose of improving the fitting ability of D 1 , we use the original image x and the generated query sample x ^ to train D 1 together. The loss function for training the discriminator D 1 is as follows:
L D 1 = β 1 × d ( D 1 ( x ; θ d 1 ) , T ( x ) ) + β 2 × d ( D 1 ( x ^ ; θ d 1 ) , T ( x ^ ) ) ,
where T ( x ) and T ( x ^ ) are the query results obtained by inputting x and x ^ into the target model T , respectively, θ d 1 is the parameters of model D 1 , D 1 ( x ^ ; θ d 1 ) is the predicting result of the discriminator D 1 about the query sample x ^ , and D 1 ( x ; θ d 1 ) is the predicting result of the discriminator D 1 about the original image x . β 1 and β 2  are the weight factors used to control the relative importance. In this paper, we set β 1 = 2 ,   β 2 = 1 .
For the decision-based black-box attack, the loss function of D 1 can be formulated as follows:
L D 1 = β 1 × C E L ( D 1 ( x ; θ d 1 ) , T ( x ) ) + β 2 × C E L ( D 1 ( x ^ ; θ d 1 ) , T ( x ^ ) ) ,
where C E L ( a , b ) denotes the cross-entropy function between a and b .
For the score-based black-box attack, the adversary is obtained through query T to get the classification probability for each class. So, we can convert the T ( x ) obtained by the query into the corresponding label value and bring it into (8) to calculate the loss function of D 1 in this attack setting. Algorithm 1 presents the training procedure of D 1 .
Algorithm 1Training procedure of the Discriminator D 1
Input:Training dataset ( x , T ( x ) ) and ( x , T ( x ) ) , where x is the original image and x is the sample after adding perturbation, target model T , the discriminator D 1 and its parameters θ d 1 , the generator G and its parameters θ g ; loss function L ( · , · ) is defined in Equation (7).
Parameters:Batch number B , learning rate λ 1 , iterations N , weight factor β 1 and β 2 , clipping upper bound α 1 and lower bound α 2 .
Output:The trained Discriminator D 1 .
1:for e p o c h 1 to N do
2:  for b 1 to B do
3:     δ = G ( x ; θ g )
4:   if n o r m = 2 do
5:      x = x + δ
6:       x ^ = C l i p ( x , x )
7:     elif   norm = do
8:       δ = c l i p ( δ , α 1 , α 2 )
9:       x ^ = x + δ
10:    end if
11:     x ^ c l i p ( x ^ , 0 , 1 )
12:    l o s s d 1 = β 1 × d ( D 1 ( x ; θ d 1 ) , T ( x ) ) + β 2 × d ( D 1 ( x ^ ; θ d 1 ) , T ( x ^ ) )
13:    θ d 1 θ d 1 λ 1 × θ d 1 l o s s d 1
14:   end for
15:end for
16:return  D 1

3.2.2. The Training of Discriminator D 2

We train the discriminator D 1 as a surrogate model for T , so most of the queries on T can be transferred to discriminator D 1 . When the attack is successful, the AEs must be close to x , so discriminator D 2 can be set to distinguish whether the sample is sampled from the original images. If it is the original image, the label is 1. If it is the AE, the label is 0. The objective function for training the discriminator D 2 is:
L D 2 = E x ~ P d a t a ( x ) [ log ( D 2 ( x ; θ d 2 ) ) + log ( 1 D 2 ( x ^ ; θ d 2 ) ) ] ,
where P d a t a ( x ) is the data distribution of the original image x , E denotes the calculation of the mean of the expression, θ d 2 is the parameters of model D 2 , D 2 ( x ; θ d 2 ) is the predicting result of the discriminator D 2 about the original image x , and D 2 ( x ^ ; θ d 2 ) is the predicting result of the discriminator D 2 about the query sample x ^ .
The discriminator D 2 is used to judge whether the sample is true or fake and uses D 2 to train a good generator to fool D 2 so that the distribution of the generated AE can be closer to the original image. Algorithm 2 presents the training procedure of D 2 .
Algorithm 2Training procedure of the Discriminator D 2
Input:Training dataset ( x , 1 ) and ( x ^ , 0 ) , where x is the original image and x ^ are the query samples, the discriminator D 2 and its parameters θ d 2 , loss function L ( · , · ) is defined in Equation (9).
Parameters:Batch number B , Learning rate λ 2 , iterations N .
Output:The trained Discriminator D 2 .
1:for e p o c h 1 to N do
2:  for b 1 to B do
3:     l o s s d 2 = E x ~ P d a t a ( x ) [ log ( D 2 ( x ; θ d 2 ) ) + log ( 1 D 2 ( x ^ ; θ d 2 ) ) ]
4:     θ d 2 θ d 2 + λ 2 × θ d 2 l o s s d 2
5:   end for
6:end for
7:return  D 2

3.2.3. The Training of Generator G

The input of generator G is the original image x , and the output is the perturbation vector δ about x . On behalf of making the generated AE to fool the target model T , which needs to maximize the objective function (6). In this way, each update of generator G needs to query T , and the parameter information of target model T needs to be used in the backpropagation process, which does not conform to the scenario settings of black-box attacks. Therefore, we replace the target model T with the discriminator D 1 and approximate (6) as follows (10):
max x ^ L ( D 1 ( x ^ ; θ d 1 ) , y ) ,
where L ( · , · ) is the cross-entropy function. Since the output of D 1 has passed softmax, the denominator of (11) will not be 0, and (10) is equivalent to the following (11):
min x ^ 1 L ( D 1 ( x ^ ; θ d 1 ) , y ) ,
generator G ’s loss function regarding discriminator D 1 can be defined as follows (12):
L G _ D 1 = 1 L ( D 1 ( x ^ ; θ d 1 ) , y ) .
While the attack is successful in ensuring that the generated AEs are closer to the distribution of the original image, the loss function of the generator G , with respect to the discriminator D 2 , is defined as (13):
L G _ D 2 = log [ 1 D 2 ( x ^ ; θ d 2 ) ] .
To obtain a high attack success rate, it is necessary to continuously input x ^ into the target model T and use the loss of output T ( x ^ ) with the ground truth (untargeted attack) or the target class (targeted attack) to optimize generator G . The objectivate loss function of the attack can be formulated as follows:
L a t t _ s c o r e = { y ^ T y ^ t , i f   u n t a r g e t e d   a t t a c k ,   y ^ t y ^ T , i f   t a r g e t e d   a t t a c k ,
where y ^ t denotes the prediction probability of T for the target class in the targeted attack or the prediction probability of T for the real class in the untargeted attack, and y ^ T denotes the maximum value among the predicted probabilities of other classes by T .
To reduce the number of queries and be more consistent with the black-box setting, we use discriminator D 1 instead of T to optimize the training process. The objectivate loss function can be formulated as follows:
L a t t = { y ^ D 1 y ^ t , i f   u n t a r g e t e d   a t t a c k ,   y ^ t y ^ D 1 , i f   t a r g e t e d   a t t a c k ,
where y ^ t denotes the prediction probability of D 1 for the target class in the targeted attack or the prediction probability of D 1 for the real class in the untargeted attack, and y ^ D 1 denotes the maximum value among the predicted probabilities of other classes by D 1 .
We train the generator G by minimizing the following objectivate function:
L G = γ 1 × L G _ D 1 + γ 2 × L G _ D 2 + γ 3 × L a t t ,
where γ i   ( i = 1 , 2 , 3 ) is the weight factor of the three losses, which controls the relative importance of the three losses. L G _ D 1 makes the generated AE deceive discriminator D 1 step by step. L G _ D 2 makes generated AEs to be closer to the actual data distribution. L a t t is the attack loss, and its optimization produces a better attack effect. In this paper, the generator G and discriminators D 1 and D 2 are obtained by solving the minimax function min G   min D 1   max D 2   L G .

3.2.4. Improved Model

We can find from the training of discriminator D 1 that every training update of D 1 needs to query T . To reduce the number of queries to T while ensuring the fitting ability of D 1 , we design a circular queue to limit the training of D 1 . We divide the training process of D 1 into two stages: model pre-training and fine-tuning.
First, when the number of iterations i t e r n , setting β 1 = 3 , and β 2 = 0 . We use ( x , T ( x ) ) to train D 1 according to the Equation (7). When the number of iterations i t e r > n and i t e r   m o d   m = 0 , we add the query result ( x ^ , T ( x ^ ) ) of this iteration to circular queue H . So, when i t e r > n , setting β 1 = 2 and β 2 = 1 and when using ( x , T ( x ) ) , the query result ( x ^ , T ( x ^ ) ) is saved in the circular queue to fine-tuned D 1 according to the Equation (7).
In each iteration training, since we constantly use the query results of T to train D 1 , D 1 and T are highly approximate. Therefore, the ultimate goal of generator G can be converted to realize the discriminator D 1 ’s misclassification of AE. If the AE can successfully lead to D 1 misclassifying them, we can think that the AE can also successfully fool the target model T with a high probability. Therefore, in the whole training process, we also trained a surrogate model that can highly simulate the target model while generating the adversarial perturbation, combining GANs and model-stealing attacks to improve the transferability of the AEs. Algorithm 3 presents the training procedure of the whole model.
Algorithm 3Training procedure of the DDSG-GAN.
input:Target model T , generator G and it’s parameters θ g , discriminator D 1 and its parameters θ d 1 , discriminator D 2 and it’s parameters θ d 2 , original image–label pair ( x , y ) the learning rate η g , η d 1 and η d 2 .
output:The trained generator G .
1:Initialize the model of G , D 1 and D 2 .
2:for i 1 to N do
3:  for j 1 to n 1 do
4:     δ G ( x ; θ g )
5:    if n o r m = 2 do
6:       x = x + δ
7:       x ^ C l i p ( x , x )
8:    elif n o r m = do
9:      δ = c l i p ( δ , α 1 , α 2 )
10:      x ^ x + δ
11:    end if
12:     x ^ c l i p ( x ^ , 0 , 1 )      query example
13:    if i > n and i   m o d   m = 0 do
14:     Input x ^ into the targeted model T to get the query result
     Add ( x ^ ; T ( x ^ ) ) to the circular queue H
16:   end if
17:    if i n do         pre-training of D 1
18:      L D 1 = d ( D 1 ( x ; θ d 1 ) , T ( x ) )
19:    elif i > n do         fine tuning of D 1
20:      L D 1 = β 1 × d ( D 1 ( x ; θ d 1 ) , T ( x ) ) + β 2 × d ( D 1 ( x ^ ; θ d 1 ) , T ( x ^ ) )
( x ^ ; T ( x ^ ) ) is taken from the circular queue H
21:    end if
22:     θ d 1 θ d 1 η d 1 d 1 L D 1 ( θ d 1 )
23:  end for
24:  for j 1 to n 2 do
25:      L D 2 = E x ~ P d a t a ( x ) [ log ( D 2 ( x ; θ d 2 ) ) + log ( 1 D 2 ( x ^ ; θ d 2 ) ) ]
26:     θ d 2 θ d 2 + η d 2 × L D 2 ( θ d 2 )
27:  end for
28:  for j 1 to n 3 do
29:     L G = γ 1 × L G _ D 1 + γ 2 × L G _ D 2 + γ 3 × L a t t
30:     θ g θ g η g × L G ( θ g )
31:  end for
32:end for
33:return G

3.2.5. Generate Adversarial Examples

Firstly, according to algorithm 3, the adversary trains the generator G for the target model T under a specific attack setting. Secondly, we input the original image x into the trained generator G to obtain the corresponding perturbation vector δ , and then add δ to the original sample to get the initial AE x = x + δ . In order to ensure that the perturbation of the AE is within a small range, we perform the corresponding clipping operation on x to obtain the AE x ^ . If it is a l 2 norm attack, the clipping operation is performed according to the formula (3). If it is a l 2 norm attack, the clipping operation is performed according to the formula (4). Input the AE x ^ to the corresponding target T model to attack.
Figure 2 shows the specific attack process of the MNIST dataset. As shown in Figure 2, after the training of DDSG-GAN, we input the original image x into the trained generator to make AE x ^ . Then, input x ^ into the corresponding target model to attack. The generator designed in this paper consists of an encoder and a decoder. The encoder is a 5-layer convolution network, and the decoder is a 3-layer convolution network. For different target models, DDSG-GAN will train different generators and get different attack results.
Figure 2. The attack procedure on MNIST dataset.

4. Experiment

4.1. Experiment Setting

In this section, we will introduce the specific details of the experiment, including datasets, target model architecture, method settings, and evaluation indicators.
Dataset: We evaluate the effectiveness of the proposed method through experimental results on MNIST, CIFAR10, and Tiny-ImageNet. For these datasets, we select images with the correct classification of the target model in their testing sets as their respective testing sets for evaluation. The number of selected images is 1000, 1000, and 1600, respectively.
Attack scenario: We use a decision-based attack in the black-box attack setting to evaluate the proposed method. The attackers can acquire the output results of the target model but cannot obtain any structure and parameter information about the target model.
Target model architecture: In the l norm attack, for the MNIST dataset, we follow the advGAN [12] trained three image classification models for attack testing. Models A and B are from the paper [31], and model C is from the paper [8]. In the l norm attack, we trained model D as the target model. The structure of these models is shown in Table 1.
Table 1. MNIST classification model.
For the CIFAR10 dataset, we perform an l norm attack. We also follow advGAN to train ResNet32 as the target model. In a Tiny-ImageNet dataset, we train the ResNet34 classification model as the target model, and perform l 2 norm attack.
DDSG-GAN model details: The DDSG-GAN model contains dual discriminators and a single generator. The generator consists of an encoder and a decoder. For MNIST and CIFAR10 data sets, we design the same generator structure. The encoder is a 5-layer convolutional network, and the decoder is a 3-layer convolutional network. Refer to Figure 2 for the specific generator structure. For the Tiny-ImageNet, we add a convolution layer in the encoder and generator, respectively. For the MNIST data set, the discriminator D 1 is a 4-layer convolutional neural network. The discriminator D 1 for the CIFAR10 data set is ResNet18 without pre-training. For the Tiny-ImageNet data set, there are two types of discriminators D 1 , ResNet18 and ResNet50, which are pre-trained. We design the same discriminator D 2 for all data sets. The discriminator D 2 is a 2-classification network model composed of a 4-layer convolutional network, which is used to distinguish whether the sample is sampled from the original images.
Method setting: Multiple classification models are trained for MNIST, CIFAR10, and Tiny-ImageNet datasets. First, algorithm 3 is used to train the generator G . Then, the trained G is used to generate the adversarial perturbation. Then, it is added to the original sample, and the AE is obtained by clipping operation. Finally, we use these AEs to attack classification models. In the targeted attack, the target class is set to t = ( y + 1 )   m o d   C , where y is the ground truth, and C is the total number of categories.
Evaluation indicators: (1) Attack success rate. In the untargeted attack, it is the proportion of the AE successfully divided into any other classes. In the targeted attack, it is the probability of classifying the image into a specific target class. (2) The magnitude of the perturbation. We conduct attack experiments under the l 2 and l norm and set the corresponding perturbation threshold.

4.2. Experiments on MNIST

In this section, we use the l 2 and l norms to perform targeted and untargeted attacks on MNIST, respectively. Table 2 shows the specific parameter settings. The untargeted attack aims to generate AEs that make the classification result of the target model different from the ground truth. The targeted attack aims to generate AEs that make the classification result of the target model in the specified category. The experimental results are shown in Table 3, Table 4 and Table 5.
Table 2. Experimental parameter setting of MNIST.
Table 3. Training results of the surrogate model.
Table 4. Experimental results of untargeted attack under l norm (ASR: the attack success rate).
Table 5. Experimental results of targeted attack under l norm.
First, we attack the target models under l norm. We train discriminator D 1 as a T ’s surrogate model. We calculate the classification accuracy and similarity with the model T (the proportion of the same number of output results of the surrogate model and that of the target model) against the MNSIT test set. The experimental results are shown in Table 3. The classification accuracy of several surrogate models and the similarity between them and the target model is close to above 99%, indicating that the surrogate model we trained can replace the target model’s function.
In the l norm attack, we set the maximum perturbation threshold ϵ = 0.30 to evaluate the proposed approach. We compare DDSG-GAN with surrogate model-based black-attack, DaST, and advGAN. The surrogate model is trained by two methods, respectively. The first is to train the surrogate model according to [32]. This method uses 150 images in the test set as the original training set S 0 , which sets the Jacobian augmentation parameter λ = 1 , and runs 30 Jacobian augmentation iterations. The second is to use the trained discriminator D 1 as the surrogate model and combine FGSM and PGD for the black-box attacks. We set an upper bound on the number of queries to the target model in the DaST method. For MNIST data sets, the query of each image is set to 1000. Under this premise, the total query upper bound of the DaST method is 6 × 10 7 .
For a surrogate model-based attack, we use the same DNN model as the surrogate model and attack the target model by combining FGSM and PGD attack algorithms. To make the surrogate model trained by the first method have a better attack effect, set ϵ = 0.40 , and the perturbation thresholds of other methods are set to ϵ = 0.30 . Table 4 shows that our proposed method (DDSG-GAN) achieves an attack success rate of nearly 100%, which is much higher than black-box attacks based on surrogate models and DaST. At the same time, we also calculated the average query numbers of the target model. For the target models A, B, and C, the query numbers of each image in the train set were 15, 20, and 28, respectively, which ensured a low query quantity. Because the target model is unknown, black-box attacks based on the surrogate model have a low success rate. If D 1 is the surrogate model, compared with the surrogate model trained by [32], if combined with the FGSM algorithm to attack, the attack success rate is increased by 3.8% (4.7%, 2.4%, 4.3%) on average. If combined with the PGD algorithm to attack, the attack success rate is increased by 19.4% (22.2%, 12.2%, 23.9%) on average, and the attack effect is significantly improved. It demonstrates that the surrogate model we trained can replace the target model to a large extent, and this method can also achieve a good attack effect.
Table 5 shows the result of the targeted attack under the l norm, and we also compare it with the advGAN method. Compared with advGAN, the attack success rate of DDSG-GAN is 4.23% (6.5%, 7.5%, 0.58%) higher than advGAN on average, and three–four times higher than DaST. It also is much higher than the surrogate-model-based black-box attack. For target models A, B, and C, each image query numbers in the train set are 70, 75, and 109 times, respectively, also maintained at a low level. If D 1 is the surrogate model, compared with the surrogate model trained by [32], if combined with the FGSM algorithm to attack, the ASR is increased by 7.17% (7%, 7.5%, 7%) on average. If combined with the PGD algorithm to attack, the ASR is increased by 31.17% (25.4%, 31.6%, 36.5%) on average. The attack effect has also been significantly improved. In this attack setting, we visualize the generated AEs by DDSG-GAN on MNIST, which is shown in Figure 3. The top row shows the original samples of each class randomly selected from the training set. Other rows show the AEs generated by DDSG-GAN for the corresponding target model. The probability that each AE is classified into the target class is shown below the image.
Figure 3. Visualization of the AE in targeted l 2 attack.
We also carried out an untargeted attack under the l 2 norm, and the results are shown in Table 6. In the l 2 norm attack, DDSG-GAN achieved comparable ASR and perturbation size to other attack methods but reduced the number of queries.
Table 6. Experimental results of untargeted attack under l 2 norm.

4.3. Experiments on CIFAR10 and Tiny-ImageNet

We perform the untargeted and targeted attacks on CIFAR10 under l norm. Different from the setting of experimental parameters of MNIST, we set m = n = 1 , η g = 0.00001 , η d 1 = 0.001 , and the maximum length of H is set to 50,001. The target model of the attack is ResNet32, and its classification accuracy is 92.4%. In the targeted attack, the classification accuracy of the trained D 1 for the test set reaches 54.82%, and the similarity with the target model is 73.26%. The classification accuracy of the surrogate model trained by DaST is only 20.35%, and the accuracy of D 1 is 2.69 times higher. To verify the effectiveness of DDSG-GAN, we also compare it with DaST, advGAN, and the black-box attack based on the surrogate model on CIFAR10. The results are shown in Table 7.
Table 7. Attack results under l norm on CIFAR10.
Under the setting of a targeted attack and untargeted attack, we have realized FGSM and PGD attacks based on the surrogate model. For FGSM, we set ϵ = 0.4 , as it is shown to be effective in [32]. For the other attack methods, we uniformly set the perturbation threshold to ϵ = 0.031 . We also set an upper bound on the number of queries to the target model in the DaST method on CIFAR10. We set the query of each image to 1000. Under this premise, the total query upper bound of the DaST method is 5 × 10 7 . As can be seen from Table 7, DDSG-GAN has an obvious advantage over the other attack methods. Compared with advGAN, DDSG-GAN’s ARS in targeted attack is improved by 0.93%, and it is much higher than the black box attack based on the surrogate model and DaST. At the same time, the surrogate model we trained also achieved a good fitting effect. In the untargeted attack (targeted attack), if D 1 as the surrogate model combined with the FGSM algorithm to attack the target model, the ASR is 4.9% (10.9%) higher than the surrogate trained by [32], and the ASR combined with the PGD algorithm is increased by 10% (8.1%). The attack effect has obviously been improved. In the untargeted attack setting, visualization of AE generated by DDSG-GAN is shown in Figure 4. Figure 4a denotes original samples randomly selected from the training set. Figure 4b denotes AE generated by DDSG-GAN for the corresponding target model.
Figure 4. Visualization of AE generated by DDSG-GAN for attacking the ResNet32 on CIFAR10. (a) original samples randomly selected from the training set; (b) AE generated by DDSG-GAN for the corresponding target model.
We perform an untargeted attack on Tiny-ImageNet under l 2 norm. Because the Tiny-ImageNet data set is large, only about 1/3 of the training set, that is, 32,000 pictures, are randomly selected for training in each iterative training. We set m = n = 1 , η g = 0.001 , η d 1 = η d 2 = 0.0001 , and ϵ = 4.6 . The maximum length of H is set to 32,001. The pre-trained ResNet18 and ResNet50 are used as discriminators D 1 . The classification accuracy of the trained D 1 for the test set is 52.3% and 45.8%, respectively. The results are shown in Table 8. As can be seen from Table 8, the more complex the surrogate model, the better the attack effect. Therefore, in order to improve the attack effect, the complexity of the surrogate model can be appropriately increased.
Table 8. Attack results under l 2 norm on Tiny-ImageNet.

4.4. Model Analysis

As can be seen from the above experimental results, compared with the black-box attack based on the surrogate model (under l norm), DDSG-GAN has great advantages and a significantly higher attack success rate. In a black-box attack experiment based on the surrogate model, the surrogate model trained in this paper has a higher success rate of attack. In the l 2 norm attack, we can find that the query requirement of the target model is greatly reduced, and the success rate is kept at a high level. In addition, the attack effect of the model depends largely on the network architecture of the generator and the discriminator. When we use a fully connected neural network as the generator to perform algorithm 3, the ASR of the untargeted attack is only 80%. Therefore, designing a better network architecture helps improve the attack ability of the model.

5. Conclusions

Based on the structure of GAN, we design the architecture of generating AE with dual discriminators and a single generator and use the generator to generate the adversarial perturbation. Two discriminators constrain the generated perturbation, respectively. While ensuring the attack success rate and low image distortion, it also ensures a low query level. While training the generator, the discriminator D 1 gradually fits the target model, and, finally, it is trained as a surrogate model that can highly simulate the target model. In this way, D 1 combined with the white-box attack algorithm can carry out a black-box attack based on a surrogate model, and this attack method reaches a higher attack level, which shows that the surrogate model we trained has a good effect. When training the discriminator D 1 , we added the structure of a circular queue to save the query results, which made efficient use of the query results and greatly reduced the query requirements. In future work, we will consider adding perturbation in key areas to ensure the attack effect and reduce unnecessary image distortion. At the same time, it is considered to select a broader data set, such as ImageNet, to improve the universality of the method.

Author Contributions

Methodology, F.W., Z.M. and X.Z.; validation, Z.M. and Q.L.; formal analysis, F.W., Z.M. and C.W.; investigation, F.W. and Z.M.; data curation, Z.M. and Q.L.; writing—original draft preparation, F.W., Z.M. and X.Z.; writing—review and editing, F.W., C.W. and Q.L.; visualization, Z.M.; supervision, F.W. and C.W.; funding acquisition, F.W. and C.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by NSFC under Grant 61572170, Natural Science Foundation of Hebei Province under Grant F2021205004, Science and Technology Foundation Project of Hebei Normal University under Grant L2021K06, Science Foundation of Returned Overseas of Hebei Province Under Grant C2020342, and Key Science Foundation of Hebei Education Department under Grant ZD2021062.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The MNIST dataset is available at http://yann.lecun.com/exdb/mnist/ (accessed on 10 January 2023). The CIFAR10 dataset is available at http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz (accessed on 10 January 2023). The Tiny-ImageNet dataset is available at http://cs231n.stanford.edu/tiny-imagenet-200.zip (accessed on 8 February 2023).

Acknowledgments

We would like to thank Yong Yang, Dongmei Zhao, and others for helping us check the details and providing us with valuable suggestions for this paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. McAllister, R.; Gal, Y.; Kendall, A.; Van Der Wilk, M.; Shah, A. Concrete problems for autonomous vehicle safety: Advantages of bayesian deep learning. In Proceedings of the Twenty-Sixth International Joint Conferences on Artificial Intelligence, Melbourne, Australia, 19–25 August 2017; pp. 4745–4753. [Google Scholar]
  2. Grosse, K.; Papernot, N.; Manoharan, P.; Backes, M.; McDaniel, P. Adversarial perturbations against deep neural networks for malware classification. arXiv 2016, arXiv:1606.04435. [Google Scholar]
  3. Goodfellow, I.J.; Shlens, J.; Szegedy, C. Explaining and harnessing adversarial examples. arXiv 2014, arXiv:1412.6572.2014. [Google Scholar]
  4. LeCun, Y. The Mnist Database of Handwritten Digits. Available online: http://yann.lecun.com/exdb/mnist/ (accessed on 10 January 2023).
  5. Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images; Technical Report; University of Toronto: Toronto, CA, USA, 2009. [Google Scholar]
  6. Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
  7. Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, D.; Goodfellow, I.; Fergus, R. Intriguing properties of neural networks. arXiv 2013, arXiv:1312.6199. [Google Scholar]
  8. Carlini, N.; Wagner, D. Towards evaluating the robustness of neural networks. In Proceedings of the 2017 IEEE Symposium on Security and Privacy (SP), San Jose, CA, USA, 22–24 May 2017; pp. 39–57. [Google Scholar]
  9. Kurakin, A.; Goodfellow, I.J.; Bengio, S. Adversarial examples in the physical world. In Artificial Intelligence Safety and Security; Roman, V.Y., Ed.; Chapman and Hall/CRC: Boca Raton, FL, USA, 2018; pp. 99–112. [Google Scholar]
  10. Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; Vladu, A. Towards Deep Learning Models Resistant to Adversarial Attacks. arXiv 2017, arXiv:1706.06083. [Google Scholar]
  11. Baluja, S.; Fischer, I. Adversarial Transformation Networks: Learning to Generate Adversarial Examples. arXiv 2017, arXiv:1703.09387. [Google Scholar]
  12. Xiao, C.; Li, B.; Zhu, J.Y.; He, W.; Liu, M.; Song, D. Generating Adversarial Examples with Adversarial Networks. arXiv 2018, arXiv:1801.02610. [Google Scholar]
  13. Moosavi-Dezfooli, S.M.; Fawzi, A.; Frossard, P. Deepfool: A Simple and Accurate Method to Fool Deep Neural Networks. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2574–2582. [Google Scholar]
  14. Su, J.; Vargas, D.V.; Sakurai, K. One pixel attack for fooling deep neural networks. IEEE Trans. Evol. Comput. 2019, 23, 828–841. [Google Scholar] [CrossRef]
  15. Chen, P.Y.; Zhang, H.; Sharma, Y.; Yi, J.; Hsieh, C.J. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, Dallas, TX, USA, 3 November 2017; pp. 15–26. [Google Scholar]
  16. Tu, C.C.; Ting, P.; Chen, P.Y.; Liu, S.; Zhang, H.; Yi, J.; Cheng, S.M. Autozoom: Autoencoder-based zeroth order optimization method for attacking black-box neural networks. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 February 2019; pp. 742–749. [Google Scholar]
  17. Ilyas, A.; Engstrom, L.; Madry, A. Prior convictions: Black-box adversarial attacks with bandits and priors. arXiv 2018, arXiv:1807.07978. [Google Scholar]
  18. Guo, C.; Gardner, J.; You, Y.; Wilson, A.G.; Weinberger, K. Simple black-box adversarial attacks. In Proceedings of the International Conference on Machine Learning, Boca Raton, FL, USA, 16–19 December 2019; pp. 2484–2493. [Google Scholar]
  19. Yang, J.; Jiang, Y.; Huang, X.; Ni, B.; Zhao, C. Learning black-box attackers with transferable priors and query feedback. In Proceedings of the NeurIPS 2020, Advances in Neural Information Processing Systems 33, Beijing, China, 6 December 2020; pp. 12288–12299. [Google Scholar]
  20. Du, J.; Zhang, H.; Zhou, J.T.; Yang, Y.; Feng, J. Query efficient meta attack to deep neural networks. arXiv 2019, arXiv:1906.02398. [Google Scholar]
  21. Ma, C.; Chen, L.; Yong, J.H. Simulating unknown target models for query-efficient black-box attacks. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11835–11844. [Google Scholar]
  22. Biggio, B.; Corona, I.; Maiorca, D.; Nelson, B.; Šrndić, N.; Laskov, P. Evasion attacks against machine learning at test time. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Prague, Czech Republic, 22–26 September 2013; pp. 387–402. [Google Scholar]
  23. Xie, C.; Zhang, Z.; Zhou, Y.; Bai, S.; Wang, J.; Ren, Z.; Yuille, A.L. Improving transferability of adversarial examples with input diversity. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2730–2739. [Google Scholar]
  24. Demontis, A.; Melis, M.; Pintor, M.; Matthew, J.; Biggio, B.; Alina, O.; Roli, F. Why do adversarial attacks transfer? explaining transferability of evasion and poisoning attacks. In Proceedings of the 28th USENIX Security Symposium, Santa Clara, CA, USA, 14–16 August 2019; pp. 321–338. [Google Scholar]
  25. Kariyappa, S.; Prakash, A.; Qureshi, M.K. Maze: Data-free model stealing attack using zeroth-order gradient estimation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 18–24 June 2022; pp. 13814–13823. [Google Scholar]
  26. Wang, Y.; Li, J.; Liu, H.; Wang, Y.; Wu, Y.; Huang, F.; Ji, R. Black-box dissector: Towards erasing-based hard-label model stealing attack. In Proceedings of the 2021 European Conference on Computer Vision, Montreal, Canada, 11 October 2021; pp. 192–208. [Google Scholar]
  27. Yuan, X.; Ding, L.; Zhang, L.; Li, X.; Wu, D.O. ES attack: Model stealing against deep neural networks without data hurdles. IEEE Trans. Emerg. Top. Comput. Intell. 2022, 6, 1258–1270. [Google Scholar] [CrossRef]
  28. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Bengio, Y. Generative adversarial nets. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
  29. Zhao, Z.; Dua, D.; Singh, S. Generating Natural Adversarial Examples. arXiv 2017, arXiv:1710.11342. [Google Scholar]
  30. Zhou, M.; Wu, J.; Liu, Y.; Liu, S.; Zhu, C. Dast: Data-Free Substitute Training for Adversarial Attacks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 234–243. [Google Scholar]
  31. Tramèr, F.; Kurakin, A.; Papernot, N.; Goodfellow, I.; Boneh, D.; McDaniel, P. Ensemble adversarial training: Attacks and defenses. arXiv 2017, arXiv:1705.07204. [Google Scholar]
  32. Papernot, N.; McDaniel, P.; Goodfellow, I.; Jha, S.; Celik, Z.B.; Swami, A. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, Abu Dhabi, United Arab Emirates, 2–6 April 2017; pp. 506–519. [Google Scholar]
  33. Brendel, W.; Rauber, J.; Bethge, M. Decision-based adversarial attacks: Reliable attacks against black-box machine learning models. arXiv 2017, arXiv:1712.04248. [Google Scholar]
  34. Cheng, M.; Le, T.; Chen, P.Y.; Yi, J.; Zhang, H.; Hsieh, C.J. Query efficient hard-label black-box attack: An optimization based approach. arXiv 2018, arXiv:1807.04457. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.