1. Introduction
Deep neural networks (DNNs) have become the central approach in modern-day artificial intelligence (AI) research. They have attained superb performance in multifarious complex tasks and are behind fundamental breakthroughs in a variety of machine-learning tasks that were previously thought to be too difficult. Image classification, object detection, machine translation, and sentiment analysis are just a few examples of domains revolutionized by DNNs.
Despite their success, recent studies have shown that DNNs are vulnerable to adversarial attacks. A barely detectable change in an image, for example, can cause a misclassification in a well-trained DNN. Targeted adversarial examples can even evoke a misclassification of a specific class (e.g., misclassify a car as a cat). Researchers have demonstrated that adversarial attacks are successful in the real world and may be produced for data modalities beyond imaging, e.g., natural language and voice recognition [
1,
2,
3,
4]. DNNs’ vulnerability to adversarial attacks has raised concerns about applying these techniques to safety-critical applications.
To discover effective adversarial instances, most past work on adversarial attacks has employed gradient-based optimization [
5,
6,
7,
8,
9]. Gradient computation can only be executed if the attacker is fully aware of the model architecture and weights. Thus, these approaches are only useful in a white-box scenario, where an attacker has complete access and control over a targeted DNN. Attacking real-world AI systems, however, might be far more arduous. The attacker must consider the difficulty of implementing adversarial instances in a black-box setting, in which no information about the network design, parameters, or training data is provided. In this situation, the attacker is exposed only to the classifier’s input-output pairs. In this context, a typical strategy has been to attack trained replacement networks and hope that the generated examples transfer to the target model [
10]. The substantial mismatch of the model between the alternative model and the target model, as well as the significant computational cost of alternative network training, often renders this technique ineffective.
In our work we assume a real-world, black-box attack scenario, wherein a DNN’s input and output may be accessed but not its internal configuration. We focus on a scenario in which a specific DNN is an image classifier, specifically, a convolutional neural network (CNN), which accepts an image as input and outputs a probability score for each class.
Herein, we present an evolutionary, gradient-free optimization approach for generating adversarial instances, more suitable for real-life scenarios, because usually there is no access to a model’s internals, including the gradients; thus, it is important to craft attacks that do not use gradients. Our proposed attack can deal with either constrained ( value that constrains the norm of the allowed perturbation) or unconstrained (no constraint on the norm of the perturbation) problems, and focuses on constrained, untargeted attacks. We believe that our framework can be easily adapted to the targeted setting.
In the next section we review the literature on adversarial attacks.
Section 3 summarizes the threat model we assume for our proposed evolutionary attack algorithm. The algorithm itself—
QuEry Attack (for
Query-Efficient
Evolutiona
ry Attack)—is delineated in
Section 4. The experiments conducted to test the method, along with results, are described in
Section 5. We discuss our findings and present concluding remarks in
Section 6.
Figure 1 shows examples of successful and unsuccessful instances of images generated by QuEry Attack, evaluated against ImageNet, CIFAR10, and MNIST.
2. Related Work
Adversarial attacks against DNNs have become an important research field in the last few years. For a comprehensive survey, we refer the reader to [
11].
An important distinction is between ‘white box’ and ‘black box’ attacks. In white-box attacks, the attacker has knowledge of the attacked model’s internal structure and parameters, and exploits that knowledge. It is commonly performed by using the model’s gradients [
5,
7,
12].
Although white-box methods achieved good results they usually do not represent a real-world scenario. More realistic is a black-box attack, where the attacker has no knowledge of the model’s structure and parameters. The attacker can only query the model—and act upon its outputs. We can further distinguish between the more common ‘light black box’ attacks, where the model gives the prediction’s confidence (which can be exploited), and ‘dark black box’ attacks where the model only gives the final class prediction (e.g., [
10]).
The first effective black-box attack traded a priori information of the model with extensive runtime querying [
10]. Using a large number of queries it builds a substitute model, and attacks it with traditional white-box methods. It uses the transferability property, namely, an attack that succeeds on one model will likely succeed on a similar—though not identical—model.
Other works used the more permissive ‘light black-box’ scenario, which can use the prediction’s confidence value. Some works estimate the gradient with this information and then use traditional gradient-based attacks [
13].
All these black-box methods rely on gradients, and, thus, are sensitive to many defense methods that obscure gradients [
14,
15,
16]. This has given rise to methods that do not rely on gradients at all, e.g., [
17], which uses random search and is also query-efficient.
Instead of randomness, one can use evolutionary methods. In
Evolutionary Algorithms (EAs), core concepts from evolutionary biology—inheritance, random variation, and selection—are harnessed in algorithms that are applied to complex computational problems. EA techniques have been shown to solve numerous difficult problems from widely diverse domains, and also to produce human-competitive machine intelligence [
18]. EAs also present many important benefits over popular Machine Learning methods, including [
19]: less reliance on the existence of a known or discoverable gradient within the search space; ability to handle design problems, where the objective is to design new entities from scratch; fewer required a priori assumptions about the problem at hand; seamless integration of human expert knowledge; ability to solve problems where human expertise is very limited; support of interpretable solution representations; and support of multiple objectives.
The evolutionary method GenAttack is a
targeted attack (thus not directly comparable to ours, which is
untargeted) that used a fitness function that tries to increase the probability of the target class and decrease the probability of the other classes [
20]. Its fitness function ignored the distance between the images. Interestingly, GenAttack uses fitness-proportionate selection, which is employed less often nowadays due to known problems. It uses an adaptive mutation rate to balance between exploration in early phases and exploitation in later phases.
Ref. [
21] treated the adversarial problem as one of multi-objective optimization: minimize the class prediction’s score on one hand, and the distance between the original image and a perturbed one on the other hand.
Another attack method changes a single pixel [
22]. This method uses differential evolution (DE) [
23] without crossover. However, it sometimes required thousands of queries.
Ref. [
24] also uses DE, but unlike the other evolutionary computation (EC) methods reviewed, it uses EC to approximate the gradients.
Unlike the above methods, which tried to minimize the perturbation as much as possible and make it as unnoticeable to the human eye as possible, [
25] makes a small but noticeable change, which looks like a regular scratch (a similar approach in the domain of Natural Language Processing creates sentences that do not make sense to a human reader [
26]). The approach uses DE as well, and also Covariance-Matrix Adaptation Evolution Strategies (CMA-ES) [
27]. Unlike most attacks, which use the
or
norms, this one is based on the
norm.
In the EC methods seen so far, evolution is run against a
single image, and each individual is a perturbation added to that image. Ref. [
28], on the other hand, used the transferability property mentioned earlier to evolve a universal perturbation. An individual is an image mask that can be applied as a perturbation to any image.
Our extensive scrutiny of the literature and software repositories revealed that many authors compare their work to prior works that do not use the same threat model: there might be a mismatch in norms (e.g., vs. ), white box vs. black box, or other subtle differences. Moreover, having investigated numerous software repositories, we found that running the code of many papers is far from straightforward.
3. Threat Model
In the black-box attack setting, queries to the network are permitted but access to internal states is prohibited (e.g., executing backpropagation). Hence, our threat model, or scenario, is as follows:
The attacker is unaware of the network’s design, settings, or training data.
The attacker has no access to intermediate values in the target model.
The attacker can only use a black-box function to query the target model.
Note that the above threat model determines the comparisons we perform, which focus on attacks that are:
- 1.
Black-box,
- 2.
Untargeted,
- 3.
-norm bounded.
We can consider a network model to be a function:
where
d is the number of input features and
C is the number of classes. The
c-th value
specifies the predicted score of classifying input image
x as class
c. The classifier assigns class
to the input
x.
A
targeted attack aims to create an image that will be incorrectly classified into a given (incorrect) class. An
untargeted attack aims to create an image that will be incorrectly classified into
any class except the correct one. An image
is termed an adversarial example, with an
-norm bound of
for
x, if:
To wit, the model should classify incorrectly, while preserving similarity between x and under an distance metric.
We focus on a black-box, score-based attack, wherein the only information of the threat model is the raw output (logits).
Our suggested black-box approach may theoretically be used in conjunction with classic machine-learning models, with the same input–output relationship. Because DNNs have reached state-of-the-art performance in a variety of image tasks, we focus on them in this paper.
4. QuEry Attack
QuEry Attack is an evolutionary algorithm (EA) that explores a space of images, defined by a given input image and a given input model, in search of adversarial instances. It ultimately generates an attacking image for the given input image. Unlike white-box approaches, we make no assumptions about the targeted model, its architecture, dataset, or training procedure. We assume that we have an image
x, which a black-box neural network,
f, classifies by outputting a probability distribution over the set of classes, as stated in
Section 3. The actual label
y is computed as
.
Our objective is to find a perturbed image,
, of image
x, such that,
, which causes the network to predict
, such that
. Finding
may be cast as a constrained optimization problem:
for a given loss function
.
We use loss
as the fitness function, defined in our case as:
where
is a perturbed image,
is the predicted score that
belongs to class
y,
is the predicted score that
belongs to class
. In order to guarantee that the adversarial perturbation is as imperceptible as possible we penalize the
distortion of the perturbation by including a regularization component in the fitness function. We use
regularization because we noticed that most of the evolved adversarial examples were on the edges of the
-ball, and we wanted to give precedence to examples which were closer to the original input. This penalization is determined by the
value, which is the regularization strength. In our experiments we used
.
The ultimate goal is to minimize the fitness value: Essentially, the lower the logit of the correct class, and the higher the maximum logit of the other classes, the better the value.
Algorithm 1 provides the pseudo-code of QuEry Attack. The original image x, along with a number of hyperparameters, are given as input to the algorithm. QuEry Attack generates an adversarial image , with the model classifying as , such that and .
The main goal of QuEry Attack is to produce a successful attack, using as few queries to the model as possible. The maximal number of queries equals generation count × population size .
4.1. Initialization
Initialization is crucial for optimization problems, e.g., in deep-learning training, gradient descent reaches a local minimum that is significantly determined by the initialization technique [
29,
30]. QuEry Attack generates an initial population of perturbed images by randomly selecting images from the edges of the sphere centered on the original image
x with radius =
. This is accomplished by adding vertical stripes of width 1 along the image, with the color of each stripe sampled uniformly at random from
per channel (i.e., the pixels of each stripe can be either
or
); in [
17], they discovered that convolutional neural networks (CNNs) are especially vulnerable to such perturbations.
4.2. Mutation
Considering the use of (square-shaped) convolutional filters by convolutional neural networks, we used square-shaped perturbations. Specifically, we employed [
17]’s technique for determining square size. Let
be the proportion of elements to be perturbed for an image of shape
. The nearest positive integer to
determines the length of the square’s edge, with
p being a hyperparameter. We set it initially to
, then halved it after {40, 200, 800, 4000, 8000, 16,000, 24,000, 32,000} queries, respectively (similar to [
17]).
4.3. Crossover
We experimented both with single-point and two-point crossover, eventually settling on the latter as it performed better. The operator works by flattening both (two-dimensional image) parents, randomly picking two indices, then swapping the pixels between the chosen pixels.
The EA then proceeds by evaluating the fitness of each individual, selecting parents, and performing crossover and mutation to generate the next generation. This latter is obtained by adding one elite individual from the current generation, with all other next-generation individuals derived through crossover and mutation. The process is repeated until a successful perturbation is found or until the termination condition is met.
A major advantage of QuEry Attack is its amenability to parallelization—due to being evolutionary—in contrast to most other adversarial, iterative (non-evolutionary) attacks in this field.
Algorithm 1 QuEry Attack |
- Input:
- Output:
- #
Main loop - 1:
- 2:
INIT() - 3:
while not TERMINATION_CONDITION() do - 4:
for do - 5:
compute fitness of using Equation ( 4) - 6:
- 7:
ELITISM() - 8:
add to - 9:
for to do - 10:
SELECTION() - 11:
SELECTION() - 12:
CROSSOVER() - 13:
SQUARE_MUTATION() - 14:
SQUARE_MUTATION() - 15:
add to - 16:
- 17:
- 18:
- 19:
return best from # QuEry Attack’s final output - 20:
function INIT( ) - 21:
- 22:
for i← 1 to N do - 23:
STRIPES_INIT(x) - 24:
add to - 25:
return - 26:
function ELITISM(pop) - 27:
return best from - 28:
function TERMINATION_CONDITION(pop, gen) - 29:
if then - 30:
return true - 31:
for do - 32:
predicted label of - 33:
if then - 34:
return true - 35:
return false - 36:
function SELECTION(pop) - 37:
randomly and uniformly pick T individuals from - 38:
return best from - 39:
function STRIPES_INIT() - 40:
for i← 1 to c do # c is the image’s number of channels - 41:
stripe ← create a vertical stripe of width 1, randomly positioned, with random values - 42:
← + stripe - 43:
# : clipping operator to ensure pixel values are within -ball - 44:
return - 45:
function SQUARE_MUTATION() - 46:
number of channels - 47:
number of features ( # h: height, w: width - 48:
- 49:
array of ones of size . - 50:
# randomly and uniformly selects from given set - 51:
for i← 1 to c do - 52:
← - 53:
- 54:
- 55:
- 56:
return - 57:
function CROSSOVER(, ) - 58:
Flatten and - 59:
Perform standard two-point crossover (as explained in text), creating - 60:
- 61:
return
|
5. Experiments and Results
To evaluate QuEry Attack we set out to collect commonly used algorithms for comparative purposes. A somewhat disconcerting reality we then encountered involved our struggle to find good benchmarks and software for comparison purposes. Sadly, we found ourselves wasting many a day (which, alas, turned into weeks) trying to run buggy software, chasing down broken links, issuing GitHub issues, and so forth. Perhaps this is due in part to the field of adversarial attacks being young.
We evaluated QuEry Attack by executing experiments against three different pretrained image-classification models, taken from PyTorch [
31]—Inception-v3 [
32], ResNet-50 [
33], and VGG-16-BN [
34]—over three image datasets: ImageNet, CIFAR-10, and MNIST. We employed 200 randomly picked and correctly classified images from the test sets. For ImageNet, Inception-v3 has an accuracy of 78.8%, ResNet-50 has an accuracy of 76.1%, and VGG-16-BN has an accuracy of 73.3% (these are top-1 accuracy values; for ImageNet, top-5 accuracy values are also sometimes given, which in our case are: Inception-v3—94.4%, ResNet-50–92.8%, VGG-16-BN—91.5%). CIFAR10 accuracy values are: Inception-v3—93.7%, ResNet-50—93.6%, and VGG-16-BN—94.0%. For the MNIST dataset we trained a convolutional neural network (CNN), whose architecture is delineated in
Table 2, which attained 98.9% accuracy.
We chose to use the above models since they are the most commonly used CNN architectures. QuEry Attack exploits the nature of convolution layers by using square mutations and stripes initialisation, which were shown to be very effective [
17]. Further, these architectures serve as a backbone for many other downstream tasks, such as object detection [
35], semantic segmentation [
36], and image captioning [
37]. Recently, vision transformers have beaten CNNs in image classification tasks [
38], but they require huge amounts of data and resources that are not available too many (including us). Future work will be dedicated to expand the attack to vision transformers.
All accuracy values are over test images. We used the Adversarial Robustness Toolbox (ART) [
39] to evaluate QuEry Attack against other attacks. We restricted all attacking algorithms to a maximum of 42K queries to the model (
,
) for MNIST and CIFAR10, and 84K queries (
,
) for ImageNet. A query refers to a prediction supplied by the model for a given image. To make the most of the computational resources we had available we prioritized actual, experimental runs over hyperparameter runs, so hyperparameters were chosen through limited trial and error. In the future, we plan to perform a more thorough hyperparameter sweep using Optuna [
40]. The only hyperparameters we set were the population size
, tournament size
, and
; these are used for all experiments reported herein. The number of generations (
G) was derived from the query budget.
AdversarialPSO [
41] results were obtained by running the code in the GitHub repository referred to in their paper. Due to technical difficulties it was run against the original models that this attack was planned to run against, namely, Inception-v3 for ImageNet, and their own trained networks for CIFAR-10 and MNIST. We duplicated these results in the table for all models.
5.1. Attacking Defenses
We show how QuEry Attack breaks a collection of defense strategies designed to boost the robustness of models against adversarial attacks.
5.1.1. Attacking Non-Differentiable Transformations
Gradient masking is achieved via non-differentiable input transformations, which rely on manipulating gradients to defeat gradient-based attackers [
42,
43]. Further, randomized transformations make it more difficult for the attacker to be certain of success. It is possible to foil such a defense by altering the defense module that performs gradient masking, but this is not an option within the black-box scenario. Herein, we investigated three non-differentiable transformations against QuEry Attack: JPEG compression, bit-depth reduction (also known as feature squeezing), and spatial smoothing. We show that QuEry Attack can defeat these input modifications, due to its gradient-free nature.
JPEG compression [
44] tries to generate patterns in color values to minimize the amount of data that has to be captured, resulting in a smaller file size. Some color values are estimated to match those of surrounding pixels in order to produce these patterns. This compression means that slight imperfections in the quality of the image will not be as noticeable. The degree of compression may be tweaked, providing a customizable trade-off between image quality and storage space. An example of the different compression degrees is shown in
Figure 2. The results in
Table 3 were evaluated with image quality
.
Bit-depth reduction [
45] can be done both by reducing the color depth of each pixel in an image and using spatial smoothing to smooth out individual pixel discrepancies. By merging samples that correspond to many different feature vectors in the original space into a single sample, bit-depth reduction decreases the search space accessible to an opponent. An example of different bit-depth values is shown in
Figure 3. The results in
Table 3 were evaluated with bit depth
.
The term “spatial smoothing“ refers to the averaging of data points with their neighbors [
46]. This has the effect of a low-pass filter, with high frequencies of the signal being eliminated from the data while low frequencies are enhanced. As a result, an image’s crisp “edges” are softened, and spatial correlation within the data becomes more prominent, as shown in
Figure 4. Data averaging is determined according to a given window size. The results in
Table 3 were evaluated with window
.
Our results are delineated in
Table 3. For this experiment we used a total budget of 82K queries to the model (
,
)—which was Inception-v3. For each given image, we first checked that it was correctly classified after applying the defense on the image, then we applied QuEry Attack on it. The different input values for the transformations were chosen such that applying them would not be destructive.
For these experiments we used the same budget of queries as in the previous experiments. For both CIFAR-10 and ImageNet, QuEry Attack has a high success rate against all non-differentiable transformations.
5.1.2. Attacking Robust Models
A model is considered to be robust when some of the input variables are largely perturbed, but the model still makes correct predictions. Recently, several techniques have been proposed to render the models more robust to adversarial attacks. One commonly used technique to improve model robustness is adversarial training. Adversarial training integrates adversarial inputs—generated with other trained models—into the models’ training data. Adversarial training has been proved to be one of the most successful defense mechanisms for adversarial attacks [
47,
48,
49,
50].
We conducted an experiment to see how well QuEry Attack performs on robust models. For CIFAR10 we used the robust model, WideResNet-70-16 [
51], wherein they used generative models trained only on the original training set in order to enhance adversarial robustness to
norm-bounded perturbations. For ImageNet we used WideResNet-50-2 [
52], which is a variant of ResNet wherein the depth of the network is decreased and the width of the network is increased. This is achieved through the use of wide residual blocks. Both of these top models were taken from the RobustBench repository [
53]. We used the same 200 randomly selected images from our previous experiments, a budget of 84K queries (
,
) for CIFAR10, and a budget of 126K queries (
,
) for ImageNet. As seen in
Table 4, QuEry Attack succeeds at breaking those strongly defended models.
5.2. Transferability
An adversarial example for one model can often serve as an adversarial example for another model, even if the two models were trained on different datasets, using different techniques; this is known as transferability [
54]. White-box attacks may overfit on the source model, as evidenced by the fact that black-box success rates for an attack are almost always lower than those of white-box attacks [
7,
13,
55,
56]. Herein, we checked transferability of our proposed black-box attack on 200 correctly classified ImageNet images by both the source model and the target model, using different
values. The results, summarized in
Table 5, show a positive correlation between the
values and the transferability success rate. We noted that attacks are better transferred between ResNet-50 to VGG-16-BN models and surmise this is due to the fact that ResNets models were mostly inspired by the philosophy of VGG models, wherein they use relatively small
convolutional layers.
6. Discussion and Concluding Remarks
We presented an evolutionary, score-based, black-box attack, showing its superiority in terms of ASR (attack success rate) and number-of-queries over previously published work. QuEry Attack is a strong and fast attack that employs a gradient-free optimization strategy. We tested QuEry Attack against MNIST, CIFAR10, and ImageNet models, comparing it to other commonly used algorithms. We evaluated QuEry Attack’s performance against non-differential transformations and robust models, and it proved to succeed in both scenarios.
As noted, we discovered that the software scene in adversarial attacks is a tad bit muddy. We encourage researchers to place executable code on public repositories—code that can be used with ease. Furthermore, we feel that the field lacks standard means of measuring and comparing results. We encourage the community to establish common baselines for these purposes.
We came to realize the importance of a strong initialization procedure. Although this is true of many optimization algorithms, it seems doubly so where adversarial optimization is concerned.
Table 1 shows that successful attacks are sometimes found during initialization—the vertical-stripes initialization in particular proved highly potent—and even if not, the number of queries (and generations) is significantly curtailed.
Figure 5 and
Figure 6 show that adversarial examples are barely distinguishable to the human eye. Clearly, neural networks function quite differently than humans, capturing entirely different features. More work is needed to create networks that are robust in a human sense.
We think that evolutionary algorithms are well-suited for this kind of optimization problems and our findings imply that evolution is a potential research avenue for developing gradient-free black-box attacks. Furthermore, evolution needs to be evaluated against a fully black-box model.
Evolution may also be a solution for rendering models more robust. In [
57] it was shown that combining different activation functions could be used to increase model accuracy; this approach might also be used for obtaining robustness.