Probabilistic Jacobian-based Saliency Maps Attacks

Machine learning models have achieved spectacular performances in various critical fields including intelligent monitoring, autonomous driving and malware detection. Therefore, robustness against adversarial attacks represents a key issue to trust these models. In particular, the Jacobian-based Saliency Map Attack (JSMA) is widely used to fool neural network classifiers. In this paper, we introduce Weighted JSMA (WJSMA) and Taylor JSMA (TJSMA), simple, faster and more efficient versions of JSMA. These attacks rely upon new saliency maps involving the neural network Jacobian, its output probabilities and the input features. We demonstrate the advantages of WJSMA and TJSMA through two computer vision applications on 1) LeNet-5, a well-known Neural Network classifier (NNC), on the MNIST database and on 2) a more challenging NNC on the CIFAR-10 dataset. We obtain that WJSMA and TJSMA significantly outperform JSMA in success rate, speed and average number of changed features. For instance, on LeNet-5 (with $100\%$ and $99.49\%$ accuracies on the training and test sets), WJSMA and TJSMA respectively exceed $97\%$ and $98.60\%$ in success rate for a maximum authorised distortion of $14.5\%$, outperforming JSMA with more than $9.5$ and $11$ percentage points. The new attacks are then used to defend and create more robust models than those trained against JSMA. Like JSMA, our attacks are not scalable on large datasets such as IMAGENET but despite this fact, they remain attractive for relatively small datasets like MNIST, CIFAR-10 and may be potential tools for future applications.


Introduction
Deep learning classifiers are used in a wide variety of situations, such as vision, speech recognition, financial fraud detection, malware detection, autonomous driving, defence, and more.
The ubiquity of deep learning algorithms in many applications, especially those that are critical such as autonomous driving [4,20] or pertain to security and privacy [17,21] makes their attack particularly useful. Indeed, this allows firstly to identify possible flaws in the intelligent learned system and secondly set up a defense strategy to improve its reliability.
Adversarial attacks are built upon the idea of adversarial samples. Given a classifier N and an input x with label l, an adversarial sample to x is an input x * close to x but such that label(x * ) = l. These attacks can be separated into two types: targeted and non-targeted depending on whether label(x * ) is specified in advance or not.
In this paper, we focus on JSMA, a simple, reliable and intuitive targeted adversarial attack against machine learning classifiers. Despite the fact it does not scale to large datasets like IMAGENET [3], JSMA is still relevant on small datasets such as MNIST [10], CIFAR-10 [7], Fashion-MNIST [25] achieving good results on these datasets [15,6,24]. Relying on its cleverhans implementation [16], JSMA is able to generate 9 adversarial samples on MNIST in only 2 seconds on a laptop with 2 CPU cores. The combination between good performance and speed makes JSMA attractive although it is less efficient than CW attack which is 20 times slower [1]. In multiple other applications in cybersecurity, anomaly detection, intrusion detection and Reinforcement Learning involving small data, JSMA may be preferred over many approaches [18,2,19,11].
Before explaining our contribution, let us introduce some definitions and recall the principle of JSMA.
Neural network classifier (NNC). The goal of a NNC is to predict through a neural network which class an item x belongs to, among a family of K possible classes. It outputs a vector of probabilities p(x) = (p 1 (x), · · · , p K (x)) where the label of x is deduced as follows: label(x) = argmax k p k (x).
Jacobian-based Saliency Map Attack (JSMA). To fool NNCs, this attack relies on the Jacobian matrix of outputs with respect to inputs. By analysing this matrix, one can deduce how the output probabilities behave given a slight modification of an input feature. Consider a NNC N as before and denote by F (x) = (F 1 (x), · · · , F K (x)) the outputs of the second-to-last layer of N (no longer probabilities, but related to the final output by applying a softmax layer). To craft an adversarial example from a given input x, JSMA first computes the gradient ∇F (x). The next step is constructing a saliency map whose role is to select the most relevant component i to perturb: (1) . Working with the F k 's instead of the probabilities p k has been justified in [15] by the extreme variations introduced by the logistic regression. Then the algorithm selects the component: and augment x imax with a default increase value θ: x imax ← x imax + θ, clipped to the domain of features values. In a more advanced form, JSMA selects pairs of components (i max , j max ) using doubly indexed saliency maps recalled later in the paper.
Contributions. We introduce two new adversarial attacks: (1) Weighted JSMA (WJSMA): This attack follows the mechanism of JSMA but "rectifies" it by weighting gradients by the respective probabilities of classes. The advantage of this fine-tuning is to reduce the impact of gradients associated with small output probabilities. (2) Taylor JSMA (TJSMA): It takes into account the output probabilities as WJSMA and additionally penalises the gradients by θ max − x k to encourage the selection of input features that are not close to θ max .
We give justifications of WJSMA and TJSMA and experimentally demonstrate they give significantly better results than JSMA. Two illustrations will be considered by targeting the LeNet-5 [9] model on MNIST and a variant of All Convolutional Net [22] on CIFAR-10. Figures 1 and 2 show examples of targeted adversarial samples generated by the three attacks JSMA, WJSMA and TJSMA from an MNIST 0 image and a CIFAR-10 car image, respectively.  At first glance, samples provided by WJSMA and TJSMA look less noisy and closer to the original images than those generated by JSMA.
In addition to attacks, we present an application to defense. It essentially demonstrates that defending against WJSMA or TJSMA makes the NNC more robust against JSMA while defending against JSMA has less impact on the performances of WJSMA and TJSMA.

Weighted Jacobian-based Saliency Map Attack (WJSMA)
This section presents WJSMA, the first contribution of the paper, its motivation and mathematical argumentation.
Motivating example. Assume a number of classes K ≥ 4 and for some input x: p 1 (x) = 0.5, p 2 (x) = 0.49, p 3 (x) = 0.01 and p k (x) = 0 for all 4 ≤ k ≤ K. Consider the problem of generating an adversarial sample to x with target label t = 2. In order to decrease k =2 F k (x), the first iteration step of JSMA relies on the gradients ∇F k (x), k = 2. The main observation is that as the probabilities p k (x) = 0 for 4 ≤ k ≤ K are already in their minimal values, the consideration of ∇F k (x) for these values of k in the search of i max is unnecessary. In other words, by acting only on gradients of the secondto-last layer, JSMA does not consider the crucial constraints on probabilities: p k (x) ≥ 0. Moreover, the possible decrease for p 1 (x) is high (up to 0.5) and, as p 3 (x) is relatively small, it will be hard to decrease further. In this situation, intuitively, instead of relying equally on ∇F 1 (x) and ∇F 3 (x), one would " bet more " on ∇F 1 (x) than ∇F 3 (x).
To address the previous issue, WJSMA relies on new saliency maps derived quite naturally from the classical log softmax reasoning. First, we compute the derivative: with t standing for the targeted class. This formula is separated as A − B, where A only depends on the targeted class and B depends on the other classes. To maximise this quantity, one can consider maximising A and minimising B independently by imposing the constraints A > 0 and B < 0. Note that, unlike JSMA, these constraints ensure that ∂p t ∂x i (x) remains positive. This allows us to introduce weighted saliency maps depending on one component as follows: Based on these maps, we present Algorithm 1, the first version of WJSMA that generates targeted adversarial samples.
When the output x * of Algorithm 1 satisfies class(x * ) = t, the attack is considered as successful.
To relax a bit the search of relevant components and motivated by an application to computer vision, Papernot et al. [15] introduced saliency maps indexed by pairs of components. Their main observation is that the conditions required Output: x * : adversarial sample to x.
(1) may be too severe for some applications and very few components will verify it. By replicating the same one-component WJSMA reasoning, we introduce weighted versions of doubly indexed saliency maps S W [x, t][i, j] as follows: Based on these maps, we present Algorithm 2, the second version of WJSMA that generates targeted adversarial samples by operating on pairs of components.
In the two previous algorithms, the selected components are always augmented by a positive default value, i.e. features are increased. It is possible to deduce two versions of Algorithms 1 and 2 where relevant components are selected and then decreased according to a similar logic.

Taylor Jacobian-based Saliency Map Attack (TJSMA)
This section presents Taylor JSMA, the second contribution of this paper. The idea of this attack is to additionally penalise the choice of feature components that are close the maximum value of features θ max and favour components that Inputs: Same inputs as Algorithm 1.
Output: x * : adversarial sample to x.
are more distant from θ max . As a motivating situation, assume two components i and j have the same WJSMA score and that x i is very close to θ max , while x j is far enough from θ max . In this case, searching for more impact, our saliency maps prefer x j over x i . Concretely, we consider maximising the two scores: Accordingly, we introduce new saliency maps for one-and two-components attacks as follows. where where We call these maps Taylor saliency maps because of the Taylor terms (θ max − x a ) ∂F k (x) ∂x a . One and two-components TJSMA follow exactly Algorithms 1 and 2 with only S W replaced with S T . Through Figures 3a and 3b, we observe that WJSMA and TJSMA decrease/increase the predicted/targeted probability of the original/targeted class much sooner than JSMA. In this example, it is worth noting how TJSMA behaves like WJSMA until it is able to find a more vulnerable component that makes it converge much faster.

Experiments
In the following, we give attacks and defense applications to illustrate the interest of WJSMA and TJSMA over JSMA. In doing so, we compare WJSMA and TJSMA and report better results for TJSMA despite that for a large part of samples WJSMA outperforms TJSMA. We use the following standard datasets: MNIST [10]. This dataset contains 70,000 28 × 28 greyscale images in 10 classes, divided into 60,000 training images and 10,000 test images. The possible classes are digits from 0 to 9.
We implement and train this model using a cleverhans model that optimises crafting adversarial examples. The number of epochs is fixed to 20, the batch-size to 128, the learning rate to 0.001 and the Adam optimizer is used. Training results in a 100% accuracy on the training dataset and 99.49% accuracy on the test dataset.
DNN on CIFAR-10. For the second experiment, a more complex DNN is trained to reach a good performance on CIFAR-10 which is more challenging than MNIST. Its architecture is inspired by the AllConvolutional model proposed in cleverhans and is described in the supplementary material.. Likewise, this model is implemented and trained using cleverhans for 10 epochs, with a batch size of 128, a learning rate of 0.001 and the Adam optimizer. Training results in a 99.96% accuracy on the training dataset and 83.81% accuracy on the test dataset.
To compare our results with [15], we use the original implementation of JSMA available in cleverhans. We have also adapted the code to WJSMA and TJSMA obtaining fast implementations of these two attacks. We only test the attacks (i.e. Algorithm 2 in the three formats: original, weighted and Taylor) on samples that are correctly predicted by their respective neural networks. In this way, the attacks are applied to the whole training set and the 9,949 wellpredicted images of the MNIST test dataset. Similarly, CIFAR-10 adversarial examples are crafted from the well-predicted 9,995 images of the first training 10,000 images and the 8,381 well-predicted test images.
To compare the three attacks, we rely on the notion of maximum distortion of adversarial samples defined as the ratio of altered components to the total number of components. Following [15], we choose a maximum distortion of γ = 14.5% on the adversarial samples from MNIST, corresponding to maxIter = 784 * γ 2 * 100 . On CIFAR-10, we fix γ = 3.7% in order to have the same maximum number of iterations for both experiments. This allows a comparison between the attacks in two different settings. Furthermore, for both experiments, we set θ = 1 (note that θ min = 0, θ max = 1).
On MNIST. Results in terms of success rate are quite remarkable for WJSMA and TJSMA respectively outperforming JSMA with near 9.46, 10.98 percentage points (pp) on the training set and 9.46, 11.34 pp on the test set. The gain in the average number of altered components exceeds 6 components for WJSMA and 9 components for TJSMA in both experiments.
On CIFAR-10. Similar results are obtained on this dataset. WJSMA and TJSMA outperform JSMA in success rate by near 9.74, 11.23 pp on the training set and more than 10, 12 pp on the test set. For both training and test sets, we report better mean L 0 distances exceeding 7 features in all cases and up to 10.14 features for TJSMA on the training set.
Dominance of the attacks. The next figures illustrate the (strict) dominance of the attacks for the two experiments. In these statistics, we do not count the samples for which TJSMA and WJSMA realise the same number of iterations strictly less than JSMA. For both experiments, TJSMA has a noteworthy advantage over WJSMA and JSMA. The advantage of WJSMA over JSMA is also considerable. This shows that, in most cases, WJSMA and TJSMA craft better adversarial examples than JSMA, while being faster. Our results are actually better when directly comparing WJSMA or TJSMA with JSMA. As additional results, we give in the supplementary material the statistics for the pairwise dominance between the attacks. As it might be expected, both WJSMA and TJSMA dominate JSMA and TJSMA dominate WJSMA. Run-time comparison. In order to have a meaningful speed comparison between the three attacks, we evaluated the run-time needed for each attack to successfully craft the first 1,000 test images of MNIST in the targeted mode. Results shown in Table 3 reveal that TJSMA and WJSMA are 1.41 and 1.28 times faster than JSMA. These performance tests were realised on a machine equipped with a Intel Xeon 6126 processor and a Nvidia Tesla P100 graphics processor. Based on a previous analysis [1], TJSMA and WJSMA are at least 28 and 24 times faster than L 0 CW attack. Note that for WJSMA and TJSMA, the additional computations of one iteration compared to JSMA are negligible (simple multiplications). Thus the difference in speed between the attacks is mainly due to the number of iterations for each attack. Note that to compare the attacks, the adversarial samples were crafted one by one. In practice, it is possible to generate samples by batch. In this case, the algorithm stops when all samples are treated. Most of the time, with a batch of large size, the three attacks approximately take the same time to converge. For example, on the same machine as previously, with a batch size equal 1000, we were able to craft the same amount of samples in about 250s, for all the attacks.

Defense
The objective of this section is to train neural networks in a way that the attacks fail as much as possible. One way of doing that is by adding adversarial samples crafted by JSMA, WJSMA and TJSMA to the training set. This way of training may imply a decrease in the model accuracy but adversarial examples will be more difficult to generate.
We experiment with this idea on MNIST with LeNet-5 in every possible configuration. To this end, 2,000 adversarial samples per class (20,000 more images in total), with distortion under 14.5%, are added to the original MNIST training set, crafted by either JSMA, WJSMA or TJSMA. Then, three distinct models are trained on these three augmented datasets. The models roughly achieve an accuracy of 99.9% on the training set and 99.3% on the test set, showing a slight loss compared to our previous MNIST model accuracy. Nevertheless, the obtained neural networks are more robust to the attacks as shown in the following Table 4. Note that each experiment is made over the well-predicted samples of the test images. For each model and image, nine adversarial examples are generated by the three attacks. Overall, the attacks are less efficient on each of these models, compared to Table 1. The success rates drop by about 8pp, whereas the number of iterations is increased by approximately 26%. From the defender's point of view, networks trained against JSMA and TJSMA give the best performance. The JSMA trained model provides the lowest success rates while the TJSMA trained network is more robust from the L 0 distance point of view. From the attacker's point of view, TJSMA remains the most efficient attack of the three regardless of the augmented dataset used.

Avoid confusion
In this section, we argue that our results do not contradict [15]. First, we stress that we use a more performant LeNet-5 model than the one in [15] (with 98.93% and 99.41% accuracies on the training and test sets). For completeness, we also generated a less performant model (with 99.34% and 98.94% accuracies on the training and test sets) and evaluated the three attacks on it through the first 1, 000 test MNIST images. We obtain 96.7% success rate for JSMA (very similar to [15]) and more than 99.5% for WJSMA and TJSMA. These results are also included in our experiments. Instead of presenting two models, we preferred to use the more performant one as this makes the paper shorter and moreover it values more our approach (giving us more advantage with respect to JSMA). Finally, we notice that for both experiments and contrary to [15] (see Appendix A in [15]), our results were obtained without simplifications on the model which is an additional advantage of our attacks.

Conclusion
This paper has introduced WJSMA and TJSMA new probabilistic adversarial attacks variants of JSMA. It has demonstrated that WJSMA and TJSMA significantly outperform JSMA on two standard DNNs on MNIST and CIFAR-10 after analysing more than 88, 200 × 9 adversarial images. Also, it has demonstrated that defending against WJSMA and TJSMA is more advantageous than against JSMA. It is important to recall that our attacks are derived quite naturally from a classical log softmax reasoning and benefit from substantial investigations of doubly-indexed saliency maps. Based on the analysis of 9,000 adversarial samples, WJSMA and TJSMA are at least 1.2 and 1.4 times faster than JSMA and accordingly at least 24 and 28 times faster than L 0 CW attack. We believe these results are quite reassuring and make the new attacks as promising tools for future applications. Finally, non-targeted versions of our attacks have not been discussed in this paper and may be subject of future work and comparison with existing approaches such as [23].

Supplementary material
Architectures of the DNNs.   Supplementary comments. Further analysis of the results on MNIST reveals that, even for examples where JSMA is better than WJSMA or TJSMA, in average, less than 10 more components are changed by WJSMA or TJSMA, whereas JSMA changes more than 17 more components in average when it is dominated by WJSMA or TJSMA. A similar gap can be remarked in CIFAR-10.