Perspectives on Adversarial Classiﬁcation

: Adversarial classiﬁcation (AC) is a major subﬁeld within the increasingly important domain of adversarial machine learning (AML). So far, most approaches to AC have followed a classical game-theoretic framework. This requires unrealistic common knowledge conditions untenable in the security settings typical of the AML realm. After reviewing such approaches, we present alternative perspectives on AC based on adversarial risk analysis.


Introduction
Classification is a major research area with important applications in security and cybersecurity, including fraud detection [1]: phishing detection [2], terrorism [3] or cargo screening [4]. An increasing number of processes are being automated through classification algorithms; it is essential that these are robust to trust key operations based on their output. State-of-the-art classifiers perform extraordinarily well on standard data, but they have been shown to be vulnerable to adversarial examples, that is, data instances targeted at fooling the underlying algorithms. Ref. [5] provides an excellent introduction from a policy perspective, pointing out the potentially enormous security impacts that such attacks may have over systems for filter content, predictive policing or autonomous driving, to name but a few.
Most research in classification has focused on obtaining more accurate algorithms, largely ignoring the eventual presence of adversaries who actively manipulate data to fool the classifier in pursue of a benefit. Consider spam detection: as classification algorithms are incorporated to such task, spammers learn how to evade them. Thus, rather than sending their spam messages in standard language, they slightly modify spam words (frequent in spam messages but not so much in legitimate ones), misspell them or change them with synonyms; or they add good words (frequent in legitimate emails but not in spam ones) to fool the detection system.
Consequently, classification algorithms in critical AI-based systems must be robust against adversarial data manipulations. To this end, they have to take into account possible modifications of input data due to adversaries. The subfield of classification that seeks for algorithms with robust behaviour against adversarial perturbations is known as adversarial classification (AC) and was pioneered by [6]. Stemming from their work, the prevailing paradigm when modelling the confrontation between classification systems and adversaries has been game theory, see recent reviews [7,8]. This entails well-known common knowledge hypothesis [9,10] according to which agents share information about their beliefs and preferences. From a fundamental point of view, this is not sustainable in application areas such as security or cybersecurity, as participants try to hide and conceal information.
After reviewing key developments in game-theoretic approaches to AC in Sections 2 and 3, we cover novel techniques based on adversarial risk analysis (ARA) in Sections 4 and 5 [11]. Their key advantage is that they do not assume strong common knowledge hypothesis concerning belief and preference sharing, as with standard game theoretic approaches to AML. In this, we unify, expand and improve upon earlier work in [12,13]. Our focus will be on binary classification problems in face only of exploratory attacks, defined to have influence over operational data but not over training ones. In addition, we restrict our attention to attacks affecting only malicious instances, denominated integrity-violation attacks, the usual context in most security scenarios. We assume that the attacker will, at least, try to modify every malicious instance before the classifier actually observes it. Moreover, attacks will be assumed to be deterministic, in that we can predict for sure the results of their application over a given instance. Refs. [14,15] provide taxonomies of attacks against classifiers. We first consider approaches in which learning about the adversary is performed in the operational phase, studying how to robustify generative and discriminative classifiers against attacks. In certain applications, these could be very demanding from a computational perspective; for those cases, we present in Section 5 an approach in which adversarial aspects are incorporated in the training phase. Section 6 illustrates the proposed framework with spam detection and image processing examples.

Binary Classification Algorithms
In binary classification settings, an agent that we call classifier (C, she) may receive instances belonging to one of two possible classes denoted, in our context, as malicious (y = y 1 ) or innocent (y = y 2 ). Instances have features x ∈ R d whose distribution informs about their class y. Most classification approaches can be typically broken down in two separate stages, the training and operational phases [16].
The first one is used to learn the distribution p C (y|x), modelling the classifier's beliefs about the instance class y given its features x. Frequently, a distinction is introduced between generative and discriminative models. In the first case, models p C (x|y) and p C (y) are learnt from training data; based on them, p C (y|x) is deduced through Bayes formula. Typical examples include Naive Bayes [17] and (conditional) variational autoencoders [18]. In discriminative cases, p C (y|x) is directly learnt from data. Within these, an important group of methods uses a parameterised function f β : R d → R 2 so that the prediction is given through p C (y|x, β) = softmax( f β (x))[y]: when f β (x) = β x, we recover the logistic regression model [19]; if f β is a sequence of linear transformations alternating certain nonlinear activation functions, we obtain a feed-forward neural network [16]. Learning depends then on the underlying methodology adopted.

•
In frequentist approaches, training data D is typically used to construct a (possibly regularised) maximum likelihood estimateβ, and p C (y|β, x) is employed for classification. Parametric differentiable models are amenable to training with stochastic gradient descent (SGD) [20] using a minibatch of samples at each iteration. This facilitates, e.g., training deep neural networks with large amounts of high-dimensional data as with images or text data [21].
• In Bayesian approaches, a prior p C (β) is used to compute the posterior p C (β|D), given the data D, and the predictive distribution is used to classify. Given current technology, in complex environments we are sometimes only able to approximate the posterior modeβ, then using p C (y|β, x).
In any case, and whatever the learning approach adopted, we shall use the notation p C (y|x). The second stage is operational. The agent makes class assignment decisions based on p C (y|x). This may be formulated through the influence diagram (ID) [22] in Figure 1. In it, square nodes describe decisions; circle nodes, uncertainties; hexagonal nodes refer to the associated utilities. Arcs pointing to decision nodes are dashed and represent information available when the corresponding decisions are made. Arcs pointing to chance and value nodes suggest conditional dependence. able to approximate the posterior modeβ, then using p C (y|β, x).

78
In any case, and whatever the learning approach adopted, we shall use the notation p C (y|x).

79
The second stage is operational. The agent makes class assignment decisions based on p C (y|x). This may be formulated through the influence diagram (ID) [Shachter 1986] in Figure 1  guess y c for an observed instance x provides her with a utility u C (y c , y i ) when the actual class is y i and she aims at maximizing expected utility through arg max An important example of utility functions is the 0-1 utility: u C (y c , y i ) = I(y c = y i ), where I is the indicator function. This leads to deciding based on maximizing the predictive probability of correct classification arg max In general, the utility is characterised as a 2 × 2 matrix, whose ij-th entry represents the utility   The classifier guess y c for an observed instance x provides her with a utility u C (y c , y i ) when the actual class is y i and she aims at maximising expected utility through arg max An important example of utility functions is the 0-1 utility: u C (y c , y i ) = I(y c = y i ), where I is the indicator function. This leads to deciding based on maximising the predictive probability of correct classification arg max In general, the utility is characterised as a 2 × 2 matrix, whose ij-th entry represents the utility that the classifier receives for classifying an instance of true label j as of being of class i. This essentially balances the relative importance of false positives and negatives. Consider the case of autonomous driving systems. In it, proper classification of identified objects is of major importance. Misclassification errors increase the likelihood of incorrect forecasts of an object's behaviour and an inaccurate assessment of the environment. To minimise the risk of accidents, the system should be very sensitive and react safely when there is uncertainty about the situation. Regrettably, such systems are prone to false-positive emergency identifications that lead to unnecessary reactions. Recall, anyway, that there are classification techniques, such as those based on SVMs, that rather than breaking classification in training and operational stages, directly learn a function that maps features x into labels y. Should we like to apply the methodologies described below to this type of classifiers, we would need to produce estimates of p C (y|x) using their outputs. This can be achieved using calibration techniques such as [23] scaling.

Attacks to Binary Classification Algorithms
Consider now another agent called adversary (A, he). He aims at fooling C and make her err in classifying instances to attain some benefit. A applies an attack a to the features x leading to x = a(x), the actual observation received by C, which does not observe the originating instance x. For notational convenience, we sometimes write x = a −1 (x ). Upon observing x , C needs to determine the instance class. As we next illustrate, an adversary unaware classifier may incur in gross mistakes if she classifies based on features x , instead of the original ones.
Attacks to spam detection systems let us focus on attacks to standard classifiers used in spam detection. Experiments are carried out with the UCI Spam Data Set [24]. This set contains data from 4601 emails, out of which 39.4% are spam. For classification purposes, we represent each email through 54 binary variables indicating the presence (1) or absence (0) of 54 designated words in a dictionary. Table 1 presents the performance of four standard classifiers (Naive Bayes, logistic regression, neural net and random forest) based on a 0-1 utility function, against tainted and untainted data. The neural network model is a two layer one. The logistic regression is applied with L1 regularisation; this is equivalent to performing maximum a posteriori estimation in a logistic regression model with a Laplace prior [25]. Means and standard deviations of accuracies are estimated via repeated hold-out validation over ten repetitions [26]. Observe the important loss in accuracy of the four classifiers, showcasing a major degradation in performance of adversary unaware classifiers when facing attacks.

Adversarial Classification: Game-Theoretic Approaches
As exemplified, an adversary unaware classifier may be fooled into issuing wrong classifications leading to severe performance deterioration. Strategies to mitigate this problem are thus needed. These may be based on building models of the attacks likely to be undertaken by the adversaries and enhancing classification algorithms to be robust against such attacks.
For this, the ID describing the classification problem ( Figure 1) is augmented to incorporate adversarial decisions, leading to a biagent influence diagram (BAID) [27], Figure 2. In it, grey nodes refer to elements solely affecting A's decision; white nodes to issues solely pertaining to C's decision; striped nodes affect both agents' decisions. We only describe the new elements. First, the adversary decision is represented through node a (the chosen attack). The impact of the data transformation over x implemented by A is described through node x , the data actually observed by the classifier; the corresponding node is deterministic (double circle) as we assume deterministic attacks. Finally, the utility of A is represented with node u A , with form u A (y c , y), when C says y c and the actual label is y. We assume that attack implementation has negligible costs. As before, C aims at maximising her expected utility; A also aims at maximising his expected utility trying to confuse the classifier (and, consequently, reducing her expected utility). Version October 8, 2020 submitted to Mathematics 5 of 20 y C x y x a u C u A Figure 2. Adversarial classification as a Bi-agent Influence Diagram review it, using our notation. The authors view the problem as a game between a classifier C and an adversary A, using the following forward myopic proposal.  2. Then, assuming that A has complete information about the classifier's elements (a common knowledge assumption) and that C is not aware of his presence, the authors compute A's optimal attack. To that end, they propose solving the integer programming problem where X C is the set of features the classifier uses for making her decision and X i , the i-th 134 feature, with original value x i ∈ X i , assumed to be discrete. δ i,x i is a binary variable adopting 135 value 1 when feature X i is changed from x i to x i , being C(x i , x i ) the cost of such change. It is 136 easy to see that problem (4) reflects the fact that the adversary tries to minimize the cost of 137 modifying an instance, provided that such modification induces a change in the classification 138 decision. 139 3. Subsequently, the classifier, assuming that A implements the previous attack (again a 140 common knowledge assumption) and that the training data is untainted, deploys her optimal 141 classifier against it: she chooses y c maximizing ∑ 2 i=1 u C (y c , y i )p C (y i |x ), her posterior expected 142 utility given that she observes the possibly modified instance x . This is equivalent to Estimating all these elements is straightforward, except for p C (x |y 1 ). Again, appealing to a 145 common knowledge assumption, the authors assume that the classifier, who knows all the 146 adversary's elements, can solve (4) and compute x = a(x), for each x the adversary may 147 receive. Thus

Adversarial Classification. The Pioneering Model.
Dalvi et al. [6] provided a pioneering approach to enhance classification algorithms when an adversary is present, calling it adversarial classification (AC). Because of its importance, we briefly review it, using our notation. The authors view the problem as a game between a classifier C and an adversary A, using the following forward myopic proposal.

1.
C first assumes that data is untainted and computes her optimal classifier through (2). Ref. [6] focuses on a utility sensitive Naive Bayes algorithm [28].

2.
Then, assuming that A has complete information about the classifier's elements (a common knowledge assumption) and that C is not aware of his presence, the authors compute A's optimal attack. To that end, they propose solving the integer programming problem where X C is the set of features the classifier uses for making her decision and X i , the i-th feature, with original value x i ∈ X i , assumed to be discrete. δ i,x i is a binary variable adopting value 1 when feature X i is changed from x i to x i , being C(x i , x i ) the cost of such change. It is easy to see that problem (4) reflects the fact that the adversary tries to minimise the cost of modifying an instance, provided that such modification induces a change in the classification decision.

3.
Subsequently, the classifier, assuming that A implements the previous attack (again a common knowledge assumption) and that the training data is untainted, deploys her optimal classifier against it: she chooses y c maximising ∑ 2 i=1 u C (y c , y i )p C (y i |x ), her posterior expected utility given that she observes the possibly modified instance x . This is equivalent to optimising Estimating all these elements is straightforward, except for p C (x |y 1 ). Again, appealing to a common knowledge assumption, the authors assume that the classifier, who knows all the adversary's elements, can solve (4) and compute x = a(x), for each x the adversary may receive. Thus where X is the set of possible instances leading to the observed one and p C (x |x, y 1 ) = 1 if a(x) = x and 0 otherwise.
The procedure could continue for more stages. However, [6] considers it sufficient to use these three.
As presented (and the authors actually stress this in their paper), very strong common knowledge assumptions are made: all parameters of both players are known to each other. Although standard in game theory, such assumption is unrealistic in the security scenarios typical of AC.

Other Adversarial Classification Game-Theoretic Developments
In spite of this, stemming from [6], AC has been predated by game-theoretic approaches, as reviewed in [29] or [30]. Subsequent attempts have focused on analysing attacks over classification algorithms and assessing their robustness against them, under various assumptions about the adversary. Regarding attacks, these have been classified as white box, when the adversary knows every aspect of the defender's system like the data used, the algorithms and the entire feature space; black box, that assume limited capabilities for the adversary, e.g., he is able to send membership queries to the classification system as in [31]; and, finally, gray box, which are in between the previous ones, as in [32] where the adversary, who has no knowledge about the data and the algorithm used seeks to push his malicious instances as innocent ones, thus assuming that he is able to estimate such instances and has knowledge about the feature space.
Of special importance in the AC field, mainly within the deep learning community, are the so called adversarial examples [33] which may be formulated in game-theoretic terms as optimal attacks to a deployed classifier, requiring, in principle, precise knowledge about the model used by the classifier. To create such examples, A finds the best attack which leads to perturbed data instances obtained from solving problem min δ ≤ c A (h θ (a(x)), y), with a(x) = x + δ, a perturbation of the original data instance x; h θ (x), the output of a predictive model with parameters θ; and c A (h θ (x), y) the adversary's cost when instance x of class y is classified as of being of class h θ (x). This cost is usually taken to be − c D (h θ (x), y), where c D is the defender's cost. The Fast Gradient Signed Method (FGSM, [33]) and related attacks in the literature [34] assume that the attacker has precise knowledge of the underlying model and parameters of the involved classifier, debatable in most security settings.
A few methods have been proposed to robustify classification algorithms in adversarial settings. Most of them have focused on application-specific domains, as [35] on spam detection. Ref. [36] study the impact of randomisation schemes over classifiers against adversarial attacks proposing an optimal randomisation scheme as best defence. To date, adversarial training (AT) [37] is one of the most promising defence techniques: it trains the defender model using attacked samples, solving the problem thus minimising the empirical risk of the model under worst case perturbations of the data D. AT can be formulated as a zero-sum game. The inner maximisation problem is solved through project gradient descent (PGD) with iterations where Π is a projection operator ensuring that the perturbed input falls within an acceptable boundary B(x), and α is an intensity hyperparameter referring to the attack strength. After T PGD iterations, set a(x) = x T and optimise with respect to θ. Other attacks that use gradient information are deepfool [38], yet [37] argue that the PGD attack is the strongest one based only on gradient information from the target model. However, there is evidence that it is not sufficient for full defence in neural models since it is possible to perform attacks using global optimisation routines, such as the one pixel attack from [39] or [40].
Other approaches have focused on improving the game theoretic model in [6]. However, to our knowledge, none has been able to overcome the above mentioned unrealistic common knowledge assumptions, as may be seen in recent reviews [7,8], who point out the importance of this issue. As an example, [41] use a Stackelberg game in which both players know each other's payoff functions. Only [42] have attempted to relax common knowledge assumptions in adversarial regression settings, reformulating the corresponding problem as a Bayesian game.

Adversarial Classification: Adversarial Risk Analysis Approaches
Given the above mentioned issue, we provide ARA solutions to AC. We focus first on modelling the adversary's problem in the operation phase. We present the classification problem faced by C as a Bayesian decision analysis problem in Figure 3, derived from Figure 2. In it, A's decision appears as random to the classifier, since she does not know how the adversary will attack the data. For notational convenience, when necessary we distinguish between random variables and realisations using upper and lower case letters, respectively; in particular, we denote by X the random variable referring to the original instance (before the attack) and X that referring to the possibly attacked instance.ẑ will indicate an estimate of z.  Given the above mentioned issue, we provide ARA solutions to AC. We focus first on modelling the adversary's problem in the operation phase. We present the classification problem faced by C as a Bayesian decision analysis problem in Figure 3, derived from Figure 2. In it, A's decision appears as random to the classifier, since she does not know how the adversary will attack the data. 4 An This leads to performance degradation as reflected in Section 2.2. In contrast, an adversary aware classifier would use arg max In the adversary unaware case, p C (y i |X = x ) is easily estimated using the training data. However, 197 estimating the probabilities p C (y i |X = x ) is harder, as it entails modeling how the adversary will 198 modify the original instance x, which is unobserved. Moreover, recall that common knowledge is not 199 available, so we actually lack A's beliefs and preferences. ARA helps us in modeling our uncertainty 200 about them. In doing so, robustness is typically enhanced. We discuss two strategies depending on 201 whether we use generative or discriminative classifiers as base models. An adversary unaware classifier would classify the observed instance x based on arg max This leads to performance degradation as reflected in Section 2.2. In contrast, an adversary aware classifier would use arg max In the adversary unaware case, p C (y i |X = x ) is easily estimated using the training data. However, estimating the probabilities p C (y i |X = x ) is harder, as it entails modelling how the adversary will modify the original instance x, which is unobserved. Moreover, recall that common knowledge is not available, so we actually lack A's beliefs and preferences. ARA helps us in modelling our uncertainty about them. In doing so, robustness is typically enhanced. We discuss two strategies depending on whether we use generative or discriminative classifiers as base models.

The Case of Generative Classifiers
Suppose first that a generative classifier is required. As training data is clean by assumption, we can estimate p C (y) (modelling the classifier's beliefs about the class distribution) and p C (X = x|y) (modelling her beliefs about the feature distribution given the class when A is not present). In addition, assume that when C observes X = x , she can estimate the set X of original instances x potentially leading to the observed x . As later discussed, in most applications this will typically be a very large set. When the feature space is endowed with a metric d, an approach to approximate X would be to consider X = {x : d(x, x ) < ρ} for a certain threshold ρ.
Given the above, when observing x the classifier should choose the class with maximum posterior expected utility (7). Applying Bayes formula, and ignoring the denominator, which is irrelevant for optimisation purposes, she must find the class In such a way, A's modifications are taken into account through the probabilities p C (X = x |X = x, y). At this point, recall that the focus is restricted to integrity violation attacks. Then, p C (X = x |X = x, y 2 ) = δ(x − x) and problem (8) becomes arg max y c u C (y c , y 1 )p C (y 1 ) ∑ x∈X p C (X = x |X = x, y 1 )p C (X = x|y 1 ) +u C (y c , y 2 )p C (y 2 )p C (X = x |y 2 ) .
Note that should we assume full common knowledge, we would know A's beliefs and preferences and, therefore, we would be able to solve his problem exactly: when A receives an instance x from class y 1 , we could compute the transformed instance. In this case, p C (X |X = x, y 1 ) would be 1 just for the x whose transformed instance coincides with that observed by the classifier and 0, otherwise. Inserting this in (9), we would recover Dalvi's formulation (5). However, common knowledge about beliefs and preferences does not hold. Thus, when solving A's problem we have to take into account our uncertainty about his elements and, given that he receives an instance x with label y 1 , we will not be certain about the attacked output x . This will be reflected in our estimate p C (x |x, y 1 ) which will not be 0 or 1 as in Dalvi's approach (stage 3). With this estimate, we would solve problem (9), summing p C (x|y 1 ) over all possible originating instances, with each element weighted by p C (x |x, y 1 ).
To estimate these last distributions, we resort to A's problem, assuming that this agent aims at modifying x to maximise his expected utility by making C classify malicious instances as innocent. The decision problem faced by A is presented in Figure 4, derived from Figure 2. In it, C's decision appears as an uncertainty to A.
where u A (y i , y j ) is the attacker's utility when the defender classifies an instance of class y j as one of 236 class y i .

237
However, the classifier does not know the involved utilities u A and probabilities p A z=a(x) from 238 the adversary. Let us model such uncertainty through a random utility function U A and a random 239 expectation P A z=a(x) . Then, we could solve for the random attack, optimizing the random expected 240 utility 241 X (x, y 1 ) ≡ A * (x, y 1 ) = arg max z [U A (y 1 , y 1 ) − U A (y 2 , y 1 )] P A z=a(x) + U A (y 2 , y 1 ) .
We then use such distribution and make 5 p C (x |x, y 1 ) = Pr(X (x, y 1 ) = x ) which was the missing 242 ingredient in problem (9). Observe that it could be the case that Pr(X (x, y 1 ) = x) > 0, i.e. the 243 attacker does not modify the instance. 244 Now, without loss of generality, we can associate utility 0 with the worst consequence and 1 with the best one, having the other consequences intermediate utilities. In A's problem, his best consequence holds when the classifier accepts a malicious instance as innocent (he has opportunities to continue with his operations) while the worst consequence appears when the defender stops an instance (he has wasted effort in a lost opportunity), other consequences being intermediate. Therefore, we adopt U A (y 1 , y 1 ) ∼ δ 0 and U A (y 2 , y 1 ) ∼ δ 1 . Then, the Attacker's random optimal attack would be assess it is based on using the probability r = Pr C (y * c (z) = y 1 |z) that C assigns to the instance received Assuming that the set of attacks is discrete, and similarly in the continuous case To solve the problem, we need p A (y * c (x )|x ), which models A's beliefs about C's decision when she observes x . Let p be the probability p A (y * c (a(x)) = y 1 |a(x)) that A concedes to C saying that the instance is malicious when she observes x = a(x). Since A will have uncertainty about it, let us model its density using f A (p|x = a(x)) with expectation p A x =a(x) . Then, upon observing an instance x of class y 1 , A would choose the data transformation maximising his expected utility: where u A (y i , y j ) is the attacker's utility when the defender classifies an instance of class y j as one of class y i . However, the classifier does not know the involved utilities u A and probabilities p A z=a(x) from the adversary. Let us model such uncertainty through a random utility function U A and a random expectation P A z=a(x) . Then, we could solve for the random attack, optimising the random expected utility We then use such distribution and make (assuming that the set of attacks is discrete, and similarly in the continuous case) p C (x |x, y 1 ) = Pr(X (x, y 1 ) = x ) which was the missing ingredient in problem (9). Observe that it could be the case that Pr(X (x, y 1 ) = x) > 0, i.e., the attacker does not modify the instance. Now, without loss of generality, we can associate utility 0 with the worst consequence and 1 with the best one, having the other consequences intermediate utilities. In A's problem, his best consequence holds when the classifier accepts a malicious instance as innocent (he has opportunities to continue with his operations) while the worst consequence appears when the defender stops an instance (he has wasted effort in a lost opportunity), other consequences being intermediate. Therefore, we adopt U A (y 1 , y 1 ) ∼ δ 0 and U A (y 2 , y 1 ) ∼ δ 1 . Then, the Attacker's random optimal attack would be Modelling P A z=a(x) is more delicate. It entails strategic thinking and could lead to a hierarchy of decision making problems, described in [43] in a simpler context. A heuristic to assess it is based on using the probability r = Pr C (y * c (z) = y 1 |z) that C assigns to the instance received being malicious assuming that she observed z, with some uncertainty around it. As it is a probability, r ranges in [0, 1] and we could make P A z=a(x) ∼ βe(δ 1 , δ 2 ), with mean δ 1 /(δ 1 + δ 2 ) = r and variance (δ 1 δ 2 )/[(δ 1 + δ 2 ) 2 (δ 1 + δ 2 + 1)] = var as perceived. var has to be tuned depending on the amount of knowledge C has about A. Details on how to estimate r are problem dependent.
In general, to approximate p C (x |x, y 1 ) we use Monte Carlo (MC) simulation drawing K samples P A,k z , k = 1, . . . , K from P A z , finding X k (x, y 1 ) = arg min z P A,k z and estimating p C (x |x, y 1 ) using the proportion of times in which the result of the random optimal attack coincides with the instance actually observed by the defender: It is easy to prove, using arguments in [44], that (12) converges almost surely to p C (x |x, y 1 ). In this, and other MC approximations considered, recall that the sample sizes are essentially dictated by the required precision. Based on the Central Limit Theorem [45], MC sums approximate integrals with probabilistic bounds of the order var N where N is the MC sum size. To obtain a variance estimate, we run a few iterations and estimate the variance, then choose the required size based on such bounds.
Once we have an approach to estimate the required probabilities, we implement the scheme described through Algorithm 1, which reflects an initial training phase to estimate the classifier and an operational phase which performs the above once a (possibly perturbated) instance x is received by the classifier.

Algorithm 1 General adversarial risk analysis (ARA) procedure for AC. Generative
Input: Training data D, test instance x . Output: A classification decision y * c (x ).

Training
Train a generative classifier to estimate p C (y) and p C (x|y) End Training Operation Read x . Estimate p C (x |x, y 1 ) for all x ∈ X . Solve Output y * c (x ). End Operation

The Case of Discriminative Classifiers
With discriminative classifiers, we cannot use the previous approach as we lack an estimate of p C (X = x|y). Alternatively, assume for the moment that the classifier knows the attack that she has suffered and that this is invertible in the sense that she may recover the original instance x = a −1 (x ).
Then, rather than classifying based on (6), as an adversary unaware classifier would do, she should classify based on arg max However, she actually has uncertainty about the attack a, which induces uncertainty about the originating instance x. Suppose we model our uncertainty about the origin x of the attack through a distribution p C (X = x|X = x ) with support over the set X of reasonable originating features x. Now, marginalising out over all possible originating instances, the expected utility that the classifier would get for her classification decision y c would be and we would solve for y * c (x ) = arg max y c ψ(y c ).
Typically, the expected utilities (13) are approximated by MC using a sample {x n } N n=1 from p C (x|x ). Algorithm 2 summarises a general procedure.

Algorithm 2 General ARA procedure for AC. Discriminative
Input: Monte Carlo size N, training data D, test instance x . Output: A classification decision y * c (x ).

Training
Based on D, train a discriminative classifier to estimate p C (y|x).

End Training Operation
Read x Estimate X and p C (x|x ), x ∈ X Draw sample {x n } N n=1 from p C (x|x ).
To implement this approach, we need to be able to estimate X and p C (x|x ) or, at least, sample from such distribution. A powerful approach samples from p C (X|X = x ) by leveraging approximate Bayesian computation (ABC) techniques, [46]. This requires being able to sample from p C (X) and p C (X |X = x), which we address first.
Estimating p C (x) is done using training data, untainted by assumption. For this, we can employ an implicit generative model, such as a generative adversarial network [47] or an energy-based model [48]. In turn, sampling from p C (x |x) entails strategic thinking. Notice first that We easily generate samples from p C (y c |x), as we can estimate those probabilities based on training data as in Section 2.1. Then, we can obtain samples from p C (x |x) sampling y ∼ p C (y|x) first; next, if y = y 2 return x or, otherwise, sample x ∼ p C (x |x, y 1 ). To sample from the distribution p C (x |x, y 1 ), we model the problem faced by the attacker when he receives instance x with label y 1 .The attacker will maximise his expected utility by transforming instance x to x given by (10). Again, associating utility 0 with the attacker's worst consequence and 1 with the best one; and modelling our uncertainty about the attacker's estimates of p A z=a(x) using random expected probabilities P A z=a(x) , we would look for random optimal attacks X (x, y 1 ) as in (11). By construction, if we sample p A z=a(x) ∼ P A z=a(x) and solve x is distributed according to p C (x |x, y i ) which was the last ingredient required. Once with these two sampling procedures, we generate samples from p C (X|X = x ) with ABC techniques. This entails generating φ is a distance function defined in the space of features and δ is a tolerance parameter. The x generated in this way is distributed approximately according to the desired p C (x|x ). However, the probability of generating samples for which φ(x , x ) < δ decreases with the dimension of x . One possible solution replaces x by s(x ), a set of summary statistics that capture the relevant information in x ; then, the acceptance criterion would be replaced by φ(s(x ), s(x )) < δ. The choice of summary statistics is problem specific. We sketch the complete ABC sampling procedure in Algorithm 3, to be integrated within Algorithm 2 in its drawing sample command.

Algorithm 3 ABC scheme to sample from p C (x|x ) within Algorithm 2
Input: Observed instance x , data model p C (x), p C (y|x), P A z , family of statistics s. Output: A sample approximately distributed according to p C (x|x ).

Scalable Adversarial Classifiers
The approach in Section 4 performs all relevant inference about the adversary during operations. This could be too expensive computationally, especially in applications that require fast predictions based on large scale deep models as motivated by the following image processing problem.
Attacks to neural-based classifiers. Section 3.2 discussed adversarial examples. This kind of attack may harm intensely neural network performance, such as that used in image classification tasks [49]. It has been shown that simple one-pixel attacks can seriously affect performance, [39]. As an example, consider a relatively simple deep neural network (a multilayer perceptron model) [21], trained to predict the handwritten digits in the MNIST dataset [50]. In particular, we used a 2 layer feed-forward neural network with relu activations and a final softmax layer to compute the predictions over the 10 classes. This requires simple extensions from binary to multi-class classification. This network accurately predicts 99% of the digits. Figure 5 provides ten MNIST original samples (top row) and the corresponding images (bottom row) perturbed through FGSM, which are misclassified. For example, the original 0 (first column) is classified as such; however, the perturbed one is not classified as a 0 (more specifically, as an 8) even if it looks as such to the human eye. Globally, accuracy gets reduced to around 50% (see Figure 6a, curve NONE far right). This suggests a very important performance degradation due to the attack. Section 6.3 continues this example to evaluate the robustification procedure we introduce in Section 5. 1 The computational difficulties entailed by the approach in Section 4 stem from two issues: • Iteration over the set X of potentially originating instances. If no assumptions about attacks are made, this set grows rapidly. For instance, in the spam detection example in Section 2.2, let n be the number of words in the dictionary considered by C to undertake the classification (54 in our spam case). If we assume that the attacker modifies at most one word, the size of X is O(n); if he modifies at most two words, it is O(n 2 ), . . . , and if he modifies at most n words, the number of possible adversarial manipulations (and thus the size of X ) is 2 n . Even more extremely, in the image processing example we would have to deal with very high-dimensional data (the MNIST images above consist of 28 × 28 pixels, each taking 256 possible values depending on the gray level). This deems any enumeration over the set X totally unfeasible.
In order to tackle this issue, constraints over the attacks could be adopted, for example based on distances.
• Sampling from p C (x|x ). In turn, a huge X renders sampling from this distribution inefficient. As an example, the ABC scheme in Algorithm 3, could converge very slowly requiring careful tuning of the tolerance rate δ. In addition, such algorithm demands the use of an unconditional generative model p C (x) to draw realistic samples from the data distribution. This could be computationally demanding, especially when dealing with high-dimensional data.
Therefore, as the key adversary modelling steps are taken during operations, the approach could be inefficient in applications requiring fast predictions.

Protecting Differentiable Classifiers
Alternatively, learning about the adversary could be undertaken during the training phase as now presented. This provides faster predictions during operations and avoids the expensive step of sampling from p C (x|x ). A relatively weak assumption is made to achieve this: the model can be expressed in parametric probabilistic terms p C (y|x, β), in a form that is differentiable in the β parameters. Mainstream models such as logistic regression or neural networks satisfy such condition, which brings in several benefits. First, we can train the model using SGD or any of its recent variants, such as Adam [51], allowing for scaling to both wide and tall datasets. Then, from SGD we can obtain the posterior distribution of the model by adding a suitable noise term as in SG-MCMC samplers like stochastic gradient Langevin Dynamics (SGLD) [52] or accelerated variants [53].
We require only sampling attacked instances from p C (x |x), an attacker model. Depending on the type of data, this model can come through a discrete optimisation problem (as in the attacks of Section 2.2) or a continuous optimisation problem (in which we typically resort to gradient information to obtain the most harmful perturbation as in FGSM). These attacks require white-box access to the defender model, which, as mentioned, is usually unrealistic in security. We mitigate this problem and add realism by incorporating two sources of uncertainties.
1. Defender uncertainty over the attacker model p C (x |x). The attacker modifies data in the operation phase. The defender has access only to training data D; therefore, she will have to simulate attacker's actions using such training set. Now, uncertainty can also come from the adversarial perturbation chosen. If the model is also differentiable wrt the input x (as with continuous data such as images or audio), instead of computing a single, optimal and deterministic perturbation, as in AT, we use SGLD to sample adversarial examples from regions of high adversarial loss, adding a noise term to generate uncertainty. Thus, we employ iterates of the form for t = 1, . . . , T, where is a step size. We can also consider uncertainty over the hyperparameters (say, from a rescaled Beta distribution, since it is unreasonable to consider too high or too low learning rates) and the number T of iterations (say, from a Poisson distribution). 2. Attacker uncertainty over the defender model p C (y|x, β). It is reasonable to assume that the specific model architecture and parameters are unknown by the attacker. To reflect his uncertainty, he will instead perform attacks over a model p C (y|x, β) with uncertainty over the values of the model parameters β, with continuous support. This can be implemented through scalable Bayesian approaches in deep models: the defended model is trained using SGLD, obtaining posterior samples via the iteration with η a learning rate, and x sampled either from the set D (untainted) or using an attacker model as in the previous point. We sample maintaining a proportion 1:1 of clean and attacked data. The previous approaches incorporate some uncertainty about the attacker's elements to sample from p C (x |x). A full ARA sampling from this distribution can be performed as well. Recall that p C (x |x) models the classifier's uncertainty about the attack output when A receives instance x, which stems from the lack of knowledge about A's utilities and probabilities. Samples from p C (x |x) could be obtained as in Section 4.2, enabling explicit modelling of the classifier's uncertainty about A's utilities and probabilities.
Algorithm 4 describes how to generate samples incorporating both types of uncertainty previously described. On the whole, it uses the first source to generate perturbations to robustly train the defender's model based on the second source.

Algorithm 4 Large scale ARA-robust training for AC
Input: Defender model p C (y|x, β), attacker model p C (x |x). Output: A set of k particles {β i } K i=1 approximating the posterior distribution of the defender model learnt using ARA training. for t = 1 to T do Sample x 1 , . . . , x K ∼ p C (x |x) with (14) or the approach in Section 4.2 (depending on continuous or discrete data).
Very importantly, its outcome tends to be more robust than the one we could achieve with just AT protection, since we incorporate some level of adversarial uncertainty. In the end, we collect K posterior samples, {β k } K k=1 , and compute predictive probabilities for a new sample x via marginalisation through p C (y|x) = 1 K ∑ K k=1 p C (y|x, β k ), using the previous predictive probability to robustly classify the received instance.

Case Study
We use the spam classification dataset from Section 2.2 and the MNIST dataset to illustrate the methods in Sections 4 and 5. As shown in Section 2.2, simple attacks such as good/bad word insertions are sufficient to critically affect the performance of spam detection algorithms. The small perturbations from Section 5 prove that unprotected image classification systems can easily be fooled by an adversary.

Ara Defense in Spam Detection Problems
For the first batch of experiments, we use the same classifiers as in Section 5. As a dataset, we use again the UCI Spam Data Set from that section. Once the models to be defended are trained, we perform attacks over the instances in the test set, solving problem (11) for each test spam email, removing the uncertainty that is not present from the adversary's point of view. He would have uncertainty about p A z=a(x) , as this quantity depends on the defender's decision. We test our ARA approach to AC against a worst case attacker who knows the true value of p C (y|x) and estimates p A z=a(x) through 1 M ∑ M n=1 p C (y 1 |x n ) for a sample {x n } M n=1 from p * (x|x ), a uniform distribution over the set of all instances at distance less than or equal to 2 from the observed x , using φ(x, x ) = ∑ 54 i=1 |x i − x i | as distance function. We employ M = 40 samples.
To model the uncertainty about p A z , we use beta distributions centred at the attacker's probability values with variances chosen to guarantee that the distribution is concave in its support: they must be bounded from above by min , were µ is the corresponding mean. We set the variance to be 10% of this upper bound, thus reflecting a moderate lack of knowledge about the attacker's probability judgements.
Our implementations of Algorithms 2 and 3 use as summary statistics s the 11 most relevant features, with relevance based on permutation importance, [54] (assessing how accuracy decreases when a particular feature is not included). The ABC tolerance δ is set at 1. Finally, for each instance in the test set we generate 20 samples from p C (x|x ). Table 2 compares the ARA performance on tainted data with that of the raw classifiers (from Table 1). As can be seen, our approach allows us to largely heal the performance degradation of the four original classifiers, showcasing the benefits of explicitly modelling the attacker's behaviour in adversarial environments. Interestingly, in the case of the naïve Bayes classifier, our ARA approach outperforms the classifier under raw untainted data (Table 1). This effect has been observed also in [12,33] for other algorithms and application areas. This is likely due to the fact that the presence of an adversary has a regularizing effect, being able to improve the original accuracy of the base algorithm and making it more robust.

Robustified Classifiers in Spam Detection Problems
We next evaluate the scalable approach in Section 5 under the two differentiable models among the previous ones: logistic regression and neural network (with two hidden layers). As required, both models can be trained using SGD plus noise methods to obtain uncertainty estimates from the posterior as in (15). Next, we attack the clean test set using the procedure in Section 2.2 and evaluate the performance of our robustification proposal. Since we are dealing with discrete attacks, we cannot use the uncertainty over attacks as in (14) and just resort to adding the attacker's uncertainty over the defender model as in (15). To perform classification with this model, we evaluate the Bayesian predictive distribution using 5 posterior samples obtained after SGLD iterations. Results are in Table 3 which include as entries: Acc. Unt. (accuracy over untainted data); Acc. Tai. (it. over tainted data); Acc. Rob. Taint. (it. over tainted data after our robustification adding uncertainties). Note that the first two columns do not coincide with those in Table 1, as we have changed the optimisers to variants of SGD to be amenable to the robustified procedure in Section 5.1. Observe that the proposed robustification process indeed protects differentiable classifiers, recovering from the degraded performance under attacked data. Moreover, in this example, the robustified classifiers achieve even higher accuracies than those attained by the original classifier over clean data, due to the regularising effect mentioned above.

Robustified Classifiers in Image Classification Problems
Next, we show the performance of the scalable approach continuing with the digit recognition example from Section 5. This batch of experiments aims to show that the ARA-inspired defence can also scale to high-dimensional feature spaces and multiclass problems. The network architecture is shown in Section 5. It is trained using SGD with momentum (0.5) for 5 epochs, with learning rate of 0.01 and batch size of 32. The training set includes 50,000 digit images, and we report results over a 10,000 digit test set. As for uncertainties from Section 5.1, we use both kinds. Figure 6 shows the security evaluation curves [7] for three different defences (none, AT and ARA), using two attacks at test time: FGSM and PGD. Such curves depict the accuracy of the defender model at this task (y-axis), under different attack intensities α (x-axis). Note how the uncertainties provided by the ARA training method substantially improve the robustness of the neural network under both attacks. From the graphics, we observe that the ARA approach provides a greater degree of robustness as the attack intensities increase. This suggests that the proposed robustification framework can scale well to both tall and wide datasets and that the introduction of uncertainties by the ARA approach is highly beneficial to increase the robustness of the defended models.

Discussion
Adversarial classification aims at enhancing classifiers to achieve robustness in presence of adversarial examples, as usually encountered in many security applications. The pioneering work of [6] framed most later approaches to AC within the standard game theoretic paradigm, in spite of the unrealistic common knowledge assumptions about shared beliefs and preferences required, actually even questioned by those authors. After reviewing them, and analysing their assumptions, we have presented two formal probabilistic approaches for AC based on ARA that mitigate such strong common knowledge assumptions. They are general in the sense that application-specific assumptions are kept to a minimum. We have presented the framework in two different forms: in Section 4, learning about the adversary is performed in the operational phase, with variants for generative and discriminative classifiers. In Section 5, adversarial aspects are incorporated in the training phase. Depending on the particular application, one of the frameworks could be preferred over the other. The first one allows us to make real time inference about the adversary, as it explicitly models his decision making process in operation time; its adaptability is better as it does not need to be retrained every time we need to modify the adversary model. However, this comes at a high computational cost. In applications in which there is a computational bottleneck, the second approach may be preferable, with possible changes in the adversary's behaviour incorporated via retraining. This tension between the need to robustify algorithms against attacks (training phase, Section 5) and the fast adaptivity of attackers against defences (operational phase, Section 4) is well exemplified in the phishing detection domain as discussed in [2].
Our framework may be extended in several ways. First, we could adapt the proposed approach to situations in which there is repeated play of the AC game, introducing the possibility of learning the adversarial utilities and probabilities in a Bayesian way. Learning over opponent types has been explored with success in reinforcement learning scenarios, [55]. This could be extended to the classification setting. Besides exploratory ones, attacks over the training data [56] may be relevant in certain contexts. In addition, the extension to the case of attacks to innocent instances (not just integrity violation ones) seems feasible using the scalable framework. We have restricted our attention to deterministic attacks, that is, a * (x, y 1 ) will always lead to x ; extending our framework to deal with stochastic attacks would entail modelling p(x |a * , x, y 1 ).
Additional work should be undertaken concerning the algorithmic aspects. In our approach we go through a simulation stage to forecast attacks and an optimisation stage to determine optimal classification. The whole process might be performed through a single stage, possibly based on augmented probability simulation [57].
We have also shown how the robustification procedure from Section 5 can be an efficient way to protect large-scale models, such as those trained using first-order methods. It is well-known that Bayesian marginalisation improves generalisation capabilities of flexible models since the ensemble helps in better exploring the posterior parameter space [58]. Our experiments suggest that this holds also in the domain of adversarial robustness. Thus, bridging the gap between large scale Bayesian methods and Game Theory, as done in the ARA framework, suggests a powerful way to develop principled defences. To this end, strategies to more efficiently explore the highly complex, multimodal posterior distributions of neural models constitute another line of further work.
Lastly, several application areas could benefit highly from protecting their underlying ML models. Spam detectors have been the running example in this article. Malware and phishing detection are two crucial cybersecurity problems in which the data distribution of computer programs is constantly changing, driven by attacker's interests in evading detectors. Finally, the machine learning algorithms underlying autonomous driving systems need to be robustified from perturbations to their visual processing models and this could be performed through our approaches.