Detecting and Isolating Adversarial Attacks Using Characteristics of the Surrogate Model Framework

: The paper introduces a novel framework for detecting adversarial attacks on machine learning models that classify tabular data. Its purpose is to provide a robust method for the monitoring and continuous auditing of machine learning models for the purpose of detecting malicious data alterations. The core of the framework is based on building machine learning classiﬁers for the detection of attacks and its type that operate on diagnostic attributes. These diagnostic attributes are obtained not from the original model, but from the surrogate model that has been created by observation of the original model inputs and outputs. The paper presents building blocks for the framework and tests its power for the detection and isolation of attacks in selected scenarios utilizing known attacks and public machine learning data sets. The obtained results pave the road for further experiments and the goal of developing classiﬁers that can be integrated into real-world scenarios, bolstering the robustness of machine learning applications.


Introduction
The rise of widespread usage of machine learning (ML) has changed many areas of our lives by allowing the automation of processes based on real-time analysis of vast amounts of data [1]. But, as the use of ML grows, so do concerns about its safety and trustworthiness. Historically, the approach to trustworthy AI has been characterized by efforts to promote transparency and interpretability of ML models, which involve understanding the decision-making processes of ML models, enabling humans to comprehend and trust their outputs [2]. Concurrently, studies have underscored the importance of robustness against adversarial attacks (AA) on ML models. In these scenarios, malevolent actors introduce intentional perturbations, aimed at causing unintended behavior of ML models [3,4]. As a response to these threats, a whole new field of adversarial attack detection and prevention has arisen [5][6][7].

Work Motivation
The motivation for our work is to verify the practical usefulness of a new method for detecting adversarial attacks on ML models, thus increasing the robustness of ML applications. The specific method used came from the realization that: • There are no well-established AA detection methods that attempt to analyze machine learning models during their operations that can work in a black-box scenario. • Rough sets theory-based methods [8,9] can be used to approximate the decision boundaries of a classifier model, thus allowing for the reconstruction of a model with no direct access, creating a surrogate model that can be then analyzed with full access to it. • Most AA detection techniques focus on image processing models [10], with tabular data techniques receiving attention only in recent years [11].
Our end goal is to create a robust framework that can be used in real-world applications for the continuous monitoring of machine learning models, thus increasing the safety and trustworthiness of machine learning applications in everyday scenarios.
Within this work, we assume a black-box scenario, which we define as a lack of access to internal knowledge of the machine learning model operations, neither its architecture nor its internal states. Instead, interested parties can only observe the inputs and outputs of machine learning model operations and need to drive their conclusions on the machine learning model characteristics from these observations. While this scenario is often assumed for the attacking party, defense mechanisms often assume full knowledge of the observed model, which is not true in many real-world scenarios.

Adversarial Attacks
Since the beginning of the adversarial machine learning (AML) research field [12,13], many attempts have been made to create taxonomies of attacks [10,14]. On the basic plane, adversarial machine learning attacks can be categorized broadly into several types based on various criteria such as the attacker's knowledge, the attacker's capabilities, and the attack's target.
The most commonly used types include the following. (1) White-Box vs. Black-Box Attacks: In a white-box attack, the adversary has complete knowledge of the model, including its architecture and parameters. This allows for the creation of sophisticated and highly effective adversarial examples. In contrast, a black-box attack assumes limited knowledge of the model, with the adversary potentially only having access to input-output pairs. The transferability of adversarial examples is often exploited in black-box attacks [4].
(2) Evasion vs. Poisoning Attacks: Evasion attacks occur at the test phase where adversarial examples are crafted to mislead the model's prediction. This is often achieved by adding carefully designed noise to the input [13]. In poisoning attacks, the training phase is targeted where the adversary manipulates the training data to compromise the learning process itself, leading to incorrect models being learned [15].
(3) Targeted vs. Non-Targeted Attacks: In a targeted attack, the goal of the adversary is to cause a specific misclassification, e.g., making a model classify a stop sign as a speed limit sign. Non-targeted attacks aim to cause any misclassification, without a specific target in mind [5]. (4) Exploratory vs. Causative Attacks: This classification overlaps somehow with the Evasion vs. Poisoning dichotomy, but it was historically the first classification type. In exploratory attacks, the adversary exploits a model's weaknesses after being trained. In contrast, causative attacks involve influencing the learning process to introduce specific vulnerabilities that can be exploited later [16].
Each type of attack requires different detection and defense strategies, emphasizing the importance of understanding the specific adversarial landscape when building robust machine learning models. In this work, we focus on detecting exploratory (evasion) attacks in black-box scenarios, targeting models processing tabular data [17][18][19].

Adversarial Defenses
Defenses against adversarial attacks can be divided into two groups: detection and preventive methods.
The main detection methods reported in the literature include: • Statistical Tests: These involve conducting statistical analyses on model inputs to identify irregularities or anomalies indicative of adversarial manipulation. For instance, a technique based on the L1-norm to detect adversarial attacks was proposed by Grosse et al. [20,21].
• Reconstruction-based Methods: These techniques rely on reconstructing the input and comparing it with the original input to identify adversarial perturbations. One approach is to use autoencoder-based methods for input reconstruction [22]. This technique can be part of a larger detection framework, such as [23].  [23].
Another stream of work is related to creating methods of implementing innate robustness against adversarial machine learning attacks into machine learning models themselves.
These methods include: •  [24]. • Certified Defenses: These methods provide mathematical guarantees of a model's robustness against adversarial attacks. Rather than solely relying on empirical assessments of model performance, these defenses offer a theoretical underpinning to ensure robustness, thus contributing to a more rigorous and reliable model defense. The core concept is to mathematically certify a region around each data point within which the model's prediction remains consistent, hence making it robust against adversarial perturbations [26].
We strive to enhance this list by presenting a new method useful in black-box scenarios on tabular data.

Methods
In this section, we presented the workflow of the whole analysis, the definition of analyzed adversarial attacks, and the detection algorithm. Figure 1 shows a diagram with the overall analysis workflow.  We started by training the machine learning model and obtaining predictions for the new data set using this model. Based on provided data and the training data set, approximator learning is conducted, which ends with diagnostic attributes calculations. Simultaneously, an adversarial attack on the data can be executed, which results in obtaining perturbed data set with model predictions. Utilization of trained approximator model allows us to calculate diagnostic attributes for new (perturbed) incoming data. The application of this workflow for many data sources and different attacks results in the acquisition of such information (original diagnostic attributes and perturbed diagnostic attributes), which, after aggregation, are used for training models for the detection and isolation of adversarial attacks.

Attacks Detected
We have tested the method against three different types of adversarial attacks: Hop-SkipJump, PermuteAttack, and ZOO. These attacks have been chosen based on the following common characteristics: • Academic renown of the original publication and subsequent publications exploring each specific attack; • Suitability to operate on tabular data; • Attempt to hide their operations-perturbances introduced are minimally required for a successful attack; • Having an available implementation code base.
The HopSkipJump attack, also known as the Decision-Based Boundary attack, is an adversarial attack that manipulates both the local and global aspects of the decision boundary of the model under attack, thus causing misclassification while minimizing the perturbation on the original input [17].
It is an iterative, decision-based attack, meaning that it only requires access to the model's output decisions (e.g., classification labels) rather than full access to the model's internal workings or gradients; it can function in a black-box environment where access to the model specifics is limited.
The attack algorithm consists of three main steps: • Initialization (Hop): A starting point for the adversarial example is identified, which lies on the opposite side of the decision boundary compared to the target input. • Binary Search (Skip): A binary search strategy is implemented to bring the adversarial example closer to the decision boundary without crossing it. • Gradient Estimation (Jump): By making small perturbations to the adversarial example and observing the model's outputs, an approximation of the gradient at the decision boundary is estimated. This gradient information is then utilized to make a controlled jump that slightly crosses the decision boundary, making the adversarial example more effective.
The PermuteAttack, described in [18], is an adversarial example generation method capable of handling tabular data including discrete and categorical variables. The method utilizes a gradient-free optimization approach founded on genetic algorithms, which applies permutations to a random selection of features while ensuring the resultant values adhere to acceptable data set ranges. As a result, the output contains counterfactual data points that, though altered from the original data points, can circumvent certain anomaly detection methods. The resulting adversarial examples can serve as analytical tools for assessing the robustness of the attacked model.
The Zeroth Order Optimization (ZOO) adversarial attack, introduced in [19], is a blackbox attack method that utilizes zeroth-order (derivative-free) optimization to construct adversarial examples. The ZOO attack is designed to be applicable even when the attacker only has access to the model's output (like probabilities of classes) and does not know about the model's architecture or parameters.
In the ZOO attack, the attacker starts with a legitimate sample and incrementally perturbs it by approximating the gradients of the loss function. These approximations are obtained through numerical methods such as finite difference methods, and they do not require access to the actual gradients of the loss function.
The attacker then uses the estimated gradients to iteratively modify the input, typically via a method similar to the Fast Gradient Sign Method (FGSM), a popular attack method in white-box settings. This process continues until the modified input misleads the model into making an incorrect prediction.
This attack method demonstrated that efficient and effective black-box attacks are possible, even without direct access to gradient information. As a result, it underscored the importance of considering black-box attack scenarios when developing defenses for machine learning models.

Approximator Learning
The paper [27] describes the detailed workflow associated with model approximation and diagnostic attributes; in this section we just briefly call the most important assumptions. The main idea underlying this approach is to create a surrogate model [28,29] based on predictions of the origin model M(x). We expect that a surrogate model will mimic the behavior of the original model M, which we do not have access to. This assumption is due to the business environment, in which access to the original model is difficult, while our method requires only access to the training data. In addition, an approach based solely on predictions is model-agnostic and universal. The quality of a surrogate model is monitored by the Cohen kappa score calculated between the original predictions and the approximated predictions, and it should be above 0.9 for good properties. To approximate these predictions, we use the rough sets theory [30] and create a surrogate model using an ensemble of approximate reducts AR [31]. An approximate reduct is an irreducible subset of attributes that is sufficient to express almost the same information about the decisions as the whole attribute set. In this context, we distinguish two data sets: diagnosed (new) D = (X D , d D ), which is used to create a surrogate model, and training R = (X R , d R ), which is used to train the original model. In our case, the decision values for which we construct the approximate reducts correspond to predictions of the diagnosed model, i.e., M(X D ), not the actual ground truth target values. The result of the above-mentioned steps is the trained approximator model M, composed of a set of approximate reducts R = {AR 1 , . . . , AR i , . . . , AR k } and approximated prediction for instances of data M(x). The construction of the approximator depends on the selection of two hyper-parameters: an representing the approximation threshold for reducts, and a number of reducts in the ensemble. Since the aim of the procedure is to find the appropriate approximation of the model's predictions, we use the grid search to tune the hyper-parameter settings. We also assume that there is access to the decision attribute (target) in the data set, which will be denoted as d. For d(x), we will denote its value for an instance x. Moreover, q M (x) is the estimation of decision class probabilities made by the diagnosed model M and q M (x) is the approximation of decision class probability predictions.
The next step is defining the neighborhood N(x) for each observation in the diagnosed data set. We want to identify instances that are similar with regard to predictions made by M, and we define a notion of neighborhood based on the prediction process of the model approximator M. In particular, the neighborhood N AR (x) for a diagnosed instance x with regard to a single reduct AR ∈ R is defined as a subset of instances from X R that belong to the same indiscernibility class, i.e., [x] AR . The final neighborhood is the sum of neighborhoods computed for all reducts in the ensemble. Thus, for each instance in the diagnosed data set, we have information about similar instances in the training data set. Then, it is possible to compare the distribution of targets, predictions, and approximate predictions between given observations in the diagnosed data set and similar observations from the training data set, which were chosen for the neighborhood. Such a defined neighborhood for a given observation from a diagnosed data set in extreme cases can contain all observations from the training data set or none of them. The neighborhood is the basis for calculating diagnostic attributes, which are listed below: This attribute is calculated as 1 where d x is the ground truth target class of x , and q d x is the approximation of its probability. • Targets diversity in the neighborhood-measures the diversity of targets in the neighborhood of the diagnosed instance in comparison to the diversity of targets calculated on the whole diagnosed data set. It is calculated as h(p, p(N(x))), where p is the prior probability distribution of decision classes and p(N(x)) is the distribution of decision classes in the neighborhood of x. • Approximations diversity in the neighborhood-measures the diversity of approximations in the neighborhood of the diagnosed instance in comparison to the diversity of approximations calculated on the whole diagnosed data set. It refers to h(p, q(N(x))), where q(N(x)) is the distribution of the approximated predictions in N(x). • Uncertainty-we define a normalized entropy of a classification distribution q(x) as , where l is a number of classes. In a case when the prior is not uniform, we need to transform it by scaling the simplex space. Let us take the distribution norm of q(x) with respect to p as |q( . It measures the distribution q(x) with the units of the prior distribution. Let us define a p-simplex off-centering as [OC p (q i (x))] i = q i (x) Then we obtain the prior-offcentered normalized entropy OCE p (q(x)) = H(OC p (q(x))), which, for the sake of simplicity, we will denote as Unc(x), which is the prediction uncertainty of model M calculated for a diagnosed instance x.
• Neighborhood size-the number of instances in the neighborhood of the diagnosed instance.
Diagnostic attributes are a standardized format of comparing and diagnosing diverse data sets regardless of the number of classes or attributes.

Experiments
This section presents the results of the conducted study, in which we applied the workflow presented in Figure 1 to a variety of real data sets. Firstly, three different model architectures were fitted to each data set, and then three different adversarial attacks were conducted. Based on the obtained predictions, we prepared surrogate models and calculated the diagnostics attributes. In the next step, they were used to prepare two kinds of classifiers: for attack detection, where we wanted to predict the occurrence of an adversarial attack without distinguishing its type, and for attack isolation, where we wanted to predict also the type of the conducted adversarial attack.

Data Preparation
We obtained 22 data sets with classification tasks from OpenML. The list of data sets with basic characteristics is presented in Table 1. Each data set was divided into training and diagnostic parts, with the diagnosis data set including at least 100 observations. We fit a logistic regression model, a support vector machine, and XGBoost to each data set. Following that, three adversarial attacks were launched against the diagnosed part of each data set. All calculations were conducted in Python using the Adversarial Robustness Toolbox package [32] and Permute Attack repository [18].
In the next step, we calculated diagnostic attributes for each data set and attack based on proper surrogate models. To create the surrogate models, we utilized an ensemble of 2500 approximate reducts with equal to 0.05. Figure 2 shows the distribution of obtained Cohen kappa scores between the original prediction and the approximated predictions made by the surrogate model; for every data set and model type, the quality condition of the surrogate model was met. All scores are above 0.9 and the median of these values is equal to 1. The result of approximator learning is that the data set contains values of eight diagnostic attributes for each instance of the diagnosis data set and attack. To provide suitable data for training a classifier, we aggregated these attributes within cross-sections defined by data set, model, and attack. For each diagnostic attribute, a set of descriptive statistics was calculated: mean, minimum, maximum, lower quartile, median, upper quartile, and range. For those aggregates, we also added some data set characteristics such as the number of observations, the number of classes in the classification problem, and the balanced accuracy calculated based on predictions and targets. There were a total of 59 features analyzed. Table 2 presents the first 15 rows and 7 columns of the aggregated data set utilized for attack detection and isolation models estimation. For the attack detection problem, column attack was replaced by a binary variable equal to 0 for org value, meaning lack of attack, and 1 otherwise. .00 · · · · · · · · · · · · · · · · · · · · · · · · · · ·

Scenarios
For assessment of the effectiveness of diagnostic attributes in attack detection and isolation, we designed a set of experiments described in Table 3. Table 3. Considered scenarios.

Scenario Name
Description Application 10-fold cross-validation Cross-validation stratified by variable of interest detection, isolation one-data-set-out One data set is treated as a test data set and the rest of data is a training data set detection, isolation one-model-out Data for one model type is treated as a test data set and the rest of data is a training data set detection, isolation one-attack-out Data for one attack type is treated as a test data set and the rest of data is a training data set detection The first column of Table 3 contains the scenario name, the second provides a short description of the main idea behind the given scenario, and the third column presents information when the scenario is applicable. Each scenario except the last will be realized for both types of task-detection of attack and isolation of attack type. The last scenario is applicable only for detection because we aim to teach a model to recognize attacks, whose characteristics were not included in the training data. In each scenario, we trained random forest and XGBoost models with hyper-parameters optimization.

Attack Detection
In the first experiment, we aimed to train classifiers that distinguish only two classesno attack versus any attack. Table 4 presents the results of the trained models. Regardless of model architecture, it is possible to identify an attack using diagnostic attributes. The highest balanced accuracy is obtained for a one-attack-out scenario in both classifiers. It shows that we are able to recognize the fact of attack even for previously unseen attacks. The lowest performance is observed for a one-data-set-out scenario with a random forest classifier; for a new data set, it was more difficult to identify the attack than e.g., a new model type.
For each scenario and classifier, we extracted feature importance and created the ranking of these variables-higher rank means higher importance. We calculated the correlation coefficients between these ranks and scenarios. The results show that the order of features is highly correlated within classifier groups. For example, the correlation between the feature importance ranks between 10-fold cross-validation and one-data-setout scenarios for the random forest classifier is equal to 0.97. The correlation coefficient for XGB and RF is rather small, varying from 0.19 to 0.30. It shows that completely different variables are significant in attack identification for analyzed classifiers. Figure 3 shows the top 20 most important variables for each classifier.  The main role in both cases has a balanced accuracy calculated on the test data set. The second and third places contain an uncertainty diagnostic attribute for the random forest classifier, while for XGBoost there is the neighborhood size diagnostic attribute. The most frequent variable in the top 20 for RF is prediction consistency with targets in neighborhood, while for XGB it is neighborhood size. It shows that each diagnostic attribute contains valuable information that helps distinguish the occurrence of an attack in data.

Attack Isolation
In another experiment, we trained random forest and XGBoost classifiers to distinguish attack types. The decision variable has four classes-no attack, the HopSkipJump attack, the PermuteAttack, and the ZOO attack. Table 5 shows the results of each scenario. We can see that attack isolation has poorer performance than attack detection. Using the XGBoost model architecture, we obtain higher balanced accuracy values than for random forest. The highest values are observed for the one-data-set-out scenario. This suggests that distinguishing the attack type for the new data set is easier than for the new model type.
The analysis of feature importance ranks shows that, similar to those obtained for attack detection, they are highly correlated within the classifier type. The correlation coefficient between scenarios for the same classifier varies from 0.67 to 0.99. However, correlation values between the results of scenarios realized using RF and XGB are slightly higher than those presented in the previous section and are in the range of 0.41-0.51. Figure 4 presents the mean rank for the top 20 features used in analyzed classifiers.

Target consistency with targets in neighborhood Q75
Target consistency with approximations in neighborhood Q75

Approximations diversity in neighborhood Q50
Prediction consistency with targets in neighborhood Q25  In the case of attack detection as well, the most important feature is balanced accuracy, calculated on the attacked data set. In subsequent places, we can observe variables connected with uncertainty measures. In the presented top 20 most influential features, the random forest classifier uses statistics calculated based on five diagnostic attributes, while the XGBoost classifier uses eight diagnostic attributes.

Conclusions and Future Works
In this paper, we tested two ML models' architecture to build classifiers for adversarial attack detection and isolation. We designed a set of scenarios that utilize a leave-one-out and cross-validation methodology to investigate whether it is possible to distinguish the occurrence of the attack and the attack type based on diagnostic attributes.
The results show that the method works and is suitable for attack detection-it is possible to build a classifier, operating on diagnostic attributes, that can detect adversarial attacks with a balanced accuracy greater than 0.9. This method works also for detecting attacks that were not included in the training data set.
For attack isolation, we obtained poorer performance of classifiers-the balanced accuracy varied from 0.66 to 0.80. The conducted research indicated the most important features used by trained classifiers: the balanced accuracy of the diagnosed data set, uncertainty, and neighborhood size.
Future works can be divided into three distinctive work streams: • Increasing the quality of classifiers, by training them on larger representations of known attack methods and data sets. New data sets and new adversarial examples will become available as a direct result of implementation activities performed by the QED Software company and will focus on real-life examples of attacks and data in attack-prone environments. • Testing different hierarchical approaches to the classifier construction-chaining attack detection classifiers with attack identification ones. • Combining the resulting classifiers with external input sources explaining changes in data characteristics. This includes methods for concept drift detection, anomaly detection, and expert reasoning.