We need to design a threshold learning method that mitigates the bias induced in the threshold estimate. In typical linear regression, the empirical loss function is , where the i-th training data point, denoted by (the RUC in our case), is formally called the residual. represents the difference between the data point and a given candidate estimate, expressed as , formally called the regression error. The poisoning attack causes the magnitude of to change, which in turn perturbs , affecting the threshold learning through regression. To achieve the above, we first provide theoretical insight into robust learning under poisoning attacks, including M-estimation. Second, we provide theoretical proof for the choice of M-estimators by demonstrating their theoretical relationship with poisoning attack strength. Third, we propose a robust algorithm for learning thresholds from the latent space.
5.1. Theoretical Intuition on Robust Learning Estimator
In the subsequent analysis, we explain how to determine a suitable loss function tailored to the specific characteristics of the attack and the rationale behind it. During the training phase, attackers can inject random attacks with different values of
and
. This manifests as varying degrees of perturbation in the
latent space. Let the varying degree of perturbations be denoted as
. These different poisoning attacks will lead to changes in regression errors (denoted as
) for each point in the latent space. The perturbation in the latent space depends on the product of the margin of false data (attack strength)
and attack scale
, resulting in a specific perturbation value of
in the latent space, as shown in
Section 4.2. In general, increasing
and
leads to a larger
. The distribution of
samples, contaminated by a margin
, is defined as follows:
where
refers to a normal (or ideal) distribution, typically representing the assumed parametric model of
, and
G represents an arbitrary symmetric distribution for contaminated
that captures deviations from
. The parameter
and/or
significantly influences the perturbed
values, which in turn result in perturbed regression errors
, introducing variations and deviations from the assumed parametric model (i.e., the normal distribution in regression). This influence is reflected in the distribution
G, which captures the contaminated
. As
increases, the contamination in
becomes more pronounced. On the other hand, the complementary value
has a notably smaller influence. This component accounts for the portion of perturbed
that still follows the normal distribution
, belonging to the portion of training data that remains unpoisoned. Although present, these
have a relatively minor impact compared to those captured by the distribution
G.
Small Strength Poisoning Attacks: Figure 2 depicts the two probability distributions originating from the two distinct sets (poisoned and non-poisoned) of regression errors. The yellow line in
Figure 2 corresponds to the perturbed
values arising from poisoned
values represented by
G. Specifically, the poisoning attack we implemented had a very small
and we kept the
. After performing a Kolmogorov–Smirnov test [
17,
18], we found that the contaminated distribution
G most closely fits a double exponential distribution. However, note that the above is only true when
is relatively small, given the same value of
.
In contrast, the green line in
Figure 2 represents the distribution of benign regression errors originating from the benign
values, denoted as
. The Kolmogorov–Smirnov test revealed that
most closely fits the Gaussian distribution.
This distribution P is a composite distribution where the central region is approximated by a mixture of normal distributions (), while the tail section resembles a double exponential distribution (G).
is the scale parameter of the input set
and
is a constant, where
is a separating point dividing the compound distribution such that points smaller than
come from the unperturbed true Gaussian distribution
, and points greater than
follow a double exponential distribution. As mentioned in [
19], in the most general case, guessing the least favorable distribution allows for hypothesis testing on compound distributions, as in our case. Hence, we can rewrite Equation (
7), as a least favorable distribution, by the following:
where
and
correspond to the location and scale parameters of the least favorable distribution (i.e., the regression errors
), and
represents the pdf of a normal distribution. The second part represents the pdf of a double exponential distribution. Without loss of generality, Equation (
8) is the density function for the least favorable distribution.
Following the theory of least favorable distributions, the pdf is not expressed via the raw input
, but rather a standardized input. An estimator is known as “minimax” when it performs best in the worst-case scenario, which is useful during poisoning attacks. A minimax estimator should be a Bayes estimator with respect to a least favorable distribution [
20]. To demonstrate this, we use Equation (
8) to find the best estimator under small poisoning attacks.
In maximum likelihood estimation (MLE), the likelihood function is taken over all
N datapoints. Where
and
as a function of the scale parameter, the likelihood of the least favorable distribution is given by
We know that optimizing the likelihood function is equivalent to optimizing the negative log-likelihood function. Hence, let
. With some algebra and derivation, it can be shown that the negative natural logarithm of Equation (
9) is given by
In general, the parameter estimate
via MLE is found by solving the following empirical risk minimization problem:
In generalized MLE, the estimated parameter
(the location parameter of the least favorable distribution in Equation (
8)) minimizes the log-likelihood function. Therefore, minimizing the log-likelihood function is equivalent to finding the solution to
, where
. Through algebraic derivation (see
Appendix A), it can be shown that the derivative
equals to the Influence Function (IF) of a least favorable distribution.
Influence Functions mathematically specify a function that is a descriptive model of the rate at which a specific loss function (used for machine learning) will change with the perturbations/changes in input training data. Since first-order derivatives capture rate of change in a function w.r.t in one of the inputs, the IF is the first-order partial derivative of the relevant optimal loss function w.r.t. to the training input to the regression model in our case. Related to the above, if we can prove that, given poisoning attacks of varying strengths, the Influence Function values of a certain estimator are on average lower than other estimators, it can be proven theoretically that it will be best estimator regardless the application. Following
Appendix A, by integrating this Influence Function (equivalent to the summation in Equation (
11)), we obtain the following result:
The above equation, in fact, turns out to be matching the Huber estimator [
21] from the family of M-estimators [
22]. Thus, we showed that the optimal M-estimator for the location parameter of the least favorable distribution (
) is when the contamination of poisoning happens with smaller-strength poisoning attacks and should be used if it is known that poisoning attacks are small in strength.
Large Strength Poisoning Attacks: Figure 3 shows a contaminated distribution of
resulting from a large data poisoning attack we implemented using
(i.e., 7 times more in strength than the small poisoning attack) and
, compared to the benign
distribution.
In
Figure 3, the green line depicts the distribution of regression errors that correspond to the benign portion of the training dataset, and closely approximates a normal distribution (
) verified via the K-S test. On the other hand, the yellow line depicts the distribution of regression errors that correspond to the attacked portion of the training dataset from the RUC latent space where the poisoning attack has large
. Performing a K-S test on the yellow perturbed distribution, we found it fits closely with the student t-distribution. This disparity requires us to consider the variability introduced by larger attacks and corresponding regression error distributions, differently than the case of smaller poisoning attacks.
In the least favorable distribution, the two components of the compound distribution, and G, are distinctly separated by ; i.e., each exhibiting a distinctly different shape.
In contrast, the parametric form of
t-distribution seamlessly blends both
and
G into a single distribution with heavy tails [
17,
23].
The prominent impact of this larger poisoning attack strength results in the emergence of a different type of contaminated distribution represented by
in Equation (
13). We can represent the set
as an inclusive range of contaminated distributions arising from various attack types, each introducing different levels of
contamination, ranging from small to large perturbations. This set is defined as follows:
The probability density function (pdf) of the student
t-distribution, with an auxiliary scale correction, can be expressed as follows:
Here, the gamma function is denoted as
,
represents the degrees of freedom,
is the regression error, and
represents the scale parameter in the
t-distribution. Thus, the likelihood function for the
t-distribution, considering all
N training data points, is
Since maximizing the likelihood function is equivalent to minimizing the log-likelihood function, taking the negative natural logarithm of Equation (
15) yields
In general, the parameter estimate
via MLE, just like in the earlier case, is found by the following empirical risk minimization problem:
In our problem, we deal with a single statistically variable parameter, and we sample residual values within a window. In the
t-distribution, the degrees of freedom indicate that smaller values result in heavier tails, while larger values make it resemble a Gaussian distribution. Considering these factors, choosing a smaller
for the degrees of freedom is appropriate for larger attacks, as shown in
Figure 3. Hence, replacing
in Equation (
16), and using the algebra and derivation shown in
Appendix B, by taking the derivative of Equation (
16) with respect to
and then integrating, we obtain the following solution according to M-estimation theory:
The above equation, in fact, matches the so-called Cauchy–Lorentz Loss Function from the class of M-estimator functions, which explains why we choose this loss function for resilient threshold learning [
22].
Significance: The significance of the above analysis is that the mitigation resulting from resilient learning in anomaly detection can be performed by implementing different kinds of attack strategies and checking the contaminated distribution’s nature and derive the corresponding optimal estimator that will produce the minimax estimate under such attacks.
5.2. Robust Learning Algorithms for Detection Threshold
While the previous section was dedicated to theoretically estimating the best loss function for learning under different types of attacks, this section is about how to use the identified appropriate estimators to propose a resilient threshold learning algorithm for time series anomaly detectors that is robust under the presence of data poisoning attacks during the training. Specifically, we combine the benefits of quantile-weighted regression with the robust Cauchy and Huber loss functions for learning the thresholds.
This results in two new threshold learning algorithms: (a) Quantile-Weighted Regression with Huber Loss (Algorithm 2), and (b) Quantile-Weighted Regression with Cauchy Loss (Algorithm 3). In our experiments, we compare these with classical (unweighted/non-quantile) regression with Cauchy loss and unweighted regression with Huber loss, with pseudo-code provided in the preliminary version of this work [
9].
Algorithm 2: Quantile-Weighted Regression with Huber loss () |
|
For the algorithms, we provide the pseudocode for learning the upper threshold , since the same process applies to learning . For learning , the algorithm remains the same except that it runs with inputs and parameter .
For optimal learning of , the input training samples are , and the search space for the candidate parameter is for each algorithm. We begin by explaining Algorithm 2, which uses Huber loss and modified quantile-weighted regression (to handle heteroskedasticity). The regression error is defined as the difference . Quantile regression applies different weights to the regression error depending on whether the candidate parameter fit is greater or lesser than the input points. This approach is necessary to reduce false alarms since the distribution of samples is not Gaussian, violating the assumptions of ordinary least squares regression.
We select a candidate for the upper threshold and compare whether it is greater or less than each point in the set . If , we assign it a weight and calculate the corresponding weighted regression errors, denoted as . Otherwise, if , we assign the regression error a weight , where . To embed the Huber estimator, for each weighted error , we calculate a loss value according to the Huber loss, depending on whether is greater or less than . This procedure gives an value for each training data point in the set for a fixed .
The empirical loss value for a single candidate is the sum of values over each entry in the set , denoted as . Each candidate generates a unique . The final optimal estimate is the that minimizes , as shown in the last line of Algorithm 2.
In a similar manner, we describe Algorithm 3, which uses the Cauchy loss instead of Huber, while keeping everything else the same. The only difference is the calculation of
from the weighted regression error
, based on how the Cauchy loss transforms the error into a loss value. The optimal solution is denoted as
, indicating that the Cauchy loss function was used for learning the upper threshold.
Algorithm 3: Quantile Weighted with Cauchy loss () |
|
When no asymmetric weights are applied to the regression errors, Algorithms 2 and 3 have been rewritten for Huber and Cauchy losses, respectively, using similar logic. In each case, we are only learning the bias parameter without the slope, as we want a fixed, time-independent threshold for flagging an anomaly.
5.4. Theoretical Results and Explainability
This section is dedicated to the proof of correctness of experimental results. Often, some papers implement a robust learning approach and report the performance but that does not prove correctness of experimental observations because experiments/codes may have biases/bugs which may alter conclusions. Therefore, it is imperative that researchers give theoretical proofs to prove correctness of experiments and its conclusions.
In the previous section, we showed the relationship between poisoning attacks, their influence on the distribution of regression errors, and how should inform the loss function choice. Now, we theoretically prove that in the presence of poisoning attacks with varying degrees of perturbation, which loss function provides better mitigation and why—which addresses an important aspect of machine learning, i.e., the explainability aspect. In the experimental section, we show that the theoretical results match the experimental result, which is a proof of correctness of the experimental findings.
Consider regression error data points where for , and is the hyperparameter of the loss function. The regression error data points are classified into two categories based on the hyperparameter . Among the N data points, those that exceed are categorized as “large errors”, while those that fall within the range of to are categorized as “small errors”.
Consequently, we can express this situation as follows: Out of the
N data points,
m have regression errors greater than
(
, where
), and the remaining
data points have errors within the interval
to
(
, where
). Thus, the perturbed errors are those either greater than
or within the range of
to
. It is important to observe that in the context of RSL attacks, adversaries introduce perturbations of varying strength, either small or large, into
training data points. This leads to distinct sets of regression errors following different distributions, as described by Equation (
13). In reality, knowing the type and characteristics of attacks a priori is not straightforward, so the choice of the best loss function remains unclear. To compare the performance of different loss functions, we present an analysis based on Influence Functions (IFs) and Mathematical Induction.
Comparing the Influence Functions (IFs) of the Cauchy and Huber estimators, the estimator with the smaller IF under perturbations is less affected by attacks and thus more robust. To maintain broad applicability, we can make an initial assumption about the regression error data points, denoted as and parameterized by . We aim to progressively extend this assumption and demonstrate its validity.
Let
,
, and
; the following inequality holds true:
then the above means that
and, hence, Cauchy provides more mitigation compared to Huber. In contrast, if
, then the Huber loss would provide more mitigation than Cauchy. This comparison indicates that the effectiveness of these robust loss functions in mitigating the influence of outliers depends on their relative Influence Functions.
This assumption shows that any perturbation in individual data points has a smaller effect on the Influence Function (IF) of Cauchy compared to Huber. In Equation (
20), we can distinguish between two categories of regression errors: those considered small and those classified as large. This differentiation can be expressed as follows:
If
, i.e., big residual
If
, i.e., small residual
We employ a direct inductive proof in this analysis. We start by examining a single data point, denoted as
, and determine whether the inequality in Equation (
20) holds. Then, we generalize it for multiple points based on the conditions of our problem.
Let
with
. Depending on whether
(large) or
(small), we provide a direct proof using Equations (
21) and (
22). A positive error can fall into either category, large or small.
Case 1: considers the scenario where the residual is large. We begin by proving the inequality for a single point using Equation (
21), and then generalize the argument to multiple points.
Case 2: addresses the case of small errors. We similarly start with a single point and extend the proof to multiple points using Equation (
22). Finally, we consider
Case 3: where both large and small errors occur simultaneously (which is the practical casr) and prove which influence function yields a lower value. Thus, the complete set of cases is as follows:
Case 1: If , big regression error, then :
The left side is always negative since and , while the right-hand side is positive. Thus, our assumption (i.e., ) holds for large regression errors.
For the case where (a large regression error), we will show that the opposite statement, , is false through proof by contradiction.
However, the last inequality is logically impossible because squared terms cannot be smaller than negative terms. Assuming the opposite leads to a mathematical contradiction. Thus, when (indicating a large regression error), with and . Hence, we proved that the Cauchy loss function mitigates the impact of poisoning more effectively than the Huber loss for bigger residuals which happen under poisoning.
Generalization to Multiple Big Regression Errors: Now, suppose there are a total of
m large errors
, where
(
). Given these
m large errors, we have
m inequalities of the form
Applying generalized Triangle Inequality leads to the following:
This inequality is valid because the total value on the left-hand side is smaller than the total value on the right-hand side. The reasoning behind this is that each individual term on the left is minimized due to the properties of the generalized Triangle Inequality, which effectively bounds the summation when compared to the sum of the signs of the larger errors on the right.
Case 2: If small regression error, then suppose or greater. Hence, . Then, we can write
Our demonstration has shown that, when dealing with small errors, the IF of Cauchy is less than that of Huber, indicating that the Cauchy loss function is either better than or equal to Huber under small poisoning attacks.
Generalization to Multiple Small Regression Errors: Now, consider the case of
small regression errors where
, (
). Given a total of
small errors, we have
inequalities of the form
Applying the generalized Triangle Inequality, we combine the above inequalities and derive a new inequality. For all
(
), the following holds true:
The above indicates that, in the presence of very small attacks, both the Cauchy and Huber loss functions provide equally effective mitigation in predicting the detection thresholds.
Case 3: Mixed Regression Errors (Large and Small). We consider the scenario where both large and small errors are present simultaneously (real world case study) and prove which influence function yields a lower value.
Previously, we showed that Equation (
13) represents the distributions affected by
, resulting from various attack types. In real-world scenarios, whether
results from a large or small poisoning attack, it causes both small and large perturbations in regression errors. The key difference is the ratio of small to large errors. For instance, a large attack strength leads to a higher number of large regression errors compared to small errors. Thus, in real-world scenarios, we need to combine inequality Equations (
23) and (
24) by using a similar technique that involves summing the two sides of these inequalities. As a result, Equation (
20) holds for all small and large regression errors
, i.e.,
The analysis reveals that the Cauchy loss has a lower IF compared to the Huber loss function when both small and large errors are present due to attacks and benign fluctuations. Based on this, it is explainable that the Cauchy loss function is more effective for the greedy case where the attacker seeks to launch larger data poisoning strengths to quickly inflict maximum damage. However, the experiments also reveal a trade-off related to the low base rate probability of a poisoning attack occurring that reveals a deeper scientific question.