Low-Cost Active Anomaly Detection with Switching Latency

: Consider the problem of detecting anomalies among multiple stochastic processes. Each anomaly incurs a cost per unit time until it is identiﬁed. Due to the resource constraints, the decision-maker can select one process to probe and obtain a noisy observation. Each observation and switching across processes accompany a certain time delay. Our objective is to ﬁnd a sequential inference strategy that minimizes the expected cumulative cost incurred by all the anomalies during the entire detection procedure under the error constraints. We develop a deterministic policy to solve the problem within the framework of the active hypothesis testing model. We prove that the proposed algorithm is asymptotic optimal in terms of minimizing the expected cumulative costs when the ratio of the single-switching delay to the single-observation delay is much smaller than the declaration threshold and is order-optimal when the ratio is comparable to the threshold. Not only is the proposed policy optimal in the asymptotic regime, but numerical simulations also demonstrate its excellent performance in the ﬁnite regime.


Introduction
Consider the problem of detecting a fixed number of anomalies among multiple processes. Each process may be in a normal or an abnormal state (note that the processes in an abnormal state are the anomalies, and our basic goal is to find out all the abnormal processes), and the state of each process does not change during the detection procedure. If the process is in an abnormal state, it incurs a certain cost per unit time and the normal processes do not incur any cost. Once the state of the abnormal process is identified, the corresponding abnormal process stops incurring costs. At each time, only one process can be probed and a noisy observation is obtained from the process. Each probe takes a certain time, and switching across processes will also bring a certain time delay. Since abnormal processes are identified at different times, the duration of each abnormal process continuing incurring costs is also different. We take the cumulative cost incurred by all the abnormal processes during the entire detection procedure as a measurement. Our objective is to find an active inference strategy, consisting of a selection rule governing which process to probe at each time, a stopping rule on when to terminate the detection procedure, and a decision rule on the final detection outcome, that minimizes the expected cumulative cost under reliability constraints.
The problem considered here can find applications in attack and intrusion localization in a cyber network, anomaly detection in sensor networks, event detection in power systems, and more. For example, consider detecting attacked components (which can be seen as anomalies) among a cyber network consisting of multiple components (such as routers and paths). An intrusion detection system (IDS) monitors the traffic over the components to detect Denial of Service (DoS) attacks (such attacks aim to flood the component with useless traffic to make it unavailable to its intended users until it is identified and repaired). Since the attacked components are flooded by massive useless traffic and can not work as normal, the per-time-costs of the abnormal components are the corresponding expected normal traffic (packets per unit time) over the component. Until it is found and repaired, the abnormal component will continue incurring costs. Due to resource constraints, only one component can be probed at each time. Each probe and switch across components bring a certain time delay. The objective is to minimize the expected cumulative cost incurred by all the attacked components during the entire detection procedure, in another word, to minimize the expected number of cumulative transmitted failed packets in the system during the DoS attack.

Related Work
The anomaly detection problem studied in this paper can be regarded as a variation of the Sequential Hypothesis Testing (SHT) problem [1] for a single process. Differing from the classic hypothesis testing problem with fixed sample size, SHT allows decision-maker to stop sampling from the process based on the gathered observations, which reduces the sample size without compromising the detection performance. The SHT problem was pioneered in [1] by Wald, who established the Sequential Probability Ratio Test (SPRT) for simple binary hypothesis testing and showed its optimality in terms of minimizing the expected sample size under error probability constraints. Various extensions for testing M-ary hypothesis or composite hypothesis for a single process were studied in [2][3][4], where asymptotically optimal performance can be obtained as the error constraints approach zero. The recent application and modification of the SPRT can be found in [5][6][7][8], where the SPRT was modified based on the fuzzy hypotheses [5], or was used to identify and diagnose the latent leakage faults for the pneumatic units [6], track multiple targets [7], or to determine software reliability [8].
Since the decision-maker can choose which process to probe at each time based on the past observations, the anomaly detection problem has a connection with the sequential design of experiments problem studied by Chernoff in 1959 [9] named as the Active Hypothesis Testing (AHT) problem. Compared with SPRT where the observation action under each hypothesis is fixed, active hypothesis testing allows the decision-maker to choose different experiments to conduct at each time and infer the state of the probed process. The test developed by Chernoff (referred to as Chernoff test) is a randomized test that generates random actions based on historical observations. The actions, consequently the test statistics of log-likelihood ratio, thus become independent and identically distributed over time, allowing asymptotic analysis of the stopping time on the test statistic. Chernoff's work has been extended in various directions. It should be mentioned that Bessler [10] generated Chernoff's work to general multiple hypotheses testing. Naghshvar and Javidi in [11][12][13][14][15] studied this sequential problem from the perspective of minimizing Bayesian risk considering the detection error and detection delay. Specifically, in [11], they established the lower bounds for the Bayesian risk which characterize the fundamental limits on the maximum achievable information acquisition rate and the optimal reliability. They further investigated the roles of information utility [13], performance bounds [14], and sequentiality and adaptivity [15] in active hypothesis testing problem separately. Recently, K. Cohen and Q. Zhao et al. [16][17][18][19][20] studied the AHT problem for anomaly detection of detecting anomalous processes among a fixed number of processes. In these studies, they were aimed at finding a sequential search strategy for the anomaly detection problem that minimizes the expected detection time subject to an error probability constraint. Specifically, in [16], in contrast to the random policies advocated in the Chernoff test, they introduced a simple deterministic model which offers the same asymptotic optimality yet with significant performance gain in the finite regime and considerable reduction in implementation complexity. In [17,18], Wang and Cohen considered the AHT model for the anomaly detection scenario where there are a few anomalous processes among a large number of processes, based on a binary-tree structured detection space. They further consider the anomaly detection problem in the scenario where the processes are heterogeneous [19] and the composite hypothesis case in which the observations follow a common distribution with an unknown parameter [20].
These above studies have focused on minimizing the expected sample size while we focus on minimizing the expected cumulative cost in this paper. When there is only one anomaly, minimizing the cumulative cost is equivalent to minimizing the sample size, because there is one and only one process incurring cost per unit time during the entire detection procedure. With multiple processes, however, minimizing the expected sample size based on the existing methods is no longer sufficient, since the abnormal process will stop incurring cost once its state is identified, which makes the cumulative per-unittime cost incurred by all the anomalies change. The anomaly detection problem with the objective of minimizing the cumulative observation cost was formulated by K. Cohen and Q. Zhao in [21], where they studied this problem with the restriction that across-processes switching is allowed only when the state of the probed process is declared. They focused on the scenario where the number of anomalies is unknown, developed optimal index algorithms and demonstrated its strong performance in terms of minimizing the expected cumulative observation cost through numerical simulations. Furthermore, they studied this problem in [22] without the switching restriction and developed index-type algorithms which had been proven to be asymptotically optimal as the error constraints approach zero. Then, A. Gurevich [23] tackled a general nonlinear cost setting and developed an effective algorithm.
However, these above studies do not consider the switching cost. Incorporating the switching cost into the objective function is motivated by lots of applications. For example, in many searching tasks, relocating the searching region using rescuing ships or air-crafts incurs a considerable cost in terms of energy or delay. Another example is medical diagnostics, where frequent and fast switching across drugs and medical procedures may carry high risk and side effects. Moreover, in power systems, switching different components to probe also incurs a certain amount of time delay and requires a switching cost. As for the AHT problem considering the switching cost, there are few studies in the literature. Vaidhiyan developed a modified Chernoff test (referred as Sluggish policy) in [24] and introduced a switching parameter η which determines the switching probability, which can significantly reduce the number of switchings. They claimed that Sluggish policy approaches the asymptotic performance of Chernoff test as η → 0. In our previous work [25,26], we applied the AHT model to the single-target anomaly detection problem. We proposed a low-complexity deterministic policy (referred as DBS policy) which is shown to be asymptotically optimal and offers significant performance improvement in the finite regime. However, it should be noted that they all focused on the AHT problem from the perspective of minimizing the summation of the sample size and the switching cost and did not consider the expected cumulative cost. Thus, the anomaly detection problem with the objective of minimizing the expected cumulative cost considering the switching cost is first formulated and studied in this paper. In order to intuitively compare our work with the recent existing works, we show the differences in Table 1 according to the different objective functions and whether the switching costs are considered.

Contributions
In this paper, we consider the problem with the objective of minimizing the expected cumulative costs that take into account not only detection cost but also the switching cost. We are thus facing a new problem that requires a trade-off between the cost per unit time, the sample size and the switching across processes. We propose a deterministic policy (DMSC policy) which makes the number of switching as small as possible while ensuring the asymptotic optimality of the expected cumulative cost incurred by observations. Furthermore, we focus on the problem in the scenario where the number of anomalies is known in advance. In this scenario, to detect all the anomalies, the decision-maker can directly identify all the abnormal processes and make a declaration; or it can exclude all the normal processes and thus obtain the correct information about the anomalies. Thus, the proposed algorithm partitions the problem into two cases. In one case, the processes most likely to be abnormal will be probed with the highest probability; in another case, the decision-maker probes processes that are likely to be normal and eliminates them one by one. We analyze the asymptotic performances of the DMSC policy for different ratios of single-switching delay to single-observation delay as the error constraint approaches zero. We draw the conclusion that the proposed DMSC policy is asymptotically optimal in terms of minimizing the expected cumulative costs when the ratio of single-switching delay to the single observation delay is much smaller than the declaration threshold and is order-optimal when the ratio is comparable to the threshold. The strong performance in the finite regime is demonstrated in the simulation part.

Organization
In this paper, Section 2 describes the system model and problem formulation for the anomaly detection problem. In Section 3 we propose the deterministic low-cost DMSC policy and analyze its asymptotic performance in the infinite regime. In Section 4 we provide numerical results to illustrate the performance of the proposed policy as compared with other existing policies in the finite regime. Section 5 concludes the paper.

System Model
Consider the problem of detecting L abnormal processes among M processes. Each process may be in a normal state (denoted by H 0 ) or an abnormal state (denoted by H 1 ). Process m is abnormal with a priori probability π m , while process m is in a normal state with a priori probability 1 − π m . Let H 0 {m : 1 ≤ m ≤ M, process m is normal}, process m is abnormal} be the set of normal and abnormal processes, respectively. Each abnormal process m incurs a cost c m (0 ≤ c m ≤ ∞) per unit time until its state has been identified, while the normal processes do not incur any cost. We focus on the case where c m is the same for each abnormal process, in another word, c m = c holds for all abnormal processes.
The detection procedure starts at time n = 0 and the decision-maker can choose one process to probe at each time. Let φ(n) ∈ {1, 2, · · · , M} be the selection rule indicating which process is chosen to be probed at time n. The time series vector of selection rules can be denoted by φ = (φ(n), n ≥ 0). When process m is probed at time n, an observation y m (n) is obtained from a distribution and is independent over time. If process m is in a normal state, the observation y m (n) follows distribution f (y); if process m is in an abnormal state, the observation y m (n) follows distribution g(y). In this paper, we focus on the case where the distributions f (y) and g(y) are completely known. In practice, we can obtain the distribution f (y) by observing the normal processes and analyzing the observation data. As for the distribution g(y) of anomalies, it can be roughly judged based on experience. For example, consider the anomaly detection problem among a cyber system. An IDS monitors the traffic over the normal components and takes the packet size values that have arrived in a given interval as statistics. By data analysis, we can know that the observations from normal processes follow a certain distribution. Furthermore, it is known that DoS attacks aim to flood the target component with massive useless traffic. Thus the observations of the process being attacked by DoS will follow a distribution with a large mean. To detect the anomalies, we can assume that the observations from abnormal processes follow a prior distribution of interest. Then, it can be considered that the distribution is known.
We define the stopping time τ m , which is counted from the beginning of the entire detection procedure, as the time when the state of process m is declared and the decisionmaker stops probing the process m. The vector of stopping times for M processes can be denoted by τ = (τ 1 , ..., τ M ). We define the stage of the detection procedure as the period time between the stopping times of two processes whose states are successively declared. Besides, the stopping time of the entire detection procedure is denoted by τ end , which means all the abnormal processes have been identified. We define δ m ∈ {0, 1} as the decision rule that the decision-maker uses to declare the state of process m at stopping time τ m . We have δ m = 0 if the decision-maker declares that process m is in a normal state, while δ m = 1 if the decision-maker declares that process m is in an abnormal state. Let N m be the random sample size required to declare the state of process m. Let δ end denote the final decision rule that the decision-maker uses to declare the set of anomalies at the stopping time τ end .
The abnormal process m continues incurring costs until its states are identified at the corresponding stopping time τ m . Thus, the expected cumulative cost incurred by all the abnormal processes during the entire detection procedure is given by where Γ = (τ, δ, φ) denotes an admissible policy for the sequential anomaly detection problem based on the active hypothesis testing model. The objective is to find a policy Γ that minimizes the expected cumulative cost subject to the error constraints for each process: It should be noted that P m e = max(P FA m , P MD m ), where P FA m and P MD m denote the falsealarm and miss-detection error probabilities for process m respectively. Besides, let C * denote the infimum achievable by any policy, Further, analyze to simplify the objective function of the expected cumulative cost. During the entire detection procedure, in addition to spending time observing the processes, it also takes time to switch across processes. Let N d τ m and N s τ m denote the total sample size and the total switching times by the stopping time τ m , respectively. We set each observation to 1 unit time and each switching to s unit time. Then, the stopping time τ m for process m can be denoted by

Notations
Since only one process can be probed at a time, let 1 m (n) be the indicator function of whether process m is probed at time n. 1 m (n) = 1 if process m is probed at time n, and 1 m (n) = 0 otherwise.
We use the log-likelihood ratio (LLR), a statistic used in classical sequential hypothesis testing [1], as the decision statistic between two hypotheses. The LLR of process m at time n is denoted as which reflects the goodness-of-fit of which distribution a certain process belongs to. The sum log-likelihood ratio (sum-LLRs) of process m at time n is given by which combines the observations of the process m at multiple times, and can be regarded as the score of whether the process is in an abnormal state.
Let D(g|| f ) and D( f ||g) denote the KL divergence between two distributions g and f which are given by If process m is in an abnormal state, the sum-LLRs S m (n) of process m is a random walk with positive expected increment E m (l m (n)) = D(g|| f ) > 0, while S j (n) of process j which is in a normal state is a random walk with negative expected increment E m (l j (n)) = −D( f ||g) < 0. Thus, when the sum-LLRs of process m is sufficiently large, we can declare with a sufficient accuracy that process m is in an abnormal process. Similarly, when the sum-LLRs of process j is sufficiently small, we can declare its state as normal.

The DMSC Policy
In this section, we propose a deterministic anomaly detection policy and analyze its sample and switching complexity and the expected cumulative cost in the asymptotic regime as the error constraint α approaches zero.

The DMSC Policy
Consider the problem of detecting L abnormal processes among M processes where L and M are known. The decision-maker can locate all the anomalies using the method of directly selecting and identifying all L abnormal processes or the method of excluding all M − L normal processes. Thus, the proposed DMSC policy partitions the problem space into two cases by comparing Case I: where the offset can be expressed as In Case I, the DMSC policy aims at identifying the L abnormal processes sequentially. It should be noticed that in this case, probing the process with the highest sum-LLRs will lead to a rapid collection of sufficient information to declare the state of the process as abnormal since the abnormal processes tend to have higher sum-LLRs among these normal processes and the process with highest sum-LLRs tends not to change after a period of time, which will make the cost incurred by the observations and switchings as small as possible. Let A(n) denote the set of processes whose states have been declared as abnormal at time n. The selection rule φ(n) is given by where m 1 (n) = arg max m/ ∈A(n) S m (n) is the index of the process with the highest sum-LLRs among all the processes whose states have not been declared at time n. Following Wald's SPRT [1], the stopping rule and decision rule are given by comparing the sum-LLRs S m (n) with the threshold at each time n. Thus, the stopping rule τ m and decision rule δ m can be denoted by and where the threshold B is determined such that the error constraint P m e ≤ α is satisfied and can be expressed as by simplifying the computation of Wald's approximation [1]. In this case, the stopping time τ end of the entire detection procedure and the final decision rule δ end are given by which means we have completed the searching for the L abnormal processes. In Case II, the decision-maker chooses the method of excluding M − L normal processes to locate the L abnormal processes with lower expected cumulative costs. Let B(n) denote the set of processes whose states have been declared as normal at time n. Thus, the selection rule φ(n), stopping rule τ m and decision rule δ m are given by where m −1 (n) = arg min m/ ∈B(n) S m (n) is the index of the process with the smallest sum-LLRs among all the processes whose states have not been declared at time n. In this case, the stopping time τ end of the entire detection procedure and the final decision rule δ end are given by where M = {1, 2, · · · , M} is the set of all processes. The DMSC policy for anomaly detection problems in the AHT framework is shown in Algorithm 1.

Algorithm 1: The deterministic (DMSC) policy.
Input: The observations y, the distributions f and g, the number of processes M, the number of anomalies L, the ratio s, the error constraint α Output: The set of anomalies δ end 1 Initial the sum-LLRs of the processes S m (0) = 0 for m = {1, 2, ..., M}; 2 Calculate the declaration threshold B according to (13); 3 Calculate the statistics according to (8); while the number of declared processes < L do 6 Probe the process with the highest sum-LLRs: Obtain an observation y φ(n) (n) from the process φ(n); 8 Update S φ(n) (n) based on the last observation according to (6); 10 Declare the process φ(n) as abnormal; 11 end 12 end 13 Declare the L declared processes as abnormal.
14 else 15 while the number of declared processes < M − L do 16 Probe the process with the lowest sum-LLRs: φ(n) = m −1 (n);

17
Obtain an observation y φ(n) (n) from the process φ(n); 18 Update S φ(n) (n) based on the last observation according to (6); 20 Declare the process φ(n) as normal; 21 end 22 end 23 Declare the remaining L processes as abnormal.

end
For Case I, the process with the highest sum-LLRs m 1 (n) is selected at each given time n and declared as abnormal once its sum-LLRs exceeds the threshold B. The asymptotic sample size N m required to declare the state of one process is expected to be B/D(g|| f ) since the probed process is in an abnormal state with high probability and its corresponding sum-LLRs is a random walk with a positive expected increment D(g|| f ). Once the sum-LLRs of the probed process exceeds the threshold B, the decision-maker stops taking observations from this process and declares it as abnormal, which means that this point in time is a stopping time and the first stage terminates. Then choose the updated process m 1 (n) to probe and apply the same procedure to the remaining M − 1 processes. This repeats until all L abnormal processes have been declared, at which point, the entire detection procedure terminates. Thus, the entire detection procedure can be divided into L stages where the observation time of each stage is expected to be B/D(g|| f ) and the i-th stage is from the (i − 1)-th stopping time to the i-th stopping time. It should be noted that at the beginning of the i-th stage, i − 1 processes have been declared as abnormal and at each time of the i-th stage, L − i + 1 unidentified abnormal processes incur cost simultaneously. The cumulative observation cost incurred by the unidentified abnormal processes during the i-th stage is excepted to be (L − i + 1)B/D(g|| f ). Thus, the cumulative cost caused by observations during the detection procedure approaches c · L(L+1) D(g|| f ) . From another perspective, the cumulative observation time of the i-th declared process is iB/D(g|| f ). We also can draw the conclusion that the cumulative observation costs can be approximately expressed as c · L(L+1) D(g|| f ) . Besides, there will inevitably be switching across processes during the initialization process of each stage, which will result in additional delays. In Case I, we use c · s · L(L−1) 2 to approximate the cost caused by the switchings.
For Case II, the process with the lowest sum-LLRs m −1 (n) is selected at each given time n and declared as normal once its sum-LLRs drops below the threshold −B. The asymptotic sample size N m required to declare the state of one process approaches B/D( f ||g) since the probed process is in a normal state with high probability and its corresponding sum-LLRs is a random walk with a negative expected increment −D( f ||g). The same procedure is then applied to the remaining M − 1 processes until all the M − L normal processes have been declared as normal and then the entire procedure terminates with the remaining L processes declared as abnormal. Unlike Case I where the number of unidentified abnormal processes decreases as the stage index increases, in each stage of Case II, there are L abnormal processes incurring costs. Thus, the cumulative cost caused by observations approaches Similarly, the switching cost can be expressed as c · s · L(M − L). Therefore, the decision-maker chooses the strategy that minimizes the asymptotic cumulative cost caused by observations and switchings according to Thus, by formula transformation, the problem space is partitioned into two cases as shown in (8).

Example
We illustrate the two cases of DMSC policy in Figure 1 with an example that detects L = 2 abnormal processes (assume that process 3 and process 4 are abnormal processes) among M = 5 processes. For Case I, the decision-maker probes the process m 1 (n) and terminates the detection procedure once the number of the processes declared as abnormal reaches L. At the beginning of the detection procedure, since the sum-LLRs of all processes are zero, a process is randomly selected to probe at n = 0. Suppose that the decision-maker selects process 1. Since process 1 is in a normal state, the sum-LLRs of process 1 will decrease with high probability. Thus, at n = 1, when the first observation is finished, the order of sum-LLRs of all processes becomes {2, 3, 4, 5, 1}. Since the switching across process takes a certain time delay s, the decision-maker randomly chooses a process to probe at n = 1 + s. Suppose that process 2 is selected. Similarly, the sum-LLRs of process 2 will also decrease since process 2 is also in a normal state, and the order of sum-LLRs of all processes may become {3, 4, 5, 1, 2}. Then, the decision-maker chooses a process to probe at n = 3 + 2s. Assume that the decision-maker selects process 3. Since process 3 is in an abnormal state, the sum-LLRs of process 3 increases and ranks first with high probability, i.e., m 1 (n) = 3 and S m 1 (n) (n) > 0. From then on, the decision-maker will continue to probe process 3 with high probability until S m 1 (n) (n) ≥ B. Then the decision-maker will stop taking observations from process 3 and declare it as abnormal. At this time, the stage 1 of the entire detection procedure is finished. Then, repeat this procedure until declaring the state of another abnormal process and the entire detection procedure will be terminated.
For Case II, the decision-maker probes process m −1 (n) at each given time. Suppose process 5 is randomly selected to probe at the beginning of the detection procedure. Since process 5 is in a normal state, the sum-LLRs of process 5 will decrease and be the lowest among these processes. Thereafter, the decision-maker will continue probing process 5 until S m −1 (n) (n) < −B, and then stop taking observations from process 5 and declare it as normal. Repeat this operation until all normal processes are declared. Then, the entire detection procedure terminates with the remaining processes declared as abnormal.

Performance Analysis
We now analyze the asymptotic performance of the DMSC policy in terms of the expected cumulative cost as the error constraint α → 0. Note that according to (13), that the error constraint approaches zero implies that the threshold B → ∞ and the sample size required to declare the state of each process approaches infinity. In another word, the detection procedure will not be terminated.
According to (4), the expected cumulative cost E(C|Γ) can be denoted by the summation of the cost caused by observations, which can be denoted as 22) and the cost caused by the switchings Thus, it is necessary to analyze the total number of observations and switchings. In our proposed policy, the states of the L abnormal processes or the M − L normal processes are declared one by one, which splits the entire detection procedure into L or M − L stages. Thus, one important part of the analysis of the DMSC policy is analyzing both the observation times and the switching times of each stage for two cases (detailed analysis can be found in Appendix A.2).
Besides the number of observations and the number of switchings, the ratio s is also an important parameter for the objective function of the expected cumulative cost. It should be noted that s determines the relative contribution of a single switching compared to a single observation to the cumulative cost. According to the order of the ratio of s to B, we analyze the optimality of the proposed policy in the following two scenarios: Note that B is the threshold relative to the error probability α and determines the expected sample size required to declared the state of processes, where the expected sample size is B/D(g|| f ) for the abnormal processes and is B/D( f ||g) for the normal processes. Based on (22) and (23), the ratio of s to B can find a connection with the ratio of the cumulative switching cost to the cumulative observation cost.
Next, we consider the above two scenarios and provide the performance analyses of the DMSC policy on the expected cumulative cost. Firstly, we focus on Scenario 1 and prove that the DMSC is asymptotically optimal in this scenario as α → 0. Then, similarly, we prove that the DMSC policy is order-optimal in Scenario 2 in the asymptotic regime.

Performance Analysis for Scenario 1
Here, we assume that the parameter s satisfies s = o(B) as α → 0. In this subsection, we prove that the DMSC policy is asymptotically optimal in this scenario.
In other words, the expected cumulative cost of the DMSC policy is asymptotically optimal in α.
Proof. The proof idea is briefly provided here. In Appendix A.1, we show that the objective function on the expected cumulative cost has an asymptotic lower bound and gives an expression for the asymptotic lower bound. Then, in Appendix A.2, we upper bound the expected cumulative cost of the DMSC policy by establishing the upper bounds of the number of observations and switchings of each stage. In addition, we show that if s = o(B), the expected cumulative cost under DMSC policy approaches the asymptotic lower bound as α → 0, which completes the proof. The detailed proof is given in Appendix A.3.

Performance Analysis for Scenario 2
Here, we focus on the scenario where the parameter s satisfies s = Ω(B) as α → 0. In this subsection, we define the order-optimal and present Theorem 2, which shows that the DMSC policy is order-optimal in this scenario.

Definition 2. We say that policy
In other words, the expected cumulative cost of the DMSC policy is order-optimal in α.
Proof. The proof for this scenario is shown in detail in Appendix A.4 based on the similar approach outlined in the proof of Theorem 1.
Since s/B > 0 as α → 0, the offset (M, L) can not be ignored in this scenario and the analysis is more complicated than that of Scenario 1. We show that the ratio of the expected cumulative cost of DMSC policy to the asymptotic lower bound is larger than 1 but bounded by a positive constant as α → 0, which completes the proof for Scenario 2.

Numerical Results
In this section, we present numerical examples to illustrate the performance of the proposed DMSC policy as compared to the Random selection Sequential Probability Ratio Testing policy (R-SPRT policy) and CL-πcN policy [22] in the finite regime.
R-SPRT Policy: The random SPRT (R-SPRT) policy is the policy with the lowest switching cost, where a series of SPRTs are performed in random order and the switchings occur only when the current probed process is identified. Once all the abnormal processes or all the normal processes are declared, the detection procedure of R-SPRT will be terminated.
CL-πcN policy: CL-πcN policy was proposed to solve the anomaly detection problem with a similar objective function that aims at minimizing the expected cumulative observation cost without considering the switching cost. It was shown to be asymptotic optimal on its objective function in the finite regime.
In this paper, we focus on the scenario where the number of abnormal processes and the total number of all processes are both known. In addition, we consider the case where the distribution under each hypothesis is completely known: f ∼ Poi(λ f ) and g ∼ Poi(λ g ), which means that the observations from the normal processes follow the Poisson distribution with parameter λ f and the observations from the abnormal processes follow the Poisson distribution with parameter λ g . The KL divergences can be expressed as: , ).
Let C * LB be the asymptotic lower bound on the expected cumulative cost as α → 0, which is given by as the relative cost in terms of the expected cumulative cost under policy Γ, as compared to the asymptotic lower bound. Following Theorem 1 and Theorem 2, we expect L DMSC to approach 0 in Scenario 1 and to be bounded by a constant in Scenario 2 as α → 0.

Scenario 1: s = o(B)
In this part, we compare the performances of DMSC policy, R-SPRT and CL-πcN policy for M = 5 and L = 2 in Scenario 1 where s = o(B) as α → 0.
First, we consider the case where s is fixed, which means that the ratio of the singleswitching delay to the single-observation delay is fixed. We set that s = 2. Since B ≈ log 1−α α , in this case, s/B = 0 as α → 0, i.e., s = o(B) holds. Note that since s is fixed, the value of the offset (M, L) changes as the threshold B increases, which leads to the switch between Case I and Case II in DMSC policy. We set λ f = 2, λ g = 0.01 and c = 1. We average the results over 5 · 10 4 trials. Figure 2a Figure 2b shows the relative cost under the DMSC policy and other policies as compared to the asymptotic lower bound. It can be seen that the DMSC policy significantly outperforms the others in the finite regime for all values of B and the relative cost of DMSC policy will approach 0 as α → 0, which is consistent with the conclusion of Theorem 1 that the DMSC policy is asymptotically optimal as α → 0. Besides, it can be noticed that the performance of the R-SPRT policy is worse than that of the DMSC policy but better than of CL-πcN policy.  M−L holds for each B in the simulation and the DMSC policy selects process m 1 (n) at each given time n. We average the results over 5 · 10 4 trials. The relative costs under the DMSC policy and other policies as compared to the asymptotic lower bound are presented as a function of the threshold B in Figure 3. It is obvious that the performance of the DMSC policy is superior to that of others.

Scenario 2: s = Ω(B)
In this part, we compare the performances of DMSC policy, R-SPRT and CL-πcN policy for M = 5 and L = 2 in Scenario 2 where s = Ω(B) as α → 0.
We consider the case where s/B = 2, which implies that the order of (M, L) (defined in (9)

Discussion
In this section, we presented numerical results to illustrate the performance of the proposed DMSC policy as compared to the R-SPRT policy, whose switching times are the smallest among any policies, and the CL-πcN policy which has been proven in [22] to be asymptotically optimal in terms of minimizing the cumulative observation costs. We compared these three algorithms in Scenario 1 and Scenario 2, respectively.
In Scenario 1 where s = o(B) as α → 0. First, we considered the scenario where s is fixed. As shown in Figure 2b, the DMSC policy outperforms the other policies, and the performance of the R-SPRT policy is worse than that of the DMSC policy but better than that of CL-πcN policy. As for the DMSC policy, it outperforms other policies and approaches 0 as threshold B increase, which is consistent with the conclusion of Theorem 1 that the DMSC policy is asymptotically optimal as α → 0.
As for the R-SPRT policy, the decision-maker performs SPRT for each process in a random order, and can not switch to another process before the state of the current probed process is declared. Thus, the expected switching time of R-SPRT is fixed and smaller than any other policies. However, due to the switching constraints, the decision-maker should probe one process until the stopping rule of the process is reached, which leads to that the expected observation time and the cumulative observation costs of the R-SPRT policy are larger than others. In this scenario, the ratio of the single-switching delay to the single-observation delay is fixed, which means the single-switching cost is comparable with the single-observation cost. When the threshold B is small, the expected sample size required to declare the state of a process is small, and the negative effects of the switching constraints of the R-SPRT policy are not much significant. At this time, the relative cost of the R-SPRT policy is slightly larger than that of the DMSC policy. As the threshold B increases, the expected sample size increases, and the cumulative observation costs of the R-SPRT policy increase faster than others. As a result, as the threshold B increases, the relative cost of R-SPRT increases, and the gap of the relative costs of R-SPRT policy and DMSC policy increases.
As for the CL-πcN policy policy, at each given time n, the decision-maker calculate the index γ m (n) = π m (n)c m for each process m. At each non-specific time, the process with the largest index is probed, while at each specific time that grows exponentially sparse with time, the processes are probed in a round-robin fashion, which leads to the increment of the switching time and the cumulative switching costs. Thus, in this scenario where the single-switching delay is comparable with the single-observation delay (s = 2), the relative cost of CL-πcN policy is highest among these compared policies, even if the CL-πcN policy is proved to be asymptotically optimal in terms of minimizing the cumulative observation costs [22].
Then, we considered the scenario where s/(Bα) is fixed. It should be noted that αB ≈ −α log α → 0 holds as α → 0. In this simulation, s/(αB) is fixed and s/B = 2α, which leads to that lim α→0 s = 2αB → 0 holds. In another word, the ratio of the single-switching delay to the single-observation delay approaches 0, and the objective of minimizing the cumulative costs can be simplified to minimizing the cumulative observation costs. Thus, due to the switching constraints which leads to that the decision-maker spends a lot of time probing each process, the R-SPRT policy performs worst in this scenario. On the contrary, benefiting from the small impact of the switches on the cumulative costs, the CL-πcN policy performs better than that of the last simulation. Due to the round-robin procedure, the switching costs of the CL-πcN policy are larger than that of the DMSC policy while the observation costs are slightly larger than that of DMSC policy. Since the ratio s approaches 0, the relative cost of the CL-πcN policy is slightly higher than that of the DMSC policy.
As a conclusion, in Scenario 1 where s = o(B) as α → 0, the DMSC policy outperforms other policies. Combining Theorem 1 where the DMSC policy is proved to be asymptotically optimal in the finite regime in this scenario, we can conclude that the DMSC policy is the optimal algorithm to solve such an anomaly detection problem in this scenario.
In Scenario 2 where s = Ω(B) as α → 0. As shown in Figure 4, the DMSC policy outperforms the CL-πcN policy but underperforms the R-SPRT policy. In this scenario, the ratio s is sufficiently large such the cost incurred by an additional switch will be higher than the cost of observing and declaring a non-target process. Since switching occurs only when the state of a process is identified, the switching times of R-SPRT are the smallest among any other policy. In this case, the R-SPRT performs better benefiting from the fewer switching times. Meanwhile, the switching cost of the DMSC policy is much higher than that of R-SPRT, making the performance on the expected cumulative cost inferior. As for the CL-πcN policy, the switching cost is much higher than that of DMSC policy, thus its relative cost is highest among these three algorithms. Besides, the DMSC is order-optimal in terms of the expected cumulative cost in the scenario where s = Ω(B) as α → 0.
In a conclusion, since the ratio s is sufficiently large, any switch will make the cumulative costs much higher. In this scenario, the best method is to minimize switching, i.e., perform SPRT for each process one by one. Thus, in this scenario, the R-SPRT policy is the optimal algorithm since its switching time is the lowest.

Conclusions
In this paper, we considered the problem of detecting anomalous processes among a fixed number of processes with the framework of the AHT model. The decision-maker can choose only one process to probe at each given time. If the processes of any two consecutive observations are different, a switching across processes will occur. Either observation for any process or switching across processes increases time delay. Each abnormal process incurs cost per unit time until its state is identified and normal processes do not incur any cost. Our objective is to find an active inference policy that minimizes the expected cumulative cost incurred by all the abnormal processes during the entire detection procedure.
As far as we know, the anomaly detection problem with the objective of minimizing the expected cumulative cost considering the switching costs is first formulated in this paper. For such a problem, we proposed a deterministic policy referred to as DMSC policy that partitions the problem into two cases, one in which the decision-maker directly selects and identifies all abnormal processes, and the other in which all normal processes are excluded and all the abnormal processes are located finally. Furthermore, we proved that the proposed algorithm is asymptotically optimal and order-optimal under error constraints when the ratio of the single-switching delay to the single-observation delay is much smaller than or comparable with the threshold which determines the expected sample size required to declare the states of processes. We also provided numerical simulations to show the strong performance of DMSC policy in the finite regime. The numerical results show the optimality of the DMSC policy in a certain scenario.
However, the optimality of our proposed algorithm is based on the assumption that all abnormal processes incur the same and constant cost per unit time until their corresponding states are identified. Besides, we focus on the scenario where the number of anomalies is known and the distributions of normal and abnormal processes are completely known. If any of these conditions are not met, the optimality of the proposed DMSC policy may no longer hold.
In the future, we would like to study such an anomaly detection problem in the scenario where different anomalous processes incur different single-time costs, or the single-time costs change over time. Besides, we expected that the problem can be extended to the situation where the number of anomalies or the distribution of each hypothesis is unknown, which may be more advantageous in some practical scenarios.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Proof for Theorems
Appendix A. 1

. The Asymptotic Lower Bound on the Expected Cumulative Cost
In this subsection, we establish the asymptotic lower bound on the expected cumulative cost that can be achieved by any policy.
Lemma A1 (The Asymptotic Lower Bound on the Expected cumulative cost). Let C * LB denote the asymptotic lower bound on the expected cumulative cost. Then, where, and B is a function of α and B ≈ log 1−α α .
Proof. According to the objective function as (4), the cumulative cost consists of the cost incurred by the observations and the cost incurred by the switchings, and E(C d ) and E(C s ) denote the cost incurred by the observations and switchings respectively. It should be noted that once an abnormal process is identified, it will not continue incurring costs.
When the total number of processes M and the number of abnormal processes L are known, there are two ways to declare the number of abnormal processes, one is to directly declare the L abnormal processes as abnormal, and the other is declaring M − L normal processes as normal one by one and declaring the remaining L processes as abnormal.
When the states of the abnormal processes are declared directly, it is intuitive that probing normal processes before declaring the states of abnormal processes will increase the observation cost. Hence, to establish the lower bound on the expected cumulative cost, we assume that all the abnormal processes are tested before the normal ones. We apply ( [22], Lemma 1), where an asymptotic lower bound on the expected cumulative cost considering only the observation cost was established under the same assumption, we have: Furthermore, any additional switching will also increase the cost. Thus, we assume that each process must be probed continuously and switching is not performed until the state of the currently probed process is declared. Thus, at least L − 1 switchings will occur during the entire detection procedure and the cost incurred by the switchings can be lower bounded by Thus, the expected cumulative cost in this situation can be lower bounded by as α → 0. When declaring the normal processes one by one, we assume that all the normal processes are tested before the abnormal process, and switching across processes is performed only when the state of the currently probed process is declared. Since lim α→0 E(N m |H 0 ) → B/D( f ||g) [1], and the number of abnormal processes incurring costs remains constant during the detection procedure (because none of the abnormal processes are identified), we have: as α → 0. Combining (A5) and (A6), the proof of Lemma A1 is completed.

Appendix A.2. The Expected Cumulative Cost of DMSC Policy
In the DMSC policy, the detection procedure is divided into L or M − L stages, depending on whether the decision-maker chooses to probe the process with the highest sum-LLRs m 1 (n) or the process with the lowest sum-LLRs m −1 (n). Here, we analyze the asymptotic performance of the DMSC policy on the expected cumulative cost at each stage. Summing the expected cost of each stage will yield the expected cumulative cost of the entire detection procedure. Inspired by (4), we analyze the number of observations and switchings at each stage to obtain the expected cumulative cost.
Let N k d and N k s denote the observation times and the switching times during the k-th stage, respectively. Note that differing from the subscript m of N d τ m and N s τ m , which represents the index of the process, the superscript k of N k d and N k s represents the index of the stage. Figure A1 visually shows the relationship between these parameters. For example, consider the anomaly detection problem among a system consisting of M = 3 processes (denoted as process 1, 2, 3) and L = 2 abnormal processes (for example, process 1, 3 are abnormal). We assume that the decision-maker chooses m 1 (n) to probe and once the number of processes declared as abnormal reaches L, the detection procedure terminates. Suppose that the decision-maker identifies the abnormal processes in the order of {1, 3}. Hence, the entire detection procedure will be divided into two stages. The stage 1 is from the beginning of the detection procedure n = 0 to the stopping time τ 1 of process 1, where the stopping time τ 1 can be expressed as τ 1 = N d τ 1 + sN s τ 1 , and the duration of stage 1 can be expressed as N 1 d + sN 1 s . Similarly, the stage 2 is from the stopping time τ 1 to the stopping time τ 3 . The stopping time τ 3 can be expressed as s are the summation of the observation times and the switching times in two stages, respectively. The duration of stage 2 can be expressed as N 2 d + sN 2 s . In addition, once the state of abnormal process is identified, the corresponding process stops incurring cost. Thus, the cost incurred by process 1 is cτ 1 and the cost incurred by process 3 is cτ 3 , while process 2 does not incur any cost because it is in a normal state. As a conclusion, the cumulative cost incurred during the entire detection procedure is given by cτ 1 + cτ 3 .
Then, we upper bound the expected cumulative cost by analyzing the number of observations and switchings at each stage for Case I and Case II. Since the condition of Case II in DBS policy and DMSC policy are the same, we can apply ( [26], Lemma 5 and Lemma 6) to our model in Case II and yield: and there exists T s ∈ (0, ∞) such that However, it is not feasible to apply the conclusion (A7) and (A8) directly to the Case I of DMSC policy since the condition of Case I of DMSC is more complicated. Therefore, it remains to upper bound the observation times and switching times for Case I of DMSC policy.
For Case I, we can upper bound the observation times N k d and switching times N k s of each stage by analyzing three last passage times of each stage k on the assumption that the ratio s is equal to zero, and the obtained upper bounds are applicable to the case where s is greater than zero. Then, on the assumption that s = 0, these three passage times can be defined as T k 1 , T k 2 , T k 3 , where T k 1 is the last passage time when the process with the highest sum-LLRs is one of the abnormal processes for all n ≥ T k 1 at stage k; T k 2 is the time when the process with the highest sum-LLRs no longer changes for all n ≥ T k 2 ; and T k 3 is the time when sufficient information for declaring the process m 1 (n) as abnormal has been gathered, in another word, T k 3 is the time when the sum-LLRs of the process m 1 (n) satisfies S m 1 (n) (n) ≥ B for all t ≥ T k 3 . Hence, according to the definition of T k 2 , the number of switchings N k s of a stage is upper bounded by the value of T k 2 , i.e., N k s ≤ T k 2 since the switchings occur only during the period from the beginning of the stage to T k 2 . In addition, the number of observations is upper bounded by T k 3 according to the definition of T k 3 . It should be noted that T k 1 , T k 2 , T k 3 are not stopping times, because they are associated with the future, and the decision-maker does not know whether they have arrived since the abnormal processes are unknown. Furthermore, these three last passage times are defined under the condition that the sampling is not stopped. In the following analysis, we say implemented indefinitely to indicate that the decision-maker probes the processes indefinitely.
According to the definitions, T k 1 represents the earliest time such that process m 1 (n) is an abnormal process for all n ≥ T k 1 in each stage. In addition, it has been proved that T k 1 is sufficiently small with high probability [26,Lemma 2] and there exist D > 0 and γ > 0 such that In addition, T k 2 is the earliest time such that process m 1 (n) no longer changes for all n ≥ n k 2 . We introduce n k 2 T k 2 − T k 1 which denotes the total amount of time between T k 1 and T k 2 . The following lemma shows that P(n k 2 ) decreases exponentially with n.
Lemma A2. Assume that the DMSC is in Case I and is implemented indefinitely. Then, there exist D > 0 and γ > 0 such that P(n k 2 > n) ≤ De −γn for each k = 1, 2, ..., L.

Proof. Definel
which completes the proof.
It should be noted that the switchings across processes occur only before T k 2 (where T k 2 is defined on the assumption that s = 0), therefore the switching times at each stage follow that N k s ≤ T k 2 = T k 1 + n k 2 . Then, based on (A9) and Lemma A2, the number of switchings of Case I in DMSC is upper bounded by applying ( [26], Lemma 4 and Lemma 5) (where the upper bound of the switching times is established) to our model. As a conclusion, there exists a constant T s ∈ (0, ∞) such that where B ≈ log 1−α α ≈ − log α as α → 0. Besides, there exists T s ∈ (0, ∞) such that As a result, according to (4), the expected cumulative cost of DMSC policy can be upper bounded by Therefore, the proof of the order-optimality is completed.