Optimal Security Protection Strategy Selection Model Based on Q-Learning Particle Swarm Optimization

With the rapid development of Industrial Internet of Things technology, the industrial control system (ICS) faces more and more security threats, which may lead to serious risks and extensive damage. Naturally, it is particularly important to construct efficient, robust, and low-cost protection strategies for ICS. However, how to construct an objective function of optimal security protection strategy considering both the security risk and protection cost, and to find the optimal solution, are all significant challenges. In this paper, we propose an optimal security protection strategy selection model and develop an optimization framework based on Q-Learning particle swarm optimization (QLPSO). The model performs security risk assessment of ICS by introducing the protection strategy into the Bayesian attack graph. The QLPSO adopts the Q-Learning to improve the local optimum, insufficient diversity, and low precision of the PSO algorithm. Simulations are performed on a water distribution ICS, and the results verify the validity and feasibility of our proposed model and the QLPSO algorithm.


Introduction
An industrial control system refers to the equipment, system, network, and controller applied in industrial production, whose main functions are to operate, control, and assist industrial automation production [1][2][3]. ICS is mainly divided into four parts: Supervisor Control And Data Acquisition (SCADA) [4], Distributed Control System (DCS) [5], Programmable Logic Controller (PLC) [6], and Process Control System (PCS) [7]. ICS is widely used in electric power, nuclear power plants, petroleum, and other industries [8,9]. Due to the increasingly close connection between ICS and the Internet, ICS often faces complex network attacks, which pose a huge threat to the social economy and people's security. In recent years, cyber-attack incidents against ICS have occurred frequently, such as the Iranian "Stuxnet" incident in 2010 [10] and the Ukraine power grid cyber-attack in 2015 [11], causing serious property damage. However, there are essential differences between ICS and IT, which results in the unavailability of the traditional IT security protection strategies, such as access control, firewall, and so on, for the protection of ICS [12][13][14][15].
Recently, a hot research topic in the protection strategies for ICS has been to focus on the active protection, which implements active protection strategies based on the risk assessment on the current security risk of the ICS. Shameli-Sendi et al. proposed a retrospective burst response method based on an adaptive and cost-sensitive model [16]. This method takes into account the effectiveness of the application response. However, this approach does not meet the security requirements of ICS well. K. Deb et al. proposed the non-dominated sorting GA algorithm, which can keep the diversity of solutions at a high level [17]. Genetic algorithms generate defense and attack strategies, while fitness functions are used to infer dominant strategies. This method can maintain a high accuracy, but it will fall into the local optimal and cannot solve the optimal solution. Granadillo et al. proposed a geometric model to select a best-response combination based on the response Return on Investment (RORI) index [18]. However, this approach ignores attack modeling. Attack modeling is critical because cyber attacks are becoming more destructive these days, and the response time is the determining factor. Miehling et al. used the concept of POMDP to model defense issues [19]. Their goal is to provide the best dynamic defense strategy against ongoing cyber attacks against protected systems. However, this approach uses only a single metric: the cost of deployment costs of attack or defense operations, and cannot quantify the problem well.
In the stage of implementing protection strategies, each protection strategy is accompanied with costs. How to balance the benefits and costs of protection under the constraint of limited resources is a typical optimization problem. Constructing such a reasonable optimization problem and finding its optimal solution are the main challenges.
In this paper, we propose an optimal security protection strategy selection model based on QLPSO. In order to evaluate the security situation of ICS, we introduce the protection strategy into the Bayesian attack graph, and calculate the probability of each attribute node being attacked to obtain the risk value. When choosing the optimal protection strategy, the general PSO algorithm often falls into a local optimum. To solve this problem, QLPSO is proposed, which updates the parameters of PSO algorithm through Q-Learning. Finally, we verify the validity and feasibility of our model and the QLPSO algorithm for a water distribution ICS.

Related Work
A lot of research has been done on constructing the optimal security protection strategy for various complex systems. Jaquith proposed security metrics, such as attack cost, defense implementation cost, attack impact, operation cost, and other indicators to define the factors of optimal solution [20]. However, this approach lacks specific and commonly used measurement systems to reliably evaluate countermeasures. S. Bandyopadhyay et al. proposed the single-objective optimization problem and multi-objective optimization problem to determine the optimal strategy [21]. Nevertheless, they did not discuss how to find the optimal strategy. Poolsappasit et al. proposed a multi-index quantitative analysis method based on cost and benefit, and calculated the optimal security protection strategy through genetic algorithm [22], while it is also easy to fall into the local optimum. Yigit et al. developed a network defense algorithm under limited budget conditions [23]. However, the algorithm only considers the minimum cost, does not consider the attack benefit, and lacks comprehensive measurement. Lei et al. developed a Markov game-based strategy selection mechanism for a balance between the defensive revenue and network service quality [24]. However, the method has high time cost. Herold et al. defined the response selection approach based on user-defined cost metrics to counteract security incidents in complex network environments [25]. However, this method does not meet the ICS security requirements well due to the ignorance of the balance between security risks and protection costs. S. A. Butler proposed a multi-attribute risk assessment framework, in which several complex metrics are introduced, such as total security control cost, attack strategy cost, and so on [26,27]. Roy et al. proposed a cost-effective countermeasure system for cyber attacks using various functions to form the objective function [28]. Viduto et al. proposed a new risk assessment and optimization model to solve the problem of security countermeasure selection, in which the total cost of security control and the total risk level of the system together constitute the objective function [29]. In this approach, historical databases are used without consideration for the first experienced or zero-day attacks, exposing the system to undue risk. R. Dewri proposed Genetic Algorithm (GA) and S. Wang et al. proposed Ant Colony Optimization (ACO), which utilizes single-objective and multi-objective optimization cost functions to select the optimal set of strategies [30,31]. Similarly, the indicators used in the methodology are inconsistent with globally accepted standards. B. Kordy et al. proposed a game selection method combining ADTree modeling with integer linear programming [32,33]. The only response considered in this approach is to fix the vulnerabilities, and the many aspects (risk, cost, budget constraints, etc.) that define the optimal response plan are not considered. Speicher et al. proposed a Stackelberg programming algorithm that models the game choice problem as a two-person game in which the defender applies mitigation strategies (to minimize the probability of a successful attack) and the attacker tries to counter and maximize his chances of successfully executing the attack [34]. However, in order to improve the efficiency, this method only considers the critical attack path. Zenitani et al. proposed a multi-objective cost-benefit optimization algorithm, which can replace the use of multi-objective optimization algorithms and obtain the optimal solution through a series of iterations [35]. However, this method has only been tested in some networks, and its performance in practice cannot be guaranteed.
The above-mentioned research focuses mainly on balancing benefits and costs, and on constructing the objective function, and the research on the algorithm for selecting the optimal protection strategy has not been discussed in-depth. In addition, there are some other problems, such as the use of indicators that are not consistent with globally accepted standards, and so on. Based on the above research, this paper proposes an optimal security protection strategy selection model based on QLPSO. The model mainly answers the following questions: (1) how to evaluate the security risk of the ICS; (2) how to determine the objective function according to the security risk of the ICS and protection cost; (3) how to choose the optimal protection strategy.

Model Framework
Our model can be divided into three modules: ICS security risk assessment, construction of objective function, and optimal protection strategy selection. The framework of the model is shown in Figure 1. (1) ICS security risk assessment: First, we build a Bayesian attack graph based on the network configuration and asset information of the ICS [36], then calculate the exploit success rate of each edge of the attack graph, and construct the local conditional probability distribution (LCPD) table according to the exploit success rate. Finally, the prior probabilities of all attribute nodes being attacked are calculated [37]; (2) Construction of the objective function: Firstly, we construct all possible protection strategies based on the network configuration and asset information of the ICS, then quantify the attack benefit and protection cost, and finally construct the objective function based on both the attack benefit and protection cost; (3) Optimal protection strategy selection: First, we design the Q-Learning particle swarm optimization algorithm (QLPSO), then solve the objective function using the QLPSO, and finally find the optimal protection strategy.

Definition of Bayesian Attack Graph
The Bayesian attack graph is a directed acyclic graph, defined as: . . , S n } is the set of all attribute nodes of the attack graph.
(2) E = . . . , E ij , . . . is the set of all directed edges of the attack graph, where E ij has two end nodes S i and S j , and S i is the parent node and S j the child node.
. . , A n } means atomic attack. A i = 1 means that the attack has been launched, otherwise A i = 0. (4) P = {P(S 1 ), P(S 2 ), . . . , P(S n )} is the set of the probabilities that the attribute nodes can be attacked. P(S i ) indicates the success probability of attribute node S i of being attacked.
The Bayesian attack graph is established by combining the network configuration and asset information, and can be directly constructed using the MulVAL tool [38].

Calculation of Success Probability of Vulnerability Exploitation
We adopt the vulnerability scoring system CVSS (common vulnerability scoring system [39]) to calculate the probability of successful exploitation of each vulnerability, which is given by where AV, AC, and AU are the CVSS availability indicators. The specific scores of CVSS indicators are shown in Table 1. AV is the attack route value. The higher the AV, the farther the attacker can launch an attack. AC is the attack complexity value. The higher the AC score, the lower the attack complexity of the attacker. AU is the authentication value. The higher the AU score, the less authentication times the attacker launches attacks. v i represents the exploit between the current node and its parent node.

Calculation of Local Conditional Probability Distribution (LCPD)
Conditional probability [40] represents the successful exploitation probability of an attribute node under the influence of its parent node set, denoted by p(S i | Par(S i )). Par(S i ) refers to the set of parent nodes of the S i . The relationship between the attribute node and its parent node is shown in Figure 2. d j represents the respective type of each attribute node, which is divided into two types: AND and OR. When d j = AND, attribute node S i is only exploited when all its parent nodes are exploited, When d j = OR, attribute node S i can be exploited when any of its parent nodes are exploited,

Prior Probability Calculation
The prior probability of an attribute node S i refers to the probability that the attribute node S i can be reached under static conditions. It is expressed as the joint probability of the attribute node S i and all attribute nodes in the path to reach the attribute node S i .
The prior probability of attribute node S i is calculated as follows. When d j = AND, attribute node S i is only exploited when all its parent nodes are exploited, When d j = OR, attribute node S i can be exploited when any of its parent nodes are exploited, When the LCPD tables are calculated, the prior probability of each attribute node is equal to the conditional probability of the attribute node multiplied by the conditional probability of its parent node.

Protection Strategy
The M is a set of protection strategies, denoted as M = {M 1 , M 2 , . . . , M n }, where M i is a protection strategy that can be operated on the attribute node S i to reduce its risk of being attacked. M i = 1 represents that the protection strategy is enabled, otherwise M i = 0. When the protection strategy is enabled, the exploit success probability of attribute nodes will be affected, followed by a certain reduction in the exploitation probability. That is As a consequence, we have Implementing protection strategies will inevitably incur protection costs. The protection cost is represented by COST = {COST 1 , COST 2 , . . . , COST n }, where COST i represents the cost of implementing the protection strategy M i . COST i is defined as [41]: where ω i is the normalized weight of the protection strategies, and value represents the value of the asset. Therefore, the total protection cost of implementing the protection strategies M = {M 1 , M 2 , . . . , M n } is given by.

Attack Benefit
The attack benefit of attribute node S i is expressed as AG(S i ), which refers to the attack benefit obtained by successfully attacking attribute node S i . AG(S i ) can be calculated as In addition, the attack benefit of attribute node S i under the protection strategies M is given by Therefore, the total attack benefit under the protection strategies M can be obtained by summing the benefits of all attribute nodes, that is

Attack Benefit-Protection Cost Objective Function
Under the above definitions of attack benefit and protection cost, it is easier to illustrate our objective as being to minimize both the attack benefit and the protection cost. Therefore, the objective function can be expressed as subject to C(M) < B where δ and 1 − δ are the preference weights of attack benefit and defense cost, respectively, and 0 ≤ δ ≤ 1. B is the constraint of total protection cost.

Particle Swarm Optimization
In 1995, Kennedy and Eberhardt proposed the PSO algorithm, which is motivated by imitating the foraging activities of birds for solving single-objective optimization problems [42]. The state of each particle i in the particle swarm has two fundamental properties: velocity V i and position X i , where: X i = (X i1 , X i2 , . . . , . . . , X iD ), i = 1, 2, . . . , N Among them, V ij represents the velocity of the ith particle in the jth dimension, and X ij represents the position of the ith particle in the jth dimension. D is the dimension.
All particles have memory, which enables them to remember their local optimal position P. At the same time, all P of different particles will be shared in the whole population, and the optimal P is regarded as the global optimal position Gp reached by the population. Based on the obtained P and Gp, all particles in the population are updated in position and velocity using the following equations [43].
Among them, weight is the inertia weight, Lp is the self-learning factor, Lb is the global learning factor, rand (0, 1) is a random number in [0, 1]. V t id represents the dth dimensional velocity of the ith particle in the tth generation. P t id indicates the local optimal particle of the ith particle in the dth dimension of the tth generation. Gp t d represents the globally optimal particle in the dth dimension of the tth generation. X t id represents the dth dimensional position of the ith particle in the tth generation.

Q-Learning
Q-Learning is a learning method proposed by Watkins in 1989 [44]. In Q-Learning, an agent learns in a 'trial and error' way, and the reward guides behavior obtained by interacting with the environment, and the goal is to make the agent obtain the maximum reward.
Q-Learning has four key elements: Q table, state, action, and reward. The process of Q-Learning is [45]: the agent selects the action with the largest Q value from the Q table according to the state. After the action is completed, the state changes, and the reward is determined according to the quality of the state change, and then according to the current state, the next state, actions, and reward values update the Q table, in turn.
The updated method of the Q table is as follows [46].
where, st t is the current state, st t+1 is the next state. at t is the action taken for the current state, and at t+1 is the action taken for the next state. α and γ are the learning rate and discount factor, respectively, R(st t ,at t ) is the reward generated by the action at t taken by the state st t . max at Q(st t+1 , at) refers to the maximum Q value in the case of state st t+1 . The model of Q-Learning is shown in Figure 3.

Q-Learning Particle Swarm Optimization (QLPSO)
The PSO algorithm can be directly used to solve the optimization objective function Equation (13). However, PSO often falls into local optimum due to its fixed parameters. In this paper, we consider the QLPSO algorithm, which could update the parameters of PSO algorithm through the Q-Learning to avoid the local optimum problem [47].
The state, action, Q table, and reward are also the core elements of the QLPSO algorithm, which are shown in Figure 4.  (1) States

State
Unlike PSO, which has only one state, QLPSO has two states: objective space state and decision space state. The objective space state needs to consider the relationship between the particle and the global optimal particle position. The decision space needs to consider the relationship between the fitness of the particle and the fitness of the global optimal particle.
The decision space state has four sub-states: DFarthest, DFarther, DNearer, and DNearest. They respectively represent the relative state of the Euclidean distance between the particle and the global optimal position Gp compared with the size of the search space. The objective space state also has four sub-states: the largest fitness difference, the larger fitness difference, the smaller fitness difference, and the smallest fitness difference. They represent the relative state of the particle's fitness compared to the global optimal fitness and the global worst fitness difference, respectively. In this article, we only need to consider the fitness value between the two solutions.
The specific information of decision space state and objective space state is shown in Tables 2 and 3 [48].  Table 3. Objective space state.

Single-Objective Problem Relative Fitness (Rf ) Objective Space State
In Table 2, Rd is the Euclidean distance between a certain particle and the global optimal particle Gp. R is the range of decision space search. In Table 3, R f is the fitness, which refers to the fitness difference between a certain particle and the global optimal particle Gp. F is the difference between the fitness of the global best particle and the fitness of the global worst particle.
(2) Action There are also four types of actions, which correspond to different parameters of the particle swarm: weight, Lp, and Lb. Different values of weight, Lp and Lb will affect the exploration form of the particle. The larger the weight is, the stronger the global exploration ability and the weaker the local exploration ability. On the contrary, the smaller the weight is, the weaker the global exploration ability and the stronger the local exploration ability. The larger the Lp, the stronger the global exploration ability. The larger the Lb, the stronger the particle convergence ability [49]. The detailed parameter settings are shown in Table 4.  Figure 5. As shown in the figure, we first determine the state of the objective space and the state of the decision space, such as (the closest distance, the smallest fitness difference). Then, according to the state of the objective space and the state of the decision space, the action with the largest Q value corresponding to the state is selected. (4) Rewards After selecting a certain action, if the fitness value becomes worse, it should be punished. Otherwise, if the fitness value becomes better, it should be rewarded. The reward function defined in this paper is as follows. (20) where R f (state) and R f (state + 1) represent the fitness values of the current state and the next state, respectively. The flowchart of the QLPSO algorithm is shown in Figure 6 and described as follows.
(1) Initialize the population and Q table; (2) The state of each particle is determined according to its position in the objective space and decision space; (3) The action (parameter) of the particle is determined using the Q table; (4) Update the particles according to the parameters determined in the previous step; (5) Update the Q table according to the reward function; (6) These steps are repeated for all particles in each generation until the number of iterations is reached. The QLPSO algorithm is shown below (Algorithm 1). The time complexity and space complexity of the proposed algorithm are PN × maxIterations and PN, respectively, where PN is the population number, and maxIterations is the maximum number of iterations. The experimental scenario in this paper is a water distribution system, as shown in Figure 7. There are three PLCs in the water distribution system, which respectively control the water pump, the pipeline, and the water tanker in the virtual system. The platform can be used for industrial control security attack and defense drills, security risk assessment, and other related experiments [50].

Generating a Bayesian Attack Graph
Asset and vulnerability information is obtained by scanning assets as shown in Figure 8. The asset consists of three PLCS, each of which has the same vulnerability CVE-1999-0517. At the same time, there is some additional information for each PLC. The attack graph of the water distribution system is generated by using the MulVal. It is shown in Figure 9 and Table 5.

Attack Benefits and Protection Costs
The prior probabilities of all attribute nodes are calculated through the LCPD table, as shown in Table 6. The quantification standard of cost index of protection operation is shown in Table 7 [34]. In Table 7, Pi refers to the impact of the implementation strategy on vulnerability utilization. The specific operation, cost and other details of the protection strategies are shown in Table 8.

Experimental Results
In this paper, assuming the benefits of attack are as important as the costs of protection. The fitness function parameter δ is set to 0.5, and there is no upper limit to the set-up cost B.
The experimental results are shown in Figure 10. The blue curve is the experimental result without protection strategies. The fitness value is always at 546.0160, and the attack benefit is very large, which means that without protection strategies, the attacker can very easily launch attacks to obtain attack benefit. The red curve is the experimental result of the ordinary particle swarm. It can be seen that the fitness value is always 430.0326, and the attack benefit plus protection cost is still quite large, indicating that the optimal protection strategy has not been obtained. The yellow curve is the experimental result of the Q-Learning particle swarm. It can be seen that the fitness value tends to be stable at 327.9708, and the attack benefit plus protection cost reaches the minimum value. The experiments show that the Q-Learning particle swarm algorithm found the optimal protection strategy It is proved that the optimal protection strategy obtained by QLPSO algorithm has a good protection effect and minimizes the attack benefit and protection cost. Therefore, the water distribution system is well protected by the strategy selected through QLPSO.

Conclusions
Constructing the optimal security protection strategy is of great significance to the ICS. Since the traditional PSO algorithm tends to fall into local optimum when choosing the optimal protection strategy, in this paper, we propose an optimal security protection strategy selection model and Q-Learning particle swarm optimization algorithm. The algorithm can easily select the optimal protection strategy and the experimental results verify the feasibility and effectiveness of our model and the QLPSO algorithm.