Network Attack Path Selection and Evaluation Based on Q-Learning

: As the coupling relationship between information systems and physical power grids is getting closer, various types of cyber attacks have increased the operational risks of a power cyber-physical System (CPS). In order to effectively evaluate this risk, this paper proposed a method of cross-domain propagation analysis of a power CPS risk based on reinforcement learning. First, the Fuzzy Petri Net (FPN) was used to establish an attack model, and Q-Learning was improved through FPN. The attack gain was defined from the attacker’s point of view to obtain the best attack path. On this basis, a quantitative indicator of information-physical cross-domain spreading risk was put forward to analyze the impact of cyber attacks on the real-time operation of the power grid. Finally, the simulation based on Institute of Electrical and Electronics Engineers (IEEE) 14 power distribution system verifies the effectiveness of the proposed risk assessment method. communication with the DMZ domain and the control center; and (3) The DMZ domain and control center can only communicate with workstations. Before quantitative modeling, one makes the following assumptions about the attacker’s capabilities: (1) The attacker understands Direct-Attached Storage (DAS) and has the latest DAS vulnerability information and (2) Attackers can deliberately and effectively use social engineering to attack.


Introduction
Smart grid is a typical cyber-physical system (CPS), which uses intelligent terminals such as massive sensors and advanced metering equipment to realize remote monitoring, control, and protection of the grid [1,2]. Advanced Persistent Threat (APT) attacks use the openness and easy accessibility of smart terminals to invade, carry out multi-step attacks through network security vulnerabilities, and, finally, enter the main station and destroy power production on a large scale [3]. For example, Ukraine suffered a "Black Energy" attack in 2015 [4] and there was also Israel's power outage in 2009 [5]. Therefore, predicting the path of a multi-step attack and analyzing the cross-layer risk of the power CPS under the attack will help ensure the safe and stable operation of the power CPS [6].
When attacking, the attacker tends to choose the path with low attack cost and high attack profit to carry out the invasion, as this is the best attack path. The purpose of the best attack path discovery is to analyze the attacking behavior by alert correlation technology, reveal the hidden logic, construct attack scenarios, and then infer the subsequent attack steps of attackers, providing important evidence for active defense of network security [7]. It has been an important method of dealing with the multi-step attacks [8]. Current research is usually static analysis based on experience. Literature used a forward search strategy to find hidden attack paths. Reference [9] adopted heuristic search algorithms to generate attack graphs. Reference [10] proposed pruning attack graph branches to exploit Greedy search strategy finds the attack path. Reference [11] made use of an attack tree and a genetic algorithm to solve the optimal attack path problem, and found the solution through a genetic algorithm. However, the above-mentioned research has high computational complexity and is difficult to apply to large-scale networks. At the same time, it cannot reflect the influence of the different attackers on the path selection.
In order to solve the above problems, the Q-Learning algorithm is introduced to discover the best attack path. Q-Learning belongs to a category of semi-supervised learning

Attack Model
Cyber attacks have complex and random characteristics. FPN has the capability of describing concurrent events and graphical representation of Petri nets. Moreover, FPN can express this transition process concisely and clearly, avoiding the problem of state space explosion. In addition, FPN also has the fuzzy inference ability of fuzzy systems. Its place credibility and transition credibility can well represent the process of network attacks and the ambiguity of the attacks. Since the above characteristics of FNP meet the modeling needs, this study uses FPN to establish a network attack model. This paper is based on the network attack model established by FPN, which is a four-tuple: M = {H, T, α, µ} (1) H = {h1, h2, h3, . . . , hn} is a finite set of places h, which represents the host of the information system in the model; (2) T = {t1, t2, t3, . . . , tm} is a finite set of transitions t, which represents the exploitable vulnerabilities of the system host in the model; (3) α represents the risk value caused by the system host represented by the place after being invaded, that is, the threat index; and (4) µ: T × H → (1,10) represents the confidence of the transition rule, that is, the probability that a certain transition is triggered. In the network attack model, it represents the complexity of the attack process. The higher the attack complexity, the lower the possibility of being attacked. Attack complexity is affected by many factors such as attack tools and attacker experience. Its value is given according to the Common Vulnerability Scoring System (CVSS) [19].
This method uses the FPN place to represent the information system host, and uses transitions to represent the exploitable vulnerabilities of the system. This method makes the complex network attack processes more concise and intuitive while reasonably considering the actual network attack. In addition, the concept of the FPN place's credibility and transition confidence is also used, which can well represent the ambiguity and uncertainty of the network attack process and its impact. Moreover, it can make subsequent analysis more reasonable and effective.

FPN-Q Learning Algorithm to Determine the Best Attack Path
This algorithm introduces the Q-Learning algorithm to analyze the network attack model established in 1.1, and uses the parameters of FPN to improve: (1) Transition confidence µ of FPN is used to define the attack cost of a single-step attack; The place credibility α of FPN is used to define the attack revenue, that is, the threat of each attack to the system. The algorithm in this paper starts from the attacker's point of view, and comprehensively considers the attack cost and attack benefit, which are used to define the attack gain indicator. Attackers tend to choose attack paths with low attack costs but high threats to the system, that is, the path with the highest attack gain is the best attack path.
(2) As described in 1.1, µ can well represent the randomness of the attacker's selection of vulnerabilities. The algorithm uses µ to optimize the exploration process of the Q-Learning algorithm, and accelerates the convergence speed of the Q function without changing the final result. The algorithm is divided into two phases: the learning phase and the attack phase.

Learning Stage
In the traditional Q-Learning algorithm, the agent selects an action to act on the unknown environment during each iteration. After the environment receives the action, it generates an enhanced signal (reward or punishment) to feed back to the node. The node chooses the next action based on the enhanced signal and the new state of the environment. The principle of action selection is to increase the probability of receiving a positive reward. After continuous learning and trial and error, the node finds the optimal action control strategy and obtains cumulative returns. It is worth noting that the traditional Q-Learning algorithm randomly selects actions with equal probability in the exploration phase. However, in a network attack, the attacker has mastered all or part of the information about the security vulnerabilities of the information system before the attack. Therefore, the attack path will be selected based on experience rather than randomly selected with equal probability. The probability of each vulnerability being selected is related to the attack complexity: the higher the attack complexity, the smaller the chance of being selected. Therefore, this paper introduces the transition confidence of the FPN model µ to optimize the exploration process of Q-Learning. In the exploration phase, the probability that the vulnerability j of host i is exploited p ij is: where, n represents the number of exploitable vulnerabilities of host i, µ ij is the attack complexity of exploitable vulnerability j of host i, and its value range is (1,10).
In the attack model, the attack cost is related to the attack complexity of the security vulnerability. Generally, the more complex the attack, the higher the attack cost. Therefore, this paper defines the attack complexity of security vulnerabilities as the attack cost. The attack proceeds are the threats to information systems caused by network attacks, which are related to the nature of the vulnerability itself. According to the FPN attack model established in Section 2.1, the following definitions are given: The initial single-step attack gain is the threat to the system caused by the attacker invading host j through host i before the start of the learning process, expressed by the reward function g ij : where, r ij is the single-step attack reward, α i is the threat value caused by the intrusion system host i to the network, and α j is the threat value caused by the intrusion system host j to the network.
The cumulative single-step attack gain is the attack gain obtained by the attacker from host i invading host j after multiple intrusion learning, denoted by Q(h i , t ij , h j ).
where, β represents the learning factor, γ represents the discount factor between delayed return and immediate return, T j represents the optional vulnerability of the next attack after the intrusion of host j, and max t jk ∈T j Q(h j , t jk , h k ) represents the maximum gain that the attacker can obtain in the next attack after the host invades host j. Based on the above definition, the attacker is regarded as the agent of the Q-Learning algorithm, and the information system is regarded as the environment where the attacker's attack behavior is given feedback. The basic idea of the learning process is: the agent first starts from the initial intrusion node according to the scanning of the network environment, and then selects one from the current intrusive system vulnerabilities to invade according to Formula (1), and finally updates the single-step cumulative attack gain of this attack according to Formula (4) The attacker takes the attack path to the target host as a scenariobased learning until the Q value of each optional intrusion step reaches the maximum and converges. The specific learning process is shown in the Algorithm 1 environment:

Attack Stage
The basic idea of the attack stage is: after learning, the attacker will select the host vulnerability with the highest cumulative gain in a single step for each step of the attack, until it invades the target host. Therefore, the optimal attack path and its attack gain G for multi-step attack are obtained, and the algorithm is shown in the Algorithm 2 environment.

Information-Physical Cross-Layer Risk Spread Model
In order to evaluate the impact of the best attack path found on the power CPS, the following information-physical cross-layer risk propagation model is established. Power CPS is a multi-dimensional heterogeneous system which fully integrates the physical network and information network of the power system [20]. Through the coordination of computing equipment, sensor equipment, and communication equipment. The overall operating performance of the power system is optimized by physical equipment, etc. The CPS structure under the power Internet of Things is shown in Figure 1 which can be divided into three levels from bottom to top: user load, terminal sensing, and control and decision. The physical system and information system realize the interaction between information flow and energy flow through intelligent terminals. On the one hand, the terminal equipment collects the electricity consumption data and equipment status data of different power users, which are used by the control center to analyze the operation status of the power system and formulate appropriate control plans [21]. On the other hand, they receive commands from the control center to regulate electrical primary equipment, such as increasing or decreasing generator output, adjusting transformer taps, etc. [22].

Attack Stage
The basic idea of the attack stage is: after learning, the attacker will select the host vulnerability with the highest cumulative gain in a single step for each step of the attack, until it invades the target host. Therefore, the optimal attack path and its attack gain G for multi-step attack are obtained, and the algorithm is shown in the Algorithm 2 environment.

Information-Physical Cross-Layer Risk Spread Model
In order to evaluate the impact of the best attack path found on the power CPS, the following information-physical cross-layer risk propagation model is established. Power CPS is a multi-dimensional heterogeneous system which fully integrates the physical network and information network of the power system [20]. Through the coordination of computing equipment, sensor equipment, and communication equipment. The overall operating performance of the power system is optimized by physical equipment, etc. The CPS structure under the power Internet of Things is shown in Figure 1 which can be divided into three levels from bottom to top: user load, terminal sensing, and control and decision. The physical system and information system realize the interaction between information flow and energy flow through intelligent terminals. On the one hand, the terminal equipment collects the electricity consumption data and equipment status data of different power users, which are used by the control center to analyze the operation status of the power system and formulate appropriate control plans [21]. On the other hand, they receive commands from the control center to regulate electrical primary equipment, such as increasing or decreasing generator output, adjusting transformer taps, etc. [22].  Different from traditional physical grid cascading failures, information-physical dual-network cascading failures caused by cyber-attacks will cause more serious consequences [23,24]. In terms of security risks in the information space, smart terminal equipment is the only way to spread to the power space. Therefore, attackers usually choose widely distributed smart terminals as access points to attack at present. After obtaining permission, scanning software is used to obtain the security vulnerabilities that can be used by each host of the system. Eventually attackers initiate a multi-level invasion, modify the configuration file or business data in the server after entering the control center.
It will cause the control center to incorrectly perceive the current state of the physical power grid [25], and affect the operation of the physical power grid by issuing wrong control commands. This paper mainly researches the attacks of tampering with system data in this situation.

Security Risk Assessment of Electric Power CPS under Cyber Attack
This paper selects the function of load control for risk assessment, and its propagation process is as follows: (1) The attacker uses a certain strategy to launch an attack through the smart terminal to enter the control center, and then randomly or deliberately tamper with the business data according to the knowledge of the physical power grid, so that the load of some physical nodes exceeds the predetermined quota. (2) The control center considers that the load on the node exceeds the capacity, and judges that the node is faulty. Therefore, the control center performs load reduction according to Formulas (5)-(10) with the goal of minimizing load loss, and issues a control command to cut off part of the load of the node and its neighboring nodes to ensure the safe and stable operation of the system.
where, I is the total load loss of the physical system, N is the number of load shedding nodes in the physical system, and Ls i is the load loss of node i. At the same time, considering the power flow constraints of the distribution network and the observable and controllable nodes, the following constraints are obtained: where, P i and Q i are the active power and reactive power of node i, respectively, s(i) is the set of nodes connected to node i, G ii and B ii are the self-conductance and self-susceptance of node I, respectively; G ij and B ii are the conductance and susceptance between nodes i and j, respectively; U i and U j are nodes, respectively The voltages of i and j; θ ij is the phase angle difference between nodes i and j; U min and U max are the lower and upper limits of the voltage of node i, respectively; I min and I max are the lower and upper limits of the line current.
where, PG i is the power generation of the controllable power generation equipment connected to node i, PG i min , and PG i max are the lower limit and upper limit of the generator's power generation capacity, and PD i is the load of node i.
(1) The intelligent terminal adjusts the load of the physical system according to the wrong instruction issued by the control system; (2) Each node of the physical power grid adjusts the load according to the control command, and some nodes will lose the load of normal operation. Therefore, the physical power grid trend will change, and new business data will be transmitted to the control center.
CPS security risks under data attacks are related to the threat value of the attack path to the information system and the consequences of cascading failures caused. Section 2.4 discusses the attack benefit G of the best attack path, that is, the threat to the information system caused by the attack through this path. Based on the above, the risk of CPS under data attack is defined as: where, R i represents the risk of power CPS when the load data of node i is tampered with. G represents the threat to the information system that an attacker launches an attack on the network through a predetermined path. p i represents the probability that the data of node i is modified, which is related to the attacker's familiarity with the operation of the system. Ls ij indicates that the load change of node i causes the load loss of node j. Load j represents the original load of node j. ls ij represents the load loss rate of node j caused by the load change of node i. ls i Indicates the load loss rate of the entire system caused by tampering of the load data of node i, which is the sum of load loss rates of all nodes.

Establishment of Simulation Environment
In order to verify the feasibility and effectiveness of the proposed algorithm, the Supervisory Control And Data Acquisition (SCADA) power distribution system was selected to establish a network attack model based on FPN, as shown in Figure 2 (2) The attacker uses the power distribution terminal equipment as the access point t invade the DMZ area and continuously use the system security vulnerabilities to in crease the authority until entering the operator's Human Machine Interface (HMI The business data is randomly tampered on the HMI side, because the attacker doe not have detailed physical power grid parameters and data. At this time the a tacker's target is H4. At the same time, in order to analyze the risks caused by network attacks to the powe CPS, the IEEE14-node system shown in Figure 3 is selected as the experimental mode Nodes 1, 2, 5, and 7 of the system are distributed power sources, and nodes 4, 8, and 1 are important load nodes. The total power generation capacity of the distributed powe generation is 3.7 MW, and the sum of the power requirements of each load is 3.19 MW. Depending on the ability of the attacker, the attacker can launch attacks on different system hosts. There are two most common modes: (1) The attacker uses the power distribution terminal equipment as the access point to further invade the DMZ area. By invading the DMZ area, The system's security vulnerabilities are continuously used to increase the authority until entering the control center application server because of the invasion. Deliberate tampering of business data will cause greater losses to the system. At this time, the attacker's target is H 8 . (2) The attacker uses the power distribution terminal equipment as the access point to invade the DMZ area and continuously use the system security vulnerabilities to increase the authority until entering the operator's Human Machine Interface (HMI). The business data is randomly tampered on the HMI side, because the attacker does not have detailed physical power grid parameters and data. At this time the attacker's target is H4.
At the same time, in order to analyze the risks caused by network attacks to the power CPS, the IEEE14-node system shown in Figure 3 is selected as the experimental model. Nodes 1, 2, 5, and 7 of the system are distributed power sources, and nodes 4, 8, and 13 are important load nodes. The total power generation capacity of the distributed power generation is 3.7 MW, and the sum of the power requirements of each load is 3.19 MW.
(2) The attacker uses the power distribution terminal equipment as the access point t invade the DMZ area and continuously use the system security vulnerabilities to in crease the authority until entering the operator's Human Machine Interface (HMI The business data is randomly tampered on the HMI side, because the attacker doe not have detailed physical power grid parameters and data. At this time the a tacker's target is H4. At the same time, in order to analyze the risks caused by network attacks to the powe CPS, the IEEE14-node system shown in Figure 3 is selected as the experimental mode Nodes 1, 2, 5, and 7 of the system are distributed power sources, and nodes 4, 8, and 1 are important load nodes. The total power generation capacity of the distributed powe generation is 3.7 MW, and the sum of the power requirements of each load is 3.19 MW.

Experimental Results-Security Analysis of the Information Layer
The attack gain index in this paper is based on the attack reward and attack cost. A a result, the relationship of the three should be studied first. There are 30 attack paths i attack mode 1, and five attack paths in attack mode 2. Figure 4 (its abscissa variables ar the attack path number) shows the attack gain, attack reward and attack cost of each attac path in the two attack modes. It can be seen that the value trend of the attack gain i roughly the same as the attack reward, but the attack cost reduces the attack gain to greater extent. When the attack reward is small, the attack gain may even appear to b smaller than the attack cost (path 5 of attack mode 2), which is not good for the attacke In order to further illustrate the relationship of the three, a scatter plot between the thre under two attack modes is drawn in Figures 5 and 6. Figures 5a and 6a show that th attack gain is closely related to the attack reward, and a high attack reward will bring high attack gain. Figures 5b,c and 6b,c show that high attack rewards and attack gain often require high attack costs. However, the path with the highest attack cost will not ge the highest attack reward and highest attack gain. This is due to the different nature o each security vulnerability. The attack complexity of the security vulnerability that pose

Experimental Results-Security Analysis of the Information Layer
The attack gain index in this paper is based on the attack reward and attack cost. As a result, the relationship of the three should be studied first. There are 30 attack paths in attack mode 1, and five attack paths in attack mode 2. Figure 4 (its abscissa variables are the attack path number) shows the attack gain, attack reward and attack cost of each attack path in the two attack modes. It can be seen that the value trend of the attack gain is roughly the same as the attack reward, but the attack cost reduces the attack gain to a greater extent. When the attack reward is small, the attack gain may even appear to be smaller than the attack cost (path 5 of attack mode 2), which is not good for the attacker. In order to further illustrate the relationship of the three, a scatter plot between the three under two attack modes is drawn in Figures 5 and 6. Figures 5a and 6a show that the attack gain is closely related to the attack reward, and a high attack reward will bring a high attack gain. Figure 5b,c and Figure 6b,c show that high attack rewards and attack gains often require high attack costs. However, the path with the highest attack cost will not get the highest attack reward and highest attack gain. This is due to the different nature of each security vulnerability. The attack complexity of the security vulnerability that poses the greatest threat to the system may not be high (such as Common Vulnerabilities and Exposures (CVE)-2004-0893, it is a user privilege escalation vulnerability. According to the CVSS, its threat index reaches 7.2 (the highest is 10), but the attack complexity is only moderate.). Therefore, there is an optimal attack path, which reduces the cost of the attack and obtains a higher attack return. At this time, the attacker obtains the maximum attack gain.
Appl. Sci. 2021, 11, x FOR PEER REVIEW 9 of 13 the greatest threat to the system may not be high (such as Common Vulnerabilities and Exposures (CVE)-2004-0893, it is a user privilege escalation vulnerability. According to the CVSS, its threat index reaches 7.2 (the highest is 10), but the attack complexity is only moderate.). Therefore, there is an optimal attack path, which reduces the cost of the attack and obtains a higher attack return. At this time, the attacker obtains the maximum attack gain.
(a) Attack mode 1 (b) Attack mode 2 In the two attack modes, the attacker uses the FPN-Q Learning algorithm to perform attack learning as shown by the solid line in Figure 7. It can be seen that the attack gain obtained by the attacker invading the control center is much greater than that obtained by  high attack gain. Figures 5b,c and 6b,c show that high attack rewards and attack gains often require high attack costs. However, the path with the highest attack cost will not get the highest attack reward and highest attack gain. This is due to the different nature of each security vulnerability. The attack complexity of the security vulnerability that poses the greatest threat to the system may not be high (such as Common Vulnerabilities and Exposures (CVE)-2004-0893, it is a user privilege escalation vulnerability. According to the CVSS, its threat index reaches 7.2 (the highest is 10), but the attack complexity is only moderate.). Therefore, there is an optimal attack path, which reduces the cost of the attack and obtains a higher attack return. At this time, the attacker obtains the maximum attack gain.
(a) Attack mode 1 (b) Attack mode 2 In the two attack modes, the attacker uses the FPN-Q Learning algorithm to perform attack learning as shown by the solid line in Figure 7. It can be seen that the attack gain obtained by the attacker invading the control center is much greater than that obtained by invading the control center. This is because once the information host in the control center is destroyed, the security threat to the system is greater. Therefore, it is necessary to strengthen the monitoring and protection of the control center host. In the two attack modes, the attacker uses the FPN-Q Learning algorithm to perform attack learning as shown by the solid line in Figure 7. It can be seen that the attack gain obtained by the attacker invading the control center is much greater than that obtained by invading the control center. This is because once the information host in the control center is destroyed, the security threat to the system is greater. Therefore, it is necessary to strengthen the monitoring and protection of the control center host. invading the control center. This is because once the information host in the control center is destroyed, the security threat to the system is greater. Therefore, it is necessary to strengthen the monitoring and protection of the control center host.

Experimental Results-Algorithm Comparison
(1) Compare the algorithm proposed in this paper with random attacks and selective attacks. Taking attack mode 1 as an example, the attack gains obtained after 100 attacks are shown in Figure 7a (its abscissa variables are the experiment times). It can be seen from the figure that compared with random attacks, selective attacks have a greater redirection through the best attack path to get the greatest attack gain, but the algorithm proposed in this paper shows obvious advantages. After learning 15 times, you can find the best attack path and get the maximum attack gain. (2) Compare the algorithm proposed in this article with the traditional Q-Learning algorithm. Use the traditional Q-Learning algorithm to perform path learning and attack gain calculation for attack mode 1 and attack mode 2. The comparison of the result with two methods is shown in Figure 7b. It can be seen from the figure that the attack gain obtained by the improved algorithm is consistent with the traditional algorithm, indicating that the algorithm in this paper has high accuracy. However, in terms of learning speed, the algorithm proposed in this paper can find the best attack path faster. Especially for the more complex attack mode (mode 1), the traditional algorithm needs 30 times of learning to get the best gain and reach convergence, while the algorithm in this paper only needs 15 times. The learning efficiency has nearly doubled, showing obvious advantages.

Experimental Results-Cross-Layer Risk Communication Analysis
Assuming that the attacker only modifies the load of a single node each time, and the offset is 1 MW, the i ls of each node is obtained as shown in the Figure 8. It can be seen from the figure that when the load shedding amount is the same, the load loss rate caused by changing the data of No. 6 and No. 9 nodes is higher. Therefore, these two nodes are defined as high-risk nodes.

Experimental Results-Algorithm Comparison
(1) Compare the algorithm proposed in this paper with random attacks and selective attacks. Taking attack mode 1 as an example, the attack gains obtained after 100 attacks are shown in Figure 7a (its abscissa variables are the experiment times). It can be seen from the figure that compared with random attacks, selective attacks have a greater redirection through the best attack path to get the greatest attack gain, but the algorithm proposed in this paper shows obvious advantages. After learning 15 times, you can find the best attack path and get the maximum attack gain. (2) Compare the algorithm proposed in this article with the traditional Q-Learning algorithm. Use the traditional Q-Learning algorithm to perform path learning and attack gain calculation for attack mode 1 and attack mode 2. The comparison of the result with two methods is shown in Figure 7b. It can be seen from the figure that the attack gain obtained by the improved algorithm is consistent with the traditional algorithm, indicating that the algorithm in this paper has high accuracy. However, in terms of learning speed, the algorithm proposed in this paper can find the best attack path faster. Especially for the more complex attack mode (mode 1), the traditional algorithm needs 30 times of learning to get the best gain and reach convergence, while the algorithm in this paper only needs 15 times. The learning efficiency has nearly doubled, showing obvious advantages.

Experimental Results-Cross-Layer Risk Communication Analysis
Assuming that the attacker only modifies the load of a single node each time, and the offset is 1 MW, the ls i of each node is obtained as shown in the Figure 8. It can be seen from the figure that when the load shedding amount is the same, the load loss rate caused by changing the data of No. 6 and No. 9 nodes is higher. Therefore, these two nodes are defined as high-risk nodes. In the mode 1, the attacker enters the control master station and masters the operation of the power grid. At this time, the possibility pi of tampering with data of nodes with high load loss rate is higher. In the mode 2, the attacker does not enter the control center and does not understand the operation of the power grid. At this time, only business data can be modified randomly, so the probability of any physical node data being modified is the same. The risks faced by each node in the two attack modes are shown in Table 1: It can be seen from the above table that high-risk nodes (No. 6 and No. 9) face significantly higher risks in attack mode 1 than in attack mode 2. This is because in mode 1, the attacker enters the control master station and masters the operation of the power grid. At this time, the possibility pi of high-risk node data to be tampered is greater, and the possibility of ordinary node data being tampered is less. However, the opposite is true in mode 2. The attacker does not enter the control center in mode 2, so he does not know the operation of the system and will randomly tamper node data. In this case, the probability pi of each node's data being tampered is the same. Therefore, the possibility of data tampering of ordinary nodes in mode 2 is greater than that in mode 1, and the risks faced are also increased. However, on the whole, mode 1 poses a greater risk to the system than mode 2, because the attacker in mode 1 has more power grid operating data and can change the operating data in a targeted manner. From the above analysis, we can see that the risks faced by the power CPS system are closely related to the attacker's attack mode. Therefore, for information systems, the protection of confidential data and control centers needs to be strengthened. For physical nodes, high-risk nodes are protected, and the power load needs to be allocated reasonably to minimize the risk of the power CPS system when the information system is invaded. In the mode 1, the attacker enters the control master station and masters the operation of the power grid. At this time, the possibility p i of tampering with data of nodes with high load loss rate is higher. In the mode 2, the attacker does not enter the control center and does not understand the operation of the power grid. At this time, only business data can be modified randomly, so the probability of any physical node data being modified is the same. The risks faced by each node in the two attack modes are shown in Table 1: It can be seen from the above table that high-risk nodes (No. 6 and No. 9) face significantly higher risks in attack mode 1 than in attack mode 2. This is because in mode 1, the attacker enters the control master station and masters the operation of the power grid. At this time, the possibility p i of high-risk node data to be tampered is greater, and the possibility of ordinary node data being tampered is less. However, the opposite is true in mode 2. The attacker does not enter the control center in mode 2, so he does not know the operation of the system and will randomly tamper node data. In this case, the probability p i of each node's data being tampered is the same. Therefore, the possibility of data tampering of ordinary nodes in mode 2 is greater than that in mode 1, and the risks faced are also increased. However, on the whole, mode 1 poses a greater risk to the system than mode 2, because the attacker in mode 1 has more power grid operating data and can change the operating data in a targeted manner. From the above analysis, we can see that the risks faced by the power CPS system are closely related to the attacker's attack mode. Therefore, for information systems, the protection of confidential data and control centers needs to be strengthened. For physical nodes, high-risk nodes are protected, and the power load needs to be allocated reasonably to minimize the risk of the power CPS system when the information system is invaded.
As can be seen from the above table, the risk to the system caused by the attack mode 1 is greater than the attack mode 2. This is because the attacker under the attack mode can change the operation data in a targeted manner with more power grid operation data, causing greater risks to the system. At the same time, it can be seen from the table that ordinary nodes (such as No. 6 and No. 9) have greater operational risks under the same load shedding amount. The reason for this is these nodes have less output and are not located in important positions of the system, and the attacker is less difficult to attack. Therefore, it is necessary to strengthen the protection of ordinary nodes to avoid major losses to the power CPS.

Conclusions
This research uses the fuzzy reasoning ability of FPN to improve the Q-Learning algorithm, and uses Q-Learning to solve the shortcomings of FPN's inability to self-learn at the same time, thereby finding the most vulnerable path in the network system (the attacker can obtain the highest gain route for). Compared with traditional methods, this algorithm saves computing resources and can better reflect the impact of the difference between the attackers and the attack targets on the network. The experimental results show that this method has high accuracy, which can better help the study of defense measures against network attacks. In addition, this paper establishes an information-physical risk propagation model to evaluate the risks brought by different attack modes to the operation of the power grid. This allows the research in this paper applicable to the identification and protection of key nodes of the CPS system, cascading failure analysis, and the transmission of confidential data in management and other fields.