Optimal Deception Asset Deployment in Cybersecurity: A Nash Q-Learning Approach in Multi-Agent Stochastic Games

Kong, Guanhua; Chen, Fucai; Yang, Xiaohan; Cheng, Guozhen; Zhang, Shuai; He, Weizhen

doi:10.3390/app14010357

Open AccessArticle

Optimal Deception Asset Deployment in Cybersecurity: A Nash Q-Learning Approach in Multi-Agent Stochastic Games

by

Guanhua Kong

,

Fucai Chen

^*,

Xiaohan Yang

,

Guozhen Cheng

,

Shuai Zhang

and

Weizhen He

Institute of Information Technology, PLA Information Engineering University, Zhengzhou 450002, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(1), 357; https://doi.org/10.3390/app14010357

Submission received: 20 November 2023 / Revised: 26 December 2023 / Accepted: 28 December 2023 / Published: 30 December 2023

(This article belongs to the Special Issue Information Security and Cryptography)

Download

Browse Figures

Versions Notes

Abstract

:

In the face of an increasingly intricate network structure and a multitude of security threats, cyber deception defenders often employ deception assets to safeguard critical real assets. However, when it comes to the intranet lateral movement attackers in the cyber kill chain, the deployment of deception assets confronts the challenges of lack of dynamics, inability to make real-time decisions, and not considering the dynamic change of an attacker’s strategy. To address these issues, this study introduces a novel maze pathfinding model tailored to the lateral movement context, in which we try to find out the attacker’s location to deploy deception assets accurately for interception. The attack–defense process is modeled as a multi-agent stochastic game, by comparing it with random action policy and Minimax-Q algorithm, we choose Nash Q-learning to solve the deception asset’s deployment strategy to achieve the optimal solution effect. Extensive simulation tests reveal that our proposed model exhibits good convergence properties. Moreover, the average defense success rate surpasses 70%, attesting to the model’s efficacy.

Keywords:

cyber deception defense; stochastic game; multi-agent reinforcement learning; Nash Q-learning

1. Introduction

The scale of networks has expanded exponentially due to rapid advances in modern science and technology. While the growth of the Internet offers immense benefits, it also introduces significant security threats. The proliferation of innovative attack methodologies and tools means that zero-day vulnerabilities are increasingly prevalent, leading to recurrent cybersecurity breaches and substantial losses across various sectors [1,2,3].

Attackers possess inherent advantages in the cyber battleground. They have (1) proactivity: attackers can actively scan and infiltrate target systems, assessing potential vulnerabilities. (2) Capability: they can tailor their attack techniques based on identified system weaknesses. (3) Cost efficiency: with minimal investment, attackers can exploit system flaws and gain access to valuable data.

The cyber kill chain, which consists of multiple cyber attack stages [4,5], highlights the challenges faced by defense system that are largely passive. Conventional defense mechanisms, such as access controls and intrusion detection systems, are often inadequate against advanced threats presented by the cyber kill chain. There is a pressing need for innovative defensive techniques that can shift the balance in this asymmetrical warfare.

As a kind of cyber defense technology that can play a role in the whole process of a cyber kill chain, active defense technologies offer a promising avenue to rectify the asymmetry, empowering defenders with proactive countermeasures against a broad spectrum of threats. Cyber active defense technologies are mainly divided into two categories: (1) moving target defense (MTD) and (2) cyber deception. Although both are active defense technologies, they have different emphases. MTD primarily shifts the attack surface, for example, by altering system port configurations or IP addresses, thwarting attackers by providing them with obsolete or incorrect information during their reconnaissance phase [6,7,8]. The purpose of cyber deception defense is to inject false information in the reconnaissance stage of an attack to mislead or confuse the attacker [9,10,11,12] so that the attacker can take actions that are beneficial to the defender. For example, the honeypot system [13,14] is to attract potential attackers, waste attack resources, and learn attack tools, attack strategies, and other information about attackers. Compared with MTD, cyber deception defense not only has strong concealment, but the deception assets are difficult to be found by attackers and can detect and locate attacks in time; they also can deploy deception assets at a lower cost and learn the attacker’s information.

As depicted in Figure 1, after an initial vulnerability scan, attackers devise a targeted strategy, breach an entry node, and initiate lateral movements within an intranet to locate high-value targets. This lateral movement involves maneuvers to access valuable nodes by leveraging internal phishing, compromising shared content, or exploiting backup authentication credentials. Defenders, in contrast, focus on deploying honeypots and other deceptive assets to inhibit attackers from pinpointing high-value targets. The effective deployment of these deception assets, particularly in dynamically changing environments, is a crucial research area in deception defense.

However, deception asset deployment faces the challenges of lack of dynamics, inability to make real-time decisions, and not considering the dynamic change of an attacker’s strategy. Aiming at the above problems, this paper uses game theory to study the cyber deception asset deployment problem and uses a multi-agent reinforcement learning method to find out the attacker’s location for deception asset deployment to solve the optimal deployment strategy. Our contributions include:

(1) Modeling the Intranet Scenario: Addressing the challenges of current deception asset deployment, we cast the attacker’s lateral movement and deception asset deployment as a non-cooperative multi-agent stochastic game. We adopt the Nash Q-learning algorithm to solve the equilibrium strategy and ascertain the best deception asset deployment approach.

(2) Introducing the Maze Model: For the attacker’s lateral movement in the intranet and the process of the defender’s deployment of deception assets, we present a maze pathfinding model to encapsulate the attack–defense process. The attacker aims to locate valuable maze nodes, whereas the defender seeks to pinpoint the attacker’s location and deploy deception assets accordingly.

(3) Validation via Simulation: To evaluate our model, we set up a Kubernetes cluster in a cloud environment for simulations. Our results validate the model’s convergence and defensive prowess. Furthermore, by comparing with another multi-agent reinforcement learning algorithm, we corroborate the superior efficacy of the Nash Q-learning algorithm.

This paper is structured as follows: Section 2 delves into related works in cyber deception defense. Section 3 describes the attacker and defender models and the maze paradigm. Section 4 presents the stochastic game model and defines attack–defense payoffs. Section 5 outlines the multi-agent reinforcement learning framework and the deception asset deployment algorithm. Experimental settings and results are detailed in Section 6. Finally, Section 7 summarizes our work.

2. Related Work

The effective application of cyber deception defense to combat attacks and enhance system security performance is a pivotal research challenge within the realm of cyber deception defense. This section provides an overview of the seminal works in this domain.

Deception asset deployment has historically grappled with dynamic inadequacies. To address this, game theory [15] has been integrated into the research as a theoretical framework for decision-making. This assists defenders in the dynamic deployment of deception assets, a subject that has been the focus of numerous studies.

La et al. [16] proposed a two-player Bayesian game based on attack–defense deception in IoT networks, and they used honeypot as a deceptive defense mechanism. In their article, they studied Bayesian equilibria in one-shot and repeated games, respectively, and the results show that there is a certain frequency of attacks beyond which the two parties will mainly adopt cheating as their strategy. The authors used honeypots as a deception strategy in imperfect information games. Çeker et al. [17] proposed a defense mechanism based on deception to mitigate Denial of Service (DoS) attacks. The defender deploys honeypots to attract the attacker and obtain attack information. They used signaling games with perfect Bayesian equilibria to model the interaction between defender and attacker. Pawlick et al. [18] developed a honeypot-based defense system based on the signaling game, where the attacker is able to detect the honeypot. Extending from the empty talk game (a signaling game between the sender and the receiver with no signaling cost), they studied the signaling game with deception detection, in which the sender (the defender) transmits a signal and the receiver (the attacker) can detect the deception with a certain probability. The results show that the ability of the attacker to detect deception does not necessarily reduce the defender’s utility. Basak et al. [19] used cyber deception tools to posthaste confirm attacker types so that better defense strategies could be adopted. The type of attacker is reflected in the actions and goals when planning the attack activity. The authors used a multi-stage Stackelberg game to model the interaction between the attacker and the defender. In this game, the defender is the leader who plays a strategy considering the strategy of the attacker. However, as a follower, the attacker chooses his own strategy after observing the strategy of the leader. Anwar et al. [20] proposed a scalable algorithm to allocate honeypots over an attack graph and formulated a two-person zero-sum strategic game between the network defender and an attacker. They investigated the trade-off between security cost and deception reward for the defender. Finally they analytically characterized the Nash Equilibrium defense strategies.

However, the above studies do not consider the real-time decision of the deception asset deployment strategy. Here, we introduce the method of reinforcement learning [21]. The reinforcement learning defender can adjust the deception asset deployment strategy in real-time according to the system state, which solves the problem that the current deception asset deployment cannot make real-time decisions.

Al Amin et al. [22] introduced an online deception design and placement method for network decoys, taking into account the impact of defender behavior on attacker strategy and tactics in order to maintain a balance between usability and security tradeoff. The defender maintains beliefs regarding safe states, while modeling resulting actions as a partially observable Markov Decision Process (POMDP). An online deception algorithm is employed by the defender, utilizing a POMDP model to select actions based on observed attacker behavior. This model, which incorporates embedded reinforcement learning, assumes that the defender’s belief is centered around monitoring the progress of attackers through a network-based intrusion detection system (NIDS), prompting the defender to take actions aimed at luring attackers towards decoy nodes. Wang et al. [23] determined the optimal deployment strategy for deception assets such as honeypots. They developed a Q-learning algorithm to intelligently deploy deception assets that can adapt to changes in the network security state. By analyzing the attacker’s strategy under uncertainty and the defender’s strategy under multiple deployment location strategies, they considered the attacker–defender game. Reinforcement learning and Q-learning training algorithms were employed to determine the most effective deployment strategy for deception assets. Huang et al. [24] applied an infinite horizon semi-Markov decision process (SMDP) to describe the random transition and sojourn time of attackers in honeynets, quantified the reward–risk tradeoff, and designed adaptive long-term participation policies, showing risk aversion, cost effectiveness, and time efficiency. Numerical results show that the adaptive interaction strategy proposed in the literature can quickly attract attackers to the target honeypot and attract them for a long enough time to obtain valuable threat information while keeping the penetration probability at a low level. Finally, reinforcement learning is applied to SMDP to achieve fast and robust convergence of the optimal strategy and value. Zhang et al. [25] introduced a new game-theoretic framework of the interaction between a defender who used limited Security Resources (SRs) to harden the network and an attacker who adopted a multi-stage plan to attack the network. They constructed the model as a maze and represented the possible plans of the attacker using attack graphs. Simulation results exhibited that their proposed reinforcement learning-based SR allocation was feasible and efficient.

However, all the above studies fail to consider the attacker model whose strategy changes in real time. In order to solve the problem of deception deployment without considering the real-time change of the attacker’s strategy, researchers in recent years have begun to develop a multi-agent model, but there have been few studies. Therefore, under the framework of game theory and multi-agent reinforcement learning, this paper also studies the problem of deception asset deployment in the intranet. Different from previous works, this paper models the process of lateral movement of the attacker in the intranet and the defender’s defense strategy as a maze pathfinding model according to [25], adopts a stochastic game [26] to describe the dynamic interaction between the attacker and defender, and uses the multi-agent reinforcement learning algorithm Nash Q-learning [27,28] to find the Nash equilibrium in the game to solve for the deception asset deployment strategy.

3. System Model

When considering the deployment of deception assets for defensive measures, the foremost question is about their optimal placement for maximum defensive advantage. While one might intuitively think that positioning these assets near high-value nodes would yield better defense, the intricacies of network structures complicate this. Such networks may inadvertently expand the attack surface around these valuable nodes, presenting multiple access paths. Given the considerable cost of deploying these assets, it is not feasible to aim for a ubiquitous deployment for flawless defense. Against advanced persistent attackers who meticulously chart all access paths to high-value nodes to select the most rewarding attack route, the defender’s challenge is to intercept these intruders within the internal network. In this section, we formulate this problem through a maze pathfinding model. Here, the attacker must locate the target node within the maze, while the defender’s task is to spot the attacker and strategically place deception assets accordingly.

3.1. Attacker Model

The prime objective for the attacker is to breach high-value nodes within the Intranet. Originating from an external network, the attacker initiates their assault from a singular entry point. Once they gain control over a specific service node that provides external services, they use it as a stepping stone for lateral movements within the internal network, as illustrated in Figure 2. Leveraging this initial node, the attacker can gather information about adjacent nodes. However, due to their limited understanding of the system, they cannot pinpoint the direct route to the target node. Thus, we posit that the attacker cannot access the entire Intranet’s topology at once and is restricted from making jumps across nodes. Their movement is constrained to lateral transitions from one node to the subsequent one, continuing this pattern until they either locate the target node or get detected by the defender.

Furthermore, we base our model on the assumption that the attacker is a logical entity, striving to maximize their attack gains within a set number of moves without getting detected. However, they lack the capability to identify deception assets, making them vulnerable to any decoys set up by the defender. An attack is deemed successful if the attacker locates the target node, while it is deemed a failure if they fall for a honeypot, granting the defender insights into their attack strategy.

3.2. Defender Model

The goal of the defender is to prevent the attacker from reaching the target node, which mainly includes disrupting the attack path and attack discovery. Disrupting the attack path means that the defender simulates the high-value real asset with the deception asset to attract the attacker to interact with it and make him deviate from the attack path. Attack discovery focuses on discovering the intruder before they can compromise the target node by setting up these deception assets. In our model, the emphasis is on the latter defense strategy.

In scenarios with multiple high-value nodes in the internal network, the defender cannot predict the specific target of the attacker. Consequently, the defender adopts a methodical approach, inspecting internal network nodes sequentially, aiming to pinpoint the attacker’s location and deploy deception assets effectively. This ensures that the attacker interacts with these decoys, fulfilling the defense objective. We assume that the defender cannot directly locate the attacker, he needs to explore the intranet environment to collect vulnerability information and find the attacker. The defender starts the investigation from the defense node, and only one node can be investigated in one step, and the cross-node investigation cannot be carried out. The step represents the complete process of interaction between agents and the environment, including action selection and execution of agents, feedback reward from the environment, and environment state changing.

3.3. Construction of Maze Model

Inspired by the attack graph model, we model the intranet node topology as a maze M, shown in Figure 3. We also limit the number of neighbours one node can have to reduce the probability of attack surface exploitation. The maze is modeled as a 5 × 5 grid graph, assigning index values to each grid, and we classify the grids into five classes, where the red grid is the entry node E, which is the initial node of lateral movement in the attacker’s inner network. For the real network structure, an entry node can be a web server, IOT devices, and other nodes that can be directly accessed from the external network. We set the attacker breaking through the entry node as the start of the attack. The blue grid is the defense node D, which is the initial node of the defender’s intranet scanning and positioning. Generally, it is the management node in the intranet, and the management node has the ability to detect the scanning and can find the location of the Intranet attacker. The green grid is the target node T, that is, the high-value node in the internal network, generally a database, file server, mail server, etc. The gray grid is the offline node O. We assume that there are offline nodes in the internal network, and these nodes cannot be accessed through the internal network link due to downtime, maintenance, or other reasons and cannot be recovered in a short time, just like the walls in the maze. The white grid is the intermediate node I, the value of this kind of node is low and the attacker can normally enter and use this kind of node as a springboard to find the target node in the internal network. The maze model can be expressed as

M = (E, D, T, I, O)

, Among them

E = {0}

,

D = {4}

,

T = {24}

,

O = {3, 17}

,

I = {1, 2, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 18, 19, 20, 21, 22, 23}

.

We have a set of valid actions for each node i, expressed as

V A_{i}, 0 \leq i \leq 24

. We use it to describe the connectivity of node i, and although the valid action set can be unlimited, we still limit the valid actions of each node for the sake of security, because the more neighbors one node has, the higher the probability of attack surface exploitation happening. In order to fit the maze model we proposed, we set the original valid action set as

V A_{i} = {U p, D o w n, L e f t, R i g h t}

, but it will be adjusted according to the situation of the model. For example, at node 0, due to its boundary position, its valid actions are only down and right, expressed as

V A_{0} = {D o w n, R i g h t}

. For the presence of offline nodes in the adjacent nodes, such as node 7, its valid action set is

V A_{7} = {U p, D o w n, R i g h t}

. Similarly, the valid action sets of the remaining nodes can be obtained.

In the process of lateral movement within the intranet, attackers incur certain costs. These costs arise predominantly from activities such as information gathering and deploying attack tools. Nonetheless, these expenses can be somewhat mitigated by exploiting the value of intermediary nodes. Similarly, defenders also bear costs when investigating the intranet. The cost is mainly generated by system vulnerability scanning, malicious program detection, and other tools.

The attacker and defender will only get a positive payoff if they reach their goals, but because of the overhead, both players expect to be able to reach their goals in the shortest number of steps. Since the attacker does not know the topology of the entire intranet, he needs to try to find the shortest path from the entry node to the destination node through continuous probing and movement. In the model shown in Figure 3, there are obviously multiple paths from the entry node to the target node, and also there exist multiple interception paths for the defender.

The path is represented in the maze as shown in Figure 4.

The analysis shows that there are Nash equilibrium paths for both attackers and defenders in the maze model. Let

N E P

denote the set of Nash equilibrium paths, let

P_{a}

denote the attacker’s attack path, and let

P_{d}

denote the defender’s interception path, then:

\begin{matrix} N E P = {{P_{a} = (0, \dots, 7, \dots, 24), P_{d} = (4, 9, 8, 7)}, \\ {P_{a} = (0, \dots, 12, \dots, 24), P_{d} = (4, 9, \dots, 12)}, \\ {P_{a} = (0, \dots, 22, 23, 24), P_{d} = (4, 9, \dots, 22)}} \end{matrix}

For the defender, there are three best interception points, among which node 7 needs the least number of steps, node 12 takes the second, and node 22 has the most steps. The actual node to intercept the attacker depends on the attacker’s attack path.

4. Construction of Attack–Defense Game Model

4.1. Stochastic Game Model

In this study, both the attacker and defender are set as agents, the single-agent model becomes insufficient for our requirements, so we use stochastic game to model the attack–defense process. Stochastic games extend the single-agent Markov decision process to multi-agent environments [29]. Within this context, we introduce a non-cooperative game environment comprising both an attacker and a defender, which is modeled as a stochastic game with two agents, expressed as

M A S G = (N, S, A, R, γ, p)

, as follows:

(1)

N = (a, d)

represents the players of the game, a represents the attacker, and d represents the defender.

(2) S represents the state space. Let

S_{a}

and

S_{d}

denote the position indices of the attacker and defender, respectively, then, the system state at step t is expressed as

S_{t} = (S_{a}, S_{d})

.

(3)

A = (A_{a}, A_{d})

represents the action space of the participant, where

A_{a}

represents the action space of the attacker, and

A_{d}

represents the action space of the defender. The action space is the set of actions that the game player can choose. At the beginning of every step, all game players need to choose an action from their own action space, all the actions form a joint action and interact with the environment, then players will get their own reward to evaluate their chosen action.

(4)

R = (R_{a}, R_{d})

represents the reward set of the participant, where

R_{a}

represents the reward of the attacker and

R_{d}

represents the reward of the defender.

(5)

γ

is the discount factor, which describes the discount proportion of the reward in the future stages relative to the current stage. When considering decisions, the decision-maker will consider the payoffs in the future stages as one of the decision criteria, but they are not as important as the payoff in the current stage, and the discount factor is used to describe the importance of the payoffs of future stages.

(6) p is the state transition probability, which is determined by the agent’s mixed policy actions and the system state.

4.2. Quantification of Attack and Defense Benefits

The goal of the attacker and defender is to find the strategy that maximizes the discounted payoff, and we quantify the payoff function of the attacker and defender in this subsection.

For the attacker, when he moves laterally, he needs to detect the surrounding node information and adopt the attack strategy, which causes a certain attack cost, which is represented by

A C

. After a successful lateral movement, the attacker will get valuable information on the invaded node and obtain a certain payoff, where

V_{i}

denotes the value of node i. If the attacker is found by the defender, the attack fails, the attacker cannot launch any more attacks, and the defender will obtain all the attack information, at which time the attacker will pay a huge cost, denoted by

A F

. It can be concluded that the attacker’s payoff at step t is as follows.

R_{a}^{t} = \{\begin{matrix} V_{i} - A C & if move succeeded \\ - A F & if been captured \end{matrix}

(1)

For the defender, he needs to start from the defense node and investigate the internal network nodes one by one, and he needs to pay a certain tool cost, denoted by

D C

. When the defender inspects a new node, it will obtain certain security information of the node, and thus it can obtain a certain profit, denoted by

D G

. When the defender finds the attacker in the intranet node, the defense goal is achieved, and he will obtain a large security gain at this time, denoted by

S G

. However, when the attacker finds the target node, the defender fails to find the attacker, and when the defense fails, the attacker will obtain a large amount of high-value information, which brings a huge security loss to the defender, denoted by

D F

. Then it can be concluded that the defender’s payoff at step t is as follows.

R_{d}^{t} = \{\begin{matrix} D G - D C & if move succeeded \\ S G & if capture \\ - D F & if defend failed \end{matrix}

(2)

If the node values do not sufficiently compensate for the attack costs, the attacker must decrease the number of lateral movements. Additionally, it is imperative for the attacker to elude detection by the defender and ensure that the defender does not acquire any attack-related information, resulting in the generation of

A F

.

Similarly, the rational defender will try to keep moving and obtain enough security information to offset the tool cost

D C

. If it is not enough, the defender will reduce the number of inspection nodes, find out the attacker as soon as possible, and prevent him from successfully reaching the target node, so as to obtain the security benefit

S G

and avoid the generation of

D F

.

5. Game Model Solution

According to the attack–defense game model proposed in Section 4, we give the multi-agent reinforcement learning method in this section to solve the deception asset deployment strategy.

5.1. Multi-Agent Reinforcement Learning Representations

A multi-agent system is a set of autonomous, interacting entities that share a common environment in which they individually perceive and take actions based on their action policy [30]. As shown in Figure 5, the reinforcement learning agent refines its understanding through constant interaction with this dynamic environment. In each step, the agent perceives the state of the environment and takes actions. The actions of multiple agents form a joint action to transform the environment to a new state and obtain certain rewards to evaluate the quality of actions. The goal of the agent is to maximize the cumulative reward during the interaction.

In this section, the multi-agent reinforcement learning elements are set as follows:

(1) State: According to the model proposed in Section 3, if both the attacker and the defender can enter all nodes except the offline nodes, then both sides have 23 possible position indexes, so there are 529 possible system states. In this paper, the position index of the attacker and defender in the maze is set to the state of the system at step t, which is shared by both the attacker and defender and is denoted as

S_{t} = (S_{t}^{a}, S_{t}^{d})

. The state of the system changes with the actions taken by the attack and defense sides. The state of the system at step

t + 1

is determined by the actions taken by the attack and defense sides at step t. Let

v_{a}

and

v_{d}

denote the change value of the position index caused by the actions taken by the attack and defense sides, respectively, then there is

S_{t + 1} = (S_{t}^{a} + v_{a}, S_{t}^{d} + v_{d})

.

(2) Actions: Since both sides accomplish their goals by moving in the same environment, this paper sets the same action set for both sides. According to the maze model, the action set is set as

A_{a} = A_{d} = (U p, D o w n, L e f t, R i g h t)

. The two sides select the action to be executed from the action set according to a certain policy, but they need to check whether the currently selected action is valid before moving, that is, the agents cannot choose invalid actions according to the node they stay in.

(3) Reward: In a multi-agent environment, rewards obtained through interactions with the environment hinge on the actions undertaken by multiple entities. Specifically, in non-cooperative contexts, distinct agent categories earn rewards unique to them, based on the collective actions they execute. In this paper, both the attacker and defender incur a cost for each move they make, which inherently restricts their mobility. The attacker’s goal is to identify the most efficient route to the target node, minimizing steps to decrease the likelihood of detection by the defender. The defender needs to balance the moving cost and moving steps. While increased movement provides a comprehensive investigation, thus boosting the chances of identifying the attacker, it simultaneously escalates costs. We set the rewards for both agents as Equations (1) and (2).

5.2. Nash Q-Learning Based Multi-Agent Deception Deployment Solution Algorithm

In this paper, Nash Q-learning algorithm is used to solve the multi-agent stochastic game model. The Nash Q-learning algorithm extends Q-learning from a single-agent to non-cooperative multi-agent environment, which needs to consider the joint action of multiple agents. For a system with n agents, the Q-function of each individual agent changes from

Q (s, a)

to

Q (s, a^{1}, \dots, a^{n})

. The authors of [30] considered the extended concept of Q-function and adopted the concept of Nash equilibrium as a solution, defining Nash Q-value as the expected sum of discounted rewards of all agents following the specified Nash equilibrium strategy from the next period. This definition differs from the single-agent case, where future rewards are based only on the agent’s own optimal policy. The Nash Q-function of agent i represents the sum of agent i’s current reward plus his future reward when all agents follow the joint Nash equilibrium strategy, and is expressed as follows.

Q_{*}^{i} (s, a^{1}, \dots, a^{n}) = r^{i} (s, a^{1}, \dots, a^{n}) + γ \sum_{s^{'} \in S} p (s^{'} | s, a^{1}, \dots, a^{n}) ν^{i} (s^{'}, π_{*}^{1}, \dots, π_{*}^{n})

(3)

where

r^{i} (s, a^{1}, \dots, a^{n})

represents the reward for one step obtained by the agent i by taking the joint strategy

(a^{1}, \dots, a^{n})

in state s,

γ

is the discount factor,

p (s^{'} | s, a^{1}, \dots, a^{n})

is the state transition probability,

(π_{*}^{1}, \dots, π_{*}^{n})

is the joint Nash equilibrium strategy, and

ν^{i} (s^{'}, π_{*}^{1}, \dots, π_{*}^{n})

is the total discounted payoff of agent i starting from state

s^{'}

in a finite step when all agents follow the Nash equilibrium strategy.

The agent initializes Q at step 0, which is usually set to

Q_{0}^{i} (s, a^{1}, \dots, a^{n}) = 0

for all

s \in S, a^{1} \in A^{1}, \dots, a^{n} \in A^{n}

. At step t, agent i observes the current state, executes its action, and then continues to observe its own reward, the actions and rewards of other agents, and the new state

s^{'}

. Then, a Nash equilibrium strategy

π^{1} (s^{'}) \dots π^{n} (s^{'})

is calculated for stage game

(Q_{t}^{1} (s^{'}), \dots, Q_{t}^{n} (s^{'}))

. Let

α_{t}

denote the learning rate of agent at step t, the learning rate defines how much weight of the new Q an old Q-value will learn from the new Q-value to itself. If the learning rate is 0, it means that the agent will not learn anything (old information is important), and if learning rate is 1, it means that the newly discovered information is the only information that matters. Then, agent i updates the value of Q according to Equation (4).

Q_{t + 1}^{i} (s, a^{1}, \dots, a^{n}) = (1 - α_{t}) Q_{t}^{i} (s, a^{1}, \dots, a^{n}) + α_{t} [r_{t}^{i} + γ N a s h Q_{t}^{i} (s^{'})]

(4)

where:

N a s h Q_{t}^{i} (s^{'}) = π^{1} (s^{'}) \dots π^{n} (s^{'}) \cdot Q_{t}^{i} (s^{'})

(5)

According to the model we proposed in the previous section, the adaptation of the Nash Q-learning algorithm is as in Algorithm 1:

Algorithm 1 Nash Q-learning for stochastic game in maze pathfinding model.

1:: Set max_episode = Y, max_step = X; ▷Initialize the maximum episode of training and maximum step for each episode.
2:: Set $t = 0$ and $S_{0} = (S_{0}^{a}, S_{0}^{d})$ ; ▷Initialize the step and system state.
3:: Set $A_{a} = A_{d} = {U p, D o w n, L e f t, R i g h t}$ ; ▷Initialize action space.
4:: Set $R_{A}$ and $R_{D}$ ; ▷Initialize the reward function of attacker and defender.
5:: for all $S = (S^{a}, S^{d})$ and $a^{i} \in A^{i}, i \in N$ do
6:: Set $Q_{t}^{i} (S, a^{a}, a^{d}) = 0$ ; ▷Initialize Q value.
7:: end for
8:: Initialize $α_{t}$ , $γ$ and $ε$ ; ▷Initialize the learning rate, discount factor and hyperparameter of action policy.
9:: while $y \leq Y$ do
10:: while $t \leq X$ do
11:: Compute $π^{i} (S)$ ; ▷Compute equilibrium strategy.
12:: Sample $x \in (0, 1]$ ; ▷Select action policy.
13:: if $x < ε$ then
14:: Set action policy = random; ▷Choose action randomly.
15:: else
16:: Set action policy = greedy; ▷Choose the action which brings maximum reward.
17:: end if
18:: Choose action $a_{t}^{i}$ according to action policy;
19:: Observe $r_{t}^{i}, a_{t}^{i}$ and $S_{t + 1} = S^{'}$ ;
20:: Update $Q_{t}^{i}$ :
21:: $t = t + 1$ ;
22:: if done then
23:: Break; ▷Attacker reaches the target or defender finds the attacker.
24:: end if
25:: end while
26:: $y = y + 1$ ;
27:: end while

We adapt the model based on the original Nash Q-learning algorithm, set the multi-agent reinforcement learning parameters, and train the model.

6. Experiment

6.1. Description of Simulation Environment

To validate the efficacy of our proposed model, we executed simulation experiments within a cloud environment. We employed Kubernetes, a renowned container orchestration tool, given its proficiency in streamlining the deployment, scheduling, and elastic scaling of applications. As illustrated in Figure 6, our Kubernetes cluster comprises a master node equipped with a 16-core processor, 16 GB of memory, and 1 TB of storage, alongside three-node nodes, each featuring an 8-core processor, 8 GB of memory, and 1 TB of storage.

The master node’s primary responsibility is orchestrating the creation and termination of pods, with the actual pod runtime environment being situated on the node nodes. These node nodes are designed to communicate seamlessly amongst each other. Utilizing the master node, we deployed a variety of applications across distinct pods, including web services, FTP services, and databases. The interconnectivity between these pods is depicted in Figure 6. For the purposes of our experiment, the pod running the MySQL database service was designated as the target node, with all other application pods serving as intermediate nodes. One particular web service pod was chosen as the attacker’s entry node, and a pod is created as a defense node through the master node and defense tools are deployed. The attacker initiates lateral movement from the pod of the entry node to the rest of the pods to find the MySQL service pod. Concurrently, the defender embarks from the defense node, aiming to pinpoint the attacker’s exact location. It is worth noting that the gray nodes depicted in Figure 6 represent offline pods. These are inactive, unable to offer services or engage in communication with their counterparts. Comprehensive details regarding the software and hardware configurations used in our simulation experiment can be found in Table 1.

6.2. Simulation Parameter Setting

According to the experimental environment set up in Section 6.1, in order to solve the optimal equilibrium path of the attack and defense sides, this paper adopts the multi-agent reinforcement learning algorithm based on Nash Q-learning, and the simulation parameter settings are shown in Table 2.

We set the reward function of the attacker and defender as follows:

R_{a}^{t} = \{\begin{matrix} - 3 & if move succeeded \\ 50 & if reach target \\ - 100 & if been captured \end{matrix}

(6)

R_{d}^{t} = \{\begin{matrix} - 1 & if move succeeded \\ 50 & if capture \\ - 100 & if defend failed \end{matrix}

(7)

The initial state of the system is

S_{0} = (S_{0}^{a}, S_{0}^{d}) = (0, 4)

if the attacker’s starting position index is 0, the defender’s starting position index is 4, and the target node’s position index is 24, denoted as

S_{0}^{a} = 0, S_{0}^{d} = 4, G o a l = 24

, respectively. As shown in Table 2, we set the action set of both attack and defense as

A^{a} = A^{d} = [0, 1, 2, 3]

, which represents the four actions of Up, Down, Left, and Right, respectively. When constructing the maze model in Section 3.3, each position in the maze is given a valid action set. If the action chosen by both the attacker and defender is not in the valid action set of the current position, the move will fail, so both sides will gradually avoid choosing invalid actions during training. We set the behavior strategy of both the attacker and defender as

ε

-greedy, and

ε

is set to 0.1, which means that the agent chooses the action randomly with probability 0.1 and the greedy action with probability 0.9. This means that the agent has a probability of 0.1 to explore the system environment and a probability of 0.9 to choose the local optimal policy in the current state (which may have not yet been trained as optimal). At the beginning of training, the agent will continue to choose the same action until a better policy is explored. It is expected that as training proceeds, the number of moves required for the end of a round of the game will gradually fluctuate slightly around the number of equilibrium path steps for both players.

6.3. Simulation Experiment and Result Analysis

6.3.1. Algorithm Comparison

In order to verify the effect of Nash Q-learning algorithm on solving the model, we also introduce Minimax-Q algorithm [31] to solve the model. We also adopt the action policy of

ε

-greedy, and set

ε

as 0.1, and the other parameters are the same. The result obtained is shown in Figure 7. In addition, we also conducted experiments on the case where both the attacker and defender adopt a random strategy, that is,

ε = 1

, and the result is shown in Figure 8.

Our analysis reveals that the Minimax-Q algorithm typically requires more steps to finalize an episode. The primary reason for this disparity is the Minimax-Q’s singular focus on actions that maximize its Q-value in the agent’s current state, without accounting for a balanced strategy. So, agents may get ensnared in local optimal policies and engage in repetitive, often futile, actions—leading to an extended number of steps to wrap up an episode. Insights from Figure 7 further underscore this point. The depicted step curve is noticeably erratic, lacking any discernible convergence even after 500 training episodes. This suggests that the agent struggles to pinpoint a equilibrium path but fails to stabilize in a convergent state. Contrastingly, Figure 8 shows that an agent employing a random strategy experiences a more consistent oscillation in the step curve, with the required steps to finish an episode typically capped at 200. Compared to Minimax-Q, this random strategy proves advantageous in that the agent does not get trapped in localized optima. Instead, the agent freely chooses actions from its set, leading to a reduced step count.

Finally, we trained the Nash Q-learning algorithm with the same parameters for 500 episodes, and the results obtained are shown in Figure 9. The figure illuminates that during the initial phase of multi-agent system training, the number of steps to finalize an episode experienced significant volatility. This suggests that neither the attacker nor the defender had discerned their respective equilibrium paths, resulting in seemingly arbitrary movements within the maze. The pattern persisted until either the attacker reached the target or was intercepted by the defender. As training progressed, the variation in steps began to stabilize, although occasional spikes in step counts were evident. These anomalies can be attributed to certain episodes where random selections recurrently fell within the [0.1, 1) range, leading to a localized optimal strategy. The agent will continue to choose the local optimal move strategy and make it wander back and forth in the maze until it can randomly move to explore other paths.

By the latter stages of training, the step count variations became minor and tended to oscillate around a consistent mean, suggestive of both players aligning with their equilibrium paths and the training converging to a stable state. Analyzing the accumulated data revealed that defenders primarily intercepted attackers at the three nodes illustrated in Figure 4b. Nonetheless, inherent randomness occasionally diverted the attacker and defender from their balanced strategies, sometimes culminating in the defender’s inability to intercept or necessitating an extended number of steps for successful interception.

In conclusion, Nash Q-learning algorithm has better performance than Minimax-Q and random strategy in solving maze model. Therefore, we chose the Nash Q-learning algorithm to solve the deception asset deployment strategy.

6.3.2. Result Analysis

Convergence Analysis

In order to verify the influence of different values of

ε

on the convergence, we set the values to 0.15, 0.2, 0.25, and 0.3, respectively. In order to reduce the time spent in training, we set max_episode to 300 for experiments, and the experimental results are shown in Figure 10. The analysis shows that with the increase in the value of

ε

, the probability of the agent exploring the system environment also increases, and the probability of falling into the local optimum decreases, which makes the oscillation degree of the step curve decrease, but once the equilibrium path is explored in the later stage of training, a larger value of

ε

will lead to a larger oscillation range of step curves near the equilibrium path, and the convergence effect is relatively poor. For a better convergence effect, we will continue to use the setting of

ε

as 0.1.

To verify the convergence of the model, we performed several experiments using the same parameters, and the experimental results of two of them are shown in Figure 11. It can be seen from the figure that in these two experiments, the offensive and defensive sides are in the equilibrium path. It can be seen from the figure that in these two experiments, both sides of the attack and defense fluctuate slightly on the equilibrium path

N E P_{1} = {P_{a} = (0, 1, 2, 7, \dots, 24), P_{d} = (4, 9, 8, 7)}

. Compared with the experiment shown in Figure 8a, the experiment shown in Figure 8b has better convergence and only has a large oscillation in the early stage of training, and almost stays on the equilibrium path the rest of the time, which indicates that our proposed model has good convergence.

Optimal Strategy

As shown in Figure 9, across the 500 episodes, the attacker reached the target node 143 times, whereas the defender successfully intercepted the attacker in 357 instances, resulting in a defense success probability of 71.4%. Obviously, the defender has a high defense success probability.

In order to verify the defense effect of the model, we have done 10 simulation experiments each for the case of

ε = 0.1

using Nash Q-learning, Minimax-Q, and random policy, and the obtained defense success probability (dsp) curve is shown in Figure 12. We calculated that the average defense success probability of these three methods is 75.1%, 89.1%, and 79.8%, respectively. Obviously, Minimax-Q algorithm performs better than the other two and this is because the agents in Minimax-Q do not consider Nash Equilibrium strategy, they only choose the action which can bring maximum reward, and there is only one target node for the attacker, but the defender can intercept him in any possible nodes, so the dsp turns out higher than the other two. Also, Minimax-Q has a bad convergence, the random policy does not converge at all, and both Minimax-Q and random policy cannot reach the Nash Equilibrium. So, we use Nash Q-learning not only because it can converge well, but also because it has a good dsp, which can also prove that the model has a good defense effect. Analysis of the data shows that the optimal equilibrium path of both sides is

{N E P}_{1} = {P_{a} = (0, \dots, 7, \dots, 24), P_{d} = (4, 9, 8, 7)}

, as shown in Figure 13. The attacker can obtain the maximum profit on

P_{a} = (0, \dots, 7, \dots, 24)

and this path is the shortest path, while the defender can obtain the maximum profit and the shortest path on

P_{d} = (4, 9, 8, 7)

when the attacker takes the above path.

Furthermore, we give a comparison with existing deception asset deployment methods, as shown in Table 3.

From the above comparisons, we can conclude that our work is the only one that can deal with the challenges of lack of dynamics, inability to make real-time decisions, and not considering the dynamic change of attacker’s strategy. What is more, our work has a good effectiveness in defending intranet attackers.

7. Summary

In the complex landscape of network attacks, this research uniquely models lateral movement in attack chains as akin to pathfinding within a maze. In this model, an attacker navigates towards a desired node to fulfill their malicious intent. Concurrently, the defender strives to pinpoint the attacker’s location, leveraging deception assets that mimic high-value targets to divert and deceive potential threats, ultimately fulfilling a defensive strategy. Addressing the challenges of optimal deception asset deployment, the process of attack and defense is conceptualized as a multi-agent stochastic game. Comparative analysis with Minimax-Q and random strategy accentuated the superior capabilities of the Nash Q-learning algorithm in our context. So, Nash Q-learning algorithm emerges as a robust solution to this model. Empirical evidence, garnered from extensive simulations, underscores the model’s impressive convergence properties. Notably, across various tests, the defender boasted an average success rate exceeding 70%, underscoring the model’s defensive efficacy.

As we look ahead, there is ample room to expand upon this foundational research. The current model, albeit effective, operates on a relatively constrained scale, simulating a singular attacker and a sole point of entry. Future research endeavors will pivot towards more expansive clusters, contemplating the dynamics of multiple attackers and multiple entry points.

Author Contributions

F.C. and W.H. developed the main idea and helped revise the manuscript. G.K. designed the main methods and simulation experiment and wrote the paper. X.Y. and G.C. helped revise the manuscrpit. S.Z. helped build the experiment environment. All authors have read and agreed to the published version of the manuscript.

Funding

This research work was supported by the National Key Research and Development Program of China (2021YFB1006200) and Major Science and Technology Project of Henan Province in China (221100211200).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Code of maze env is available on https://github.com/Nero-K/maze_env (accessed on 3 November 2023).

Acknowledgments

The authors would like to thank the editor and the reviewers for their constructive comments and suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

IT Giant Bitmarck Shuts Down Customer, Internal Systems after Cyberattack. Available online: https://www.theregister.com/2023/05/01/bitmarck_data_breach (accessed on 1 May 2023).
Maritime Giant DNV Says 1000 Ships Affected by Ransomware Attack. Available online: https://techcrunch.com/2023/01/18/dnv-norway-shipping-ransomware (accessed on 18 January 2023).
Portuguese Water Utility Attacked by LockBit. Available online: https://www.scmagazine.com/brief/portuguese-water-utility-attacked-by-lockbit (accessed on 22 February 2023).
Yadav, T.; Rao, A.M. Technical aspects of cyber kill chain. In Proceedings of the Third International Symposium on Security in Computing and Communications (SSCC’15), SSCC 2015, Kochi, India, 10–13 August 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 438–452. [Google Scholar]
Bahrami, P.N.; Dehghantanha, A.; Dargahi, T.; Parizi, R.M.; Choo, K.-K.R.; Javadi, H.H. Cyber kill chain-based taxonomy of advanced persistent threat actors: Analogy of tactics, techniques, and procedures. J. Inf. Process. Syst. 2019, 15, 865–889. [Google Scholar]
Lei, C.; Zhang, H.-Q.; Tan, J.-L.; Zhang, Y.-C.; Liu, X.-H. Moving target defense techniques: A survey. Secur. Commun. Netw. 2018, 2018, 3759626. [Google Scholar] [CrossRef]
Sengupta, S.; Chowdhary, A.; Sabur, A.; Alshamrani, A.; Huang, D.; Kambhampati, S. A survey of moving target defenses for network security. IEEE Commun. Surv. Tutorials 2020, 22, 1909–1941. [Google Scholar] [CrossRef]
Navas, R.E.; Cuppens, F.; Cuppens, N.B.; Toutain, L.; Papadopoulos, G.Z. Mtd, where art thou? A systematic review of moving target defense techniques for iot. IEEE Internet Things J. 2020, 8, 7818–7832. [Google Scholar] [CrossRef]
Lu, Z.; Wang, C.; Zhao, S. Cyber Deception for Computer and Network Security: Survey and Challenges. arXiv 2020, arXiv:2007.14497. [Google Scholar]
Wang, C.; Lu, Z. Cyber deception: Overview and the road ahead. IEEE Secur. Priv. 2018, 16, 80–85. [Google Scholar] [CrossRef]
Cranford, E.A.; Gonzalez, C.; Aggarwal, P.; Tambe, M.; Cooney, S.; Lebiere, C. Towards a cognitive theory of cyber deception. Cogn. Sci. 2021, 45, e13013. [Google Scholar] [CrossRef] [PubMed]
Zhang, L.; Thing, V.L. Three decades of deception techniques in active cyber defense-retrospect and outlook. Comput. Secur. 2021, 106, 102288. [Google Scholar] [CrossRef]
Paxton, N.C.; Jang, D.-I.; Russell, S.; Ahn, G.-J.; Moskowitz, I.S.; Hyden, P. Utilizing network science and honeynets for software induced cyber incident analysis. In Proceedings of the 2015 48th Hawaii International Conference on System Sciences, Kauai, HI, USA, 5–8 January 2015; pp. 5244–5252. [Google Scholar]
Franco, J.; Aris, A.; Canberk, B.; Uluagac, A.S. A survey of honeypots and honeynets for internet of things, industrial internet of things, and cyber-physical systems. IEEE Commun. Surv. Tutor. 2021, 23, 2351–2383. [Google Scholar] [CrossRef]
Chi, C.; Wang, Y.; Tong, X.; Siddula, M.; Cai, Z. Game theory in internet of things: A survey. IEEE Internet Things J. 2021, 9, 12125–12146. [Google Scholar] [CrossRef]
La, Q.D.; Quek, T.Q.; Lee, J.; Jin, S.; Zhu, H. Deceptive attack and defense game in honeypot-enabled networks for the internet of things. IEEE Internet Things J. 2016, 3, 1025–1035. [Google Scholar] [CrossRef]
Çeker, H.; Zhuang, J.; Upadhyaya, S.; La, Q.D.; Soong, B.-H. Deception-based game theoretical approach to mitigate dos attacks. In Decision and Game Theory for Security, Proceedings of the 7th International Conference, GameSec 2016, New York, NY, USA, 2–4 November 2016; Proceedings 7; Springer: Berlin/Heidelberg, Germany, 2016; pp. 18–38. [Google Scholar]
Pawlick, J.; Zhu, Q. Deception by design: Evidence-based signaling games for network defense. arXiv 2015, arXiv:1503.05458. [Google Scholar]
Basak, A.; Kamhoua, C.; Venkatesan, S.; Gutierrez, M.; Anwar, A.H.; Kiekintveld, C. Identifying stealthy attackers in a game theoretic framework using deception. In Decision and Game Theory for Security, Proceedings of the 10th International Conference, GameSec 2019, Stockholm, Sweden, 30 October–1 November 2019; Proceedings 10; Springer: Berlin/Heidelberg, Germany, 2019; pp. 21–32. [Google Scholar]
Anwar, A.H.; Kamhoua, C.; Leslie, N. Honeypot allocation over attack graphs in cyber deception games. In Proceedings of the 2020 International Conference on Computing, Networking and Communications (ICNC), Big Island, HI, USA, 17–20 February 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 502–506. [Google Scholar]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Amin, M.A.R.A.; Shetty, S.; Njilla, L.; Tosh, D.K.; Kamhoua, C. Online cyber deception system using partially observable monte-carlo planning framework. In Security and Privacy in Communication Networks, Proceedings of the 15th EAI International Conference, SecureComm 2019, Orlando, FL, USA, 23–25 October 2019; Proceedings, Part II 15; Springer: Berlin/Heidelberg, Germany, 2019; pp. 205–223. [Google Scholar]
Wang, S.; Pei, Q.; Wang, J.; Tang, G.; Zhang, Y.; Liu, X. An intelligent deployment policy for deception resources based on reinforcement learning. IEEE Access 2020, 8, 35792–35804. [Google Scholar] [CrossRef]
Huang, L.; Zhu, Q. Adaptive honeypot engagement through reinforcement learning of semi-markov decision processes. In Decision and Game Theory for Security, Proceedings of the 10th International Conference, GameSec 2019, Stockholm, Sweden, 30 October–1 November 2019; Proceedings 10; Springer: Berlin/Heidelberg, Germany, 2019; pp. 196–216. [Google Scholar]
Zhang, H.; Liu, H.; Liang, J.; Li, T.; Geng, L.; Liu, Y.; Chen, S. Defense against advanced persistent threats: Optimal network security hardening using multi-stage maze network game. In Proceedings of the 2020 IEEE Symposium on Computers and Communications (ISCC), Rennes, France, 7–10 July 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–6. [Google Scholar]
Horák, K.; Bošanskỳ, B. Solving partially observable stochastic games with public observations. Proc. Aaai Conf. Artif. Intell. 2019, 33, 2029–2036. [Google Scholar] [CrossRef]
Hu, J.; Wellman, M.P. Nash Q-learning for general-sum stochastic games. J. Mach. Learn. Res. 2003, 4, 1039–1069. [Google Scholar]
Vamvoudakis, K.G. Non-Zero Sum Nash Q-Learning for Unknown Deterministic Continuous-Time Linear Systems. Automatica 2015, 61, 274–281. Available online: https://www.sciencedirect.com/science/article/pii/S000510981500343X (accessed on 5 September 2015). [CrossRef]
Wei, C.-Y.; Hong, Y.-T.; Lu, C.-J. Online reinforcement learning in stochastic games. Adv. Neural Inf. Process. Syst. 2017, 30, 4994–5004. [Google Scholar]
Zhang, K.; Yang, Z.; Başar, T. Multi-agent reinforcement learning: A selective overview of theories and algorithms. In Handbook of Reinforcement Learning and Control; Springer: Cham, Switzerland, 2021; pp. 321–384. [Google Scholar]
Zhu, Y.; Zhao, D. Online minimax q network learning for two-player zero-sum markov games. IEEE Trans. Neural Netw. Learn. Syst. 2020, 33, 1228–1241. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Attack–defense scenario.

Figure 2. Lateral movement.

Figure 3. Maze model.

Figure 4. Path diagram.

Figure 5. Multi-agent reinforcement learning framework.

Figure 6. Simulation environment.

Figure 7. Training Results using Minimax-Q policy.

Figure 8. Training results using random policy.

Figure 9. Training result using Nash Q-learning.

Figure 10. Effect of different

ε

.

Figure 10. Effect of different

ε

.

Figure 11. Convergence verification. (a) Additional experiment 1 for

ε = 0.1

. (b) Additional experiment 2 for

ε = 0.1

.

Figure 11. Convergence verification. (a) Additional experiment 1 for

ε = 0.1

. (b) Additional experiment 2 for

ε = 0.1

.

Figure 12. Convergence verification.

Figure 13. Optimal equilibrium path.

Table 1. Simulation Experiment Configurations.

Item	Configuration
OS	Windows 10 Professional
CPU	Intel Core i7-8700 @3.20 GHz
GPU	Intel(R) UHD Graphics 630
Memory	24 GB RAM
Python Edition	3.9.11
Platform	Pycharm Community Edition

Table 2. Simulation experiment.

Simulation Parameter	Configuration
Players’ Action Set	[0, 1, 2, 3]
Max Episode	500
Max Step	10,000
Hyper Parameter $ε$	0.1
Learning Rate $α$	0.001
Discount Factor $γ$	0.99

Table 3. Comparison.

Paper	Model	Method	Result
Reference [20]	Attack Graph	Game Theory	Nash Equilibrium deployment Strategy
Reference [25]	Maze + Attack Graph	MDP + Single-agent RL	Nash Equilibrium deployment Strategy
Reference [23]	Attack Graph	Single-agent RL	Optimal deployment policy
ours	Maze	Game Theory + Multi-agent RL	Optimal deployment location

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kong, G.; Chen, F.; Yang, X.; Cheng, G.; Zhang, S.; He, W. Optimal Deception Asset Deployment in Cybersecurity: A Nash Q-Learning Approach in Multi-Agent Stochastic Games. Appl. Sci. 2024, 14, 357. https://doi.org/10.3390/app14010357

AMA Style

Kong G, Chen F, Yang X, Cheng G, Zhang S, He W. Optimal Deception Asset Deployment in Cybersecurity: A Nash Q-Learning Approach in Multi-Agent Stochastic Games. Applied Sciences. 2024; 14(1):357. https://doi.org/10.3390/app14010357

Chicago/Turabian Style

Kong, Guanhua, Fucai Chen, Xiaohan Yang, Guozhen Cheng, Shuai Zhang, and Weizhen He. 2024. "Optimal Deception Asset Deployment in Cybersecurity: A Nash Q-Learning Approach in Multi-Agent Stochastic Games" Applied Sciences 14, no. 1: 357. https://doi.org/10.3390/app14010357

APA Style

Kong, G., Chen, F., Yang, X., Cheng, G., Zhang, S., & He, W. (2024). Optimal Deception Asset Deployment in Cybersecurity: A Nash Q-Learning Approach in Multi-Agent Stochastic Games. Applied Sciences, 14(1), 357. https://doi.org/10.3390/app14010357

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Optimal Deception Asset Deployment in Cybersecurity: A Nash Q-Learning Approach in Multi-Agent Stochastic Games

Abstract

1. Introduction

2. Related Work

3. System Model

3.1. Attacker Model

3.2. Defender Model

3.3. Construction of Maze Model

4. Construction of Attack–Defense Game Model

4.1. Stochastic Game Model

4.2. Quantification of Attack and Defense Benefits

5. Game Model Solution

5.1. Multi-Agent Reinforcement Learning Representations

5.2. Nash Q-Learning Based Multi-Agent Deception Deployment Solution Algorithm

6. Experiment

6.1. Description of Simulation Environment

6.2. Simulation Parameter Setting

6.3. Simulation Experiment and Result Analysis

6.3.1. Algorithm Comparison

6.3.2. Result Analysis

Convergence Analysis

Optimal Strategy

7. Summary

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI