Modelling and Intelligent Decision of Partially Observable Penetration Testing for System Security Verification

Liu, Xiaojian; Zhang, Yangyang; Li, Wenpeng; Gu, Wen

doi:10.3390/systems12120546

Open AccessArticle

Modelling and Intelligent Decision of Partially Observable Penetration Testing for System Security Verification

¹

College of Computer Science, Beijing University of Technology, Beijing 100124, China

²

China Electronics Standardization Institute, Beijing 100007, China

^*

Author to whom correspondence should be addressed.

Systems 2024, 12(12), 546; https://doi.org/10.3390/systems12120546

Submission received: 22 October 2024 / Revised: 22 November 2024 / Accepted: 26 November 2024 / Published: 9 December 2024

(This article belongs to the Special Issue Advanced Model-Based Systems Engineering)

Download

Browse Figures

Versions Notes

Abstract

As network systems become larger and more complex, there is an increasing focus on how to verify the security of systems that are at risk of being attacked. Automated penetration testing is one of the effective ways to achieve this. Uncertainty caused by adversarial relationships and the “fog of war” is an unavoidable problem in penetration testing research. However, related methods have largely focused on the uncertainty of state transitions in the penetration testing process, and have generally ignored the uncertainty caused by partially observable conditions. To address this new uncertainty introduced by partially observable conditions, we model the penetration testing process as a partially observable Markov decision process (POMDP) and propose an intelligent penetration testing decision method compatible with it. We experimentally validate the impact of partially observable conditions on penetration testing. The experimental results show that our method can effectively mitigate the negative impact of partially observable conditions on penetration testing decision. It also exhibits good scalability as the size of the target network increases.

Keywords:

penetration testing; partially observable problems; modelling and intelligent decision; partially observable Markov decision process; observational locality; observational uncertainty

1. Introduction

As network systems become larger and more complex, there is an increasing focus on how to verify the security of systems that are at risk of being attacked. Automated penetration testing [1] is one of the effective ways to achieve this. Traditional Penetration Testing (PT) relies on manual approaches that become impractical as systems grow in size and complexity. Automated penetration testing technologies simulate real attackers with attack strategies and use various algorithmic models to automate the penetration of target networks, significantly reducing testing costs and increasing penetration efficiency [2,3].

Due to the adversarial nature between attackers and defenders, the ability to accurately observe the whole situation is often lacking, resulting in the failure of classical decision algorithms. The study of modelling and decision methods under partially observable conditions in these adversarial environments is of critical importance [4].

Studies on penetration testing decision typically model the penetration testing process by constructing a Markov decision process (MDP) [5], which is a traditional mathematical tool for formalizing process dynamics and state transition uncertainty. However, it lacks sufficient consideration for partially observable problems.

To address this issue, we model the penetration testing process as a partially observable Markov decision process (POMDP) [6]. In addition to process dynamics and state transition uncertainty, we extend the modelling of observational locality and observational uncertainty, thereby strengthening the formal expression of partially observable problems. Correspondingly, a compatible intelligent decision method is proposed that combines deep reinforcement learning [7] and recurrent neural networks [8]. By exploiting the characteristics of recurrent neural networks in handling sequential data and the advantages of deep reinforcement learning in exploring and learning unknown systems, the method can effectively mitigate the negative impact of partially observable conditions on penetration testing decision.

The main contributions of this paper are as follows:

Formalising the penetration testing process as a partially observable Markov decision process, allowing the exploration and study of observational locality and observational uncertainty in adversarial scenarios. This approach is more in line with the realities of the “fog of war” and the subjective decision process based on observations in the field of intelligent decision.

To address the new uncertainty challenges posed by partially observable conditions, an intelligent penetration path decision method based on the combination of deep reinforcement learning and recurrent neural networks is proposed. This method enhances the learning capability of sequential attack experience under partially observable conditions.

2. Related Works

Automated penetration testing is a significant application of artificial intelligence technology in the field of cybersecurity [9]. Methods based on reinforcement learning, by designing the success probability of action execution, can simulate real-world uncertainties in attack and defence, making it a crucial research direction in this domain. Schwartz et al. [10] designed a lightweight network attack simulator, NASim, providing a benchmark platform for network attack–defence simulation testing. They validated the application effectiveness of fundamental reinforcement learning algorithms like Deep Q-learning Network(DQN) in penetration path discovery. Zhou Shicheng et al. [11] proposed an improved reinforcement learning algorithm, Noisy-Double-Dueling DQN, to enhance the convergence speed of DQN algorithms in the context of path discovery problems. Hoang Viet Ngueyen et al. [12] introduced an A2C algorithm with a dual-agent architecture, responsible for path discovery and host exploitation, respectively. Zeng Qingwei et al. [13] suggested the use of a hierarchical reinforcement learning algorithm, addressing the problem of separate processing for path discovery and host exploitation.

Methods based on reinforcement learning often formalise the penetration testing process as a Markov decision process (MDP). While these methods describe the dynamism of the penetration testing process and the uncertainty of state transitions, they do not effectively model the observational locality and observational uncertainty in the network attack–defence “fog of war”. Therefore, related studies pay attention to modelling partially observable conditions in penetration testing [14]. Sarraute et al. [15] incorporated the information gathering stage of the penetration testing process into the penetration path generation process, achieving the automation of penetration testing for individual hosts for the first time. Shmaryahu et al. [16] modelled penetration testing as a partially observable episodic problem and designed the episodic planning tree algorithm to plan penetration paths. The aforementioned solving methods do not give detailed analysis and general models of partially observable problems, nor do they consider integration with intelligent methods to support automated testing.

Based on the above studies, we further investigate the impact of partially observable conditions on penetration testing. We analyse and model the partially observable problems in penetration testing from the two aspects of observational locality and observational uncertainty, and integrate them with an MDP. Based on this, we propose an intelligent decision method to enhance the learning ability of sequential attack experience under partially observable conditions.

3. Problem Description and Theoretical Analysis

Figure 1 shows a typical penetration testing scenario, which we use as a case study to illustrate the issue addressed in this paper. The target network consists of four subnets, each containing several hosts, including sensitive hosts. The role of the PT decision method is to decide, based on state observation, the attack targets at each step and plan the optimal attack path to maximise the attack reward. In this scenario, the goal is to intelligently achieve the highest privileges on hosts (2,0) and (4,0) with the fewest attack steps, while avoiding host (3,2) to avoid triggering the negative reward associated with the honeypot [17].

The MDP is typically modelled as a quadruple

< S, A, R, P >

, where S denotes the set of penetration states observed by the agent, such as network topology, host information, and vulnerability details. A is the set of attack actions available to the PT agent, such as network scanning, host scanning, vulnerability exploitation, privilege escalation, etc. R is the reward function and

R (s)

corresponds to the reward for different penetration states, for example, the reward for obtaining the highest privileges on host (2,0) is 100, and the reward for attempting to attack host (3,2) is −100. P denotes the state transition function

P (s, a, s^{'}) = P r (s^{'} | s, a)

, typically associated with the success rate of the attack actions. The goal of the MDP is to select the optimal policy

a_{t} = π (s_{t})

to optimise the long-term cumulative reward

G (s_{0})

for the current state, as in (1).

\begin{matrix} G (s_{0}) = E \{\sum_{t = 0}^{T - 1} γ^{t} R (s_{t + 1}) P [s_{t}, π (s_{t}), s_{t + 1}]\} \end{matrix}

(1)

where

γ \in (0, 1)

denotes the discount factor, used to balance the importance of current rewards against future rewards.

The MDP formalises the process dynamics and state transition uncertainty of penetration testing, but assumes accurate observability, where the state observed by the PT agent matches the actual state:

S^{(o b s)} = S^{(a c t)}

. However, the existence of adversarial relationships and the “fog of war” makes this assumption impractical. On the one hand, the PT agent cannot directly obtain a global perspective, but gradually explores the topology and host information as it moves laterally through the penetration. On the other hand, the application of moving target defence and deceptive defence methods may also introduce inaccuracies in the observed state.

Therefore, in this paper we make more realistic assumptions:

S^{(o b s)} \approx S^{(a c t)}

, indicating that the PT agent has observational locality and observational uncertainty, which means that the observed state is not only partial, but also may have some degree of deviation.

In this paper, we model the penetration testing process as a partially observable Markov decision process (POMDP), taking into account the observational locality and observational uncertainty of the PT agent. The actual penetration process still relies on the foundation of the MDP quadruple, but replaces the set of attack actions A with the intelligent agent model

A

, denoted as

< S, A, R, P >

. And the intelligent agent model is further defined as

A (O, A, π)

, where O is the observation function,

O : s^{(a c t)} \to s^{(o b s)}

, which describes the observational locality and the observational uncertainty. A denotes the set of available attack actions for the PT agent.

π

denotes the decision strategy of the PT agent based on the observed state:

π : s^{(o b s)} \to a

. The goal is to select the optimal strategy

a_{t} = π (s_{t}^{(o b s)})

based on the partially observed state

s_{t}^{(o b s)}

of the PT agent, rather than the actual state

s_{t}^{(a c t)}

, in order to optimise the long-term cumulative reward

G (s_{0})

for the current state, as in (2).

G (s_{0}) = E \{\sum_{t = 0}^{T - 1} γ^{t} R (s_{t + 1}) P [s_{t}, π (s_{t}^{(o b s)}), s_{t + 1}] | O : s_{t} \to s_{t}^{(o b s)}\}

(2)

In our model, the actual state

s_{t}^{(a c t)}

still evolves based on objective attack actions and an objective state transition function. However, the observation and decision components of the PT agent are treated independently. They are replaced by observed states

s_{t}^{(o b s)}

, which may differ from the actual states, and decisions are made based on the observed states. This aligns more closely with actual “fog-of-war” scenarios and better simulates the subjective decision process based on observations, which is the main innovation of this paper.

4. Methodology

4.1. POMDP Modelling

4.1.1. State Modelling

Typically, related studies have modelled the state space at different stages of the penetration process. However, the penetration stage is a state concept from a global perspective, which contradicts the partially observable conditions. Therefore, we propose a new state modelling approach to describe the penetration state based on the feedback obtained from each action of the PT agent. In the penetration process, the primary feedback consists of the target host information and its changes. Thus, we model the penetration state

s_{t}^{(a c t)}

based on the information of each host and derive the observation state

s_{t}^{(o b s)}

for the PT agent. The information for each host includes topological location, compromise information, operating system details, service details, and process details. Different types of information are encoded and identified using one-hot vectors. The dimensions of the encoding are determined based on the estimated size and complexity of the specific scenario. In the scenario shown in Figure 1, where the total number of subnets is 4, the corresponding field dimensions can be set to any dimension greater than 4. Larger dimensions provide more capacity for our method. However, accurate estimation contributes to modelling accuracy and efficiency.

Using the scenario in Figure 1 as an example, the modelling of host information is as follows:

Topological location information: Topological location information includes subnet identifiers and host identifiers. For the scenario in Figure 1, the capacity of the subnet identifier field can be set to 4 and the capacity of the host identifier field can be set to 5. Thus, the dimension of the topological location field is 9, with the first 4 dimensions identifying the subnet in which the target host is located, and the remaining 5 dimensions identifying the location of the host within the subnet. For example, the topological location encoding for the honeypot host (3,2) is (0010 00100).

Compromised state information: Compromised state information is encoded by a 6-dimensional vector, including whether compromised, accessible, discovered, compromise reward, discovery reward, and compromised privileges (user or root).

Operating system, service, and process information: These three categories of information are closely related to vulnerabilities and attack methods. In the scenario shown in Figure 1, the target host is configured with 2 types of operating systems, 3 types of services, and 2 types of processes, as shown in Table 1. Of course, based on the requirements of a particular scenario, this can be extended to more detailed types, such as adding software version numbers to identify Windows 7 and Windows 10 as two different types of operating systems, corresponding to a more dimensional encoding field for operating systems. This can be more useful for detailed mapping of vulnerabilities and attack methods. Common Vulnerabilities and Exploits (CVEs) [18] and Common Attack Pattern Enumeration and Classification (CAPEC) [19] can be introduced to identify these mappings. Using the honeypot host (3,2) as an example, the operating system is encoded as (01), the service is encoded as (101), and the process is encoded as (01).

4.1.2. Action and Reward Modelling

To simulate a realistic observation-based penetration process, we assume that both topology and host information must be obtained through scanning or feedback from attack actions. Therefore, in addition to the usual vulnerability exploitation (VE) and privilege escalation (PE) specific to each vulnerability, we also model four types of scanning actions: (1) subnet scan (subnet_scan) to discover all hosts in the subnet; (2) operating system scan (os_scan) to obtain the operating system type of the target host; (3) service scan (service_scan) to obtain the service type of the target host; and (4) process scan (process_scan) to obtain information about the processes ofn the target host. By performing the scan actions, the PT agent can acquire the corresponding host information as an observed state.

The VE actions and PE actions must be performed based on the specific requirements. In addition, a certain probability of success is set according to the Common Vulnerability Scoring System (CVSS) [20] to simulate the uncertainty of attacks in reality. The CVSS has a number of classification criteria. Its Attack Complexity (AC) index is used to assess the difficulty of exploiting the vulnerability. AC has 3 values, which are divided into high, medium, and low, and the medium- and low-complexity levels indicate that a vulnerability is relatively easy to attack and requires fewer conditions to exploit, while a highly complex vulnerability requires more preconditions to exploit successfully and an attacker needs to prepare a lot of targets. Therefore, in order to describe the uncertainty of vulnerability exploitation, we refer to [21] in setting the probability of success of vulnerability exploitation. By configuring different VE actions and PE actions, the PT agent is modelled with different attack capabilities, for example as shown in Table 2. The results of the attack actions are simulated by updating the compromised state information of the target host.

The reward is modelled on three aspects. First, there is a reward value for obtaining the highest privilege (root) on each host, with sensitive hosts (2,0) and (4,0) set to 100, and other hosts set to 0. Second, each attack action has a cost: scanning and PE actions have a cost of 1, the VE action E_SSH has a cost of 3, E_FTP has a cost of 1, and E_HTTP has a cost of 2. Finally, Some preset decoy traps (such as honeypot nodes) result in the disclosure of attack information, which means a penalty for the attacker. For example, there is a penalty of −100 if the attacker tries to exploit vulnerabilities on the honeypot host (3,2).

4.1.3. Partially Observable Condition Modelling

For partially observable conditions, we define the observational locality and the observational uncertainty:

Observational locality. Observational locality means that the information obtained by the PT agent about the target network or host is only related to the current action and does not make any assumptions about global perspectives. With this in mind, we define the information acquisition capability for each attack action type by local observation items h. Each attack action can only obtain information about the target host associated with the corresponding local observation item. For example, the privilege escalation action PE, depending on whether the escalation is successful, encodes new host information with an updated compromised privileges field and adds it to the observation trajectory $T r$ . On the other hand, the subnet scan action subnet_scan may add several newly discovered h, depending on how many hosts are discovered. That is, each POMDP state s contains a collection of n local observation items h. The observation trajectory $T r$ is recorded sequentially.
Since we use one-hot encoding for host information (see Section 4.1.1), the merging of multiple local observation items does not cause confusion. This modelling approach helps to shield the influence of global information on the dimensions of observation states, making the algorithm more scalable.

Observational uncertainty. Observational uncertainty refers to the fact that the observations obtained by the PT agent may not be accurate. In this paper, we model this observational uncertainty by introducing random changes to the target fields of the local observation items h. A Partially Observable Module (PO Module) is designed to introduce random changes to certain fields of the observation state with a certain probability $O (ρ) : s^{(a c t)} \to s^{(o b s)}$ . The pseudocode for the PO Module is shown in Algorithm 1. We define a field-specific random change strategy for each type of attack. For example, we introduce random changes to the compromised privileges field of the compromised state information for the privilege escalation action PE, and introduce random changes to the topological location field for the subnet scan action subnet_scan.

Algorithm 1 Partially Observable Module

Input: Action a, Actual state

s^{(a c t)}

, Observational Uncertainty Factor

ρ

Output: Observation state

s^{(o b s)}

1:: $s^{(o b s)} = s^{(a c t)}$
2:: if Random(0,1) $< ρ$ then:
3:: switch (a)
4:: {
5:: case 1: $a = s u b n e t_s c a n$
6:: Random change $s^{(o b s)} . h o s t$ field
7:: case 2: $a = o s_s c a n$
8:: Random change $s^{(o b s)} . O S$ field
9:: case 3: $a = s e r v i c e_s c a n$
10:: Random change $s^{(o b s)} . s e r v i c e$ field
11:: case 4: $a = p r o c e s s_s c a n$
12:: Random change $s^{(o b s)} . p r o c e s s$ field
13:: case 5: $a = V E P E$
14:: Random change $s^{(o b s)} . p r i v i l e g e s$ field
15:: default:
16:: break;
17:: }
18:: end switch
19:: end if
20:: return $s^{(o b s)}$

4.2. PO−IPPD Method

We design an intelligent penetration path decision method based on the DQN algorithm framework and make extensions to make it more suitable for partial observation conditions, named the Partial Observation-based Intelligent Penetration Path Decision method (PO−IPPD method). We extend DQN by incorporating Gated Recurrent Unit (GRU) and Long Short-Term Memory (LSTM) into the DQN target and evaluation networks. This modification allows our method to consider trajectory information from previous time steps when making decisions, using historical experience to compensate for the uncertainty of current information. The PO−IPPD framework is illustrated in Figure 2.

The actual state

s^{(a c t)}

processed by the Partially Observable Module (see Algorithm 1) serves as the observed state

s^{(o b s)}

for the PT agent. Trajectory information is recorded and stored in the experience pool. During the training process, random samples are collected and

Q ({s^{'}}^{(o b s)}, a^{'} | θ^{'})

and

Q (s^{(o b s)}, a | θ)

are computed using the Q_target and Q_valuation networks, respectively.

θ

represents the weight parameters in the neural network. Following the Bellman equation, the formula for the loss function is defined as follows:

L (θ) = E [{(r + γ {m a x}_{a^{'}} Q ({s^{'}}^{(o b s)}, a^{'} | θ^{'}) - Q (s^{(o b s)}, a | θ))}^{2}]

(3)

The update of

θ

is implemented through stochastic gradient descent:

\begin{matrix} \nabla L (θ) = \\ E [(r + γ {m a x}_{a^{'}} Q ({s^{'}}^{(o b s)}, a^{'} | θ^{'}) - Q (s^{(o b s)}, a | θ)) \nabla Q (s^{(o b s)}, a | θ)] \end{matrix}

(4)

The structure of the Q_target network and the Q_valuation network designed in this paper is shown in Figure 3. The network structure is mainly divided into two parts: (1) Using GRU to process historical trajectory information. This part takes all previous trajectories

(s_{0}, s_{1}, . . ., s_{t})

of the sample time state

s_t

as input, passes through a convolutional structure, and then enters the GRU network. (2) Using LSTM to process current trajectory information. This part takes the previous n trajectories

(s_{t - n}, s_{t - n + 1}, . . ., s_{t})

of the sample time state

s_{t}

as input, passes through a convolutional structure, and then enters the LSTM network. The outputs of the two parts of the recurrent neural network are concatenated and fed into a fully connected structure, producing the Q_values for each strategy. The simultaneous consideration of all historical trajectories and the current trajectory is inspired by the work of [22], which emphasises the importance of simultaneously understanding historical experience and analysing the current state when global observational conditions are lacking and partial observations may exist.

Due to the need to capture the entire historical trajectory of the state

s_{t}

for each sampling, the experience replay pool must identify the episode information for each record. This allows the simultaneous retrieval of all historical trajectories corresponding to the sampled record during sampling. To achieve this, we extend the traditional DQN experience tuple

(s^{(o b s)}, a, {s^{'}}^{(o b s)}, r)

by adding a Boolean indicator ’done’ to represent whether the penetration was successful. The trajectory experience is recorded as a quintuple

(s^{(o b s)}, a, {s^{'}}^{(o b s)}, r, d o n e)

, which is then stored in the experience replay pool. The pseudocode for the memory function of the experience replay pool is given in Algorithm 2, and the pseudocode for the sampling function is given in Algorithm 3. The training algorithm for the PO−IPPD method is given in Algorithm 4.

Algorithm 2 Memory_Func

Input: Experience Replay Pool M, Memory Pointer

p t r

, Observation state

s^{(o b s)}

, Action a, Next Observation state

{s^{'}}^{(o b s)}

, Reward r, Success Indicator

d o n e

Output: Experience Replay Pool M

1:: $M . s_b u f [p t r] = s^{(o b s)}$
2:: $M . a_b u f [p t r] = a$
3:: $M . s^{'}_b u f [p t r] = {s^{'}}^{(o b s)}$
4:: $M . r_b u f [p t r] = r$
5:: $M . d o n e_b u f [p t r] = d o n e$
6:: $p t r = (p t r + 1) % M . c a p a c i t y$
7:: return M

Algorithm 3 Sampling_Func

Input: Experience Replay Pool M, Sampling Size

b a t c h_s i z e

Output: Observation state sample

b a t c h_s^{(o b s)}

, Action sample

b a t c h_a

, Next Observation state sample

b a t c h_{s^{'}}^{(o b s)}

, Reward sample

b a t c h_r

, Success Indicator sample

b a t c h_d o n e

1:: Initialize $b a t c h_s^{(o b s)}$ , $b a t c h_a$ , $b a t c h_{s^{'}}^{(o b s)}$ , $b a t c h_r$ ,
2:: $b a t c h_d o n e$ , $s_t r$ , $a_t r$ , $s^{'}_t r$ , $r_t r$ , $d_t r$
3:: for $b a t c h = 1$ to $b a t c h_s i z e$ do:
4:: $I d x = R a n d i n t (0, M . s i z e)$
5:: while not $M . d o n e_b u f [I d x]$ do:
6:: $s_t r = s_t r . a p p e n d (M . s_b u f [I d x])$
7:: $a_t r = a_t r . a p p e n d (M . a_b u f [I d x])$
8:: $s^{'}_t r = s^{'}_t r . a p p e n d (M . s^{'}_b u f [I d x])$
9:: $r_t r = r_t r . a p p e n d (M . r_b u f [I d x])$
10:: $d_t r = d_t r . a p p e n d (M . d_b u f [I d x])$
11:: $I d x = I d x - 1$
12:: $b a t c h_s^{(o b s)} = b a t c h_s^{(o b s)} . a p p e d (s_t r)$
13:: $b a t c h_a = b a t c h_a . a p p e d (a_t r)$
14:: $b a t c h_{s^{'}}^{(o b s)} = b a t c h_{s^{'}}^{(o b s)} . a p p e d (s^{'}_t r)$
15:: $b a t c h_r = b a t c h_r . a p p e d (r_t r)$
16:: $b a t c h_d o n e = b a t c h_d o n e . a p p e d (d_t r)$
17:: end for
18:: return $b a t c h_s^{(o b s)}$ , $b a t c h_a$ , $b a t c h_{s^{'}}^{(o b s)}$ , $b a t c h_r$ ,
19:: $b a t c h_d o n e$

Algorithm 4 PO−IPPD training algorithm

Input: Experience Replay Pool M, Training Episodes N, Exploration Rate

ϵ

, Target Network

Q (θ^{T})

, Evaluation Network

Q (θ^{V})

Output: Target Network

Q (θ^{T})

1:: Initialize Target Network $Q (θ^{T})$ , Evaluation Network $Q (θ^{V})$ , $s^{(o b s)}_t r$
2:: for $e p i s o d e = 1$ to N do:
3:: $s^{(o b s)} = I n i t i a l s t a t e$
4:: $s^{(o b s)}_t r = s^{(o b s)}_t r . a p p e n d (s^{(o b s)})$
5:: while not $d o n e$ do:
6:: $a = ε - G r e e d (s^{(o b s)}_t r, Q (θ^{V}), ε)$
7:: ${s^{'}}^{(o b s)}, r, d o n e = s t e p (a)$
8:: Memory_Func $(s^{(o b s)}, a, {s^{'}}^{(o b s)}, r, d o n e)$
9:: Sampling_Func and Training $θ^{V}$ in Evaluation Network $Q (θ^{V})$
10:: Update $θ^{T}$ in Target Network $Q (θ^{T})$ using $θ^{V}$ in Evaluation Network $Q (θ^{V})$ periodically.
11:: $s^{(o b s)} = {s^{'}}^{(o b s)}$
12:: end for

5. Experiment and Discussion

To validate the performance of our model and decision method, we focus our experiments on answering the following three Research Questions (RQs):

RQ1: Does the lack of global observation capability indeed have a negative impact on penetration testing, and does the proposed PO−IPPD method have the ability to mitigate this negative impact?

RQ2: Does the proposed PO−IPPD have scalability within different network scales?

RQ3: How can we evaluate the impact and contribution of each component in the model on PO−IPPD performance?

5.1. Experiment Environment

The experiments use NASim [10] as a test platform. In the experiments of this paper, the following improvements are made to NASim to adapt it to the analysis of observational locality and observational uncertainty:

The addition of a Partial Observation Module to simulate observational locality and observational uncertainty as described in Section 4.1.3 and Algorithm 1.

Modifications are made to the experience replay pool, as in Algorithms 2–4, adding trajectory experience records for the PO−IPPD method.

The network and hyperparameter configurations for the scenario in Figure 1 are shown in Table 3. During the training process, each scenario is trained for 10,000 episodes and the average values are recorded every 500 episodes for evaluation. The designed evaluation metrics are as follows:

Average Reward (Reward): The total reward value is calculated for each episode to assess the economic cost of penetration, including the sum of all action rewards and action costs. The average is calculated every 500 episodes.

Average Steps (Steps): The number of attack steps required for successful penetration in each episode is calculated to assess the time cost of penetration. Similarly, the average is calculated every 500 episodes.

Average Reward per Step (Reward/Step): The average reward per step is calculated to assess the average penetration efficiency. It is derived from the average episode reward (Reward) and the average episode steps (Steps).

Average Loss (Loss): The value of the loss function (Formula (3)) in each episode is calculated to assess the relationship between the training process and the state of convergence. Similarly, the average is calculated every 500 episodes.

5.2. Functional Effectiveness Experiment for Answering RQ1

The DQN algorithm has been widely used in research on intelligent penetration path decision [23,24]. We take two benchmark methods: the DQN algorithm with global observation capability (labelled as FO-DQN) and the DQN algorithm with partial observation capability (labelled as PO-DQN). Global observation capability is the only difference between the two benchmark methods. Compared to our PO−IPPD method, FO-DQN lacks GRU and LSTM structures, but has global observation capability, with

S^{(o b s)} = S^{(a c t)}

; PO-DQN lacks GRU and LSTM structures, has the same observational locality and observational uncertainty, with

S^{(o b s)} \approx S^{(a c t)}

. This experiment validates the functional effectiveness of the PO−IPPD method from two aspects:

By comparing the performance of FO-DQN and PO-DQN, it can be validated if the lack of global observation capability does indeed have a negative impact on penetration testing.

By comparing the performance of PO-DQN and PO−IPPD, it can be validated if our enhancements to DQN can effectively mitigate the negative impact of partially observable conditions on penetration test decision.

The experiment is performed using the training parameters set in Section 5.1, under the scenario shown in Figure 1. The experimental results are shown in Figure 4, Figure 5 and Figure 6 for comparison.

Comparing the performance of FO-DQN and PO-DQN in the three figures, it is clear that the lack of global observation capability does indeed have a negative impact on penetration testing. As shown in Figure 3 and Figure 4, FO-DQN with global observation capability converges around 1000 episodes and shows stable training results: the average (Reward) from 7000 to 10,000 episodes is 185.9, and the average Steps is 8.14. This is very close to the theoretical optimal values in this scenario (186, 8). In contrast, PO-DQN, which lacks global observation, not only has a slower convergence speed, but also shows poorer stability of training results, with significant fluctuations even after 7000 episodes. This is influenced by the settings of observational locality and observational uncertainty in the experiment. Furthermore, the average (Reward) from 7000 to 10,000 episodes for PO-DQN is 172.7 and the average Steps is 16.3, significantly lower than for FO-DQN. The average reward per step (Reward/Step), shown in Figure 6 as a metric for evaluating the efficiency of penetration, indicates that the PO-DQN is even more than 50% lower than the FO-DQN.

The performance of PO-DQN and PO−IPPD is further compared in the three figures. As shown in Figure 3 and Figure 4, PO−IPPD converges at around 2000–2500 episodes. Although the convergence is slightly slower compared to FO-DQN and PO-DQN, the training effect and result stability are significantly better than PO-DQN, and it approaches FO-DQN with global observation capability. From 7000 to 10,000 episodes, the average (Reward) of PO−IPPD is 183, which is 1.56% lower than FO-DQN and 6.40% higher than PO-DQN. The average Steps for PO−IPPD is 9.7, which is 19.16% lower than FO-DQN and 40.49% higher than PO-DQN. As shown in Figure 6, the average (Reward/Step) of PO−IPPD can be improved by 75.93% compared to PO-DQN.

5.3. Scalability Experiment for Answering RQ2

The scalability experiment involves contrasting and analysing the variations in Reward in scenarios with different numbers of hosts and subnets. It aims to validate the performance of PO−IPPD under different scales of target networks. Three different target network scenarios are set up, as detailed in Table 4. These scenarios differ only in the number of hosts and subnets, with other parameters and settings held constant. Other settings were held constant in the scenario shown in Figure 1: two sensitive hosts (each with a reward of 100) and one honeypot host (with a reward of −100) are placed on different subnets. Scenario 1 and Scenario 2 have the same number of subnets but different numbers of hosts, in order to compare the effect of the number of hosts on PO−IPPD. Scenario 2 and Scenario 3 have the same number of hosts but different numbers of subnets to compare the effect of subnet numbers on PO−IPPD.

Figure 7 illustrates the variation in the average Reward with training episodes for PO−IPPD in three comparative scenarios. It can be observed that increasing the number of hosts and subnets not only leads to significant differences in the initial Reward, but also results in a decrease in the convergence speed. This is related to the number of potential penetration paths that need to be explored during the training process. An increase in the number of hosts and subnets implies an increase in the number of potential penetration paths. Nevertheless, all three scenarios show relatively good convergence near the theoretical optimum of 200, indicating the robustness of PO−IPPD.

5.4. Ablation Experiment for Answering RQ3

To compare and analyse the contribution and impact of each major component in the PO−IPPD model on performance, as well as the influence of changes in network scale on the training process and convergence state, three groups of models are set up for each scenario, namely, remove GRU from PO−IPPD (labelled LSTM+DQN), remove LSTM from PO−IPPD (labelled GRU+DQN), and complete PO−IPPD model (labelled LSTM+GRU +DQN).

The results of the ablation experiments are shown in Figure 8. As shown in Figure 8a, when the network scale is relatively small, the three models all show good convergence performance, converging in 2000–25,000 episodes and being relatively stable. Figure 8b shows that as the network scale increases, more episodes are required to train the model. A difference in performance between the three models gradually emerges. GRU+DQN has a faster convergence rate than the other two models, but the stability after convergence is worse and the convergence value is higher. LSTM+DQN has a slower convergence rate and a fluctuating convergence process, and its convergence range is better than GRU+DQN, but it is also less stable. Although PO−IPPD requires more training time (about 4000 episodes) than Scenario 1, its convergence speed and stability are better than GRU+DQN and LSTM+DQN. In Figure 8c the differences between the three models are more obvious. In terms of convergence speed, GRU+DQN is the best, PO−IPPD is second, and LSTM+DQN is the worst. In terms of convergence range and convergence stability, PO−IPPD is the best, LSTM+DQN is the second, and GRU+DQN is the worst.

According to the analysis, in the partially observable penetration testing in this paper, we can see that GRU has an advantage in training speed due to its simpler model structure, and can capture the dependency patterns in small-scale networks more quickly. In contract, LSTM has a relatively complex structure. It has finer control over information processing and memory, and although the training cost is higher, it has advantages in convergence range and convergence stability. The proposed PO−IPPD method uses GRU and LSTM to process different types of historical data. Although the training speed is between that of GRU and LSTM, the convergence range and convergence stability are improved.

6. Conclusions

Adversarial relationships and the “fog of war” make partially observable conditions an unavoidable challenge in the study of automated penetration testing. With this in mind, we formalise the penetration testing process as a partially observable Markov decision process, enabling the exploration and study of observational locality and observational uncertainty in an adversarial scenario. The impact of partially observable conditions on penetration testing is experimentally validated. Based on this, an intelligent penetration testing decision method (PO−IPPD) that uses historical experience to compensate for the uncertainty of current information is proposed. The functionality and scalability of PO−IPPD is experimentally validated. Future research directions may include studying uncertainty problems in security validation, for example, penetration behaviour trajectory imputation, prediction, intent analysis, and other related areas based on partial observations.

Author Contributions

Conceptualisation, X.L. and Y.Z.; methodology, X.L. and Y.Z.; software, X.L. and W.L.; validation, W.L. and W.G.; writing—original draft preparation, X.L.; writing—review and editing, Y.Z. and W.G.; visualisation, W.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data supporting the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Alhamed, M.; Rahman, M.M.H. A Systematic Literature Review on Penetration Testing in Networks: Future Research Directions. Appl. Sci. 2023, 13, 6986. [Google Scholar] [CrossRef]
Greco, C.; Fortino, G.; Crispo, B.; Choo, K.-K.R. AI-enabled IoT penetration testing: State-of-the-art and research challenges. Enterp. Inf. Syst. 2023, 17, 2130014. [Google Scholar] [CrossRef]
Ghanem, M.C.; Chen, T.M.; Nepomuceno, E.G. Hierarchical reinforcement learning for efficient and effective automated penetration testing of large networks. J. Intell. Inf. Syst. 2023, 60, 281–303. [Google Scholar] [CrossRef]
Zhang, Y.; Zhou, T.; Zhu, J.; Wang, Q. Domain-Independent Intelligent Planning Technology and Its Applicaion to Automated Penetration Testing Oriented Attack Path Discovery. J. Electron. Inf. Technol. 2020, 42, 2095–2107. [Google Scholar]
Bellman, R. A Markovian decision process. J. Math. Mech. 1957, 6, 679–684. [Google Scholar] [CrossRef]
Spaan, M.T.J. Partially observable Markov decision processes. In Reinforcement Learning: State-of-the-Art; Springer: Berlin/Heidelberg, Germany, 2012; pp. 387–414. [Google Scholar]
Arulkumaran, K.; Deisenroth, M.P.; Brundage, M.; Bharath, A.A. Deep reinforcement learning: A brief survey. IEEE Signal Process. Mag. 2017, 34, 26–38. [Google Scholar] [CrossRef]
Ghojogh, B.; Ghodsi, A. Recurrent neural networks and long short-term memory networks: Tutorial and survey. arXiv 2023, arXiv:2304.11461. [Google Scholar]
Wang, Y.; Li, Y.; Xiong, X.; Zhang, J. DQfD-AIPT: An Intelligent Penetration Testing Framework Incorporating Expert Demonstration Data. Secur. Commun. Netw. 2023, 2023, 5834434. [Google Scholar] [CrossRef]
Schwartz, J.; Kurniawati, H. Autonomous penetration testing using reinforcement learning. arXiv 2019, arXiv:1905.05965. [Google Scholar]
Zhou, S.; Liu, J.; Zhong, X.; Lu, C. Intelligent Penetration Testing Path Discovery Based on Deep Reinforcement Learning. Comput. Sci. 2021, 48, 40–46. [Google Scholar]
Nguyen, H.V.; Teerakanok, S.; Inomata, A.; Uehara, T. The Proposal of Double Agent Architecture using Actor-critic Algorithm for Penetration Testing. In Proceedings of the 7th International Conference on Information Systems Security and Privacy, Vienna, Austria, 11–13 February 2021; pp. 440–449. [Google Scholar]
Zeng, Q.; Zhang, G.; Xing, C.; Song, L. Intelligent Attack Path Discovery Based on Hierarchical Reinforcement Learning. Comput. Sci. 2023, 50, 308–316. [Google Scholar]
Schwartz, J.; Kurniawati, H.; El-Mahassni, E. Pomdp+ information-decay: Incorporating defender’s behaviour in autonomous penetration testing. In Proceedings of the International Conference on Automated Planning and Scheduling, Nancy, France, 14–19 June 2020; Volume 30, pp. 235–243. [Google Scholar]
Sarraute, C.; Buffet, O.; Hoffmann, J. Penetration testing== POMDP solving? arXiv 2013, arXiv:1306.4714. [Google Scholar]
Shmaryahu, D.; Shani, G.; Hoffmann, J.; Steinmetz, M. Partially observable con-tingent planning for penetration testing. In Proceedings of the Iwaise: First International Workshop on Artificial Intelligence in Security, Melbourne, Australia, 20 August 2017; Volume 33. [Google Scholar]
Mairh, A.; Barik, D.; Verma, K.; Jena, D. Honeypot in network security: A survey. In Proceedings of the 2011 International Conference on Communication, Computing & Security, Odisha, India, 12–14 February 2011; pp. 600–605. [Google Scholar]
Common Vulnerabilities and Exploits. Available online: https://cve.mitre.org/ (accessed on 28 November 2024).
Common Attack Pattern Enumeration and Classification. Available online: https://capec.mitre.org/ (accessed on 28 November 2024).
Common Vulnerability Scoring System. Available online: https://www.first.org/cvss/v4.0/specification-document (accessed on 28 November 2024).
Backes, M.; Hoffmann, J.; Kunnemann, R.; Speicher, P.; Steinmetz, M. Simulated Penetration Testing and Mitigation Analysis. arXiv 2017, arXiv:1705.05088. [Google Scholar]
Rabinowitz, N.; Perbet, F.; Song, F.; Zhang, C.; Eslami, S.M.A.; Botvinick, M. Machine theory of mind. In Proceedings of the International Conference on Machine Learning. PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 4218–4227. [Google Scholar]
Chowdhary, A.; Huang, D.; Mahendran, J.S.; Romo, D.; Deng, Y.; Sabur, A. Autonomous security analysis and penetration testing. In Proceedings of the 2020 16th International Conference on Mobility, Sensing and Networking (MSN), Tokyo, Japan, 17–19 December 2020; pp. 508–515. [Google Scholar]
Zhou, S.; Liu, J.; Hou, D.; Zhong, X.; Zhang, Y. Autonomous penetration testing based on improved deep q-network. Appl. Sci. 2021, 11, 8823. [Google Scholar] [CrossRef]

Figure 1. Example of penetration testing.

Figure 2. PO−IPPD framework.

Figure 3. PO−IPPD network structure.

Figure 4. Variation in Reward with training.

Figure 5. Variation in Steps with training.

Figure 6. Variation in Reward/Step with training.

Figure 7. Variation in Reward in comparative scenarios.

Figure 8. Variation in Loss in comparative scenarios.

Table 1. Host configuration in Figure 1.

Host	OS	Service	Process
(1,0)	Linux	HTTP	—
(2,0) (4,0)	Linux	SSH, FTP	Tomcat
(3,0) (3,3)	Windows	FTP	—
(3,1) (3,2)	Windows	FTP, HTTP	Daclsvc
(3,4)	Windows	FTP	Daclsvc

Table 2. Attack capability configuration in Figure 1.

Type	Name	Prerequisites			Results
Type	Name	OS	Service	Process	Prob	Privileges
VE	E_SSH	Linux	SSH	—	0.9	User
	E_FTP	Windows	FTP	—	0.6	User
	E_HTTP	—	HTTP	—	0.9	User
PE	P_Tomcat	Linux	—	Tomcat	1	Root
PE	P_Daclsvc	Windows	—	Daclsvc	1	Root

Table 3. Basic experiment configurations.

Type	Settings
	CONV1:
	Kernel_size=3;Stride=1;Outchannels=16
Conv Structure	CONV2:
	Kernel_size=3;Stride=1;Outchannels=32
	AdaptiveAvgPool1d:
	output_size=1
GRU	Input=32;Hidden_size=64
LSTM	Input=32;Hidden_size=64
	FC1:
FC Structure	In_features:128,Out_features:64
	FC2:
	In_features:64,Out_features:action_space
Learn Rate	0.0001
Batch_size	64
M.size	100,000
Discount factor	0.9
Method capacity	4-dimension subnet field
	5-dominion host field

Table 4. Scalability experiment scenarios.

Scenarios	Hosts	Subnets	Reward
Scenario 1	8	4	100 × 2
Scenario 2	16	4	100 × 2
Scenario 3	16	8	100 × 2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, X.; Zhang, Y.; Li, W.; Gu, W. Modelling and Intelligent Decision of Partially Observable Penetration Testing for System Security Verification. Systems 2024, 12, 546. https://doi.org/10.3390/systems12120546

AMA Style

Liu X, Zhang Y, Li W, Gu W. Modelling and Intelligent Decision of Partially Observable Penetration Testing for System Security Verification. Systems. 2024; 12(12):546. https://doi.org/10.3390/systems12120546

Chicago/Turabian Style

Liu, Xiaojian, Yangyang Zhang, Wenpeng Li, and Wen Gu. 2024. "Modelling and Intelligent Decision of Partially Observable Penetration Testing for System Security Verification" Systems 12, no. 12: 546. https://doi.org/10.3390/systems12120546

APA Style

Liu, X., Zhang, Y., Li, W., & Gu, W. (2024). Modelling and Intelligent Decision of Partially Observable Penetration Testing for System Security Verification. Systems, 12(12), 546. https://doi.org/10.3390/systems12120546

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Modelling and Intelligent Decision of Partially Observable Penetration Testing for System Security Verification

Abstract

1. Introduction

2. Related Works

3. Problem Description and Theoretical Analysis

4. Methodology

4.1. POMDP Modelling

4.1.1. State Modelling

4.1.2. Action and Reward Modelling

4.1.3. Partially Observable Condition Modelling

4.2. PO−IPPD Method

5. Experiment and Discussion

5.1. Experiment Environment

5.2. Functional Effectiveness Experiment for Answering RQ1

5.3. Scalability Experiment for Answering RQ2

5.4. Ablation Experiment for Answering RQ3

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI