Network Defense Strategy Selection with Reinforcement Learning and Pareto Optimization

: Improving network security is a difﬁcult problem that requires balancing several goals, such as defense cost and need for network efﬁciency, in order to achieve proper results. In this paper, we devise method of modeling network attack in a zero-sum multi-objective game and attempt to ﬁnd the best defense against such an attack. We combined Pareto optimization and Q-learning methods to determine the most harmful attacks and consequently to ﬁnd the best defense against those attacks. The results should help network administrators in search of a hands-on method of improving network security.


Introduction
In an increasingly connected world, network security is a pressing issue facing many organizations.Modern networks consist of a myriad of devices sharing data through a variety of services and each with a private data store that they wish to keep safe [1,2].On top of this, many devices are connected to other networks or the Internet, exposing them to threats from outside attackers, including compromising network availability, damaging devices, and stealing data.
Given the complexity of the network and the range of possible damages, many network administrators have turned to game theory to model network vulnerabilities and determine an optimal defense strategy [3].Game models allow us to predict, to a degree, the results of defensive actions, which offers an enormous advantage over simple heuristics.Modification of the model allows for quick response to new attacker strategies.Filtering network security decisions through a game-based model helps to allocate limited resources, calculate risks, and automate network defense [4].
Xiao et al. [5] surveyed ongoing usage of game theory in cyber security and have identified a variety of applications, each adapted to different security scenarios.Consistent in their approach is the idea of modeling the network defense as a two-player game between an attacker and a defender.Kordy proved that attack-trees can be extended to include defender actions [6], to create attack-defense trees, which are essentially a representation of actions in a two-player zero-sum game [7].
Given the advantages of game modeling, the security industry would benefit from research in determining an optimal defense strategy in network security games.This article attempts to find an optimal defense strategy for a two-player zero-sum game.We use an information-incomplete model with simultaneous game play and multiple objectives.This model is a more realistic representation of actual network security, as the defender needs to balance different goals, during an attack.Specifically, we attempt to automate attacker and defender behavior and determine a rapid method to find an optimal defense strategy.
Current application of game theory models tends to focus on finding the Nash equilibrium for single-objective symmetric games.Our proposal offers a major improvement by using multiple objective reinforcement learning to ensure continuous improvement.This approach, while not unique, is scarcely researched and has not been applied to network security games.Examples outside of network security include Kamis and Gomaa implementing a multi-objective learning algorithm for traffic signal control [8] and van Moffaert [9] describing methods for scalarized reinforcement.We believe that there is a clear need for this approach in network security games.
We further improve the basic Q-learning approach by combining it with a multi-objective specific optimization algorithm, namely Pareto optimization.This method has been used in network games by Eisenstadt [10] to show that Pareto optimization can be used to remove non-optimal strategies.The advantages of the Pareto approach are not undisputed, as Ikeda et al. [11] doubts the effectiveness of using Pareto domination to find optimal solutions.This article shows that using Pareto optimization actually improves the reinforcement learning approach to find an optimal defense strategy for a network security game.
There is a plethora of research addressing the issues in the context of networking using game-theoretic models, bandwidth allocation [12][13][14][15], flow control [16][17][18] price modeling [19,20] and routing [21,22].Methodological analysis of decision-making in the context of security has paved the way for using game theory to model the interactions of agents in security problems [23,24].Game theory provides mathematical tools for the decision-maker to choose optimal strategies from an enormous search space of what-if scenarios [25].There has been a growing body of research towards game-theoretic approach on network security problems [24].
Roy et al. [26] and Manshaei et al. [4] have conducted comprehensive surveys of the field showing that game theory serves as an excellent tool for modeling security in computer networks, since securing a computer network involves an administrator trying to defend the network while an attacker tries to harm the network.
Roy categorized game-theoretic network security models into six groups [26].They are (1) the perfect information game-all the players are aware of the events that have already taken place; (2) the complete information game-all the players know the strategies and payoffs of the other players; (3) the Bayesian game-where strategies and payoffs of others are unknown and Bayesian probability is used to predict them; (4) the static game-a one-shot game where all of the players play at once; (5) the dynamic game-a game with more than one stage in which players consider the subsequent stages; and (6) the stochastic game-games involving a probabilistic transition of stages.
Sallhammar [27] has proposed a stochastic approach to study vulnerabilities and system's security and dependability in certain environments.The model provides real-time assessment.Unlike our model the proposed model is more inclined toward security risk assessment, dependability, and trustworthiness of the system.In addition to that, their model assumes vulnerabilities are unknown, whereas in our work the possible attack and defense scenarios are known to both agents.
Liu et al. [28] analyzed interaction between attacking and defending nodes in ad-hoc wireless networks using a Bayesian game to approach.The proposed hybrid Bayesian-approach Intrusion Detection System (IDS) saves significant energy while reducing the damage inflicted by the attacker.In this model, the stage games are considered Bayesian static since the defender's belief update requires constant observation of the attacker's action at each stage.Karim et al. [29] have proposed a similar model that uses collaborative method for IDS.The dynamic nature of this multi-stage game is similar to our model.In contrast to their model, we analyze the best defense strategy for a defender in heterogeneous network.
In this paper, we study the behavior of an agent (defender) who tries to maximize three goals: the cost difference, the number of uncompromised data nodes, and network availability.The agents iteratively play the game and stage transition uses a probability function.Thus, in this article, we focus on stochastic game with multiple objectives.In multi-objective mathematical programming there is no single optimal solutions [30].Since all objective functions are simultaneously optimized, the decision maker searches for the most preferable solution, instead of the most optimal solution [31].

Pareto Optimization
The administrator has a limited amount of resources and achieving complete security is unattainable [32,33].Thus, a realistic approach is to find an optimal strategy that maximizes the performance of the network with a minimum cost despite known vulnerabilities [34][35][36][37][38][39][40][41][42][43][44][45][46].Similarly, the attacker has a limited amount of resources.The knowledge of possible actions for both players allows participants to discover the potential actions and their best counter action [37].Thus, both participants are forced to maximize their utility from multiple goal perspectives [38].Pareto optimality is a state of resource allocation that addresses the issue of cost/benefit trade off.In a Pareto optimal state, no goal can be further improved without making the other goals worse off.
Gueye et al. [32] have proposed a model for evaluating vulnerabilities by quantifying corresponding Pareto optima using the supply-demand flow model.The derived vulnerability to the attack metric reflects the cost of a failed link, as well as the attacker's willingness to attack a link.
The Pareto front separates the feasible region from an infeasible region [39].The optimal operating point is computed by considering a network utility function.In this model, vulnerability to the attack metric is dependent on network topology, which makes it unsuitable for fully connected networks [40].
Guzmán and Bula proposed a bio-inspired metaheuristic hybrid model that addresses the bi-objective network design problem.It combines Improved Strength Pareto Evolutionary Algorithm-SPEA2 and the Bacterial Chemotaxis Multi-objective Optimization Algorithm (BCMOA) [41].The Pareto-optimal points approximations are faster than SPEA2 and Binary-BCMOA, however this method is not applicable to our model since we include more than two objectives.
In recent years, multi-objective reinforcement learning has emerged as a growing area of research.Fang et al. [42] used reinforcement learning to enhance the security scheme of cognitive radio networks by helping secondary users to learn and detect malicious attacks and nodes.The nodes then adapt and dynamically reconfigure operating parameters.

Reinforcement Learning
Reinforcement learning (RL) is a machine learning technique used by software agents to determine a reward maximizing strategy for given environment [43].Attack-defense games that use RL provide users with the capability to analyze hundreds of possible attack scenarios and methods for finding optimal defense strategies with predicted outcomes [26].The technique is derived from behavioral psychology, where agents are encouraged to repeat past beneficial behavior and discouraged from repeating harmful behavior [44,45].This article uses a Q-learning implementation of reinforcement learning.As such, it is best to describe the technique using the specific Q-learning.
Lye et al.
[37] proposed a two-player stochastic, general-sum game with a limited set of attack scenarios.The game model consists a seven-tuple set of network states, actions, a state transition function, a reward function, and a discount factor, resulting in, roughly, 16 million possible states.However their work only considered 18 network states.The game topology consists of only four nodes with four links.This model is inefficient if a large number of nodes is introduced to the network.One of the key disadvantages of the model is that it uses full state spaces, making it inefficient.The model has some resemblances to ours as we employ reinforcement learning as the state transition probability calculation method.In this paper, we address a similar situation with much larger set of network nodes.
RL has the problem of a combinatorial explosive state space [46].In order to learn Q-values, each agent (attacker or defender) is required to store Q-values in a table.Each entry in the Q-table represents a state reward for each action by one agent (attacker) to a corresponding action for that of other agent (defender).Thus, the action space is an exponential of the number of agents.In RL, the agent learns by trial-and-error, by repeatedly going through the stages.The iterative process of taking an action and obtaining a reward allows the agent to gain percepts of the dynamic environment.
Several methods have been proposed to reduce the size of the state space and most methods involve scalarizing the reward function and learning with the resulting single-objective function [47].Girgin et al. [48] have proposed a method to address the issue by constructing a tree to locate common action sequences of possible state for possible optimal policies.The subtasks that have almost identical solutions are considered similar.The repeated smaller subtasks that have a hierarchical relationship among them are combined into a common action sequence.The intuition behind this method is that, in most realistic and complex domains, the task that an agent tries to solve is composed of various subtasks and has a hierarchical structure formed by the relations between them [49,50], and the tree structure with common action sequence prevents agents going through inefficient repetition.
Tuyls et al. [51] used a decision tree to overcome the problem.The tree is created online allowing agent to shift attention to other areas of the state space.Nowe and Verbeek's approach to limit the state space is to force the agents to only model agents relevant to themselves.Inspired by biological systems, Defaweux et al. [52] have a proposed a similar system to address the product space problem, in which they only take a niche combination of agents who have relevant impact on reward function.Those who have a little impact on reward function are discarded.
Another method to resolve the issue of combinatorial explosion of state space is generalization.The goal is to have a smaller number of states (subset), which is useful to approximate a much larger set of states.This method speeds up learning and reduces the storage space required for lookup tables.However the (accuracy) quality of this method depends on the accuracy of the function approximation that takes example from target function mappings and attempts to generalize to entire function [53].
In our proposed model, we address the issue of a large state space by using Pareto optimization.The Pareto sets (Pareto fronts) contain all the optimal solutions for a given defense action.Instead of taking all the actions as inputs for the Q-learning function, we take a subset of the actions (Pareto front of the player's actions), improving speed of learning.The sets of dominated actions are eliminated, leaving only optimal Pareto fronts.The purpose is to let agents learn from the superior set of actions and make the learning process faster.

Simple Single Objective Q-Learning
Q-learning is a model-free approach to find an optimal state-action policy for a given Markov process, meaning that if the system finds itself in a specific state, the conditional probability distribution of future states is independent of the sequence of events or states that preceded it.Q-learning uses dynamic programming to allow a software agent to repeatedly examine specific sate-action pairs and determine if they are beneficial, and thus worth repeating, or harmful, and thus should be avoided [43].
Specifically, we assume that a software agent is moving through a finite state space that can, in a specific state, perform one of a finite set of associated actions, the result of which may or may not be state change.We also assume that each state-action pair, i.e., the action performed by the agent while in the specific state, has an associated reward.Finally, we assume that the probability that a specific state-action pair results in a given state change is independent of the agent's previous actions or states, i.e., the agent's movement is a Markov process.Given these assumptions, it is possible to use Q-learning to determine a movement policy that maximizes the software agent's cumulative reward.
Defining the environment described in the previous paragraph means that at a given time t, our agent will be in state x t ( S), and be able to perform action a t ( A),which should result in a reward r t and a probability of moving to state y( S).The probability that the environment at time t + 1 will have changed to state y is defined as: Note that because this is a Markov process, the probabilities are time independent.The probability that action a at state x will result in state y, is the same at t = 0 and t = 1000.
The action that the agent will take is determined by its policy π, such that: Note that just like the probabilities in Equation ( 1), the policy is also time independent.Rewards are equally time independent, meaning we can rewrite the reward at time t as a function of the state and policy dictated action: The goal of the Q-learning algorithm is to determine an optimal policy π * that will result in a maximum value for ∑ t r t .For this we need to determine both the value of the current reward and the potential future rewards.We define the combination of those two as the value of a state: Here, V π (x) is the value of state x when using policy π and γ is a discount factor, used to exponentially discount the value of future states.We can now use dynamic programming to determine the optimal policy by finding the action maximizes the state value at each step: The optimal policy π * says that at state x, the agent should perform the action a, that results in the maximum possible state value V * (x), assuming that the agent continues with policy π * .An example of a Markov process with policy-specific state values is given in Figure 1.If the values for γ, R x (a), and P x,y (a) are known, this can easily be implemented as simple iterative function.Note that because this is a Markov process, the probabilities are time independent.The probability that action at state will result in state , is the same at = 0 and = 1000.
The action that the agent will take is determined by its policy , such that: Note that just like the probabilities in Equation ( 1), the policy is also time independent.Rewards are equally time independent, meaning we can rewrite the reward at time t as a function of the state and policy dictated action: The goal of the Q-learning algorithm is to determine an optimal policy * that will result in a maximum value for ∑ .For this we need to determine both the value of the current reward and the potential future rewards.We define the combination of those two as the value of a state: Here, V π (x) is the value of state x when using policy π and γ is a discount factor, used to exponentially discount the value of future states.We can now use dynamic programming to determine the optimal policy by finding the action maximizes the state value at each step: The optimal policy * says that at state x, the agent should perform the action a, that results in the maximum possible state value * ( ), assuming that the agent continues with policy * .An example of a Markov process with policy-specific state values is given in Figure 1.If the values for , ( ), and , ( ) are known, this can easily be implemented as simple iterative function.This iterative approach requires a Q-value to compare single action changes to the policy.The Q-value for an action gives a specific state is defined as: By iteratively updating the Q-value for each state-action pair, we can slowly determine the best action for each state, giving us: As previously described, updating the Q-values requires a simple iterative function.Assume that all state-action pairs have a pre-defined initial Q-value ( , ).The specific steps for an agent using Q-learning are visible in Figure 2 and defined as: This iterative approach requires a Q-value to compare single action changes to the policy.The Q-value for an action gives a specific state is defined as: By iteratively updating the Q-value for each state-action pair, we can slowly determine the best action for each state, giving us: As previously described, updating the Q-values requires a simple iterative function.Assume that all state-action pairs have a pre-defined initial Q-value Q(x 0 , a 0 ).The specific steps for an agent using Q-learning are visible in Figure 2 and defined as: -Determine the current state x t ; -Choose an action a t , either by exploring (with a probability of ε) or exploiting (with a probability of 1 − ε) the current Q-value; -Receive a reward r t for having performed the action; -Observe the resulting state x t+1 ; -Adjust the Q-value using a learning rate α, according to:  Choosing an action by exploration just means randomly selecting one of the possible values.Choosing an action by exploitation means selecting the action with the highest Q-value [54].
The Q-learning process can, of course, be expanded to multiple dimensions [55].The primary change is that a state-action pair results in a reward vector instead of a scalar reward.The second change is that in the maximization function for ( , ) becomes a multi-objective optimization problem.A solution may be optimal in one dimension, but not in another.We can use the previously described Pareto optimization approach to determine the optimal solution for this problem.

Pareto Optimization
In this article, we intend to combine Q-learning with Pareto optimization.The goal of Pareto optimization is to remove solutions that are objectively inferior from the possible solutions.A solution taken from the remaining set is more likely to be the optimal solution for the optimization problem.An objectively inferior solution is called a dominated solution.
A reward ( ) dominating another reward ( '), means that ( ) is superior or equal to ( ') in every dimension of the solution space and there exists at least one dimension in which ( ) is better than ( ').In the context of a maximization function, superior means greater or Choosing an action by exploration just means randomly selecting one of the possible values.Choosing an action by exploitation means selecting the action with the highest Q-value [54].
The Q-learning process can, of course, be expanded to multiple dimensions [55].The primary change is that a state-action pair results in a reward vector instead of a scalar reward.The second change is that in the maximization function for Q(x t+1 , a) becomes a multi-objective optimization problem.A solution may be optimal in one dimension, but not in another.We can use the previously described Pareto optimization approach to determine the optimal solution for this problem.

Pareto Optimization
In this article, we intend to combine Q-learning with Pareto optimization.The goal of Pareto optimization is to remove solutions that are objectively inferior from the possible solutions.A solution taken from the remaining set is more likely to be the optimal solution for the optimization problem.An objectively inferior solution is called a dominated solution.
A reward R x (a) dominating another reward R x (a ), means that R x (a) is superior or equal to R x (a ) in every dimension of the solution space and there exists at least one dimension in which R x (a) is better than R x (a ).In the context of a maximization function, superior means greater or equal to, while, for a minimization function, superior means less than or equal to [56].
Pareto dominance for minimization is defined as: Pareto dominance for maximization is defined as: Pareto set dominance describes a situation where, for a set X and a set Y, X is said to dominate Y, if and only if X contains a vector x that dominates all vectors in set Y. Pareto set dominance for minimization is defined as: Pareto optimization is a good method to improve the basic Q-learning approach.Standard Q-learning obliges the software agent to explore an action and estimate the value of its future states before determining whether or not the action is worth performing.This means that the Q-learning approach requires extensive exploration before providing a useful policy [57].
As such, Q-learning would benefit from a mechanism that objectively removes inferior actions from the set of potential state-action pairs.Pareto optimization is well suited to provide just that.By identifying and removing dominated actions before the Q-learning algorithm chooses its action, we should speed up the rate at which the Q-learning approach arrives at an optimal solution.
We also extend the basic Q-learning approach by allowing multiple player actions at the same time.This means that each player will run a separate software agent, optimizing for that player's actions.Pareto optimization already allows for multiple players moving at the same time.The combined Pareto-Q-learning approach is shown in Figure 3. Pareto dominance for minimization is defined as: Pareto dominance for maximization is defined as: Pareto set dominance describes a situation where, for a set and a set , is said to dominate , if and only if contains a vector that dominates all vectors in set .Pareto set dominance for minimization is defined as:

Pareto Optimization in Q-Learning
Pareto optimization is a good method to improve the basic Q-learning approach.Standard Qlearning obliges the software agent to explore an action and estimate the value of its future states before determining whether or not the action is worth performing.This means that the Q-learning approach requires extensive exploration before providing a useful policy [57].
As such, Q-learning would benefit from a mechanism that objectively removes inferior actions from the set of potential state-action pairs.Pareto optimization is well suited to provide just that.By identifying and removing dominated actions before the Q-learning algorithm chooses its action, we should speed up the rate at which the Q-learning approach arrives at an optimal solution.
We also extend the basic Q-learning approach by allowing multiple player actions at the same time.This means that each player will run a separate software agent, optimizing for that player's actions.Pareto optimization already allows for multiple players moving at the same time.The combined Pareto-Q-learning approach is shown in Figure 3.

Game Definition
The network attacker and defender interaction can be modelled as an information-incomplete two-player zero-sum game.Using such a model, we will teach our software agent to determine the

Game Definition
The network attacker and defender interaction can be modelled as an information-incomplete two-player zero-sum game.Using such a model, we will teach our software agent to determine the optimal action when confronted with an actual network attack [37].The specific network interaction we wish to model is described as the following game: Here, G is the network, S is the set of all possible states the game can be in; A is the set of possible actions that an attacker can perform on a node; D is the set of possible actions that a defender can perform on a node; P is the state transition function; R is the reward function defined in Equation (3); and γ is a discount factor.RV refers to the solution space for the reward function.
A network G consists of a set of nodes connected by edges.Each node represents a hardware device, which runs several services, can be infected by a virus, and contains data.The value/weight of the services is given by β, the value of keeping the node virus free is υ, and the value of the data on the node is ω.These values can differ per node.An edge represents the connection between two nodes.If two nodes are running the same service, they will use a connecting edge to communicate.
A game state s i S is a combination of the network state and each player's resource points.The network state is the state of all the nodes, i.e., which services are running on the node, if it is infected by a virus, and whether or not the data has been stolen, and the state of all the edges identified by service availability.Resource points are points that players can spend to perform an action.
The players can only perform actions on the nodes, so the network state is simply the state of the nodes.Let n be the state of the nodes.In the node state, b is a vector showing which services are running on the node (1 = running and 0 = down), v indicates whether or not the node has been virus free (1 = virus-free and 0 = infected) and w indicates whether or not the data is secure (1 = secure and 0 = data is stolen).
In this game, we introduce the concept of resource points.From existing research, we know it is common to assume that each action taken by a player has an associated cost.To keep track of costs, we start a game, by giving each player a fixed set of resource points.The resource points tell us if a player can afford to take a certain action, e.g., if the cost of an action exceeds the available resource points, the player cannot take that action.If a player runs out of resource points, the game ends.Let p be a vector containing the player resource points: p ∈ N 2 .
The final definition of a game state is < n, p > S. Figure 4 shows a game state with compromised services in n 1 , n 2 , and n 3 and a virus installed in n 2 .

Game Rules
This game uses the following actions on a node for the attacker and defender (the cost is shown in brackets): (6), install_virus (12), steal_data (8), nothing (0)} D = {restore_service (12), remove_virus (12), nothing (0)} A player can perform any of the above action on a single node.The actual action sets defined in Equations ( 16) and ( 17) are a subset of the combination of one of the above actions and the nodes in the network.Assuming that n is the nodes of the network as defined in Equation ( 13).The subset contains all the valid actions for the current state.
The game is played as follows: At a given time , the game is in state .If all the data is stolen, the game ends.If not, check the player's resource points.If one of the players has enough resource points to do anything other than nothing, they can perform an action.The game ends as soon as one of the following ending conditions is met: - The attacker brings down all the services (network availability is zero), the game ends.

-
The attacker manages to steal all the data or infect all the nodes, the game ends -Either the attacker or the defender can no longer make a move, the game ends.

Performing an action:
The attacker chooses an action from and performs it, then the defender chooses an action from and performs it.The game determines the next state st+1 according to the transition function .The rewards for that state transition are calculated and the goals are updated with the rewards.The game transitions to state and time + 1. See Figure 5

Game Rules
This game uses the following actions on a node for the attacker and defender (the cost is shown in brackets): A = {compromise_service (6), install_virus (12), steal_data (8), nothing (0)} D = {restore_service (12), remove_virus (12), nothing (0)} A player can perform any of the above action on a single node.The actual action sets defined in Equations ( 16) and ( 17) are a subset of the combination of one of the above actions and the nodes in the network.Assuming that n is the nodes of the network as defined in Equation ( 13).The subset contains all the valid actions for the current state.
The game is played as follows: At a given time t, the game is in state s t S. If all the data is stolen, the game ends.If not, check the player's resource points.If one of the players has enough resource points to do anything other than nothing, they can perform an action.The game ends as soon as one of the following ending conditions is met: - The attacker brings down all the services (network availability is zero), the game ends.

-
The attacker manages to steal all the data or infect all the nodes, the game ends -Either the attacker or the defender can no longer make a move, the game ends.

Performing an action:
The attacker chooses an action from A s and performs it, then the defender chooses an action from D s and performs it.The game determines the next state s t+1 according to the transition function P. The rewards for that state transition are calculated and the goals are updated with the rewards.The game transitions to state s t+1 and time t + 1. See Figure 5 for a state transition Transitioning to a new state: Transitioning to a new state involves three things: (1) modify the network state according to the performed actions; (2) subtract the cost of the performed action from the corresponding player's resource points; (3) subtract one resource point for each compromised service and one resource point for each installed virus from the attacker's resource points; this is a cost for keeping the nodes compromised.
Valid actions: compromise_service-an attacker can compromise any service that is not yet compromised.
install_virus: an attacker can install a virus on any node that is currently virus free.steal_data: an attacker can steal data from any node that has a virus installed.restore_service: a defender can restore any service that is compromised.remove_virus: a defender can remove a virus from any node that has a virus installed.nothing: either attacker or defender can choose to do nothing.
Appl.Sci.2017, 7, 1138 10 of 23 service and one resource point for each installed virus from the attacker's resource points; this is a cost for keeping the nodes compromised.
Valid actions: compromise_service-an attacker can compromise any service that is not yet compromised.
install_virus: an attacker can install a virus on any node that is currently virus free.steal_data: an attacker can steal data from any node that has a virus installed.restore_service: a defender can restore any service that is compromised.remove_virus: a defender can remove a virus from any node that has a virus installed.nothing: either attacker or defender can choose to do nothing.

Goal Definition
This is a zero-sum game, thus, the goals for the attacker and defender are the inverse of each other, meaning that if the defender aims to maximize a score, the attacker aims to minimize it.The game score at a given time , where the game is in state , is defined as: The score for network availability is defined as the sum of the value of communicating services.
Keeping the nodes secure means keeping them virus-free and making sure the attacker did not steal the data.Let be the state of node in the network, defined in Equation ( 13), such that = < , , >.

Goal Definition
This is a zero-sum game, thus, the goals for the attacker and defender are the inverse of each other, meaning that if the defender aims to maximize a score, the attacker aims to minimize it.The game score at a given time t, where the game is in state s t S, is defined as: f (1) = score for the network availability; -f (2) = score for the node security; -f (3) = score for the defense strategy cost effectiveness The score for network availability is defined as the sum of the value of communicating services.
Keeping the nodes secure means keeping them virus-free and making sure the attacker did not steal the data.Let n i be the state of node in the network, defined in Equation (13), such that An effective defense strategy means that the defender expended as few resource points as possible while defending the network.The score for the defense strategy cost effectiveness is the difference in resource points between the defender and attacker.
The Reward function maps a combination of the specific state s t and a player's action to a reward.This mapping of a state-action pair to a reward is the r t used in the Q-learning algorithm.We want to incentivize the player's for keeping the network in their desired state as long as possible and for the efficient use of their resource points.This means we can use the cumulative state scores of f (1) and f (2) and the final score of f (3) .The reward function for a single transition is then: An example of a reward calculation is visible in Figure 5.

Experiment
In this section, we use an example to show how the Pareto-Q-Learning implementation can help a network defender to create an optimal policy for network defense.First, we describe the game conditions and implementation details, then we discuss the results of the implementation.In the discussion, we will compare the Pareto-Q-Learning approach to a random defense strategy and a Q-learning only approach.We prove that the Pareto-Q-Learning approach proves superior to both alternatives.

Network Definition
The goal of this experiment is showing that the Pareto-Q-learning approach is an effective method of determining an optimal defense strategy.More specifically, the approach (a) outperforms random selection; (b) outperforms a non-Pareto learning approach; and (c) works well under different network configurations.The experiment reaches that goal by answering three questions: 1.
Does the Pareto-Q-learning approach lead to an optimal solution for the defined network security game?
The experiment will hopefully show that a policy determined by the Pareto-Q-learning approach performs better than an arbitrary (random) policy selection.

2.
Does the Pareto-Q-learning approach provide a superior solution to a simple Q-learning approach?
The experiment will hopefully show that the Pareto-Q-learning reaches an optimal solution faster than the simple Q-learning algorithm.

3.
Does the Pareto-Q-learning provide a solution for different computer network configurations?
As indicated in the introduction and related work sections the Pareto-Q-learning approach should prove itself to be applicable to any computer network that conforms to the game definition described in the theory section.The experiment should show that the Pareto-Q-learning approach can find an optimal solution in several different network configurations.

Experiment Planning
The Pareto-Q-learning approach was applied to 16 different network configurations.This is to determine how well the approach holds up in variant networks.The configurations were chosen to emulate real networks as much as possible and to allow for a game with at least three moves.
A node in the network has three services, can be infected by one virus and has one data store.10% of the possible services have been turned off (meaning that the service cannot be contacted, but also cannot be compromised).Likewise, 10% of the nodes do not have data stores and 10% of nodes cannot be infected by a virus.
Each network consists of a set of high-value nodes (server, Network Attached Storage, etc.) and low-value nodes (client machines, peripheral devices, etc.).The weights of high-and low-value nodes are randomly assigned and within the ranges defined in Table 1.The costs of actions are the same as those defined in the model and also given in Table 1.The ratio of high-value nodes to low-value nodes is given in Table 1.The variety in the network was achieved by using a different number of nodes, game points and sparsity.The different choices are given in Table 1.To show the improvement of the policy over time, the resulting reward vectors have been normalized and turned into scalars using the scalarization values given in Table 1.The defined nodes are randomly connected to each other until the network hits the number of edges specified by the sparsity, which is |edges| = sparsity*|nodes|*(|nodes| − 1).The resulting random network is used as the starting state for a game.Note that each configuration has a different starting state.Figure 6 shows an example of what an actual network may look like.Note that this Figure contains 10 just nodes, and that there exists a path between all the nodes of the graph but the graph is not fully connected.

Experiment Operation
Each configuration of the game was run 200 times with ε starting at 0 and increasing to 0.8 by the 150th run.The starting state of a configuration was constant for each iteration.Each configuration was run for three different algorithms, namely: (1) Random strategy: The defender uses a random strategy.
(2) Basic Q-learning: The defender uses the basic Q-learning algorithm.
The experiment was implemented in Python, using the numpy library to keep track of the state vectors and q-table vectors.Rather than saving the state as a pair of vectors of vectors, we chose to roll out the state into a single integer array with 0 representing a service/virus/data that has been compromised, and 1 representing when the defender restored it.
Figure 7 shows sequence diagram of the program, the program consisted of one game object acting as controller of the state object.The state object holds the current state, network definition, and logic to calculate Pareto fronts.The game object holds the logic to calculate the Q-values needed for the Q-learning.
During a single game, the game object will retrieve the valid actions from the state, use the Qvalues to choose an action, update the state, and update the Q-values, until the game ends.Once the game ends, it resets the state object to the initial state and starts a new game.It repeats this 200 times.

Experiment Operation
Each configuration of the game was run 200 times with ε starting at 0 and increasing to 0.8 by the 150th run.The starting state of a configuration was constant for each iteration.Each configuration was run for three different algorithms, namely: (1) Random strategy: The defender uses a random strategy.
(2) Basic Q-learning: The defender uses the basic Q-learning algorithm.
The experiment was implemented in Python, using the numpy library to keep track of the state vectors and q-table vectors.Rather than saving the state as a pair of vectors of vectors, we chose to roll out the state into a single integer array with 0 representing a service/virus/data that has been compromised, and 1 representing when the defender restored it.
Figure 7 shows sequence diagram of the program, the program consisted of one game object acting as controller of the state object.The state object holds the current state, network definition, and logic to calculate Pareto fronts.The game object holds the logic to calculate the Q-values needed for the Q-learning.
During a single game, the game object will retrieve the valid actions from the state, use the Q-values to choose an action, update the state, and update the Q-values, until the game ends.Once the game ends, it resets the state object to the initial state and starts a new game.It repeats this 200 times.

Experiment Result Definition
Each game iteration produces a final score, which is the cumulative reward of that game.The goal of the defender, as defined in the game model section, is to maximize that cumulative reward.
Note, that the game has three separate objectives and, thus, the reward is a 3D vector.An optimal defense policy will have a higher cumulative reward than a non-optimal defense policy.
In order to resolve the stated problem statements, the results of the experiments should show that, over time, the Pareto-Q-learning approach leads to higher cumulative rewards.By mapping the cumulative rewards of games played over time, we can show just that.
Each configuration of the game was run 200 times.The cumulative reward of each game was averaged over two iterations, creating 100 data points for analysis.The averaging to 100 points was done in order to decrease the variability between the results.While learning, the Pareto-Q-learning algorithm will, from time to time, select a random action, when rand() > ε.These random choices are a feature of Q-learning, as visible in Figures 2 and 3, and are necessary for the exploration of the state space.However, this can lead to very different rewards and, thus, increase the variability of the results.

Experiment Result Definition
Each game iteration produces a final score, which is the cumulative reward of that game.The goal of the defender, as defined in the game model section, is to maximize that cumulative reward.
Note, that the game has three separate objectives and, thus, the reward is a 3D vector.An optimal defense policy will have a higher cumulative reward than a non-optimal defense policy.
In order to resolve the stated problem statements, the results of the experiments should show that, over time, the Pareto-Q-learning approach leads to higher cumulative rewards.By mapping the cumulative rewards of games played over time, we can show just that.
Each configuration of the game was run 200 times.The cumulative reward of each game was averaged over two iterations, creating 100 data points for analysis.The averaging to 100 points was done in order to decrease the variability between the results.While learning, the Pareto-Q-learning algorithm will, from time to time, select a random action, when rand() > ε.These random choices are a feature of Q-learning, as visible in Figures 2 and 3, and are necessary for the exploration of the state space.However, this can lead to very different rewards and, thus, increase the variability of the results.

Results and Analysis
The results show cumulative rewards over time.If the Pareto-Q-learning leads to an optimal solution, the cumulative rewards should increase over time in all three dimensions.In order to validate that this is the result of the approach and not the game definition, the Pareto-Q-learning rewards should be compared to a base case of arbitrary (random) actions and a simple Q-learning approach (i.e., the possible action space is not limited by Pareto optimization).
A visual representation of the network with configuration of 100 nodes and 0.05 sparsity is given in Figures 8-10, and the results showing improvement over time are visible.Figure 8 shows the network availability ( f (1) ), Figure 9 shows the node security ( f (2) ) over the same games, and Figure 10 shows the defense strategy cost effectiveness ( f (3) ) over the same games.The configuration for these figures are derived from Table 1, reflecting 100 nodes, a sparsity of 0.05, and both the defender and attacker start with 250 game points.The weights of the nodes are randomly generation within the limits described in Table 1.
network availability ( ( ) ), Figure 9 shows the node security ( ( ) ) over the same games, and Figure 10 shows the defense strategy cost effectiveness ( ( ) ) over the same games.The configuration for these figures are derived from Table 1, reflecting 100 nodes, a sparsity of 0.05, and both the defender and attacker start with 250 game points.The weights of the nodes are randomly generation within the limits described in Table 1.We implemented the game using Python version 3.6 (2017) and the NumPy library (2017), Python is a free software by Python Software Foundation, America, and NumPy is the fundamental package for scientific computing with Python.The experiment was run on a MacBook Pro 2016 model machine (MacOS Sierra Version 10.12.6) with an Intel Core i5 (dual-core 3.1 GHz) CPU and 8 GB of RAM (Random-Access Memory).We use the machine's internal SSD (Solid State Drives) hard-disk for storing data.
According to the experimental configuration in Section 4.3, the algorithm performs 200  We implemented the game using Python version 3.6 (2017) and the NumPy library (2017), Python is a free software by Python Software Foundation, America, and NumPy is the fundamental package for scientific computing with Python.The experiment was run on a MacBook Pro 2016 model machine (MacOS Sierra Version 10.12.6) with an Intel Core i5 (dual-core 3.1 GHz) CPU and 8 GB of RAM (Random-Access Memory).We use the machine's internal SSD (Solid State Drives) hard-disk for storing data.
According to the experimental configuration in Section 4.3, the algorithm performs 200 We implemented the game using Python version 3.6 (2017) and the NumPy library (2017), Python is a free software by Python Software Foundation, America, and NumPy is the fundamental package for scientific computing with Python.The experiment was run on a MacBook Pro 2016 model machine (MacOS Sierra Version 10.12.6) with an Intel Core i5 (dual-core 3.1 GHz) CPU and 8 GB of RAM leads to greater execution time in large action spaces.However, increasing the efficiency of the Pareto algorithm is not the goal of this paper.Moreover, a shorter number of games matches the actual situation, and is conducive to the defenders to prevent attacks as soon as possible in the practical application.Instead we focus on the number of game moves that it takes to come to an optimal strategy instead of the actual running time.We believe that the immediate outperformance of the Pareto-Q-learning algorithm is due to the elimination of clearly inferior actions from the set of possible actions that the Q-learning algorithm can explore.The result is a rapid determination of an optimal defense strategy.
To determine if the approach works in different network conditions, we graphed the scalarized cumulative reward vector over time for the 16 different configurations described in the experiment planning section.The horizontal axis represents the running time in the corresponding network configuration, and the vertical axis represents the corresponding cumulative reward function value; We believe that the immediate outperformance of the Pareto-Q-learning algorithm is due to the elimination of clearly inferior actions from the set of possible actions that the Q-learning algorithm can explore.The result is a rapid determination of an optimal defense strategy.To determine if the approach works in different network conditions, we graphed the scalarized cumulative reward vector over time for the 16 different configurations described in the experiment planning section.The horizontal axis represents the running time in the corresponding network configuration, and the vertical axis represents the corresponding cumulative reward function value; the higher the reward function value, the better the game result.Figure 13 shows that for each implemented configuration the Pareto-Q-learning approach leads to an increase in the cumulative reward over time.

Conclusions and Future Work
In this article, we presented an application of reinforcement learning, specifically Q-learning, to determine an optimal defense policy for computer networks.We enhanced the basic Q-learning approach by expanding it to multiple objectives and pre-screening possible actions with Pareto optimization.The results of our experiment show that basic Q-learning is a slow approach for network security analysis, but the process can be made significantly more efficient by applying Pareto optimization to the defense strategy selection process.
The work presented in this paper has several limitations.Our game model consists of nodes, which are directly connected to each other.However, most real-world networks have various devices that facilitates connection between devices such as routers and switches.In addition to that, in our model the attacker chooses random actions.In real computer networks, attackers do not behave in this way.Even if the attacker, blinded, performs random attacks at the beginning, he will change his strategy over time.Therefore, it seems difficult to use these approaches to counteract a real-time attack.The model does not consider that the attacker has a defined goal, nor do we explicitly predict attacker behavior.Instead, we merely focus on the defender.
We feel that the presented experiment functions best as an academic exercise, which shows the advantages offered by combining Pareto optimization and reinforcement learning.The method can be used for systematic monitoring of a network or a set of computers.Network defenders can run an offline system analysis based on the model, and use the results as a reference for planning defense One disadvantage of the Pareto-Q-learning approach, is that it does not completely solve the problem of a large action space.As is visible from Figure 13, an increase in either the sparsity or the number (i.e., increasing the action space) does effectively decrease the learn rate.The Pareto-Q-learning approach, as an improvement of simple Q-learning, will eventually encounter the same problems as Q-learning in terms of problems with a large action space.
Another disadvantage is the high variance in strategy selection.We believe that is the result of the stochastic nature of Q-learning in the exploration phase.By randomly selecting strategies, the results vary wildly.The independent nature of the goals exacerbates this problem, as an accidental move that highly rewards one goal may lead to a policy that pursues that goal at cost of the others.This is visible in the uniform distribution ignoring the "do-nothing" strategy, which has a low chance of being selected from the very large action space.

Conclusions and Future Work
In this article, we presented an application of reinforcement learning, specifically Q-learning, to determine an optimal defense policy for computer networks.We enhanced the basic Q-learning approach by expanding it to multiple objectives and pre-screening possible actions with Pareto optimization.The results of our experiment show that basic Q-learning is a slow approach for network security analysis, but the process can be made significantly more efficient by applying Pareto optimization to the defense strategy selection process.
The work presented in this paper has several limitations.Our game model consists of nodes, which are directly connected to each other.However, most real-world networks have various devices that facilitates connection between devices such as routers and switches.In addition to that, in our model the attacker chooses random actions.In real computer networks, attackers do not behave in this way.Even if the attacker, blinded, performs random attacks at the beginning, he will change his strategy over time.Therefore, it seems difficult to use these approaches to counteract a real-time attack.The model does not consider that the attacker has a defined goal, nor do we explicitly predict attacker behavior.Instead, we merely focus on the defender.
We feel that the presented experiment functions best as an academic exercise, which shows the advantages offered by combining Pareto optimization and reinforcement learning.The method can be used for systematic monitoring of a network or a set of computers.Network defenders can run an offline system analysis based on the model, and use the results as a reference for planning defense strategies.This is especially useful for networks with a large number of nodes.
While the current model provides us with an automated defense selection mechanism, we can see that the results were still plagued by a rather high variance in the game scores.As indicated in the results analysis, we assume this to be the result of the algorithm's inability to balance the divergent goals.We suggest that future work be focused on choosing an optimal strategy from independent, and seemingly equally important, goals and applying that to Pareto set domination.
Another direction for future work, is implementing this approach for an information system facing an emulated distributed denial of service attack.We can test, independently of the processor time delay, the propagation of the attack and of the reaction of the defender.

Figure 1 .
Figure 1.Example of Q-learning in a Markov Process.

Figure 1 .
Figure 1.Example of Q-learning in a Markov Process.

Figure 4 .
Figure 4. Example of a game state.
for a state transition Transitioning to a new state: Transitioning to a new state involves three things: (1) modify the

Figure 4 .
Figure 4. Example of a game state.

-
( ) = score for the network availability; -( ) = score for the node security; -( ) = score for the defense strategy cost effectiveness

Figure 5 .
Figure 5. Game state transition and reward calculation.

Figure 12 .
Figure 12.Comparison of execution time between algorithms.

Figure 12 .
Figure 12.Comparison of execution time between algorithms.

Figure 13 .
Figure 13.Pareto-Q-Learning over time for different configurations.

Figure 13 .
Figure 13.Pareto-Q-Learning over time for different configurations.

Table 1 .
Configuration of Game Model.