Cyber Insurance Ratemaking: A Graph Mining Approach

: Cyber insurance ratemaking (CIRM) is a procedure used to set rates (or prices) for cyber insurance products provided by insurance companies. Rate estimation is a critical issue for cyber insurance products. This problem arises because of the unavailability of actuarial data and the uncertainty of normative standards of cyber risk. Most cyber risk analyses do not consider the connection between Information Communication and Technology (ICT) sources. Recently, a cyber risk model was developed that considered the network structure. However, the analysis of this model remains limited to an unweighted network. To address this issue, we propose using a graph mining approach (GMA) to CIRM, which can be applied to obtain fair and competitive prices based on weighted network characteristics. This study differs from previous studies in that it adds the GMA to CIRM and uses communication models to explain the frequency of communications as weights in the network. We used the heterogeneous generalized susceptible-infectious-susceptible model to accommodate different infection rates. Our approach adds up to the existing method because it considers the communication frequency and GMA in CIRM. This approach results in heterogeneous premiums. Additionally, GMA can choose more active communications to reﬂect high communications contribution in the premiums or rates. This contribution is not found when the infection rates are the same. Based on our experimental results, it is apparent that this method can produce more reasonable and competitive prices than other methods. The prices obtained with GMA and communication factors are lower than those obtained without GMA and communication factors.


Introduction
In 2020, the World Economic Forum placed cyberattacks in the top 10 categories of risks over the next ten years (World Economic Forum 2020). The rapid development of information and communication technology (ICT) has led to the need for proper risk management. Cyber insurance is one option that can protect ICT sources from losses incurred due to cyberattacks (Bodin et al. 2018;Camillo 2017). Empirical studies have shown the adequacy of the use of cyber insurance to manage cyber risks (Biener et al. 2015). Cyber insurance ratemaking (CIRM) is one of cyber insurance products' main problems (Marotta et al. 2017). The uncertainty of cyber risk factors has caused companies to price cyber insurance products conservatively at high prices. Several existing risk models have been built without paying attention to network structure and computer information (Bohme and Schwartz 2010). The connectedness of ICT sources in the network and the transmission process occurring via connections between them are two distinctive characteristics of cyber risks. Therefore, the analysis of cyber threats must consider the network structure and the characteristics of computer information.
To characterize the structure, we propose using a graph mining approach (GMA) for CIRM on weighted networks. The weights describe the communication frequency (the number of connections made while sending and receiving information) in a computer communication network (Chou 1975). Several studies of network traffic and cyberattacks have shown a relationship between them (Almutairi et al. 2020;Pimenta Rodrigues et al. 2017;Wang and Jones 2020). Consequently, graph mining serves to acquire groups with intense communication. GMA has three stages. Stage 1 comprises the steps used to generate a weighted network based on probability distribution information. Miller and Childers (2012) treated the arrival of messages (packets) to a node as an example of a random process in the computer communication network and modeled it following the Poisson process. In this study, we modeled the number of connections using a probability distribution with two perspectives. These are node-and link-based models. The node-based model uses the analogy of the weighted co-purchase product network formation for market basket analysis (Kim et al. 2012;Raeder and Chawla 2011;Videla-Cavieres and Ríos 2014). The link-based model directly treats the weights on the links as random variables.
Stage 2 is the process of obtaining communication characteristics using the GMA (Zhang et al. 2011). This stage comprises three parts-namely, community detection, threshold setup, and a filtering process. Community detection is used for identifying structural similarity (Boobalan et al. 2016;Chang et al. 2017;Karatas and Sahin 2018;Remy et al. 2018). Especially in the spread of viruses, community detection can be used to find groups with more dense contacts than inter-group contacts . In every community, some nodes or links are rarely used. Threshold setting and filtering processes are used to determine the communication threshold. Nodes and links that are lower than the threshold are not involved in the CIRM simulation. The company can choose the level of risk desired based on the proportion of the threshold in each community. Nodes that do not meet the threshold are not covered by insurance. Hence, the insurance rate can be adjusted according to the company's capabilities.
In stage 3, a Monte Carlo simulation is conducted to evaluate the network security level and calculate losses in each filtered community. We use the heterogeneous generalized susceptible-infectious-susceptible (HG-SIS) model by Ottaviano et al. (2018Ottaviano et al. ( , 2019 as the basis of the simulation to include the effects of communication weight. The HG-SIS model is an extension of the ε-SIS model (Van Mieghem 2014;Van Mieghem and Cator 2012). The ε-SIS model has been used as the basis for previous CIRM simulations in the unweighted network (Xu and Hua 2019). Our previous study also extended the compartmental SIS model (Kermack and McKendrick 1991) to estimate cyber risk using the average degree for several particular network topologies (Indratno and Antonio 2019). This model was not used because it could not detect the individual status of the microlevel perspective simulations.
The main research objective of the GMA is to obtain a more appropriate and more competitive insurance rate (or price) by characterizing the network structure. This approach adjusts rates based on more active communication to overcome overpricing issues. The main contributions of this paper are as follows: • incorporating the effects of communication intensity in the CIRM process. • developing GMA procedures to homogenize high communications for CIRM in weighted networks. • extending the CIRM simulation with different link infection rates (according to communication intensity) using the HG-SIS model. • applying the GMA procedure for CIRM in two networks: a hybrid network and a random network. Then, the results are compared with those obtained without GMA cases.
• None of the works used GMA for selecting risk, especially from a network model perspective. We propose the use of GMA before the CIRM simulation. • The last three studies used epidemic models with similar infection rates. We adjusted the link infection rate based on the communication weights using the HG-SIS model. • Some works included network structures, but none used weighted networks. We considered the communication weight factor in the CIRM process. • Only one work included the communication effect-that is, the average arrival traffic per attack. However, there is none from the network model perspective. We consider the communication effect (the frequency of communication) from the network model perspective.

Graph Mining Approach
In this section, we explain the GMA used to study the structure of weighted networks for CIRM. Figure 1 shows the proposed approach used for the ratemaking process in cyber insurance products. This method comprises three stages: generating a weighted network, selecting risks using graph mining, and processing the ratemaking simulation. The methodology and algorithms at each stage are described in this section. Stage 1 requires information on the frequency of communications in the network or the distribution of communication frequencies. Two models used for the study of communication frequency are the node-based model and the link-based model. In a node-based model, we use a co-purchase product network formation analogy. This model involves two random variables: the number of communications and the number of communicating nodes. The link-based model assumes the number of communication on the link to be a random variable. Synthesis data during the contract period can be obtained through the random communication process simulation based on statistical distribution assumptions.
In stage 2, the graph mining procedure is conducted on the weighted network formed in stage 1. Community detection on the weighted network is used to classify risks. In each community, threshold settings and filtering processes are conducted to select links with a high level of communication. Connections with small contacts are not involved in rate simulation.
The simulation of CIRM is conducted in communities that have been filtered in stage 3. Before carrying out the simulation, the infection rate for each link, which was initially the same, is adjusted to the communication frequency of each link. Therefore, we obtain different infection rates for each connection. The Monte Carlo simulation for capturing dynamic transmission is conducted using the HG-SIS model, with heterogeneous infection rates (different for each link). Insurance rates are finally obtained by calculating the total loss and using the standard deviation premium principle.

Connectivity Models
To build a communication network, we need a model to represent random events in the network. We consider the analogy of basket market analysis to create an interconnected network using the number of communications in every link. Then, we build the weighted network based on the number of co-purchase products derived from the data of each transaction in the market basket analysis using the network. Each transaction includes several co-purchase products. This analogy can be used to build a communication model on a computer network by assuming transactions as communications, and each transmission can involve several nodes or computers.
The communication calculation between two nodes produces a weighted communication network based on the node or computer that communicates in a specific transaction. The weights are the frequency or quantity of communications. This approach is called the node-based model and is explained in Figure 2. Suppose that a company has four nodes (computers). In one day, there are three communications where C1 denotes the first communication, C2 denotes the second communication, and C3 denotes the third communication. The first communication involves Node 1, Node 2, and Node 4. In the second communication, Node 2, Node 3, and Node 4 send data. The interaction between Node 3 and Node 4 occurs during the third communication. Let X denote a random variable representing the number of communications that follow a discrete distribution with a probability mass function p X = P(X = c) . Also, let Y is a random variable representing the number of nodes that communicate in each communication, which follows a discrete distribution with the probability mass function p Y = P(Y = n). Both are independent, and they can follow binomial distributions, Poisson distributions, or negative binomial distributions. If the random variables X and Y are drawn from Poisson distributions, they indicate the number of communications or nodes within a specific time. If both have binomial distributions, they represent the number of successful communications or nodes linked successfully. They might also be interpreted as the number of successful communications. When the total number of communication failures are known, they exhibit negative binomial distributions.
A formula for the number of connected pairs of nodes in a day is ( n 2 )c. For this, we assume a company that has 100 computer units. Consequently, the maximum number of nodes connected to the communication is also 100. We assume that the company's computer network can accommodate up to a thousand contacts per day. Figure 3 shows the effect of c and n on the total communication that occurs in all links. If given a value around the mean of the random variable Y equal to n, then the change in the value of c will have a linear relationship to the total number of communications in a day (see Figure 3a). Meanwhile, the value of n and the total number of communications have a nonlinear relationship if a value around the mean of the random variable X equals c (see Figure 3b). The algorithm used for the node-based model is given by Algorithm 1.
(a) Change of c for given n.
(b) Change of n for given c. Algorithm 1: Network construction simulation using a node-based model.
Input: Network topology, the number of computers N, time T, p X , and p Y . Generate link list/set L. Generate N sampling probabilities from U ∼ [0, 1]. for i = 1 to T do Generate the number of communication c from p X . for j = 1 to c do Generate the number of nodes in each communication n from p Y . Select at random n node from node-set based on its sampling probability. Construct link list l uv for each co-node u and v. Match l uv with k ∈ L. for k in L do Calculate the accumulation of match link. end end return communication wight vector. end Construct a weighted network until T. Output: communication weight.

Link-Based Model
A simpler model that is also used as a communication quantity model treats the number of communications on each link as a random process. This model is called the link-based model. Let Z k denote a random variable of the number of communications in each link k with the probability mass function p l = P(Z k = l). The difference with the node-based model is building a weighted network process where the weight of each link is the number of communications through the link. Thus, this model only uses one random variable. Following the previous model, Z k is an identical binomial random variable, a Poisson random variable, or a negative binomial random variable, implying the same distribution for every k. Table 3 explains the probability mass function for each distribution of Z k .

No. Distribution
Parameter All distributions depend on the value of the parameter. Suppose a network has links. Thus, Z k , k = 1, 2, · · · , are the independent and identically distributed random variables. The distribution of the total communication in a network is the sum of Z k equal to Z = Z 1 + Z 2 + · · · + Z . The following properties show the distribution of the number of communications in the network Z for each distribution of Z k in Table 3 using the characteristics function.
Proposition 1 (Dekking et al. (2005)). Let Z k for k = 1, 2, · · · , denote a random variable for the number of communication in k-th link with an independent and identically Poisson distribution and parameter λ for every k. The distribution of the total number of communications Z = Z 1 + Z 2 + · · · + Z in a network is a Poisson distribution with the parameter λ .
Proposition 2 (Dekking et al. (2005)). Let Z k for k = 1, 2, · · · , denote a random variable for the number of communications in the k-th link with an independent and identically binomial distribution and parameters m and p for every k. The distribution of the total number of communications Z = Z 1 + Z 2 + · · · + Z in a network is a binomial distribution with the parameters m and p.
Proposition 3 (Dekking et al. (2005)). Let Z k for k = 1, 2, · · · , denote a random variable for the number of communications in the k-th link with an independent and identically negative binomial distribution and the parameter r,ρ for every k. The distribution of the total number of communications Z = Z 1 + Z 2 + · · · + Z in a network is a negative binomial distribution with the parameter r ,ρ.
Algorithm 2 provides the rule used for creating weights using a link-based model. First, we generate a random number of communications in a network from the given distribution. Then, we choose an active link for each contact based on its probability. The link weight denotes how much the link is selected as an active link. This step runs until time T. We used the beta distribution as a sampling probability for the algorithm to accommodate different communication patterns rather than assuming the patterns to be uniform.

Community Detection
The structural properties of node interactions in its group are the output of community detection procedures. Community detection algorithms are widely used in the fields of computer science and mathematics. Community detection for communication networks has been used to find efficient mobile networks (Nguyen et al. 2014). Community detection is also used to obtain the similarity of communication structures in mobile phone communication networks (Blondel et al. 2008). After we obtain a graph with communication weight in a computer network, the next step is to find the similarity of its structure through community detection methods.
Algorithm 2: Network construction simulation using link-based model.
Input: Network topology, the number of computers N, time T, the distribution of Z. Generate link list/set L. Generate sampling probability from Beta(a, b). for i = 1 to T do Generate a total frequency of communication z in the network based on the distribution of Z. Select at random z link from link set based on its sampling probability.
Calculate the accumulation of each selected link. end return communication wight vector end Construct a weighted network until T time Output: communication weight There are two major groups of community detection algorithms: disjoint community detection algorithms and overlapping community detection algorithms (Javed et al. 2018). In this case, we want a node to only be in one risk group so that the detection of the selected communities results in disjoint communities. One of the disjoint community detection algorithms is the modularity-based algorithm. Suppose that G = (G 1 , G 2 , · · · G ζ ) is a disjoint sequence of ζ subgraphs and the number of subgraphs ζ is assumed to be unknown. The number of subgraphs should be determined using an algorithm for maximizing the modularity function Q, where Q is equal to where ij is the total number of links that have one end node in the community i and the other in community j, while the term a i is the total number of edges that connect to nodes in the community i (Newman and Girvan 2004). Three algorithms can be used to solve this problem. First, modularity-based algorithms use a heuristic searching method to approximate the optimization problem, called extremal optimization Percus (2001a, 2001b). Second, spectral optimization (Chen et al. 2014;Newman 2006;Newman and Girvan 2004) uses spectral information from matrix data, eigenvalues, and eigenvectors to maximization the modularity. Finally, greedy optimization (Blondel et al. 2008;Clauset et al. 2004;Danon et al. 2006;Newman and Girvan 2004) runs the modularity optimization of the largest number of communities (treating every node as a community).
Because of the time complexity issue, the greedy algorithm from Blondel et al. (2008), which we can call the Louvain algorithm, is chosen for our problem. A communication network is a weighted network where the link weight is the communication between two nodes. We need a modularity function for a weighted network (Newman 2004). For a weighted network, the modularity function Q W can be defined as: where w ij is denoted as the weight of nodes i and j, k i and k j are the cumulative weight of the link between nodes i and j, and m is the total weight of a network. G i and G j are denoted as the community locations of nodes i and j, respectively, and δ(.) is the Kronecker delta function.
Afterwards, we need to define the threshold setup and network filter methodology.

Threshold Setup
We should define the threshold of η to extract a strong connection. The threshold is created because the entire co-product network contains high degrees and spurious edges (Videla-Cavieres and Ríos 2014). We also assume that the connections between the computers in the computer communication network have high degrees (similar to the fully connected network topology) and contains spurious connections. Then, it is necessary to determine η. This methodology find a new graph that meets the criteria that remove the weight of all edges w uv , u, v = 1, 2, · · · , N lower than η, w uv < η.
Although there is no standard method for this step, several ways have been used to determine η of the co-purchase network. The threshold η can be chosen from a specific constant value even though it cannot apply to all networks. Alternatively, η is selected from the average weight on the network or using the top three heavy edge thresholds (tthet) (Videla-Cavieres and Ríos 2014). We consider applying this method for our weighted computer communication network and adding other two criteria, namely, the top four heavy edge thresholds (tfhet) and the top five heavy edge thresholds (tvhet), as a comparison.
Suppose the descending ordered set of m weighted links in the network is W = {ω 1 , ω 2 , · · · , ω m }. Therefore, tthet, tfhet, and tvhet are defined as the average of the top three weights, the top four weights, and the top five weights according to the following term : where ω 1 is the heaviest edge weight and ω 2 to ω 5 are the second to fifth heaviest edge weights, respectively.

Network Filter
We also use network filter methodology (Videla-Cavieres and Ríos 2014) for filtering our communication network. This filtering step can make a new graph more flexible than only using a threshold set-up methodology. Sometimes only a little edge weight will satisfy the minimum threshold. The filtering step is the proportion or percentage of tthet, tfhet, and tvhet. Consider every 5% of thresholds tthet, tfhet, and tvhet until 100% arep = {0.05, 0.1, 0.15, · · · , 0.95, 1}. A new set of thresholds is found, with 20 members for each.
Using dot product betweenp and thresholds, the set of filters can be written as: With this methodology, we can find the limit of weight that makes the network structure represent each risk not only for high-risk categories but also for medium-and low-risk categories.

HG-SIS Model Ratemaking
After obtaining communication, the next step is to simulate CIRM for the graphs that have been received. Previous studies have used homogeneous infection rates and recovery rates for ratemaking. These rates can vary. Figure 4 illustrates the processes that occur at each node by using the SIS, ε-SIS, H-SIS, and HG-SIS models. All models are worked as node-level models. In the SIS model (Van Mieghem 2014;Van Mieghem et al. 2009), each node can have a vulnerable or infected status at a rate that remains assumed to be the same. In this model, the infection occurs because of contact with other infected nodes (see Figure 4a). The ε-SIS model (Van Mieghem and Cator 2012) generalization adds the source of infection through self-inflicted access to malicious sites, opening e-mails containing worms, downloading files that are inserted with malicious software, etc. (see Figure 4b).
Cyber insurance pricing uses a homogeneous ε-SIS model (Fahrenwaldt et al. 2018;Xu and Hua 2019). Heterogeneous models have also been described for the case of the spread of computer viruses at different infection rates. Heterogeneous SIS (H-SIS) allows the rate of each link to be different (see Figure 4c) Ottaviano et al. (2018Ottaviano et al. ( , 2019. The H-SIS model enables infection rates that depend on the type of connection between the two nodes. This model is more realistic and has a broader scope. We

HG-SIS Model
Let us consider a network represented by a graph G = (N , L), where N is the nodeset, and L is the link set (Diestel 2017). Computer viruses spread in this network through links. Graph G is a loopless graph that does not accommodate a connection to itself. Representation graph G as an undirected graph is based on the assumption that each node can send and receive data, or this attack is called a two-way attack. Graph G is a weighted graph where the weight of the link (u, v) ∈ L for u, v ∈ N is given by w uv and w uu = 0 for every u ∈ N because G is an undirected graph. Weights in this network are the number of communication on each link obtained from the node-based or link-based models.
Suppose that β uv is the infection rate for connection types between u and v for u, v ∈ N . Because the network type does not have a loop, there is no connection to itself or β uu = 0 for every u ∈ L. Given that G is an undirected graph, the infection rate matrix B = [β uv ], for u, v = 1, 2, · · · N, is a symmetric matrix. At the node, v recovery and infection can occur depending on the type of computer, δ v and ε v . The infection, recovery, and self-infection process follows the Poisson process, where the infection rate is β uv , the recovery rate is δ v , and the self-infection rate is ε v . Hence, the time to infection for node u due to an attack from infected node v is an exponential random variable with mean β uv . Next, the time to recovery for node v is an exponential random variable with mean δ v . The time to self-infection for the v node is an exponential random variable with mean ε v . They follow a homogeneous Poisson process, which is a Poisson process that does not depend on time. However, links or nodes have non-identical (heterogeneous) rates.
The infection rate is a function of the communication weight in the network. Assume that for communication weights, there are a maximum and a minimum infection rate so that ∀(u, v) ∈ E, min(β uv ) < β uv < max(β uv ). The relationship between the network weight and infection rate is given by the positive sigmoid function where β > δ β and k > 0. The infection rate matrix B = [β uv ] becomes: where β > 0 and δ β > 0. The average communication weight is given byw = ∑ u,v w uv /2m and the growth rate of function is given by k = 1/σ, where: and m is the number of links |L|. The following proposition describes the characteristics of the function regarding the infection rate.
Proposition 4. For a function defined by a positive sigmoid function, the infection rate β uv satisfies the following properties: • max(β uv ) = β and min(β uv ) = δ β • If w uv →w and σ > 0, then β uv = β+δ β 2 • If w uv → ∞ and σ > 0, then β uv = β • If w uv → 0,w 0 and σ > 0, then β uv = δ β The full proof of Proposition 4 can be found in Appendix A. Figure 5 shows a transformation function f (w uv ) for β uv . In practice, the upper and lower limits of the infection rate (β and δ β ) can be determined by the upper and lower limits of the confidence interval of the point estimator for the link infection rate:β.
Let I v (t) be the random variable that explains the status of node v, where I v (t) ∈ {0, 1}. If at time t, node v is infected, then I v (t) = 1 with probability p v (t) = P(I v (t) = 1). If node v is vulnerable at time t, then I v (t) = 0 with probability 1 − p v (t) = P(I v (t) = 0). The transition probabilities of node v, which is p v,xy (t) = P(I v (t + h) = y|I v (t) = x), of the HG-SIS model can be written as follows: Clearly, Consider the conditional probability of infected node v at time t + h: . By the law of total expectation (Ross 2019) and the same perspective as the SIS model (Van Mieghem 2014), we can obtain: The dynamic equation for the infection probability of the HG-SIS model can be driven using N-intertwined mean-field approximation (NIMFA) (Van Mieghem 2014) as follows: The other approximation uses the upper bound of infection probability. Cator and Mieghem (Cator and Van Mieghem 2014) showed that: where X i (t) and X j (t) are non-negatively correlated for all finite graphs. This result leads us to find the upper bound of infection probability (Xu and Hua 2019), which was previously found for the ε-SIS model.
for u, v = 1, 2, · · · , N, then, for the HG-SIS model, the upper bound for infection probabilities is given by: We can show that the result of Theorem 1 can be obtained in the same way as Xu and Hua (2019) (see Appendix B). The difference between this theorem and the previous theorem (Xu and Hua 2019) is a generalization of the adjacency matrix to the link-infection rate matrix. Note that the stationary probability of NIMFA is the upper bound for the SIS model, which is: Proposition 5. The stationary distribution for the infection probability of the HG-SIS model using NIMFA is given by: Proof of Proposition 5 is given in Appendix C.

Ratemaking
A rate is the total losses or price per unit of exposure (Michael and Rejda 2017). Exposure is a quantity that corresponds to the risk of the policyholder (Parodi 2014). Prices comprise pure premiums used to pay total losses and loading factors as price adjustments for expanding sales and company profits. We consider the standard deviation premium principle (Tse 2009) for pricing the premium, which is: where P is the premium, S is the total loss, and θ is the loading factor. The exposure has a criterion that is proportional to the expected loss. If the exposure increases, the loss expectation also increases. Consider the communication weight on the link (u, v), which is w uv ; the total weight on the network is chosen as the exposure factor-i.e., e = ∑ u,v w uv 2 . The cyber insurance rate for the whole network during a time [0, t] is proportional to: where P is the premium and e is the exposure factor. Conversely, the rate for each node is given by: where Rate v is the rate of node v, P v is the premium using the standard deviation premium principle of node v, and e v is the exposure of node v. Now, we define the total loss using the same perspective with two losses factors (Xu and Hua 2019). Let L v(i) denote the i-th loss of node v caused by infection (stolen information; destroyed data; unauthorized use of an asset; exposed personal data, passwords, or records, etc.). Also, let R v(i) denote the i-th loss of node v caused by the time needed for the system recovery system downtime, where it cannot work as usual to obtain profit. Both of these are modeled by the cost functions µ v (L v(i) ) and ξ v (R v(i) ). The commutative loss for node v to time t is given as follows: where for node v, the total number of infections during (0, t] is given by M v (t). The total network loss up to time t is a summation of each node loss that is equal to: Therefore, the rate for each node v can be determined using the concept in Equation (22) for substitution with a total loss for node v in Equation (23).
Assume that the losses caused by infection L v follow a generalized beta distribution with the following density function: wherew v is the scale parameter that explains the initial wealth or information resources of node v, a, b, c > 0 are shape parameters, and B is the beta function. The loss profile of this model is explained in Figure 6. The generalized beta distribution was selected as a loss distribution because it has a loss value that does not exceed the computer wealthw v . This distribution is also a flexible distribution that can cover all profiles of losses depending on the selected parameter. Figure 6a shows several types of loss profiles that depend on the selected parameter of the loss profile collected at small, center, and large values for the scale parameterw v = 1000. The scaling distribution forw v = 2000 is given by Figure 6b, which uses the same shape parameter values and gives the same profile with a different variability. Consider the linear cost function for the loss caused by infection and the loss caused by the time taken to recover are defined as: where α, α 1 , α 2 are rates related to the infection, initial wealth, and recovery process. We used Algorithm 3 to simulate the total loss. The algorithm is a slight modification of the algorithm created by Xu and Hua (2019).
Algorithm 3: Simulation of cyber security risk with different infection rates Input: Infection rate matrix B, intial status, the number of simulations n s , contract period T, secure node set . for i = 1 to n s do while t < T do Calculate the number of infected nodesM. Generate random time-to-recovery r 1 , r 2 , · · · , rM from exp(δ).
for v in secure nodes do Determine the infected neighbors of node v, j 1 , · · · , j d v where d v is the number of infected neighbors of node v. Generate random time-to-infection y v j 1 , y v j 2 , · · · , y v j dv based on link Determine time for the first event t 1 = min{r 1 , r 2 , · · · , rM, y v j 1 , y v j 2 , · · · , y v j dv , z v }.
if infection occurs then change status from 0 to 1 and calculate the loss. else change status from 1 to 0 and calculate the loss. end end return t, network status, the loss for every node. end Calculate insurance rate using exposure (network weight) for T time contract. Output: dynamic of the network status, total loss, and insurance rate for every node.

Experimental Results and Discussion
Suppose that two companies have two different types of network topologies. There are three divisions in the first company with 50 computers connected to the complete network topology and connected by one bridge. The second company has 150 units of the computer using a network topology that follows a random network (van der Hofstad 2016). Figure 7 shows both of the topologies.
They want to ensure their computer and provide data related to the number of communications in the network. The number of communications can be modelled by the node-based model or the link-based model given in Section 2. It appears that the first topology has three communities based on their structure. The second topology is generated from a random network with N = 150 and p = 0.1, where N is the number of nodes and p is the probability. The process of generating random graphs is an evolutionary process that starts with N isolated node. Then, the process develops with a successful link that exceeds the value of p. In these networks, procedures are performed using the following steps: • Generating weighted graphs. In practice, link weight can be formed by communication data. Then, we can fit and simulate the distribution of communication using Algorithms 1 or 2 to predict future risk. • Finding risk group. To reduce the size of the network, we separate the network into some communities using community detection. • Threshold setting and filtering. Some of the activities or areas of contact between nodes are small. This step excludes the low communication in each community. • Ratemaking. A simulation for infection risk is carried out for every community after the threshold setting and filtering processes. Then, the total premium or rate is the total premium or rate in every community.
The results are described and discussed in the following subsections.

Generating Weighted Networks
Weighted network modelling in this section uses models in Section 3, specifically node-based models using Algorithm 1 and link-based models using Algorithm 2. Consider the computer network of the first company and the second company-they are as shown in Figure 7. First, the node-based model requires the distribution of the number of communication p c in a day and the number of nodes involved in each communication. Both companies want a one-year contract period of T = 365. Let the number of communications and the number of nodes involved in each communication follow the Poisson distribution with mean λ c and λ n . The values of λ c and λ n affect the communication weight distribution in links. Assume that, on average, there are λ c = 400 communications in the network and that each communication involves λ n = 20 nodes on average each day. Figure 8 shows the distributions of each topology. Based on these results, the first network with 3678 links gives 9,369,070 total communications in the network. Conversely, the second network with 1126 links provides 2,756,208 total communications in the network. Although both networks have the same number of nodes equal to 150, the number of links dramatically affects the total number of communications in the network. Figure 8 also shows a high number of spurious connections or connections with small weights. This result indicates that the model represents many real cases, where not all nodes communicate with high intensity, and some are even connected with sporadic communication.  Table 4 explains descriptive statistics for both networks. These results provide similar mean values of weights for the first and second networks. The mean value of the first network is 2547.33, and the mean is 2447.79 for the second network. The maximum weights for each network are 11,154 and 10,773, with minimum values of 0 and 7, respectively. The standard deviation of weight in the second network is equal to 2352.23 and more generous than the standard deviation of weight in the first network equals 2162.83. Next, we introduce the process of generating weighted networks by a link-based model with the procedure given in Algorithm 2. This algorithm requires the number of links and the distribution of Z k , k = 1, 2, · · · , . The distribution is assumed to be selected from the three distributions given in Table 3. This algorithm uses the beta distribution as a sampling probability because this distribution is in the (0, 1) interval. The model can adjust parameters a, b so that there are many spurious weighted links. If we use a uniform distribution as we did before, the frequency of each weight will be similar.
We select the parameters used for the sampling probability distribution of a = 1 and b = 4. From the information given in Table 5, there are 1 = 3678 links in the first network and 2 = 1126 in the second network. Suppose that the average number of communication per link per day is 20. Thus, for the Poisson distribution we select λ = 20, for the binomial distribution we select n = 100 and p = 0.2, and for the negative binomial distribution we choose r = 60 andρ = 0.25. Figure 9 shows the results for each distribution in each link. During one year, 26.8 million communications took place in the first network, and 8 million communications took place in the second network based on simulations conducted using a link-based model. This result is due to the number of links in the first network being three times the number of links in the second network. An identical link weight distribution is obtained, which implies that two networks have the same average communication on the link for each distribution. The descriptive statistics in Table 5 show how close the central tendency is and measure of dispersion for each distribution in both networks. Thus, the models can always obtain a distribution of communication weights on the network with much spurious weight. Both the node-based and link-based models can be used to model the number of connections in each link. In the next section, we consider the results of the node-based model, as shown in Figure 8.

Finding Risk Group
After the previous step, including modelling network weight, community detection is obtained by maximizing the modularity for weighted networks using the Louvain algorithm provided in Section 3. Figure 10 explains the results of community detection, where there are three communities in the first network and six communities in the second network. The same colour indicates that the nodes are in the same community. Different colours imply that the nodes are in various communities. As discussed earlier, the first network comprises three communities, with each group connected to a complete network.
(a) The first network with three communities.
(b) The second network with six communities. Figure 10. Community detection of the weighted network for the first and the second company using the Louvain algorithm.
Next, assume that each group or community or subgraph is mutually exclusive. Figure 11 describes each of the subnetworks from the first and second networks. The maximum modularity is 0.666 for the first network and 0.210 for the second community. The three subnetworks in the first network have the same nodes-i.e., 50 nodes per community. Conversely, six subnetworks in the second network have different numbers of nodes-i.e., 30,18,28,27,24, and 23 sequentially from the first to the sixth community.  (b) Subgraphs of the second network. Figure 11. Subgraphs or subnetworks of the first and second networks.

Threshold Setup and Filtering Process
We want to eliminate the spurious link-for example, a link with a weight equal to 0, which causes the infection rate to be 0. In this step, removing the link is achieved using the threshold setup and filtering process methodology. Considering the thresholds in Equations (5) and (6), Table 6 provides the results of the threshold for each community in the first and second networks. Every community in every network has a threshold near the maximum value. It means that only a few links meet the thresholds. The threshold in Table 6 gives decreasing values for tthet, tfhet, and tvhet. Thus, by increasing the average threshold, we can obtain a more relaxed threshold value, allowing the inclusion of more nodes. Based on these results, we choose to use tvhet to provide a more flexible threshold and use filtering processes to adjust the number of links and the number of nodes in this case. For the filtering step, we consider the proportion of the thresholds that have been predetermined.
f ilters =p · tvhet (28) wherep is a proportion set {0.05, 0.1, · · · , 0.95, 1}. Figure 12 explains the relationship between the number of nodes and the number of edges. If the proportion selection of the filter is large, fewer nodes and links are involved in it. Therefore, it is necessary to consider selecting a good filter to delete the spurious link but not too many to eliminate the connection in the network. The relationship between the number of nodes and the number of links in the first network has the same pattern, although the community in the second network has a different pattern. This pattern is dependent on the link structure and its weight in each community, where the three communities in the first network have a similar structure.

Infection Characteristics and Ratemaking
Let us consider three communities in the first network and six communities in the second network. In each community, we build a weight matrix and use the functions in Equation (8) to show the upper bounds of the infection probability using Theorem 1 and premium simulation using Algorithm 3 for all nodes. As an adjustment, the weight is divided by 365 to obtain the average number of communications per day. Furthermore, β uv for u, v = 1, 2, · · · , N is applied using a positive sigmoid function. Consider (β, δ β , δ v , ε v ) = (0.02, 0.01, 0.05, 1), for v = 1, 2, · · · , N; this parameter is chosen based on the assumption that the average time taken for one node to infect its neighbours via the link is between 50 and 100 days and n s = 1000. The average time taken until one node becomes infected with self infection is 20 days. The average repair time is one day. β and δ β are the maximum and minimum infection rates for all nodes.
Moreover, consider the risk profile in Figure 6; assume that the w v parameter for computer unit wealth is 2000; and follow a red profile pattern where most of the losses occur between 0 and 1000 with parameters a = 2, b = 5, and c = 0.7. Suppose the parameter value for the linear cost function (α, α 1 , α 2 ) equals (0.01, 5 × 10 −6 , 2 × 10 −5 ). To demonstrate the significance of the results, we consider several conditions. These conditions are: • Full network without GMA (without community detection and filtering). • With GMA (using community detection and filtering). In this case, the percentages of the filter are 0% (no filter), 5%, 10%, 15%, and 20%. We set the maximum percentage to 20% to avoid too many links not being considered in the simulation, which would lead to underestimation.
The three conditions were carried out at homogeneous (ε-SIS) and heterogeneous infection rates (IH-SIS). Homogeneous cases used an infection rate of 0.02. Figures 13 and 14 show the upper bound of stationary infection probabilities and the premiums of nodes for cases without and with GMA. Additionally, Figures 15 and 16 depict the total premium and covered nodes in each scenario. These four figures can help explain some of the impacts on the upper bound, premium, and total premium.     (2) with GMA (using filter 0% (No Filter), 5%, 10%, 15% and 20%.

Filter Selection and Community Detection Effects
We discuss the effect of selecting a percentage of tvhet on communities in the first and second networks. Based on Figure 12, the number of nodes and links for each filter percentage is significantly reduced. The stationary probability given by Equation (20) can be used to approximate rates or premiums. We consider five filter percentages-namely, 0%, 5%, 10%, 15%, and 20%. Theorem 1 and Proposition 5 are used to obtain the effect of the filtering process on the upper bound of the stationary infection probability. Figures 13-15 show the results of the filtering process for the upper bound, the premium estimation, and the total premium for the first and second network. The findings for the upper bound of the infection probabilities using Theorem 1 follow the same pattern as the results of the premium simulation. As a result, this upper bound of infection probabilities can be used to approximate the premium. In the first network, all the results showed significant decreases in the upper bound, the premium, and the total premium. These results indicate that although the first network density is high, many nodes were not actively communicating. Conversely, the drop in upper bound, premium and total premium for each filter percentage is not statistically significant in the second network, with an extremely low density. By selecting this risk, we can provide more realistic premiums or rates. The first network is more than three times denser (3678 links) than the second network (1126 links). The filter's effect on the upper bound, premium and total premium in a low-density network is not very visible.
Both networks produce intriguing results when a 0% filter is used (no filter). The premium is calculated in this situation just by identifying the community. The modularity of the first network is 0.666, while the modularity of the second network is 0.210. As a result, the first network produces extremely comparable results for the full network (no community detection and no filter). Meanwhile, the second network delivers a significant reduction. This is because networks with low modularity eliminate a large number of connections between communities. Thus, community detection is recommended for networks with high modularity (more than 0.6). Figure 16 shows the percentage reduction achieved against no filter (filter 0%) and nodes covered for every case. The decrease in premium occurred due to a decrease in the number of covered nodes. On the 20% filter, only 128 (first network) and 120 (second network) are covered. The decrease in premium was faster than the decrease in covered nodes. The selection of a large filter can lead to underestimation. However, our approach could identify risk and allow policyholders to adjust the number of nodes covered (% filter) based on their capacity to pay premiums.

Different Infection Rate and Communication Effects
We compare the model used by Xu and Hua (2019)-namely, ε-SIS and the model with different link infection rates: HG-SIS. Figure 4 shows the difference between the two models. Considering Figures 13-15, the results obtained for the upper bound, premium, and total premium with heterogeneous infection rates give lower premiums than homogeneous ones. As with the scenario that included filtering, the heterogeneous infection rate was more significant in the high-density network (first network) than in the low-density network (second network). According to Figure 16, the heterogeneous infection rate lowered the total premium by 35.6% in the first network and 22.77% in the second network without filtering (0% filter). Xu and Hua (2019) and  obtained results showing that the insurance premiums they produce are greatly influenced by the degree of the node. Thus, the weight factor or communication frequency is not considered in this model. To illustrate the importance of these findings, we plot the relationship between the total communication weight of neighbours and the premiums provided by the model with homogeneous and heterogeneous infection rates. Figure 17 shows the results obtained for the first network, while Figure 18 shows the results obtained for the second network. In the first network, both full and unfiltered scenarios with homogenous infection rates demonstrate a lack of connection between communication weights and premiums, where correlation coefficients of ρ = 0.057 and ρ = −0.021 with p-values > 0.05. The relationship is seen in the 5-20% filter with the results ρ > 0.5 and p-value < 0.05. However, this relationship seems to be affected only by degrees because ∑ u w uv is also the sum of the degree. All cases demonstrated significant findings for the first network with heterogeneous infection rates, with a ρ > 0.9 and a p-value < 0.05.
The premiums of the second network effectively lead to the same conclusion. For homogenous infection rates, the premium demonstrates no relationship between the premiums and communication weights (ρ < 0.3) in any of the cases. Meanwhile, the premium based on heterogeneous infection rates produced significant results in all cases with ρ > 5.9, although the results were lower than the first network. Therefore, the model developed has some highly appealing outcomes. The heterogeneous model could handle risk based on the intensity of contact or communication in the network.

GMA Effects on Microlevel
At the micro-level (node level), we compare the premium or rate obtained without GMA (full and homogenous) to the premium or rate obtained with GMA (community detection, filtering, and heterogeneous infection rate) for fifteen selected nodes. To display the most comprehensive comparison, we chose to use the 20% filter. Table 7 represents the premium calculation simulation results without GMA and with GMA for the 15 selected nodes.
In the first network without GMA, it appears that each node has a mean infection in the interval 63-68, with a principal value of 49 degrees. The premiums show almost the same results of approximately 25 in one currency unit. At node 2 (N2) in the first network (see Table 7A), GMA succeeded in reducing the premium estimate to 6.827 currency units with a 20% filter. Using this method, the premium for Node 2 is diminished by 73.6% compared to the premium without GMA. The previous procedure was conducted in uniform network conditions. GMA considers active communications to acquire risk groups to offer lower prices. The effect of rate adjustment is seen at node 2 (N2). N2 has four neighbours (degree is equal to 4), and the premium is 6.827. Meanwhile, node 142 (N142), with 24 degrees, has a premium of 6.795. Thus, the degree is no longer an influential factor affecting premiums. Figure 17. Relationship between the total weight of the neighbors and their premium (homogeneous and heterogeneous) for two cases: (1) without GMA (Full), and (2) with GMA (using filter 0% (No Filter), 5%, 10%, 15%, and 20% in the first network. ρ is the Pearson correlation coefficient and p is the probability value. Figure 18. Relationship between the total weight of neighbors and their premium (homogeneous and heterogeneous) for two cases: (1) without GMA (Full), and (2) with GMA (using filter 0% (No Filter), 5%, 10%, 15%, and 20% in the second network. ρ is the Pearson correlation coefficient and p is the probability value. In Table 7B, the second network was constructed from random networks. Thus, the degree of each node is different. Although the degrees are different, the premium calculation performed without GMA shows that the 15 nodes are between 21 and 27. The degree effect was also seen in the non-GMA results, where nodes with high degrees had more mean infections. For cases without GMA, node 24 (N24) had a mean infection of 27.150 and a premium of 10.526 in currency units with 23 degrees. The premium obtained for N23 was successfully reduced by 37.9% using GMA.
This result also supports the previous outcome (Figures 13-16), showing that the first network achieved a considerable reduction in the total premium or total loss. These results show that by using a 20% filter, the contraction that occurred reached 70% or more. Filter selection is highly dependent on network density. For dense networks, a 20% filter is too high. However, in low-density networks, the 20% filter yields reasonable improvements. The results of the second network also improve the total premiums. Thus, this method has the potential to be developed and evaluated as a method for adjusting cyber insurance premiums using a network structure to obtain premiums or rates that are genuinely appropriate (not overpriced).

Conclusions
We propose the use of a GMA for CIRM in this study. CIRM performed using GMA has three stages. In stage 1, a network is built with communication weight. In stage 2, community detection and filtering are carried out as the essence of GMA. Stage 3 simulates the premium or rate. The experiments were carried out using two types of networks: a hybrid network and a random network with 150 nodes. The proportion of filters applied during the filtering operation substantially affects the premium or rate outcomes. According to the premium calculation, network density substantially affects the filter's efficiency and the heterogeneous infection rate. Low-density networks tend to produce fewer improvements and vice versa. Community detection is advised if the network's modularity is sufficiently strong (more than 0.6).
Comparisons of the ε-SIS and HG-SIS models show very significant levels of premium reduction. This result is more visible in the network with a high density (first network) than in the second network. The relationship between the communication weights and premiums shows that the proposed model successfully accommodates the communication factor. The correlation between the premium obtained using a heterogeneous infection rate and the weight of the communication was much higher than that obtained for the premium with a homogeneous infection rate. The experimental comparison of the total loss and premium for the first and second networks shows that the GMA results are lower than those obtained without GMA. Consequently, GMA can be developed and evaluated to reduce insurance rates based on the characteristics of communication networks.
This study is still limited to two network characteristics-namely, communication weight and network density. In the future, other network characteristics should be explored. Additionally, this approach disregards the importance of cybersecurity expertise and internal threats posed by employees. Macro-level models, such as those suggested by Xu and Hua (2019) must also consider these variables. Network analyses such as centrality measure degrees, random-walk betweenness, shortest-path betweenness, and farness (Christley et al. 2005) can be considered for the identification of high-risk nodes in future studies. The average degree is also a network size that greatly affects the epidemic thresholds (Kim et al. 2021). In large networks, simulations face complex computational time problems. We suggest using the SIS process simulation approach with the Gillespie Algorithm (Indratno and Antonio 2019; Kiss et al. 2017), which is one of the cornerstones of analysing dynamical processes in complex networks in future studies. Individual-level epidemic models or other agent-based models that can explain the process of computer virus infection still require exploration.

Institutional Review Board Statement: Not applicable
Informed Consent Statement: Not applicable Data Availability Statement: Data sharing not applicable to this article as no datasets were generated or analysed during the current study.