1. Introduction
With borderless network thoroughly covered nearly every terminal in the world, it is crucial to maintain security on digital assets/property and identify vulnerable data residencies in time. Industries, companies and organizations have been increasingly suffered by cyber breaches, which have posed serious risks to their business operations over last decades. For instance, a ransomware attack paralyzed at least 200 U.S. companies via Kaseya, a globally used software supplier on 3 July 2021 (
BBC News 2021). It was a colossal and devastating supply chain attack and has the potential to spread to any size or scale business through cloud-service providers. Several federal legislations (e.g.,
Data Security and Breach Notification Act 2015 and
Data Accountability and Trust Act 2019) have been introduced in the U.S. to enhance the cyber security and data protection. The Federal Bureau of Investigation (FBI) set up an Internet Crime Complaint Center (IC3) (
FBI 2000) in 2000 with a trustworthy source for information on cyber criminal activities to combat through criminal and cyber investigative work. In 2020, IC3 received a total of 791,790 cyber crime records from American public with reported losses exceeding USD 4.1 billion, which is a 69% increase in total complaints and about 20% increase in loss amount from 2019. Over the years from 2016 to 2020, IC3 received over two million complaints, reporting a nearly USD 13.3 billion (
Internet Crime Report 2020) total loss. Those complaints address a wide array of Internet scams affecting victims across the globe. Recently, IBM Security published the Cost of Data Breach Report 2021
IBM (
2021) that analyzed 537 real data breaches across 17 counties and different industries. Data breaches refer to unauthorized access and manipulation on exposed confidential data (information). The report shows a 10% increase in average total cost of a breach incident from 2020 to 2021 with USD 1.07 million cost difference where remote work was a key factor in causing the data breaches and a 10.3% increase in average per record cost of a data breach from 2020 to 2021. This increasing trend in breach frequency and average cost raises the importance of cyber insurance for business and organizations to protect themselves against data breach losses/liabilities. A recent industry survey
Rudolph (
2022) indicates that cyber/networks has been listed as number one or two among the top five notable emerging risks in their 2018–2021 surveys.
Cyber insurance is emerging as an important tool to protect organizations against future cyber breach losses and its institutional pillars are progressively evolving and reinforcing one another (
Kshetri 2020). By analyzing the U.S. cyber insurance market,
Xie et al. (
2020) find that professional surplus insurers and insurers with surplus insurer affiliation demonstrate a competitive advantage in cyber insurance participation. According to an NAIC report (
NAIC 2020), U.S. domiciled insurers writing cyber coverage had USD 2.75 billion of direct premium written in 2020 (increased by 21.7% and 35.7%, respectively, from the year of 2019 and the year of 2018). The top 20 groups in the cyber insurance market reported average direct loss ratios 66.9% up from 44.6% in 2019 and 35.3% in 2018. The report also points out that changes in cyber insurance loss ratios are not driven by premium growth but by claim frequency and severity growth, implying the significance of cyber insurance policy designs.
Cyber risk has become an increasingly important research topic in many disciplines. Recently,
Eling (
2020) present a comprehensive review of the academic literature on cyber risk and cyber insurance in actuarial science and business related fields including economics, finance, risk management, and insurance. Here, we briefly review recent research in the actuarial science literature on modeling and analyzing data breach related cyber risks.
Maillart and Sornette (
2010) reveal an explosive growth in data breach incidents up to July 2006 and a stable rate thereafter.
Wheatley et al. (
2016) focus on the so called extreme risk of personal data breaches by detecting and modeling the maximum breach sizes and show that the rate of large breach events has been stable for U.S. firms under their study.
Edwards et al. (
2016) find that daily frequency of breaches can be well described by a negative binomial distribution.
Eling and Loperfido (
2017) implement frequency analyses on different levels of breach types and entities through multidimensional scaling and multiple factor analysis for contingency tables, while
Eling and Jung (
2018) extend former work by implementing pair copula construction (PCC) and Gaussian copula to deal with asymmetric dependence of monthly losses (total number of records breached) in two cross-sectional settings.
Fahrenwaldt et al. (
2018) develop a mathematical (network) model of insured losses incurred from infectious cyber threats and introduce a new polynomial approximation of claims together with a mean-field approach that allows computing aggregate expected losses and pricing cyber insurance products.
Jevtić and Lanchier (
2020) propose a structural model of aggregate cyber loss distribution for small and medium-sized enterprises under the assumption of a tree-based local area network (LAN) topology.
Schnell (
2020) shows that the frequently used actuarial dependence models, such as copulas, and frequency distributions, such as Poisson distribution, would underestimate the strength and non linearity of dependence.
The purpose of this paper is to provide predictive analytics based on historical data on cyber incidents frequency aiming to help insurance companies examine, price, and manage their cyber related insurance risks. This analysis may be used by organizations as a reference in balancing their prevention costs with premiums according to their entity types and locations. We make use of related factors from cyber breach data and perform Bayesian regression techniques under generalized linear mixed model (GLMM). GLMM is one of the most useful structures in modern statistics, allowing many complications to be handled within linear model framework (
McCulloch 2006). In the actuarial science literature,
Antonio and Beirlant (
2007) use the GLMMs for the modeling of longitudinal data and discuss the model estimation and inference under the Bayesian framework. Recently,
Jeong et al. (
2021) study the dependent frequency and severity model under the GLMM framework, where the aggregate loss is expressed as a product of the number of claims (frequency) and the average claim amount (severity) knowing the frequency. The GLMM has also been used in studying the credibility models; see, for example,
Antonio and Beirlant (
2007) and
Garrido and Zhou (
2009). Generally, a generalized regression model is used to describe within-group heterogeneity of observations, and a sampling model is used to describe the group specific regression parameters. A GLMM can handle those issues by not only accommodating non-normally distributed responses and specifying a non-linear link function between response mean and regressors but also allowing group specific correlations in the data.
The dataset we examine in this paper is from Privacy Rights Clearinghouse (PRC) (
PRC 2019). It is primarily grant-supported and serves individuals in the United States. This repository keeps records of data breaches that expose individuals to identity theft as well as breaches that qualify for disclosure under the state laws. Chronology includes the type of breaches, type of organization, name of company and its physical location, date of incidents, and number of records breached. It is the largest and most extensive dataset that is publicly available and has been investigated by several research papers from various perspectives. Below are notable studies based on this dataset.
Edwards et al. (
2016) develop Bayesian generalized linear models to investigate trends in data breaches.
Eling and Loperfido (
2017) investigate this dataset under the statistical and actuarial framework; multidimensional scaling and goodness-of-fit tests are used to analyze the distribution of data breach information.
Eling and Jung (
2018) propose methods for modeling cross-sectional dependence of data breach losses; copula models are implemented to identify the dependence structure between monthly loss events (frequency and severity).
Carfora and Orlando (
2019) propose an estimation of value at risk (VaR) and tail value at risk (TVaR).
Xu et al. (
2018) model hacking breach incident inter-arrival times and breach sizes by stochastic processes and propose data-driven time series approaches to model the complex patterns exhibited by the financial data. Recently,
Farkas et al. (
2021) present a method for cyber claim analysis based on regression trees to identify criteria for claim classification and evaluation, and
Bessy-Roland et al. (
2021) propose a multivariate Hawkes framework for modeling and predicting cyber attacks frequency.
In this study, we propose a Bayesian negative binomial GLMM (NB-GLMM) for the quarterly cyber incidents recorded by PRC. The quarter specific is one of the variations of random effects explained by the quarterly hierarchical panel data. Regression models on covariate predictors can capture variations of within-quarter heterogeneity effects. Moreover, GLMMs outperform the generalized linear model (GLM) by reveling features of the random effects distribution and allowing subject-specific predictions based on measured characteristics and observed values among different groups, while most studies on modeling cyber risk related dependencies in the literature are geared toward cross-sectional dependence using copulas (see, for example,
Eling and Jung (
2018) and
Schnell (
2020), and references therein), our approach models the dependence between the frequency and severity under the widely known generalized linear framework, which excels in interpreting the directional effect of features, along with the GLMM that deals with hierarchical effects and dependent variables using general design matrices (
McCulloch and Searle 2004) The Bayesian approach and Markov chain Monte Carlo (MCMC) method are utilized to obtain posterior distributions of parameters of interest. Specifically, our hierarchical structure of Bayesian NB-GLMM requires Metropolis–Gibbs (M-G) sampling schemes working on regression mean related parameters, and conditional maximum likelihood estimates of the dispersion parameter.
The significant findings of our study are the following. (1) It is effective to use of the complex NB-GLMM for analyzing the number of data breach incidents with uniquely identified risk factors such as type of breaches, type of organizations, and their locations. (2) It is practical to include in our model the notable correlation detected between the number of cyber incidents and average severity amount (the number of data breached), as well as the time trend effects impacted on the cyber incidents. (3) It is efficient to use the sophisticated estimation techniques for our analysis, including Bayesian approach, MCMC method, Gibbs sampling, and Metropolis–Hastings algorithm. (4) Using the frequency–severity technique, it is feasible to use our predictive results for pricing the cyber insurance products with coverage modifications.
Our contributions to related research areas can be described as follows. In modeling the dependence between frequency and severity of cyber risks, we investigate the use of average severity as one of subject-specific covariates via GLMM regression process. Meanwhile, we model time trend effects as a group-specific factor in order to explain the change in data breach incidents over time. Besides examining fixed effects, we adopt MCMC method to extract random effects working on several different explanatory variables. We estimate parameters of GLMM under the NB distribution with a non-constant scale parameter by combining the maximum likelihood estimation with the MCMC method. We add to the existing literature the implementation of our proposed estimation procedure in the actuarial context, which may be of interest to other researchers and practitioners in the related fields.
The rest of this paper is structured as follows. In
Section 2, we introduce our database and present empirical data analysis.
Section 3 presents the NB-GLMM for our breach data and the parameter inferences under Bayesian framework.
Section 4 shows the MCMC implementation and inference of the posterior distribution of parameters, followed by a simulation study and cross validation test against testing dataset to assess model performance in
Section 5. Model applications in industry risk mitigation and premium calculations are discussed and illustrated in
Section 6. Finally, in
Section 7 we provide further discussions on aggregating total claim costs.
5. Simulation Study and Validation Test
We design a simulation study to verify the accuracy and effectiveness of the parameter estimations and the model predictability. The exploratory data analysis showed in this section should provide supports for the proposed NB-GLMM model. The simulation model is established in accordance with similar assumptions and design scheme of our analytical model. For demonstration purpose, this simulation study uses the same multivariate normal distribution estimated from
Section 4.2. Given the sets of coefficients from multivariate normal distribution, we can generate target variable counts from generalized linear relationships. True values of model parameters are taken from
Table 5 and
Appendix A. According to the hierarchical requirements, we first draw 69
from a 6-dimensional multivariate normal model with mean
and variance
; together with posterior mean of
, they consist 69 sets of independent quarter coefficients. Multiplying 69 sets of coefficients to the manipulated covariates using (
7) leads to 69 logarithm mean of the negative binomial distribution. Combining those mean parameters with dispersion parameters we estimated previously, we generate 16 observations on uniquely identified combinations for each quarter, which results a total of 1104 observations. In this way we make sure that the simulated data follows the same patterns as experimental data. The new data set of 1104 testees is generated using the MCMC estimates obtained on the original dataset. Taking these observations as one dataset, we further generate 100 datasets following the same algorithm. Simulated datasets are then investigated under the same procedure as presented in
Section 3.2. The estimated hyper-parameters are determined using MCMC and M-G methodologies, as well as maximum likelihood estimation under Bayesian framework. Here the MCMC analyses utilize the same prior distributions and the starting values are the same as obtained from the empirical estimation.
The estimated posterior means of coefficient parameters and the relative differences (errors) between the true and estimated values obtained under our modeling and estimation procedures are displayed in
Table 6, where the relative error is calculated by dividing the difference of the estimated value and its corresponding true value by its true value (used for simulation). As seen from
Table 6, differences between the true value and the estimated posterior means, illustrated by relative errors, are all relatively small, implying that these estimated posteriors are all centered compactly around their true values. On the other hand, all the estimated results from our simulation study have over
confidence intervals where the true values fall into. All these imply that our estimation algorithm is effective and estimation results are satisfied in terms of their accuracy.
To examine the model predictability and its accuracy under our GLMM settings, we employ 5-fold cross-validation procedure to have an objective evaluation of the prediction performance. Cross-validation was first applied when evaluating the use of a linear regression equation for predicting a criterion variable (
Mosier 1951). It provides a more realistic estimate of model generalization error by repeating cross-validations based on the same dataset with large calibration/training samples and small validation/test samples. In particular, we randomly divide the dataset 10 times into five folds; four of them are used to train the GLMM and remaining one is used to compare its predicted values and actual ones. The performance of the test datasets should be similar to that of the training datasets. Our purpose of conducting cross validations is to ensure that our model has not over-fitted the training dataset and that it performs well on the test dataset. In order to testify our GLMM prediction accuracy, we also fit our training dataset to Poisson and NB regression models, respectively. The root mean squared error (RMSE) metric is taken as a summary fit statistic, which can provide useful information for quantifying how well that our GLMM fits the dataset. A good performance with a relative low RMSE indicates that our proposed GLMM is fine-tuned. RMSE values are calculated by
where
n is the number of tested observations,
is the
ith actual target value, and
is the
ith predicted value based on trained model.
Table 7 gives summary fit statistics for Poisson regression, NB regression, and NB-GLMM on training dataset and test dataset. We first compare training set RMSEs for model accuracy. The predicted accuracy of three models is compared under same training set measured by RMSE. The lowest training RMSE value of GLMM implies that it has the highest prediction level. We then compare GLMM RMSEs between the training set and the test set to test over-fitting. According to our cross validation results, the training set has a mean of
RMSE which means that the average deviation between the 69 predicted quarterly counts and the actual quarterly ones is
.
A RMSE of the test dataset is close enough to that of the training dataset, which means that our model is not over-fitted. A higher RMSE of the test dataset is judged as an improvement in model fit when using the training dataset to build the model. Given the fact that two RMSEs do not have much difference, there is no evidence showing that our GLMM is over-fitted. These two relatively low values of RMSE also show that our model, GLMM, achieves the best model accuracy for frequency counts predictions among other tested models.
6. Practical Implications
In this section, we discuss the potential applications and practical implications of our modeling results in cyber risk mitigation and management. We have proposed a NB-GLMM with group-specific fixed effects and among group random effects on some featured variables including the type of breached, type of organizations and their geographical location and associated average severity caused by data breaches under these uniquely identified features. We also consider the impact of the trend over time on the breach frequencies. In general, this study can increase the awareness that it is important to analyze the growth trends of cyber incidents frequency among sub-characteristic groups. We discuss below the impact of our modeling and predictive analytic approaches in relation to cyber risks from both the perspective of the organization (potential insured) and the insurance company (insurer), as well as other important stakeholders such as corporate information technology (IT) and data security officers, and data scientists.
From the perspective of organizations, our results provide quantitative insights to organizations with different entity types and locations, which encourages firms to adopt new techniques and technologies in managing risks with respect to the cyber-related risks they are facing.
Gordon and Loeb (
2002) present an economic model that can be used to determine the optimal amount to invest to protect a given set of information. The model takes into consideration the vulnerability of the information to a security breach and the potential loss it may cause. Given a company’s physical and geographical characteristics, our GLMM model is able to predict their estimated quarterly data breach frequencies so that firms can determine whether to accept the risk or to seek out risk transformation in order to mitigate risks.
Mazzoccoli and Naldi (
2020) propose an integrated cyber risk management strategy that combines insurance and security investments, and investigate whether it can be used to reduced overall security expenses. The optimal investment for their proposed mixed strategy is derived under several insurance policies. This type of risk management strategies could also include the consideration of the risk over a specified time horizon; our model can provide an effective predictive guidance for managing cyber risks with respect to data breach incidents occurred within a quarterly time interval. The organizations could act based on our findings when they put cyber risk management into practice.
In some cases, managing cyber risks through internal controls would be impractical or too costly especially when organizations are facing high frequency of breach incidents. Consequently, organizations may seek insurance coverage as alternative means to transfer their cyber related risks. Reducing cyber risk exposures by purchasing insurance also take advantage of reducing the capital that must be allocated to the cyber risk management. In general, cyber insurance combined with adequate security system investments should allow organizations to better manage their cyber-related risks.
Young et al. (
2016) present a framework that incorporates insurance industry operating principles to support quantitative estimates of cyber-related risk and the implementation of mitigation strategies.
From the perspective of insurance companies, besides those incentives from organizations to increase cyber insurance purchases, our results also encourage insurance companies to think about how much premiums they want to collect because they expect to be paid adequately to accept the risk. Current pricing of cyber insurance is based on expert models rather than on historical data. An empirical approach to identifying and evaluating potential exposure measure is important but challenging due to the current scarcity of reliable, representative, and publicly available loss experience for cyber insurance. This paper avoids this limitation by illustrating how to utilize available full exposure data to obtain a quantitative idea of cyber premium pricing. We present a methodology to rigorously classify different risk levels of insureds. Our modeling results can ease one of the problems that cyber risk insurers face, the disparity in premiums with respect to different characteristic groups, by forecasting loss frequency on different characteristic segmentations. Geographical area is one of the most well-established and widely-used rating variables, whereas business type is considered as one of the primary drivers of cyber claims experience.
Ideally, the cyber insurance rating system should consider various rate components, such as business type and geographic location in our model, when calculating the overall premium charged for cyber risks. The portion of the total premium that varies by risk characteristics, shown as a function of the base rate and rate differentials, is referred to as a variable premium (
Werner and Modlin 2010). Our work can be directly applied in setting variable premium factors by using posterior frequency distributions upon different risk characteristic segments. The premium
P under the standard deviation premium principle (
Tse 2009) for pricing variable premium, for example, is given by
where
S is the aggregated total loss, and
is the loading factor. To calculate the premium rate
P in this case, the first two moments of the distribution of
S need to be determined. We use a quarter as our investigation window period which is the same as our NB-GLMM frequency time interval. The severity portion that we use to calculate the aggregate quarterly loss is based on the latest three years quarterly average loss amount (number of data breaches recorded) for the purpose of simplicity. Using the posterior frequency distributions on characteristic segments obtained in
Section 4.2, we generate a set of total 16 aggregate loss distributions for all the level combinations. By using the frequency-severity technique, the aggregated quarterly loss distribution
S can be obtained. We then apply log-log model
2 raised by
Jacobs (
2014) (also used in (
Eling and Loperfido 2017) to estimate prices for cyber insurance policies) to convert the number of records breached into its corresponding dollar amount loss.
Let
be the insurance payment per loss with policy limit
u and deductible
d. The
kth moment of
can be calculated by
(
Klugman et al. 2012). The mean and variance of
,
and
, can be determined by (
8) using bootstrap from set of posterior distributions of coefficients.
Table 8 lists predictions of next
3 quarter aggregate monetary loss of internal breach types on two representative geographical locations Northeast and West with or without deductible (of amount USD 10,000) and/or policy limit (of amount USD 1 million).
Based on these results, we have several interesting findings from different perspectives. Firstly, there is a significant difference in loss amount between the Northeast and West regions. Estimated loss amount for Northeast region ranges from USD 197,891 to USD 2,283,023, whereas that for West area ranges from USD 1,408,541 to USD 14,661,661. Secondly, non-business organizations face much higher cyber risks than business organizations do according to their more than 10 times estimated loss differences. Furthermore, whether having deductibles makes no big difference in cyber losses as almost the same estimated loss amount with and without a deductible (of amount USD 10,000) is observed. Last but not least, setting a maximum coverage loss amount can reduce covered cyber losses gigantically in non-business organizations compared with that in business organizations. Those insights are worth to consider while setting premium rates and designing insurance products in order to reach an equilibrium covering limited risk by sufficient amount of premiums. These quantitative insights provide relative differential rates information when setting adjusted manual rates in premium pricing. Insurance companies are able to maintain high solvency in the differentiated pricing case compared to the case of non-differentiated pricing (
Pal et al. 2017).
In addition to a better idea of defining risk classes, the paper illustrates how to work with current available data and update the model components and parameters by collected cyber related data over time. Our model decomposes risk effects on cyber breach frequencies into fixed effects and random effects based on classified characteristics, average severity and non-linear time trend effects. Bayesian statistics are particularly useful in simulating from the posterior distribution of the number of incidents (claims) in a future quarterly based time period given risk characteristics. Due to the nature of Bayesian methodology, some of the assumptions, such as the polynomial time trend, and parameters choices might be updated in the future once suitable data is available. Moreover, individual features of the model can be refined or replaced to incorporate properties of given internal datasets without changing the overall model structure. The updates and modifications enable our model to be a precise predictor for data breach frequencies.
7. Conclusions
This paper develops a statistical model for cyber breach frequencies that considers not only characteristics such as risk profile, location and industry, but also average loss sizes and time effects. It provides an effective and comprehensive modeling approach for predictive analytics due to the consideration of dependent and correlated risk aspects. We believe that our study makes an important and novel contribution to the actuarial literature in the sense that our NB-GLMM for cyber breach frequencies considers risk category, company census, severity dependence and time trend effects together in quantifying and predicting quarterly number of data breach incidents, a fundamental quantity for appropriately setting the manual rates.
The study of cyber risks is important for insurance companies in mitigating and managing their risks given that the functioning of the insurance business is a complex process. In this view, our study is of practical value for insurance companies, since the consideration of the most dangerous risks for each business entity will allow forming a relevant information security for the company. Enterprises need to take several measures in dealing with cyber risks: operations based on statistical modeling in actuarial analysis process, ensuring the balance and adequacy of tariffs in pricing process and adjusting premium rates in insurance marketing. Our research results can be used as a differential indicator on different organization types and geographical locations. In addition, our study can also be useful for data security officers and scientists, and other potential corporate stakeholders for them to better understand the impact of the cyber risks for business operations.
Another important aspect of this study is the use of the publicly available PRC data on developing actuarial approaches to quantify cyber loss frequencies. However, the quality of available data and whether the data represents well cyber risks in general also lead to a limitation of this paper. The fact that firms do not reveal details concerning security breaches reduces data accuracy, and not voluntarily reporting cyber breaches leads to data inadequacy. Moreover, Privacy Rights Clearinghouse has stopped updating latest breach incidents since 2019, which causes data inconsistency in a time trend manner. The availability of high-quality data such as policy or claim database in the future would open up new research opportunities. Our model is subjective and can be modified to accommodate the features of new dataset and the purpose of prediction.
Despite the limitations, the proposed NB-GLMM makes a notable methodological contribution to the cyber insurance area as it provides a theoretically sound modeling perspective in frequency quantification, and provides a practical and statistical framework and approach for practitioners to customize and update based on their predictive needs. In the next step of our research, we are going to analyze zero-inflated heavy tailed severity (the number of data breached due to breach incidents and their corresponding monetary losses incurred) using finite mixture model and extend the analysis using extreme value theory. Together with GLMM frequency predictive model, we can simulate aggregate full insurance losses for given characteristics. Moreover, we will use a numeric approach to test predicted overall aggregate claim amounts under different factor combinations in any projecting period in order to make characterization of premiums. For instance, pure technical insurance premiums can be expressed as a VaR or TVaR metric and computed from the loss distribution of each risk category. Lastly, this two-part severity-frequency actuarial quantification method seeks to overcome some of above-mentioned data limitations such as inadequacy and inconsistency.