Alleviating Class Imbalance in Actuarial Applications Using Generative Adversarial Networks

: To build adequate predictive models, a substantial amount of data is desirable. However, when expanding to new or unexplored territories, this required level of information is rarely always available. To build such models, actuaries often have to: procure data from local providers, use limited unsuitable industry and public research, or rely on extrapolations from other better-known markets. Another common pathology when applying machine learning techniques in actuarial domains is the prevalence of imbalanced classes where risk events of interest, such as mortality and fraud, are under-represented in data. In this work, we show how an implicit model using the Generative Adversarial Network (GAN) can alleviate these problems through the generation of adequate quality data from very limited or highly imbalanced samples. We provide an introduction to GANs and how they are used to synthesize data that accurately enhance the data resolution of very infrequent events and improve model robustness. Overall, we show a signiﬁcant superiority of GANs for boosting predictive models when compared to competing approaches on benchmark data sets. This work offers numerous of contributions to actuaries with applications to inter alia new sample creation, data augmentation, boosting predictive models, anomaly detection, and missing data imputation.


Background
Gaining an advantage in competitive markets through offerings of suitable tailored products on customers relies on building and maintaining adequate predictive models. To build these models, a substantial amount of data and a sizeable number of records is desirable. However, when expanding to new or unexplored markets, that level of information is rarely always available. To build such models, actuarial firms often have to procure data from local providers, use limited unsuitable industry and public research or rely from extrapolations from other better known markets.
In this work, we show how an implicit model using the Generative Adversarial Network (GAN) Goodfellow et al. (2014) can alleviate this problem through the generation of adequate quality data even from very limited small samples, from difficult domains, or without alignment, thus handling class imbalance.
A GAN is an example of a generative model that is used to create new samples from a latent noise space. A generative model describes how a data set is generated in terms of a probabilistic model. This generative model p model mimics the training data distribution p data as close as possible. If this is achieved, then we can sample from p model to generate realistic samples that appear to have been drawn from p data . We are satisfied if our model can also generate diverse samples that are suitable different from the training data. In some cases, the model can be estimated explicitly, and sometimes it can generate samples implicitly. Other models are capable of doing both.
GANs were proposed in a seminal paper by Goodfellow et al. (2014). GANs are useful for learning the structure of the data and can generate new samples without explicitly postulating the model Goodfellow et al. (2014). They are known to be better than other generative models due to the quality of samples they generate. GANs have been highly successful in computer vision Brock et al. (2018); Karras et al. (2019); Zhu et al. (2017); Vondrick et al. (2016), music generation Yang et al. (2017), text generation Reed et al. (2016), missing data imputation Li et al. (2019); Shang et al. (2017); Yoon et al. (2018), time series generation Esteban et al. (2017); Yoon et al. (2019); Fu et al. (2019), and data augmentation Antoniou et al. (2017); Fiore et al. (2019); Mariani et al. (2018); Mottini et al. (2018); Park et al. (2018), with remarkable results, but their application to the actuarial discipline remains largely still unexplored.
In this work, we discuss how and where GANs can be used by actuaries. Additionally, we provide an experiment showing how GANs can be used to boost imbalanced samples in actuarial data sets and improve actuarial models.

Aims and Objectives
In this paper, we explain what GANs are and how they can be used to synthesize data in order to accurately enhance very infrequent events, alleviate class imbalance and create better prediction models. This work also provides theoretical and practical applications of GANs.
We demonstrate a popular GAN architecture to a typical problem resembling an actuarial use on benchmark data sets using Python Python Software Foundation (2017). Overall, we show a significant superiority of GANs for predictive models and stochastic simulations compared to current approaches. Specifically, this paper covers the following aims and objectives: • deep overview of generative models and why GANs are of better quality than other generative models; • an overview of GANs with practical applications in a number of areas with emphasis for actuarial use; and • provide a practical example of a popular GAN use for alleviating class imbalance, data augmentation, and improving predictive models.

Contribution
This work provides thorough theoretical, empirical and practical applications of GANs, with possible leverage in actuarial science for inter alia new sample creation, data augmentation, boosting actuarial models, anomaly detection, missing data imputation, time series simulations and projections in life insurance, short-term insurance, health and care, banking, investment, enterprise risk management, and other non-traditional actuarial areas, such as telecommunications, economics, medicine, engineering, and other wider fields.
For example, actuaries build pricing models in order to determine competitive premiums that customers should pay to be provided adequate insurance coverage O'Malley et al. (2005). These pricing models are dependant on risk events, such as mortality, morbidity, and lapse, which need to be estimated using an adequate and accurate model. However, these risk events are often under-represented in data.
In this work, we show how a GAN could be used to alleviate this problem through the generation of adequate quality data from limited or highly imbalanced samples. Essentially, we show that synthetic data generated using GANs can augment imbalanced data sets, leading to significantly higher predictive power of possible actuarial models fitted after.

Structure of the Paper
The rest of the paper is organized as follows. Section 2 describes the problem of class imbalance and its common solutions. Section 3 reviews the literature on generative models, with particular emphasis on GANs, while Section 4 covers GAN applications, especially for actuarial adoption. Section 5 describes the methodology followed. Section 6 outlines the example experiments conducted. Section 7 presents the results, and Section 8 discusses the results, while Section 9 gives conclusions, limitations, and possible future work.

Definition
Whilst machine learning (ML) has gained significant prevalence in the past few decades, class imbalance, limited data sets, and missing data remain pervasive problems Chawla et al. (2002); Fernández et al. (2018); Longadge and Dongre (2013). These issues occur due to the nature of the data space, data collection costs, data limitations, new markets, and absolute rarity.
In binary classification problems, class imbalance occurs when one of the classes has overwhelmingly more instances than others. ML classifiers tend to have skewed accuracy towards the majority class when the data is imbalanced Chawla (2009) ;Fernández et al. (2018). This is problematic as misclassifying a minority class can result in significant misclassification costs than for the majority case Chawla et al. (2002). Class imbalance arises because ML classifiers do not necessarily take into account unequal class distributions. This problem causes a significant and an unexpected performance behavior for most classifiers.

Techniques to Alleviate Class Imbalance
Techniques as shown in Figure 1 exist to alleviate class imbalance, and these techniques include re-sampling, algorithmic-level solutions, cost-sensitive learning, ensembles, and generative models.

Re-Sampling
Re-sampling techniques modify the training data such that the distribution of the classes is evenly balanced where the majority or minority class is either under-sampled or over-sampled. Over-sampling has been the most frequently used technique than undersampling since under-sampling eliminates important information in the majority class.
Hybrid sampling techniques combine over-sampling with data cleaning techniques, informed under-sampling techniques, or greedy-filtering approaches Batista et al. (2004), thereby eliminating redundant and noisy instances, boosting the predictive accuracy of models trained after.

Synthetic Sampling
A pioneering and popular method to alleviate class imbalance has been Synthetic Minority Over-sampling TEchnique (SMOTE) Chawla et al. (2002). However, SMOTE suffers from over-fitting, over-lapping classes, noisy examples, is less reliant on the true probability distribution, and alters the original distribution of the minority classes, and this may not be desirable Batista et al. (2004); Ganganwar (2012); Longadge and Dongre (2013).
There have been few empirical reviews which compare and synthesize SMOTE and its density-based variants Gao et al. (2014). There have been few approaches which create synthetic samples by sampling implicitly from the minority class distribution. Current density-based approaches may be subjective as they need to pre-specify the format and structure of the minority class distribution Das et al. (2015); Zhang and Li (2014). Generative models offer a significant alternative, yet these models have not been thoroughly explored in imbalanced learning. In this work, we show how a popular implicit generative model can be used to handle class imbalance and rival SMOTE.

Ensembles
Ensemble is where a classifier's accuracy is increased by the use of training on different over-sampled data sets or different algorithms and combining outputs to a single outcome. These approaches tend to improve the results of re-sampling techniques Wang and Yao (2009);Chawla et al. (2003). However, they can take a long time to compute and still do not solve the true data distribution issue.

Other Methods
Algorithmic-level solutions modify the ML classifier to adjust for the presence of class imbalance in the data. Cost-sensitive learning incorporates mis-classification costs in the evaluation metric Ganganwar (2012). This approach is more computationally efficient than data-level solutions He and Garcia (2008). However, mis-classification costs are often unknown and difficult to set, making this method less popular than sampling techniques López et al. (2013). SMOTE and its variants remain the most studied and widely used solutions, with generative models slowly being adopted in alleviating class imbalance Fernández et al. (2018);Fiore et al. (2019); Gao et al. (2014). Generative models are described in detail in Section 3.

Generative Models
This research is concerned with handling class imbalance through generative modeling. Other approaches exist, such as synthetic sampling; however, these approaches do not take into account the underlying structure of the data distribution and often lead to over-fitting and over-lapping cases Gao et al. (2014). Generative models are flexible models capable of learning the data distribution and sampling from this data distribution, thereby creating new synthetic cases. In this section, we review generative models and explain why GANs are of better quality than other deep generative models.

Definition
Given a data set with observations X, we assume that X has been generated from an unknown true probability distribution p data . A generative model p model mimics p data as close as possible. If this is achieved, then we can sample from p model to generate realistic samples that appear to have been drawn from p data .
We are satisfied if our model can also generate diverse samples that are suitable different from X. In some cases, the model can be estimated explicitly and sometimes it can generate samples implicitly. Other models are capable of doing both. GANs provide no estimate of the model but are capable of generating new data without knowing it. Goodfellow (2016) provides a taxonomy of common deep generative models shown in Table 1, divided into implicit and explicit models. GANs are designed to remedy most of the disadvantages that come with explicit models and other Markov chain models.

Explicit Models
Explicit models specify or approximate a parameterized log-likelihood representation of the data Goodfellow et al. (2014). Parameters are then estimated and learned from the data and this requires a maximum likelihood estimation which integrates over the entire data space, and this may be intractable Li et al. (2015). These approximation techniques may not always yield the best results as some of them rely on Markov chains, which are time-consuming Goodfellow et al. (2014).

FVBNs
FVBN estimates the probability density of the training data p model (x) into a decomposed product of one-dimensional probability distributions. This model outputs a probability for each possible value if x is discrete and outputs a network of parameters of a simple distribution if x is continuous. Using the generated model, sampling is done one step at a time, conditioned on all previous steps Goodfellow et al. (2014).
The problem with these models is their computational complexities as they need to generate one point at a time. Other problems include poor learning representations, over-emphasizing details over global data, and not closely reflecting the true generation process Goodfellow (2016). Moreover, these models have been more useful for image synthesis than structured data sets, such as tabular data Van den Oord et al. (2016). GANs are known to provide new samples in parallel, thus yielding greater speed of generation Goodfellow (2016); Li et al. (2015).

Non-Linear ICA
Non-linear ICA involves defining some continuous non-linear transformations of data between high dimensions and lower dimensional spaces. The distribution of the data p model is transformed into a distribution of a latent space z defined by p z (g), where g is some tractable transformed version of p z . The challenge in ICA is finding tractable distributions in the latent space, and these are limited Goodfellow et al. (2016). GANs are known to have fewer restrictions than these models Goodfellow et al. (2014); Bengio et al. (2014); Goodfellow (2016).

Variational Autoencoders
VAEs, along with FVBNs and GANs, are three of the most popular approaches for sample generation. VAEs are an extension to AEs Bellinger et al. (2016); Larsen et al. (2015); Rezende et al. (2014). AE learns useful representations of the data by encoding X into a compressed latent space z using q(z|x) and then decoding z back into X using p(x|z) by minimizing the reconstruction error between the original data and the deconstructed data Bellinger et al. (2016). VAE maximizes the following function: (1) Unlike auto-regressive models, VAEs are normally easy to run in parallel during training and inference Goodfellow et al. (2016); Larsen et al. (2015); Rezende et al. (2014). Conversely, they are normally harder to optimize than auto-regressive models Goodfellow et al. (2016); Makhzani et al. (2015). The encoder converts the input to latent space representations through the mean and variance, and samples can be created from the learned representation. VAEs have been criticized to be generating blurry samples and are intractable Goodfellow et al. (2016); Salimans et al. (2016).

Boltzmann Machines
Boltzmann machines rely on the use of Markov chains to model p model (x) and to sample from it Ackley et al. (1985); Hinton (2002); Salakhutdinov and Hinton (2009). A Markov chain is a process that is used to generate samples by repeatedly drawing a sample from a transition operator Geyer (1992). A Boltzmann machine is an energy-based function defined as: where E(x) is an energy function, and Z is a normalizing factor to ensure that p model (x) sums to one Ackley et al. (1985); Goodfellow et al. (2016). These methods include restricted boltzmann machine (RBM) Ackley et al. (1985) and deep belief networks (DBNs) ; Hinton and Salakhutdinov (2006). DBNs and RBMs are generative stochastic neural networks that can estimate a probability distribution Ackley et al. (1985). Samples are obtained through MCMC runs to convergence, and this can be very expensive to run Li et al. (2015). These models were pioneers in early 2006 for deep generative models, but they have been rarely used because of poor scale-ability for higher dimension problems and high computational costs Goodfellow et al. (2016).

Implicit Models
Implicit models learn to model the true distribution and define a stochastic procedure to directly generate new data from a latent space. These models can be trained indirectly without needing an explicit density function to be learned or defined. Some of these models, such as the Generative Stochastic Network (GSN) Bengio et al. (2014), involve MCMC methods, which impose greater computational cost and often fail to scale to higher dimensional spaces Goodfellow et al. (2016). GANs Goodfellow et al. (2014) and Generative Moment Matching Networks (GMMNs) Li et al. (2015) are one of the few implicit probabilistic models capable of sampling in parallel and in a single step.
GANs were designed to remedy most of the issues with explicit and some implicit models. GANs are known to parallelize sample generation, have fewer restrictions, subjectively thought to produce better samples, make no use of Markov chain or maximum likelihood estimation, and assume no variational bounds and no distributional assumptions Arjovsky et al. (2017); Goodfellow et al. (2016). As a result, GANs have generated a lot of interest since 2014, with wide applications in many areas.

GANs
GANs were originally invented in a landmark paper by Ian Goodfellow in 2014 Goodfellow et al. (2014). The setup of the framework uses an adversarial process to estimate the parameters of two artificial neural network (ANN) Rumelhart et al. (1986) models by iteratively and concomitantly training a discriminator (D) and a generator (G), as shown in Figure 2.
Through multiple cycles of generation and discrimination, both networks train each other, while simultaneously trying to outwit each other Goodfellow et al. (2014); Mariani et al. (2018); Odena et al. (2017); Zhu et al. (2017). GANs have two adversarial ANNs: • G picks z from the prior latent space Z and then generates samples from this distribution using ANN; • D receives generated samples from G and the true data examples, and it must distinguish between the two for authenticity.
Both D and G are ANNs which play a zero-sum game, where G learns to produce realistic-looking samples and D learns to get better at discriminating between the generated samples and the true data. Once G is trained to optimality, it can create new samples and augment the training data set. GANs can sample in parallel better than other generative models, have fewer restrictions on the generator function, assume no use of Markov Chains, as well as no variational bounds, unlike VAE, and produce subjectively better quality samples than other generative models Arjovsky et al.  Whilst GANs are gaining popularity in many applications, they have notable issues. GANs are notoriously difficult to train properly and difficult to evaluate, the likelihood cannot be easily be computed, and they suffer from the vanishing gradient problem, mode collapse, boundary distortion, and over-fitting Mode collapse is when many latent noise values z are mapped to the same data point x, leading to a lack of diversity in the samples that are created, i.e., under-fitting. The vanishing gradient problem occurs when D becomes perfect in its training without giving G the chance to improve. As a result, GANs may fail to converge, thereby leading to poor generated samples Arjovsky et al. (2017).  For GAN reviews, (Creswell et al. (2018); Hitawala (2018); Hong et al. (2019)) provide a comparative overview. Lucic et al. (2018) conduct an in-depth study on GANs and note no significant performance differences on the GANs studied. There are over 300 GAN variants, and it is impossible to review all of them. In this work, we are interested in exploring GAN applications and showing their potential to actuaries, especially for alleviating class imbalance, data augmentation, and improving the predictive ability of actuarial models.

GMMNs
GMMNs minimize the maximum mean discrepancy (MMD) between the moments of p data and p model and are known to be simpler than other generative models Li et al. (2015). Moment matching evaluates whether the moments of the true distribution p true (x) match those of the data p data (x) through MMD. This approach is similar to GANs in terms of training, except using a different loss function, which leads to faster sampling. However, GMMNs have received less attention than GANs and VAEs, limiting their sample generative scheme Arjovsky et al. (2017); Goodfellow et al. (2016); Hitawala (2018).

Summary
There are a number of deep generative models for synthetic sample generation. Some of the models are explicit with an intractable likelihood and inference. Some models are only approximate and generate blurry samples. Other methods do not sample in parallel, are complex, and rely on Markov chains, which are time-consuming. GANs are attractive as they do not make any explicit density estimation, and they remedy most of these issues. GANs have generated extremely good examples in many domains. Section 4 reviews these GAN applications.

Applications of GANs
The most successful applications of GANs are in computer vision, but there have been applications in other domains, as well. In this section, we focus on the applications where there is some actuarial use.

Data Augmentation
The availability of sufficient data in many domains is a necessity, especially where predictive models are needed to make business decisions. Such models are built on adequate training data for better generalization and meaningful accuracy Goodfellow et al. (2016). In this work, we are interested in adopting GANs for new sample creation and data augmentation in order to boost predictive models by supplementing training data sets with new samples that are learned from the real data distribution in an adversarial manner. Data augmentation is a procedure to create synthetic cases to augment the training data and increase its size, especially for those data points that are lacking. This is where GAN shines-the ability to create new samples and adequate data sets Goodfellow et al. (2014); Fiore et al. (2019).
There are two main strategies to check if this augmentation really helped something: we can train our model on fake data and check how well it performs on real samples. We can also train our model on real data to do some classification task and only after check how well it performs on generated data. If it works well in both cases-you can feel free to add samples from the generative model to your real data and retrain it again-you should expect gain of performance.
Recently, a number of papers have applied GANs to augment various data sets, with remarkable results on the performance of the predictive models applied after . Similarly, GANs can be used to augment actuarial data sets and boost actuarial models, making them more accurate and less biased. In this work, we demonstrate how this can be done for a number of data sets, described in Section 6.

Anomaly Detection
Anomaly detection is the identification of rare items, events, or observations which raise suspicions by differing significantly from the majority of the data. Anomaly detection finds extensive use in a wide variety of applications, such as fraud detection for credit cards, insurance, or health care. The importance of anomaly detection is due to the fact that anomalies in data translate to significant (and often critical) actionable information in a wide variety of application domains. There are a number of these methods, such as clustering-based, classification-task, nearest neighbor, spectral, or statistical, but most of them have rather strong assumptions and long training times.
Main generative models, like VAE or GAN, consist of two parts. VAE has an encoder and the decoder, where the encoder basically models the distribution and the decoder reconstructs from it Larsen et al. (2015). A GAN consists of the generator and the discriminator, where the generator models the distribution and the discriminator judges if it is close to the training data Goodfellow et al. (2014). They are pretty similar in some way-there is modeling and judging part (in VAE, we can consider reconstructing as some kind of judgement).
The modeling part is supposed to learn the data distribution. What will happen to the judging part if we give it some sample not from the training distribution? In case of a well trained GAN, the discriminator will tell us 0, and reconstruction error of VAE will be higher than average one on the training data Akcay et al. (2018). Our unsupervised anomaly detector is then easily trained and evaluated. We can feed it with some steroids, like statistical distances, if we want.
In medicine, Schlegl et al. (2017) propose an AnoGAN for anomaly detection of medical images, and learn the characteristics of lesions by learning the characteristics of health data sets. Akcay et al. (2018) present GANomaly for anomaly detection in visual noise, noting a significant improvement on detecting anomalies on various data sets. These methods can be leveraged for potential applications in fraud detection, lapse prediction and claiming likelihood in insurance. A GAN useful and leveraged for anomaly detection can rival other anomaly detection techniques.

Time Series
Suppose we wanted to simulate the evolution of a stock price for some particular asset using traditional simulations, such as Monte Carlo. We would need to estimate the mean and volatility of the returns using past price evolution and then simulate new prices under the assumption that the returns follow a Gaussian distribution with the estimated parameters. However, this normality assumption may not be entirely true in practice where there is a tendency for higher observed probabilities for the tail events than those predicted by the Gaussian distribution. We could change our assumption, say, into a student-t distribution, but neither would that assumption completely describe the reality. GANs are capable of replicating the price evolution without making any model assumptions.
Time series and stochastic processes are widely used by financial firms for risk management, financial projections, stock prediction, extreme event monitoring, and monetary policy making Fu et al. (2019). Traditionally, autoregressive time series models, exponential smoothing and their variants, and, more recently, deep learning, have been introduced and intensively studied and applied for time series data Esteban et al. (2017).
However, most of these models rely on strong dependence on model assumptions and model parameter estimation and, thus, are less effective in the estimation of complex Immediate actuarial uses leveraging these GANs are stochastic simulations, capital modeling, mortality projections, reserving, asset and liability management, solvency projection, and other time series generation tasks. GANs can be used to rival Monte Carlo or stochastic simulations without any distributional assumptions. In insurance, mortality forecasting is an important actuarial task. Typically, mortality forecasting models, such as Lee-Carter Lee and Carter (1992), are used, but these make strong mathematical assumptions which need to be validated by the data. Time series GANs could potentially be used to simulate and project mortality rates into the future, potentially competing with existing models.

Privacy Preservation
Data of a lot of companies can be secretive, confidential, or sensitive. Sometimes, we need to share it with third parties like consultants or researchers. If we want to share a general idea about our data that includes the most important patterns, details, and shapes of the objects, we can use GANs directly to sample examples of our data to share with other people without sharing identifiable features. This way we will not share any exact confidential data, just something that looks exactly like it.
Privacy-preservation GANs are capable of accomplishing this task Beaulieu-Jones et al. (2019). In actuarial valuation models where model points are used to determine the amount of money to hold for an individual/groups, such GANs may be useful for the creation of synthetic samples to be fed into the valuation model, without needing the details of any policy. In particular, GANs can be used to share synthesized data and make them publicly available, increasing the scope for actuarial research, collaboration, and comparisons.

Missing Data Imputation
Missing data causes an issue in analysis as most standard data analytic methods are not designed for missing data. Techniques, such as single imputation (SI) and multiple imputation (MI) Rubin (2004), exist, but there is no consensus on which of the MI method is superior, even though MI is known to be better than SI Schafer and Olsen (1998).
Generative Adversarial Imputation Net (GAIN) Yoon et al. (2018) provides an alternative generative modeling approach to create new cases that can be used to impute missing information. View Imputation via GANs (VIGAN) Shang et al. (2017) deals with data that are collected from heterogeneous sources, resulting in multi-view or multi-modal data sets where missing data occurs in a number of these sources. These methods were shown to be better than SI/MI methods, thereby improving the effectiveness of ML algorithms trained after. These GANs can be used to impute missing data points in experience investigations and assumption setting in both short-term and life insurance when conducting valuation or pricing, increasing the number of data points available for boosting the predictive power of models built after.

Semi-Supervised Learning
The purpose of a Semi-Supervised GAN (SGAN) is to train the discriminator into a classifier which can achieve superior classification accuracy from as few labeled examples as possible Sricharan et al. (2017), thereby reducing the dependency of classification tasks on enormous labeled data sets. It has been shown that an SGAN generalizes from a small number of training examples much better than a comparable, fully-supervised classifier Chongxuan et al. (2017); Liu et al. (2019); Miyato et al. (2018). This has been lauded as the most useful GAN application with good performance with a small number of labels on data sets Odena (2016); Salimans et al. (2016).
For imbalanced data sets, such as mortality, morbidity, fraud, lapses, extreme events, large claims, and sub-standard risks, SGAN may offer a superior alternative predictive model compared to ML models which require significant training data for improved accuracy. Typically, one has to deal with imbalanced classes either through synthetic sample generation using some heuristic method, such as SMOTE, cost sensitive adjustment to the evaluation metric, or adding uncertainty margins, which can be subjective. Through the training of an SGAN, it is possible to have a sample generative scheme whilst having a classifier, as well. This has tremendous advantages over many generative and ML models.

Domain Adaptation
It is quite possible that the training data used to learn a classifier has a different distribution from the data which is used for testing. This results in degradation of the classifier performance and highlights the problem known as domain adaptation Hong et al. (2018). In domain adaptation, the training and test data are from similar but different distributions. This area has become interesting for GANs in the past few years.
These methods include CycleGAN Zhu et al. (2017), Discover GAN (DiscoGAN) Kim et al. (2017), DualGAN Yi et al. (2017), and StarGAN Choi et al. (2018), which can be used for multiple domains. With these methods, one can transfer an algorithm learned from a different data set to a new one and achieve similar performance. Such approaches are also able to learn representation adaptation, which is learning feature representations that a discriminator cannot differentiate which domain they belong to Tzeng et al. (2017). By using synthetic data and domain adaptation, the number of real-world examples that are needed to achieve a given level of performance is reduced significantly, utilizing only randomly generated simulated points Hoffman et al. (2017). Domain adaptation can learn transfers between different domains, by synthesizing different data sets. This can be useful in combing public data sets or other market data with internal company data in actuarial firms.

Summary
Given the above taxonomy of GAN applications, Table 2 depicts specific actuarial areas where GANs can be useful.
To our knowledge, there has been limited applications of GANs in actuarial areas, such as insurance, health care, banking, investment, and enterprise risk management. This is compounded by the fact that GANs have been highly successful on computer vision, with less emphasis on tabular data sets. However, there have been recent applications of GANs on other tabular data sets, such as airline passengers Mottini et al. (2018) and medical records Armanious et al. (2018).

Actuarial Discipline Description
Product Design, Propensity and Customer Behavior Create wider and more model points; Boost propensity models with more data per cell, leading to better and accurate models.

Actuarial Models
Experience monitoring and experience rates derived using a large credible data set. Boost models using data augmentation, semi-supervised learning, missing data imputation and domain adaptation for pricing, assumption setting, anomaly detection, risk estimation, time series and attention prediction in insurance, reinsurance, banking, investment, healthcare, and enterprise risk management.

Projections
Network modeling by looking at driving dependencies rather than correlation assumptions, i.e., use generative models. Strategic flexible and more decision-based models based on the environment. More GAN-based time series models driven by the environment. Enhanced solvency projection models and stress tests which are based on rich data sets.

Reserving
Make projections more predictive through a large enough credible data at all model points, i.e., accurate assumptions per risk cell with less margins.

Surplus Distribution
More granular individual information from alternative data sources through leveraging generative models.

Investment Strategy
Granular data for asset/liability modeling, i.e., use GANs to simulate scenarios that depend entirely on the adopted investment strategy and boosting the model. Enhanced market risk monitoring. Improvements to portfolio optimization.

Data Cleaning
Reduce errors; fill in gaps using imputation; increase the sample size; query other data sets and verify patterns using Cycle GANs.

Research
Make actuarial data sets more publicly available through synthesized data generated by GANs, boosting industry data. This is helpful for creating accurate and more up-to-date standard tables and encouraging actuarial research.

External Data Sources
Leverage other data sets through combining multiple data sets. For example, DualGAN or CycleGAN can be leveraged to learn a representation that encompasses different data sets.
GANs can equally be adopted or leveraged for similar tasks to boost limited actuarial data sets and improve actuarial models, especially in areas where models are needed to make business decisions. Examples of actuarial applications with limited data and the class imbalance problem include claim frequency modeling, claim amount estimation, lapse prediction, fraud detection, mortality/morbidity rate estimation, catastrophe modeling, extreme event models, and risk estimation. Leveraging GANs to increase the data size on these data sets could lead to better actuarial models. In particular, GANs could allow less reliance on using stochastic simulations that are based on subjective distributions and err less on margins used.

Methodology
This section describes in detail the theoretical operation of GANs, their challenges, and tricks to improve their training. Throughout this paper, it is assumed that both GAN networks are implemented with ANNs. For comparative purposes, we also implement a popular synthetic data generative mechanism using Synthetic Minority Over-sampling Technique (SMOTE) Chawla et al. (2002).

SMOTE
This section describes the theoretical operation of SMOTE for comparative purposes with the GAN applied in this work. SMOTE creates new synthetic cases by linearly interpolating between two nearest neighbor (NN) instances of the minority class. Chawla et al. (2002) show that SMOTE improves the effectiveness of ML classifiers compared to random over-sampling and under-sampling approaches. Over time, SMOTE has become the default method for synthetic sample generation and has proven to be popular among researchers, becoming a pioneer in imbalanced learning Fernández et al. (2018).
Considering a random minority instance x, a new instance s is generated by considering its k-NNs. These k-NNs are found by using the Euclidean distance metric. Initially, an instance y is generated at random from the k-NNs. Then, a new synthetic minority instance s is generated, as follows: where α is randomly generated from the Uniform distribution [0, 1]. SMOTE parameters are the value of k and the number of minority cases to generate. These parameters can be tuned to ensure an optimal metric is achieved. SMOTE is the benchmark method for addressing class imbalance in binary classification problems.

Vanilla GAN
This section describes the original GAN formulation, called MiniMax GAN (MM-GAN). This is the baseline model over which all other GAN variants are based.

The Discriminator
The discriminator (D) receives generated samples from a generator G and the true data examples from p data (x), and must distinguish between the two for authenticity through a deep ANN Goodfellow et al. (2014). The resulting output D θ d (x) for an input x is the probability of x being sampled from p data (x) instead of p g , where p g is the implicit distribution defined by G. The vector θ d represents learned parameters from D. The discriminator's goal is to yield D(x) near 1 for x ∼ p data and D(G(z)) closer to 0 for p ∼ p z (z) using the sigmoid function in the output layer. This is achieved by maximizing D's loss over θ d :

The Generator
The generator (G) randomly picks a sample z from the prior latent space defined by p(z) and then generates samples from this distribution using an ANN. This deep ANN must learn the parameters Θ g given an input z ∼ p z (z), that will give the output G θg (z). G is trained to fool D, i.e., to make D's output for fake/generated sample D(G(z)) closer to 1. The parameters of G are learned by minimizing G' loss over Θ g :

GAN Loss
Combining the losses for D and G, GANs solve the following minimax game in alternate steps through Gradient Descent (GD) Ruder (2016): The above losses for D and G are the original formulation proposed by Goodfellow in 2014, called minimax GAN (MM-GAN). Since we are minimizing over θ g and maximizing over θ d , training of GANs alternate between GD on G and gradient ascent on D Goodfellow et al. (2016). Typically, for every training of G, D is trained k times although an optimal choice is debatable among researchers. This is shown in Algorithm 1. Remark 1. Gradient based updates on the networks can be accomplished using one of the GD optimizers. Typically, Stochastic GD (SGD) with Momentum Qian (1999) for D, Root Mean Square propagation (RMSprop) Hinton and Tieleman (2012), or Adaptive Moment estimation (Adam) Kingma and Ba (2014) for G tend to work well in practice Goodfellow et al. (2014); Radford et al. (2015). . , x (m) } from the training data distribution p data (x). • Update the discriminator D by ascending its stochastic gradient on these mini-batches: • Sample mini-batch of m noise samples {z (1) , . . . , z (m) } from the noise prior p g (z).
• Update the generator by descending its stochastic gradient computed on this mini-batch: log 1 − D G(z i ) .

Non-Saturating GAN
While the above loss function is useful for theoretical results, unfortunately, it does not work well in practice, and there are challenges getting the GAN to convergence, stabilize its training, and getting diverse samples Arjovsky et al. (2017); Mirza and Osindero (2014); Radford et al. (2015); Salimans et al. (2016). In practice, rather than training the above loss function for G, to provide better gradients in earlier training, Goodfellow et al. (2014) suggest to maximize the following objective function for G instead: This version of GAN is called non-saturating GAN (NS-GAN) and is typically used as the benchmark in most studies and in practice. This leads to the following NS-GAN loss function: max With this new loss function, we alternate between gradient ascent on D and gradient ascent on G. Algorithm 1 is based on the original MM-GAN formulation; however, it can easily be tweaked to represent NS-GAN.

Optimal Solution
Theoretically, it can be shown that for p g = p data , the GAN zero-sum game in Equation (6) Radford et al. (2015). The optimal discriminator D * G (x) for a fixed G is: .
Assuming that D is perfectly trained and if we substitute D * G (x) into Equation (6) for G's loss, this gives rise to the Jensen-Shannon (JS) divergence Lin (1991). The JS divergence can be written as a function of the Kullback-Leibler (KL) divergence Kullback (1997); Kullback and Leibler (1951). Definition 1. The KL divergence between two probability distributions p data and p g is defined as Definition 2. The JS divergence between two probability distributions p data and p g is defined as JS(p data , p g ) = D JS p data ||p g = 1 2 KL p data , p data + p g 2 + 1 2 KL p g , p data + p g 2 .
If we substitute D * G (x) into Equation (6), the minimum loss for G is reached if and only if p g = p data ; thus, one can show that: This equation tells us that, when D has no capacity limitation and is optimal, the GAN loss function measures the similarity between p data and p g using JS divergence. However, although the above results provide a nice theoretical result, in practice, D is rarely ever fully optimal when optimizing G Goodfellow et al. (2014). Thus, alternative GAN architectures have been proposed to fix this issue and to get closer to optimality. Below, we describe what causes this failure to convergence and how to fix it.

Challenges with GANs
GANs are notoriously difficult to train properly and to evaluate, the likelihood cannot be easily be computed, and they suffer from the vanishing gradient problem, mode collapse, boundary distortion, and over-fitting Arjovsky et al. (2017)

Mode Collapse
Mode collapse is when many latent noise values z are mapped to the same data point x, leading to a lack of diversity in the samples that are created, i.e., under-fitting. This is regarded as the most significant problem with GANs Manisha and Gujar (2018). Many studies have spent lots of time in varied contexts to fix this.

Vanishing Gradient
This occurs when D becomes perfect in its training without giving G the chance to improve. As a result, GANs may fail to converge, thereby leading to poor generated samples Arjovsky et al. (2017).

Improved GAN Training
There are many GAN architectures which avoid the problems that come with the vanilla GAN. We briefly describe some of the most common and popular GAN solutions. Salimans et al. (2016) look at ways to improve GANs (called hacks), while other authors propose variants to the vanilla GAN by changing the cost function, adding gradient penalties (GPs), adding labels, avoiding over-fitting, and finding better ways of optimizing GANs. Given the vast number of taxonomies, we are not able to cover all of them but only discuss the most popular and those subsequently used in this work.

Conditional GANs
The first extension of GAN was the conditional GAN (cGAN) which gave the generator the label Y in the latent space, making them class conditional Mirza and Osindero (2014). Most of the GAN variants can be modified to include cGAN. cGAN allows to create diversified samples and forcing G to create specific samples, thereby fixing mode collapse problem.

Deep Convolutional GAN
Until the introduction of deep convolutional GAN (DCGAN) Radford et al. (2015), training GANs was still unstable. DCGANs provide some further tricks using convolutional and deconvolutional layers. Given that DCGANs use convolutional NNs which are typically used for images, we do not review this architecture in detail as our main focus in on tabular data. Despite this, the structure of the DCGAN is very useful in providing stable training for most GANs Lucic et al. (2018).

Loss Variants
There are a number of GAN architectures which change the loss function to improve GAN training and stability. The loss function for GAN measures the similarity between p data and p g using JS. Unfortunately, JS tends not to be smooth enough to ensure a stable training Hong et al.  (2018). Thus, we adopt to describe WGAN as it has become the most widely used GAN architecture since DCGANs.

WGAN
This section describes WGAN and its improved training using WGAN-GP.

Wasserstein Distance
IPM generalizes a critic function f belonging to an arbitrary function class, where IPM measures the maximal distance between two distributions under some functional frame f Hitawala (2018). Among the IPMs, the Wasserstein distance is the most common and widely used metric Manisha and Gujar (2018).
Informally, the Earth mover (EM) Rubner et al. (2000) distance W(p data , p g ) measures the minimal changes needed to transform p g into p data . More formally, EM between two probability distributions p data and p g is: where Π (p data ,p g ) represents a set of all joint probability distributions in which marginal distributions are, respectively, p data (x) and p g (x). Precisely, γ(x, y) is a transport plan, i.e., percentage of mass that should be moved from x to y to transform p g into p data . (11) is intractable as it is tricky to exhaust all the elements of Π (p data ,p g ) Arjovsky et al. (2017). This is solved using the following functional format:

The infimum in Equation
where the supremum is taken over a 1-Lipschitz function f . A function f is 1-Lipschitz if for all x 1 ,

The Critic
In WGAN, D's output is not a probability anymore but can instead be any number, and, for this reason, D is typically called the critic. The WGAN critic tries to maximize the difference between its predictions for real samples and generated samples, with real samples scoring higher. Arjovsky et al. (2017) force the critic to be 1-Lipschitz continuous for the loss function to work well: where W is the set of 1-Lipschitz continuous functions. Typically, to enforce the Lipschitz constraint, the critic weights w are clipped to lie within a small range, usually [−0.01, 0.01] after each training batch Arjovsky et al. (2017); Gulrajani et al. (2017). The critic is trained to convergence so that the gradients of G are accurate, thus removing the need to balance the training of G and D by simply training D several times between G's updates, to ensure it is close to convergence. Typically, 5 critic updates to 1 generator update is used Arjovsky et al. (2017). WGAN used the RMSProp version of gradient GD with a small learning rate and no momentum Arjovsky et al. (2017). However, Adam may also be used as it is a combination of RMSProp with Momentum.

Improved WGAN Training
Even though WGAN has been shown to stabilize GAN training, it is not generalized for deeper training due to weight clipping which tends to localize most parameters at −0.01 and 0.01 Gulrajani et al. (2017); Manisha and Gujar (2018). This effect dramatically reduces the modeling capacity for D. Gulrajani et al. (2017) amend WGAN through an addition of a gradient-penalty (GP) to the loss function, coming with WGAN-GP. In total, three changes are made to WGAN critic to convert it to WGAN-GP: include a GP to the loss function; do not clip critic weights; and do not use batch normalization layers in the critic. WGAN-GP is defined using the following loss function: wherex samples uniformly along the straight line between points sampled from p data and p g , and λ is the GP term. Gulrajani et al. (2017) show a better distribution of learned parameters compared to WGAN, and this method has been the default method in most GAN loss variants. We adopt the conditional version of WGAN-GP, called WCGAN-GP, as an alternative to current actuarial/statistical approaches for synthetic sample generation. Once WGAN-GP is trained to convergence, G can be used to create new samples by feeding it the latent space Z.

Experiments
This section outlines the experiments conducted, showing a popular GAN application for data augmentation and boosting predictive models. We compare WGAN with SMOTE. This exercise can be similarly adopted for any actuarial modeling problem, such as mortality, morbidity, medical segmentation, credit risk, extreme events, regression, Value-at-Risk, and anomaly detection in insurance, investment, banking, and healthcare.

Data Sets
We considered 5 publicly available imbalanced data sets from the Center for Machine Learning and Intelligent Systems database Dua and Graff (2017). The data sets are described below and shown in Table 3. Table 3. Imbalanced data sets used in the experiments.  Pozzolo (2015). This data is highly imbalanced, with 492 fraudulent transactions out of a total of 284,807 transactions, representing a mere 0.172% of fraud cases. This data set contains 31 anonymized features (Time, Amount, V0,V1, . . . V28) and the Class indicator showing 1 for frauds and 0 for non-fraudulent cases. All the variables are numeric.

Pima Indians Diabetes
This data set contains the prediction of the onset of diabetes within 5 years in Pima Indians given some medical details, representing 34.90% of diabetic cases out of a total of 768 samples Smith et al. (1988). There are 8 independent variables.

German Credit Scoring
This data comes from the German credit scoring from the Machine Learning Repository Dua and Graff (2017). There are 1000 observations with 20 independent variables. The dependent variable is the evaluation of customer's current credit status, which indicates whether a borrower's credit risk is good or bad.

Breast Cancer Wisconsin
This data represents the characteristics of a cell nuclei that is present in the digitized image of a breast mass Street et al. (1993). The data is used to predict the presence of benign or malignant cancer, with 37.25% being malignant samples from a total of 569 cases.

Glass Identification
This data set determines whether the glass type is float or not in term of their oxide content Evett and Spiehler (1987). There are 32.71% of float glass types out of a total of 214 cases.

Scaling the Data
Many ML methods expect data to be of the same scale to avoid the dominance of certain variables and this can affect the accuracy of specific models Ioffe and Szegedy (2015); Mitchell (2006). Normalization re-scales the data to the range between 0 and 1. Standardization centers the data distribution to N(0, 1). We adopt normalization as it does not assume any specific distribution. This will potentially speed up convergence Goodfellow et al. (2016);Mitchell (2006).

Train-Test Split
ML models are usually trained and tested on unseen data. Two approaches to split the data are cross-validation (CV) and train-test split Friedman et al. (2001). CV divides the data into K subsets that can lack sufficient credibility and can result in higher variability of predictions, if the data size is too small Friedman et al. (2001). Train-test split, however, can allow a larger subset of the data to be used for estimating model coefficients and results in more reasonable results Mitchell (2006).
Existing literature typically uses a 70-90% train-test split, especially if the data is large. This technique is simple, easy to understand and widely used, despite giving noisy estimates sometimes Friedman et al. (2001); Goodfellow et al. (2016);Mitchell (2006). CV is typically used to optimize parameters of a classifier. This work adopts 75% training data and 25% testing data. Other train-test splits are possible; however, we leave this for future work.

SMOTE Implementation
Over-sampling is performed on the 75% training data using the R imbalance library. The R imbalance library contains functions for performing SMOTE and other variants. 1 The two parameters to tune are the number of neighbors and the over-sampling rate. We kept the over-sampling rate the same to ensure balanced class distributions within each data set. Using SMOTE, we create additional synthetic cases to augment the above training data sets.
We varied the number of k-NNs for each data set to ensure optimal parameters are chosen through a 10-fold CV. This was done through a grid search scheme, with values of k-NN ranging from 1 to 15, optimized using the Area under the Precision-Recall Curve (AUPRC). The best parameter values for each data set are shown in Table 4 below.

Data Set
Value of k-NN Credit Card Fraud 6 Pima Indians Diabetes 9 Glass Identification 10 German Credit Scoring 12 Breast Cancer Wisconsin 10

GAN Implementation
Given its popularity and wide use, WGAN is adopted for an alternative synthetic sample generation. Specifically, we adopt the conditional version of WGAN with GP; thus, we use WCGAN-GP Gulrajani et al. (2017); Mirza and Osindero (2014). Below, we describe how parameters are chosen and results generated. 6.5.1. Software Due to its simplicity and faster computations, the high-level Keras François (2015) library with Tensorflow back-end is chosen to implement WCGAN-GP. 2 This is trained using all minority cases of each data set.

The Generator
This section describes how the parameters for G are chosen. The random noise for z is generated from N(0, 1) with 100 dimensions. This is based on GAN hacks which suggest to sample from a spherical distribution Salimans et al. (2016).
Rectified Linear Unit (ReLU) Glorot et al. (2011) is adopted in the hidden layers Salimans et al. (2016). For G's output later, hyperbolic tangent (tanh) is adopted. No drop out or batch normalization is applied in the hidden layers, following advice by Gulrajani et al. (2017) for WGAN-GP.
The layers are chosen such that they are ordered in an ascending manner for G. For simplicity, after a number of iterations, 3 layers were chosen for each data set. In the first layer, there were 128 units; in the second layer, 256 units; and, in the third layer, 512 units. These layers worked well in the experiments conducted. The output layer had the data dimension of the data as the number of units.
Weights are initialized using the He initialization method He et al. (2015). Adam is used to optimize the weights of G Radford et al. (2015); Salimans et al. (2016). For Adam, we used default values with β 1 = 0.5 and β 2 = 0.9 for G Kingma and Ba (2014).
We used a batch size of 128 when optimizing the gradients for faster training Ioffe and Szegedy (2015). Initial learning rate η for G was fixed at 0.00004. The number of epochs were found to be 5000 where the GAN training was found to be stable.

The Critic
ReLu is adopted with a negative slope of 0.2 Glorot et al. (2011);Radford et al. (2015). Similarly to the generator, 3 layers were used in the hidden layers. The layers were arranged in a descending manner, with 512 units in the first layer, 256 units in the second layer, and 128 units for the last layer. The critic gives the output a single value using a linear function Arjovsky et al. (2017). Adam was used with default parameters in Keras François (2015), as per Table 5. Critic weights were also initialized using the He method, and a similar batch size as in the generator was used. We pre-trained the critic 100 times at each adversarial training step Arjovsky et al. (2017). This ensures faster convergence at each step before G is updated. We used WGAN with a GP with the default values as per the original paper Gulrajani et al. (2017). The GP value was left unchanged at 10. We call this model WGAN-GP. We found that, after 5000 epochs, the losses plateaued and did not change much.

Labels
Typically, to boost faster training and fix mode collapse, additional information can be incorporated in both G and D using cGAN Mirza and Osindero (2014). We used the conditional version of WGAN-GP where class labels were added to the minority cases. To accomplish this, clustering was done on the minority cases in order to induce class labels on the training data.
We explored a number of common mechanisms considering k-means clustering Hartigan and Wong (1979), Agglomerative Hierarchical Clustering (AHC) Voorhees (1986), Hierarchical DBSCAN Ester et al. (1996), and t-distributed Stochastic Neighborhood Embedding (t-SNE) Maaten and Hinton (2008). The details of these algorithms are beyond the scope of this work.
Due to its wide use and simplicity, we adopted k-means clustering with 2 clusters for each data set. This yielded labels that could be fed into G and D to induce generated samples. We call the final model WCGAN-GP after incorporating these class labels into the training. 6.5.5. Training WGAN-GP Figure 4 presents the experiments of training WCGAN with GP. For comparative purposes, using similar parameters, we show the quality of samples generated for WCGAN 3 with GP, WGAN, cGAN and non-saturating GAN on the credit card fraud data. We consider this for two combinations of the features for illustrative purposes up to 5000 epochs. The results show the superiority of samples generated by WCGAN with GP. There is a clear mode collapse problem on the vanilla GAN and cGAN. WGAN and WCGAN with GP show better samples. There are clear damped oscillations and unstable losses for GAN and cGAN where Wasserstein GANs exhibit stable training and losses, especially after 1000 iterations, where it seems to settle and stabilize. Figure 5 shows the critic loss for each epoch, where, after 1000 epochs, the loss starts to plateau. Thus, we decided to stop the training after 5000 epochs. We repeated this experiment for each data set and adopted WCGAN with GP after 5000 epochs as the model to use for synthetic sample generation.

Generating Synthetic Samples
Once the WCGAN with GP is trained to 5000 epochs, the learned generator distribution is used to create more synthetic samples by feeding it the number of samples to output. Figure 5. Difference between generated and real data critic loss.

Logistic Regression
For simplicity, and given the wide use with actuaries, a Logistic Regression (LR) McCullagh (1984) model is trained using Python 3.7 Python Software Foundation (2017) on both the imbalanced training data and over-sampled data sets to predict the likelihood of each minority case using this equation: where h θ (x (d) ) is the probability of the given minority case, θ i 's are the estimated coefficients using SGD, X i is the feature vector for sample i, and d is the number of features to include in the LR model. The coefficients are estimated by minimizing a loss function through a SGD. Typically, classification is such that, when h θ (x (d) ) ≥ 50% for each instance, assign the minority case and, otherwise, the majority case.

Evaluation
The confusion matrix returns a report showing how predicted classes on unseen test data using the LR model compare to actual observed classes, as depicted in Table 6. True Negative (TN) is the number of majority cases that were correctly classified as such. False Positive (FP) is the number of majority cases that were incorrectly classified as minority. True Positive (TP) is the number of minority cases that were correctly classified as minority. False Negative (FN) is the number of minority cases that were incorrectly classified as majority. Using these definitions, Table 7 presents the most well known evaluation metrics for binary problems.

Metric Formula
Accuracy Precision is the ability of the LR model not to label a minority case that is actually majority. Recall is the ability of the LR model to find all minority cases. F1-Score is a harmonic mean between Precision and Recall He and Garcia (2008). F1-Score puts equal weight to both Precision and Recall. Accuracy can be misleading and inappropriate when there are imbalanced classes and, thus, may be biased towards majority cases Chawla et al. (2002); Ganganwar (2012); He and Garcia (2008). Thus, we do not use rely on it in this work. Accuracy, Precision, Recall, and F1-Score should be close to 100% for a LR model to do well on the testing data. However, these scores are influenced by what threshold is used to decide between the two binary classes.
The Receiver Operating Characteristic (ROC) curve Bradley (1997); Hanley and McNeil (1982) measures a classifier's performance on a test set over different decision thresholds by varying the Precision and the FP rate. The Area under the Curve (AUC) measures the performance of the LR model trained on both imbalanced and over-sampled data sets and tested on unseen data with values close to 100% considered excellent performance Bekkar et al. (2013); Hanley and McNeil (1982). We also compute the Precision-Recall curve and compute the Area Under the Precision-Recall Curve (AUPRC) to get a weighted score. A method that gives the highest score is better.

Statistical Hypothesis Testing
Friedman test Friedman (1937), followed by a post-hoc Nemenyi test Nemenyi (1962), are performed to verify the statistical significant differences between WCGAN-GP and SMOTE.

Friedman Test
The Friedman test is a non-parametric ranking test to determine whether SMOTE and WCGAN-GP methods perform similarly in mean performance rankings based on the measures above, when normality does not hold Friedman (1937).

Post-Hoc Nemenyi Test
If the null hypothesis is rejected, a post-hoc test can be applied where WCGAN-GP is considered as the control method. The post-hoc Nemenyi test evaluates pairwise comparisons between the over-sampling methods if the Friedman test suggests that there is a difference in performance Nemenyi (1962);Pohlert (2014). We adopt WCGAN-GP as the control method.

Implementation
Both tests are conducted using the Pairwise Multiple Comparison Ranks Package (PMCMR) Pohlert (2014) available in R. We assume statistical significance of the alternative hypothesis at p-values < 0.05. In other words, we fail to reject the null hypothesis when the resulting p-value is higher than 0.05, suggesting that there is no difference between SMOTE and WCGAN-GP.

Results
This section presents the results of all the LR models applied on the baseline and over-sampled data sets, with metrics on Precision, Recall, F1-Score, AUC, and AUPRC computed on the same unseen test data. Table 8 presents the evaluation metrics (based on the testing set) of the LR model applied on the baseline and over-sampled data sets for a default threshold of 50%. Bold shows an algorithm that performs the best for that data set, i.e., a higher score for that metric. Figure 6 shows the average performance across all data sets from each evaluation metric. In general, SMOTE improves Recall at the expense of a lower Precision. This results in a lower F1-Score than Baseline results. As a result of a much lower Precision for SMOTE, AUPRC is penalized and lower than both Baseline and WCGAN-GP. SMOTE compromises the Precision significantly, whereas WCGAN-GP improves Recall, while not significantly penalizing Precision.

Comparisons
Overall, WCGAN-GP shows a higher F1-Score. Thus, using a default threshold, WCGAN-GP performs the best on F1-Score, followed by Baseline and SMOTE being last (on the average). The lower Precision on SMOTE may be due to the strict assumed probability distributions and possible creation of over-lapping and noisy samples Bellinger et al. (2015); Das et al. (2015); Gao et al. (2014); Mathew et al. (2015); Zhang and Li (2014). While the univariate results on Precision, Recall, and F1-Score are useful, they do not give the entire picture over different thresholds Bekkar et al. (2013).
Since AUC and AUPRC are based on varied thresholds, these metrics are typically preferred over one dimension measurements, such as Precision, Recall, and F1-Score Bekkar et al. (2013); Ganganwar (2012); López et al. (2013). Since we are also comparing the above results with the Baseline model, these metrics are impacted by class imbalance He and Garcia (2008). Thus, we rely on the AUC and AUPRC.

AUC
The ROC curve represents the trade-off between Precision and the FP rate, while the AUC is the area under the ROC curve Bekkar et al. (2013). SMOTE reports higher AUC values than the Baseline. In general, WCGAN-GP is better on 3 of the 5 data sets except on credit card fraud and diabetes data sets. Overall, the average AUC value is not too different between WCGAN-GP and SMOTE. This result conflicts the AUPRC scores where WCGAN-GP shows a clear dominant superiority over SMOTE.
Whilst AUC may be useful, it does not consider Recall, which may be the most important metric for minority cases. AUC may be affected by skewed data sets and the data distribution He and Garcia (2008). ROC curves are appropriate when the data is balanced, whereas Precision-Recall curves are appropriate for imbalanced data sets Bekkar et al. (2013); He and Garcia (2008). AUC may tend to provide an overly optimistic view than AUPRC He and Garcia (2008).
In general, an algorithm that dominates in AUC may not necessarily dominate the AUPRC space He and Garcia (2008). Saito and Rehmsmeier (2015) suggest that the Precision-Recall curve and AUPRC are more informative than the ROC curve and AUC. Since we are also comparing with the Baseline which is imbalanced, ROC and AUC may be inappropriate; thus, AUPRC provides a sensible measure for all methods.

AUPRC
AUPRC has all the characteristics of the AUC; thus, for the purposes of this work, we rely more on AUPRC than AUC He and Garcia (2008); Saito and Rehmsmeier (2015).
Overall, WCGAN-GP shows better improvements over SMOTE. WCGAN-GP is highest on AUPRC, suggesting this algorithm performs the best across many thresholds and all the data sets used. On the average, SMOTE does not provide a superior predictive performance than the Baseline on all the metrics. Below, we further provide conclusive evidence on the statistical significance of the above results on the AUPRC. Table 9 shows the results of the Friedman test applied on AUPRC to verify the statistical significance of WCGAN-GP compared to SMOTE. There is enough evidence at 5% significance level to reject the null hypothesis on 3 of the data sets, except German credit scoring and glass identification, suggesting that over-sampling methods are not performing similarly and are different.

Statistical Hypothesis Testing
Since the null hypothesis was rejected for 3 of the data sets, a post-hoc test was applied to further determine pairwise comparisons using the Nemenyi test where WCGAN-GP is the control method. Table 10 shows the results of the post-hoc test. The above results confirm the significant superiority of WCGAN-GP over SMOTE as all the p-values are less than 0.05 for the 3 data sets where Friedman's test suggested a difference. These results confirm the findings shown in Figure 6 and Table 8 where the average performance seen on both the AUC and AUPRC was lower for SMOTE compared to WCGAN-GP. In general, WCGAN-GP provides statistically significant better performance on 3 of the 5 data sets.

Results
Overall, SMOTE improves the AUC/AUPRC when applied on the imbalanced data set but significantly penalizes Precision, leading to a lower AUPRC on 2 of the data sets used. SMOTE samples synthetic points along line segments joining minority instances using the Euclidean distance. This approach may end up using majority instances, thus creating noisy examples and over-lapping cases Han et al. (2005). SMOTE is not based on the true distribution of the minority class data Gao et al. (2014). The poor performance of SMOTE (especially on Precision on the credit card fraud data set) may be attributed to these effects. Overall, SMOTE alters the data distribution as was observed by the significant compromise on Precision and generally lower F1-Score, AUPRC, and AUC values.
Other SMOTE variants, such as density-based approaches, are meant to improve the above SMOTE weaknesses Bellinger et al. (2015); Gao et al. (2014). However, they make strict assumptions about the structure and distribution of the minority class data. SMOTE was the quickest to over-sample. WCGAN-GP requires a significant pre-training of both the critic and the generator.
GANs are well-known for their training and computing powers Creswell et al. (2018); Lucic et al. (2018). Thus, they have expensive run-times. However, current GANs, such as WGAN and WGAN-GP, remedy this impact with stable training. The quality of generated samples may be worth it compared to the training times. In this study, the GANs reached stable training even for small samples, such as credit card fraud cases. This means that GANs may still be used even for smaller data sets with enough training capacity.
Using WCGAN-GP to over-sample minority cases provided the best performance on the AUPRC and on 3 of the data used on AUC. GANs do not make explicit assumptions about the probability distribution of the minority class data. This idea has been used to  2019) report GAN superior performances over SMOTE for image data sets. There is a significant potential to create new samples using GANs, leveraging them to augment limited actuarial data sets.
Given the current surge in interest for GANs, optimizing and training GANs is becoming straightforward as there are many implementations in Keras, Pytorch, and Tensorflow. Thus, running times for GANs might not necessarily be an issue, enabling GANs to provide a superior over-sampling approach to supplement imbalanced data sets. Because GANs have become so popular, their limitations have been improved tremendously.
However, there are still open challenges for GANs. GANs rely on the generated examples being completely differentiable with respect to the generative parameters. As a result, GANs cannot product discrete data directly. Another key challenge is the evaluation of GANs after training, even though there are measures to compute the quality of results generated.
Research for GANs grows each year. Practitioners may need to add GANs to their toolkit as this will significantly improve their models and aid on decision-making as GANs will be characterized by advancements in deep learning, training process maturity, and open acceptance and their wide use in commercial applications.

Implications for Actuaries
Given the superiority of GANs over other generative models and their wide applications, there is scope for actuarial use. The most obvious use is data augmentation and boosting predictive models used for assumption setting, propensity modeling, pricing, reserving, capital, and solvency projections, as demonstrated in the experiments conducted. Using synthetic data sets created through GANs could allow actuaries to share salient features of their data without sharing the full data set, enabling actuarial data sets to be more widely available for public use and research purposes.
Given the surge in marketing and social promotions, info-graphics are the main ingredient of social media marketing. GANs can help marketers and designers in the creative process. Other applications include anomaly detection, joint distribution learning, discriminative modeling, semi-supervised learning, domain adaptation, attention prediction, data manipulation, missing data imputation, time series generation, privacy preservation, and computer vision.
These GAN applications have potential actuarial adoption in insurance, banking, health and care, and other non-traditional areas, as GANs have been shown to provide better alternatives to current approaches. Research for GANs grows each year, and actuaries may need to add GANs to their toolkit, as this will significantly improve their models and aid on decision-making.

Conclusions
Gaining an advantage in competitive markets through offerings of suitable tailored products on customers relies on building and maintaining adequate predictive models. To build these models, a substantial amount of data and a sizeable number of records is desirable. However, when expanding to new or unexplored markets, that level of information is rarely always available. As a result, actuarial firms have to procure data from a local provider, through purchasing reinsurance from a re-insurer, through limited unsuitable industry and public research or rely from extrapolations from other betterknown markets. In this work, we show how an implicit model using GANs can alleviate this problem through the generation of adequate quality data even from very limited small samples, from difficult domains, or without alignment.
This example is a classic data augmentation application of GANs where we showed their superiority of SMOTE and improving the original results. SMOTE improved the classification performance. However, SMOTE is not based on the true underlying minority class distribution. SMOTE density estimation approaches remedy this issue; however, they are not based on the true data distribution as they make strong data assumptions.
Using WCGAN-GP, it is possible to create synthetic cases implicitly, and this turned out to offer a significantly better improvement over SMOTE. This work comprehensively reviews GAN theory and applications in a number of domains, with possible adoption for actuarial use. These applications have scope for actuarial science, and actuaries can add them to their toolkit to aid predictive models.

Summary of Applications
In our opinion, the future of GANs will be characterized by open acceptance of GANs and their applications by the research community and being used in commercial applications. Given their impressive results and advancement in deep learning techniques, we expect a wider use of GANs. The training instability of GANs will soon be done without any problems as the maturity of the training improves with new techniques being invented at a rapid speed.
There are potential applications of GANs for actuarial use in insurance, health and care, banking, investment, and enterprise risk management to inter alia new sample creation, data augmentation, boosting predictive models, anomaly detection, semi-supervised learning, attention prediction, time series generation, and missing data imputation. In conclusion, GANs have the potential to boost actuarial models and make better business decisions.

Limitations and Future Research
We repeated training and testing of each over-sampling method 30 times to minimize stochastic effects-this sample size can be increased for more robustness. Alternatively, a bootstrapping approach can be applied to better understand the distributional attributes of the model errors.
Given that we considered train-test split to split the data, changing this could potentially change the outcome of the results. Given that we have provided the significance of the results using 30 multiple samples, this adds further comfort to the outcome. However, the potential use of other data splitting methods, such as bootstrap, CV, and different train-test splits, can enhance this work further. Different data splitting methods for different data sets may provide further research work.
Our future work includes comparing current traditional actuarial approaches, such as stochastic simulations and pricing models versus each GAN approach in each domain, extensively. Time series GANs have been gaining interest in the past few years. An interesting research area is using recurrent conditional GANs to simulate and project mortality, compared with the traditional Lee-Carter model and its variants. Below are possible future research to improve this work: • Consideration on other data sets to apply the same techniques, especially complex data sets that include small disjuncts, over-lapping, mixed data types, and multiple classes, particularly actuarial data sets. • Alternative consideration for other ML algorithms would show which ML technique is best and for which data set and domain. • Empirical comparison of these results with other tabular data sets where GANs were applied. • Implementation and leveraging of the GANs in R or Python for actuarial use.