A Combined Neural Network Approach for the Prediction of Admission Rates Related to Respiratory Diseases

: In this paper, we investigated rates of admission to hospitals (or other health facilities) due to respiratory diseases in a United States working population and their dependence on a number of demographic and health insurance-related factors. We employed neural network (NN) modelling methodology, including a combined actuarial neural network (CANN) approach, and model admission numbers by embedding Poisson and negative binomial count regression models. The aim is to explore the gains in predictive power obtained with the use of NN-based models, when compared to commonly used count regression models, in the context of a large real data set in the area of healthcare insurance. We used nagging predictors, averaging over random calibrations of the NN-based models, to provide more accurate predictions based on a single run, and also employed a k -fold validation process to obtain reliable comparisons between different models. Bias regularisation methods were also developed, aiming at addressing bias issues that are common when fitting NN models. The results demonstrate that NN-based models, with a negative binomial distributional assumption, provide improved predictive performance. This can be important in real data applications, where accurate prediction can drive both personalised and policy-level


Introduction
Chronic respiratory diseases (CRDs) remain one of the main causes of morbidity and mortality worldwide.The World Health Organization (WHO) reports that hundreds of millions of people worldwide suffer from preventable CRDs, and these diseases cause around four million deaths each year (Bousquet et al. 2007).Asthma, chronic obstructive pulmonary disease (COPD), and occupational lung disease are some of the major diseases classified under preventable CRDs.As per the WHO, 262 million people have asthma globally, and over 3 million people die each year from COPD, which accounts for nearly 6% of deaths around the world (WHO 2022b).The recent COVID-19 pandemic has once again put CRDs in the limelight due to the notion that people with pre-existing respiratory conditions are at a high risk for COVID-related health complications and death.Although some studies, such as Aveyard et al. (2021), support this, owing to the lack of data, the extent of higher risks from COVID for individuals with CRDs is still unclear (WHO 2022a).Most CRDs are preventable through early detection and intervention, which require precise identification and understanding of the key risk factors and their association with the incidence and prevalence of these diseases.Previous studies, such as Doney et al. (2014) and CDC (2012), have explored the prevalence of COPD among the US population, whereas Blanc et al. (2019) focused on the relationship between workplace exposure and respiratory diseases.
Ever since generalised linear models (GLMs) were introduced by Nelder and Wedderburn (1972), they have been at the forefront of predictive modelling in all areas of research.Details regarding numerous applications of GLMs for tackling insurance-related actuarial problems in both life and non-life sectors are described in Haberman and Renshaw (1996), De Jong and Heller (2008), Frees (2009), and Ohlsson and Johansson (2010).The advancement in computational capabilities and the development of software, such as Bayesian inference using Gibbs sampling (BUGS), have further facilitated the Bayesian implementation of GLMs.For instance, in an insurance-related context of morbidity studies, Ozkok et al. (2014) adapted Bayesian methodologies for modelling claim diagnosis rates of critical illness; more recently, Arık et al. (2021) conducted a population study for assessing the socioeconomic disparities in cancer incidence and mortality in England.
Improvements in the ease of access to fast computing over the last few decades have paved the way for the extensive adaptation of machine learning approaches in many research and practice areas.The same pattern has been observed in the insurance field, particularly in recent years.There is an increasing trend in applying machine learning approaches for addressing various insurance-specific problems.Blier-Wong et al. (2020) provided a detailed description of recent adaptions of different machine learning approaches in the actuarial field, particularly for pricing and reserving.Ferrario et al. (2020) detailed the adaptation of neural network regression models for modelling claim rates using the French motor third-party liability (MTPL) insurance data set.The approach facilitates the embedding of traditional regression models for count data into a neural network framework using a class of feed-forward neural network (FFNN) models.For the rest of the paper, we refer to the FFNN and neural network (NN) indifferently.The approach was further extended by Schelldorfer and Wuthrich (2019) to develop the combined actuarial neural network (CANN) approach (see Section 3.3 for details), which showed superior performance in comparison to the previous NN model.Tzougas and Li (2021) added to the approach by developing both NN and CANN models under a negative binomial distributional assumption.
In this work, we developed an ensemble of statistical predictive models that can accurately predict admission rates to hospitals (or other health facilities) related to respiratory diseases in a US population.In what follows, we outline the main contributions of the paper.

•
First, we investigated the efficiency of the Poisson and negative binomial CANN models for predicting admission rates related to respiratory diseases in a United States (US) working population.In particular, we began by considering Poisson NN models, including a CANN model, and developed modifications based on early stopping and dropout techniques, which improve their performances.Subsequently, motivated by the suitability of the negative binomial distribution when data are overdispersed, we also developed negative binomial NNs and compared their predictive performances to those of the Poisson models.NN-based models were trained by minimising the corresponding deviance loss functions and compared using the testing data loss.Models under the negative binomial distributional assumption led to superior forecasting performances.The same result was obtained when we eventually compared the Poisson CANN to the negative binomial CANN.Furthermore, it is worth noting that while machine learning approaches and CANN models, under both Poisson and negative binomial distributional assumptions, have been explored for data-driven applications in the field of non-life insurance, to the best of our knowledge, this is the first paper that considered employing such methods for morbidity modelling in an insurance context.• Second, we considered the bias-regularised version of the negative binomial NN and CANN models by modifying the intercept of their output layers following the approach of Wüthrich (2020).Additionally, we also addressed bias issues on a population level by extracting the last hidden layer of the NN and CANN models, fitting the corresponding negative binomial regression models and, therefore, controlling the portfolio bias by adjusting the intercept.
• Third, following the setup of Richman and Wüthrich (2020), we determined a nagging predictor in the case of the negative binomial NN and CANN models for taking advantage of the randomness of neural network calibrations to provide more stable predictions than those under a single neural network run.• Finally, for providing reliable comparisons between the performance of the regression, NN, and CANN models under the negative binomial assumption, k-fold validation was carried out to allow us to evaluate the model's predictive ability when different data configurations were considered for training and prediction purposes.
The rest of this article is as follows: in the next section, we present a description of the data used in this paper and we provide details of the exploratory analysis that was carried out for summarising the main data characteristics.A number of data considerations, prior to undertaking any modelling approaches, are also discussed.In Section 3, we provide a detailed description of the regression, NN, and CANN models that were used for modelling the admission rates.In Section 4, we present hyperparameter tuning and determine the approach that was followed for fitting the network-based models.Various model improvement approaches, such as regularisation, dropout, and early stopping, were adopted for the network models, and are presented in Section 5.In Section 6, a comparison of the predictive performance of the competing models under Poisson and negative binomial distributional assumptions is presented, along with bias-regularised versions and a nagging predictor for the negative binomial NN and CANN models.Furthermore, an evaluation of the performance of the negative binomial regression, NN, and CANN models via k-fold validation is carried out in Section 7. Finally, concluding remarks are provided in Section 8.

Data
The admission data for respiratory diseases were constructed using the Commercial Claims and Encounters Database of Merative MarketScan Research Databases provided by Merative (previously called IBM Watson Health).The data set contains individual-level information regarding admissions to hospitals and other health facilities, along with the claims and encounter data over time, linked with patient information and service provider details.For categorizing the admissions with respect to the primary cause of admission, the International Statistical Classification of Diseases and Related Health Problems 10th Revision-Clinical Modification (ICD-10-CM) groupings were used (see Table A1).The data set was constructed by combining the admission information and the enrollment details for the year 2016, and the details of the different variables in the data set constructed are given in Table 1.
The ENROLID is a unique identifier assigned to each individual.The EMPREL variable specifies the individual's relationship to the primary beneficiary/employee.The different plan types defined by the PLANTYP variable vary in terms of their characteristics, such as incentives for using specific service providers, deductibles, copay, etc. . . .For more details regarding the different plan types, see Table A3.The geographical variable EGEOLOC gives more granular information regarding the primary beneficiary's residence location than the REGION variable.The UR variable was created using the Metropolitan Statistical Area (MSA) variable by assigning value 1 (rural) if the primary beneficiary resides in a non-MSA or rural area and value 2 (urban) if the primary beneficiary resides in MSAs.The EECLASS and EESTATU variables give information regarding the employment of the individual.The DATATYP indicates whether the individual's health plan operates as a fee-for-service plan or a capitation plan.The difference is that the fee-for-service works on a reimbursement basis while the encounter record arises from fully or partially capitated managed care plans.The prepaid capitation amount paid by the employer or the health plan to the service provider could either be on an individual basis or a bulk basis.The details of the primary beneficiary are assigned to the spouse and other dependents as well.Prior to undertaking any modelling, several data considerations and feature preprocessing were undertaken, details of which are described in the following subsections.

Data Description
The admissions records of an individual within two days were treated as a single admission.Data variables with a high proportion of missing data (e.g., INDSTRY) were excluded.For the other variables, only complete cases were considered as the proportion of missingness was less than 2%.Variables, such as HLTHPLAN, DATATYP, and REGION were excluded due to the high level of relationship with other variables.The HLTHPLAN variable had a high level of association with EECLASS and EESTATU variables.The employment-related information, particularly the employment classification, was only available if the data came from the employer (HLTHPLAN = 0).Hence, for all the individuals whose data came from health plans, the EECLASS was unknown (category 9).The association between PLANTYP and DATATYP arises since the value of the DATATYP variable is 2 (Encounter) only for individuals under the Health Maintenance Organization plan (PLANTYP = 4) and Capitated (Cap) or Partially Capitated (PartCap) Point-of-Service plan (PLANTYP = 7).The EGEOLOC variable was chosen over the REGION variable as it gives more granular information regarding an individual's location of residence.
After the above-mentioned data considerations, the data for the age range 30-65 comprises 2,050,100 records, with 1,902,840 unique individuals and 100,212 admissions due to various diseases.The age range selection of 30 to 65 was used as the focus is on the working population and 30-65 seemed to be a reasonable choice for working age.The circumstances of some individuals changed over the year and, hence, their risk characteristics changed as well.Multiple records with different risk characteristics exist for those individuals for the corresponding period.This results in having more records than the number of unique individuals.Although multiple records might exist for some individuals, the sum of the exposure over different records for any given individual is less than or equal to one year.The distribution of admissions over the different ICD chapters is shown in Figure 1.A1).
Among these 100,212 admissions, 4525 were related to respiratory diseases (ICD chapter 10), and some individuals had more than one admission.The most prominent cause of admission was pregnancy and childbirth (ICD chapter 15), followed by diseases related to the musculoskeletal system and connective tissue (ICD chapter 13), diseases of the circulatory system (ICD chapter 9), and diseases of the digestive system (ICD chapter 11).Admission frequencies due to respiratory diseases are given in Table 2, and the age-wise crude rates of admission for different categories of the variables such as SEX, UR, and EECLASS are shown in Figure 2.  As unusually high numbers of individual admissions are often the result of common healthcare data practices, the ceiling for the number of admissions for an individual was set as five in the current analysis.Hence, the final data set involves 4513 admissions related to respiratory diseases.In order to determine whether to use EXPOSURE as an offset variable, an exploratory analysis similar to the one mentioned in Ferrario et al. (2020) was carried out.The records were categorised using an additional 'Exposure group' variable defined by E k = k−1 10 , k 10 with k = 1, . . .10, which indicated that a significant proportion of the 2,050,100 records belonged to group E 10 with exposure e i ∈ (0.9, 1].A group-wise empirical frequency was also calculated by fk = where Y i represent the number of admissions for the ith record (see Table 3).Even though the empirical frequencies of exposure groups give no evidence of suspecting the non-linearity of exposure, to identify any potential relationship between the exposure and other features, the distribution of exposure over other variables was also investigated (see Figure 3).Although the analysis indicates a slight variation in the exposure composition over different levels of some variables, the evidence is not strong enough to consider alternative treatments of the exposure variable.For instance, for the EESTATU variable, unlike the other categories, for category six (Comprehensive Omnibus Budget Reconciliation Act (COBRA) continues), the exposure seems to be evenly distributed.COBRA allows employees and their dependents to continue with a health plan provided by an employer for a particular period even after the cessation of their employment with that employer.Individuals with EESTATU six (COBRA) may have reached the end of that specified period sometime during the year of the study (2016).Although this could be a case of right censoring, we lack sufficient information to confirm this.Hence, it was decided to proceed with treating exposure as an offset variable.

Data Pre-Processing
The full data set was randomly split into a learning set D and a testing set T .The learning set and testing set were created using a 90:10 split, comprising 1,845,090 and 205,010 records, respectively.All models were fitted using the learning data set, and the performances of the models were evaluated on the testing set under the assumption that the underlying modelling assumptions held for both sets of data.
As discussed by Ferrario et al. (2020), the proper working of the gradient descent methods (GDMs) used for neural network model fitting required the different feature components to be on an identical scale.In order to adjust the scaling, for the continuous and binary variables, such as AGE, SEX, and UR, a min-max scaler was adopted, which transformed the variable to a scale of [−1, 1] using the formula: where m j and M j represent the minimum and maximum values of variable x j .For the binary variables SEX and UR, the values would have been replaced by either −1 or 1.For all remaining categorical variables, since they were nominal in nature, a dummy encoding was used.Under dummy encoding, a categorical variable with l levels was represented using a (l − 1) dimensional feature vector, with a reference category and value 1 being used to identify the actual level for a data record (see Equation ( 2)).An alternative option was to use one-hot encoding, which would have resulted in an (l) dimensional feature vector without a reference level (Equation ( 3)).For a categorical variable x j , with l categories c 1 , c 2 , .., c l , treating c l as the reference level, dummy encoding is given by whereas one-hot encoding would take the form As both approaches increase the dimension of the feature space, for neural network models, data embedding was implemented using an embedding layer, which facilitated a lower dimensional representation of the categorical variables.The embedding layer approach was proposed by Bengio et al. (2000) and was adopted in an insurance context by Richman and Wüthrich (2021).The embedding layer maps a categorical variable x j with l levels to a low dimensional real-valued vector of dimension v ((v < l) (i.e., f : x j → R v ).The value of v, which is treated as a hyperparameter, needs to be decided, taking into consideration that it impacts the complexity of the model.In the current context, we chose v = 2 for creating the embedding layers for categorical variables (see Figure 4).

Regression Models
We first consider a Poisson GLM and a negative binomial regression model, for modelling count data (e.g., Frees 2009;Hardin et al. 2007).For the Poisson GLM, we assume that for i = 1, . . ., n, n being the number of records in the learning data set D, admission numbers, Y i , follow a Poisson distribution with where the mean (µ Pois i = λ i e i ) depends on the policyholder's characteristics x i through λ i = exp(β x i ), and the exposure e i .By choosing the logarithmic link function, which is, in fact, with the canonical link function for the Poisson GLM, we have a predictor of the form where X ⊂ R q is the feature space with x i = (x i,1 , . . ., x i,q ) giving the feature information.o i = log(e i ) is the offset term, and β = (β 11 , .., β q1 ) is the unknown vector of coefficients to be estimated.The β, x i represent the inner products of vectors β and x i and are equivalent to β x i ; both notations are used interchangeably in this paper.
Similarly, for the negative binomial regression model, admission numbers, Y i , i = 1, . . ., n, are assumed to follow a negative binomial distribution with a dispersion parameter φ > 0: For a logarithmic link function, the predictor µ NB i has the form

Neural Network Model
In what follows, we consider the basic feed-forward neural network (FFNN) model.Generally, a FFNN comprises an input layer, one or more hidden layers, and an output layer.The feature space X makes up the input layer, and each of the hidden layers comprises a set number of neurons.The output from a given hidden layer acts as the input for the next layer.The output from a neuron depends on the linear combination of the output from the previous layer and the choice of activation function assigned to the layer that it is part of (see Section 4 for more details on activation function).The number of hidden layers d ∈ N is treated as a hyperparameter and is also referred to as the depth of the network.The last layer of the architecture, which is connected to the last hidden layer, is the output layer.In a neural network architecture, each layer is a function of the previous layer (see LeCun et al. (2015) and Ferrario et al. (2020) for more details).The mth hidden layer z (m) , 1 ≤ m ≤ d with dimension q m ∈ N can be defined as where the neurons z with ψ : R → R, being the activation function and β (m) j ∈ R q m−1 the network parameters.
In addition to the first intercept component, z (m) depends on β ∈ R q m −1 .For q 0 = q, with q being the dimension of the feature space X , the network parameter A diagrammatic representation of a feed-forward neural network with an embedding layer and three hidden layers with 20,15,10 neurons in each layer, are shown in Figure 4.
Under a neural network regression model, the predictors µ Pois and µ NB of the traditional Poisson and negative binomial regression models are replaced by the neural network predictors µ PoisNN and µ NBNN .To illustrate the structure of the network-based model, we will refer to the models under the Poisson distributional assumption.For an FFNN model with depth d under the Poisson assumption, the predictor is of the form for i = 1, . . ., n, where β (d+1) ∈ R q d are the weights that map the neurons of the last hidden layer z d to the output layer R + .

CANN
As mentioned earlier, the CANN model has an additional regression function nested into the model predictor using a skip connection.Under the Poisson distributional assumption, the model predictor of the CANN model with depth d ∈ N has the form with i = 1, . . ., n and parameter vector β = β Pois , β (d+1) ∈ R q 0 +r .The β Pois vector represents the parameters associated with the skip connection.The three terms on the right-hand side of Equation ( 13) represent the offset, regression function/skip connection, and the network function, respectively.A schematic representation of a sample CANN model with three hidden layers and 20,15,10 neurons in each layer is shown in Figure 5.As shown in Figure 5, feature space X is directly linked to the output layer.Different variants of CANN exist depending on whether the weights in the regression part are updated or not whilst training the model (Schelldorfer and Wuthrich 2019).Owing to the ease of implementation and following the literature, we focus on the variant in which the weights of the regression component are kept fixed as the iterated weighted least squares (IWLS) estimate βPois from the regression model.This particular variant of the CANN can be implemented by replacing the offset term o i in Equation ( 13) with The approach can be enacted indifferently under both Poisson and negative binomial distribution assumptions by adjusting the likelihood function under each model and altering the sample code given in Listing A1.For the models under the Poisson distribution assumption, with y i being the response variable and µ i the mean (µ Pois i for regression, µ PoisNN i for FFNN and µ PoisCANN i for CANN), the log-likelihood is of the form Likewise, for the negative binomial models, under the NB2 parameterisation (Hardin et al. 2007), the log-likelihood is given by The details of the fitting of the different models discussed so far are discussed in detail in the next section.

Model Fitting
The different data considerations and exploratory analysis described in the earlier sections, as well as the development of models described in the previous section, were carried out using the programming language R using RStudio IDE (R Core Team 2021; RStudio Team 2021).The Poisson and negative binomial regression models were fitted using the glm() function in the stats package and the gamlss() function in the gamlss package.The glm() uses the IWLS method whereas gamlss() function uses the Rigby and Stasinopoulos (RS) algorithm, for estimating the model coefficients β (R Core Team 2021; Rigby and Stasinopoulos 2005).The two main packages utilised for developing the neural-network-based models are keras and TensorFlow packages, details of which can be found in the respective manuals Allaire and Chollet (2021) and Allaire and Tang (2021).The important snippets of code developed for implementing NN and CANN models, as well as the different model improvement approaches, are given in the Appendix A.

Hyperparameters
The development of the network-based models was carried out in different stages.The first step of constructing the models involves determining the hyperparameters, such as the number of hidden layers, choice of activation function, and the gradient descent method (GDM) used for model training.In terms of the aspects mentioned above, following earlier work by Ferrario et al. (2020) in a similar context, the following assumptions were made: • Number of hidden layers: the number of hidden layers was kept at three.• Activation function: for hidden layers, the hyperbolic tangent function, ψ(x) = tanh(x), was used.Any alternate non-linear activation function would work.The motivation behind a non-linear activation function is that a non-linear activation function allows for a non-linear model space, reducing the number of nodes needed and allowing the network to automatically capture the interaction effect of different features.For the output layer, an exponential function was used, which is the inverse of the link function (g(.) = ln()) and, therefore, is in line with the underlying distributional assumption.

•
Gradient descent method: the neural network training utilises a gradient descent optimisation algorithm for estimating the model weights.Ferrario et al. (2020) compared different GDMs in terms of performance and identified the Nesterov-accelerated adaptive moment estimation (Nadam) method as performing better compared to other similar methods.Hence, we also adapted the Nadam as the choice of GDM.
An overview of the different GDMs is given in Ruder ( 2016), and additional details regarding the 'Nadam' method could be found in Dozat (2016).

•
Validation set: the training of neural network models requires further splitting the learning set into a training set, D (−) , and a validation data set, V.The validation data set is used as the evaluation set during the iterative process for estimating the model weights.In other words, V tracks possible overfitting of the model to D (−) .For the network-based models discussed here, and an 80:20 split was used for the training and validation data sets.Once the training is complete, the final performance of the fitted model is assessed using the testing set.

•
Loss function: the loss function is the objective function that the GDM algorithm minimises in order to estimate the model weights (Goodfellow et al. 2016).Numerous options exist in terms of the choice of the loss function.For instance, mean squared error (MSE), mean absolute error (MAE), and deviance loss are some of the popular choices of loss functions used in a regression problem.For our context, we adapted deviance loss as the loss function.The motivation behind the particular choice is that minimising the deviance loss is equivalent to maximising the corresponding log-likelihood function, which gives the MLE.The deviance loss is defined as the difference between the log-likelihood of the saturated or full model and the fitted model, and for a data set A the Poisson deviance loss is given by where μi denotes the fitted mean and y i the observed number of admissions for i = 1, . . ., n.Similarly, for models under the negative binomial distributional assumption, the deviance loss function has the form: The GDM estimates the model weights by minimising the deviance loss L D for the learning set, and the performance of the thus fitted models can be compared using the deviance loss L T for the testing set.As described earlier, the GDM updates model weights with an improved choice in an iterative manner.The iterative updation of model weights under the GDM could be represented as where [t] indicates the algorithmic step, and ρ t+1 > 0 gives the learning rate and D (−) s are the mini-batches or batches (see next section for more details on batches).The learning rate determines the size of each step and influences the speed of movement towards the optimal model weights.Since the primary focus of this work is the predictive modelling of admission rates, more details relating to the functionalities of the neural network model training are not discussed further (see, e.g., Russell 2010; Goodfellow et al. 2016).

Batch Size and Epochs
In addition to the model choices discussed above, the other main model attributes that need to be determined are the number of neurons in each layer, batch size, and epochs.Due to the computational burden of considering a large data set at once, during the training of a neural network, the data in the training set D (−)  The ideal choice of batch size must be determined in conjunction with the epochs as it determines the total number of GDM steps undertaken during model training, which impacts the model performance.Due to the size of the data in this work, we determined the batch size and epochs using trial and error.Two approaches, outlined below, were undertaken to determine a reasonable choice of batch size, epochs, and network architecture.The two approaches were considered using a Poisson NN with seven different model architectures of varying complexity.For all seven architectures, three hidden layers were used (100,75,50), with (75,50,25), (50,35,25), (35,25,20), (25,20,15), (20,15,10), and (15,10,5) neurons in each of the three hidden layers.The choice of the batch size epoch and network architecture thus identified from the exercise was then adopted for Poisson CANN and the network-based models under negative binomial distribution assumption.
Approach 1: batch size is varied, keeping the epochs fixed; then, for the batch size giving the best performance, different epochs are considered.

•
Step 1: Initially, for different model architectures of varying complexities, different batch sizes were considered, keeping the number of epochs fixed to 1000.All considered models involve three layers, with a different number of nodes in each layer.The different model architectures were fitted using batch sizes of 10,000, 30,000, 50,000, 75,000, 100,000, 175,000, 250,000, 500,000, and 750,000.The performances of the models were compared using the testing (out-of-sample) loss L T , i.e., the deviance loss (Equation ( 17)) under the testing data set.The results are shown in Tables A4 and A5 and illustrated in Figure 6.The tables also show the learning (in-sample) deviance loss, L D , and the portfolio average, i.e., the average fitted mean μ for the full data set under the considered models.All models, irrespective of their level of complexity, performed well with a batch size of 175,000.As anticipated, complex models had a higher testing loss with smaller batch sizes due to over-fitting.In general, the testing loss presented a decreasing trend for all considered models as the batch size increased from 10,000 to 175,000.For batch sizes greater than 175,000, both testing loss and learning loss for simpler models started to rise.This indicates under-fitting and shows that for batch sizes greater than 175,000, the complexity of simpler models with fewer neurons in the hidden layers is insufficient to fit the data effectively.Hence, this analysis suggested choosing a batch size of 175,000.Nevertheless, as all models had a comparable testing loss for batch size 175,000, three of the simpler models NN (25,20,15), NN (20,15,10), and NN (15,10,5) were considered further for identifying the optimal number of epochs.

•
Step 2: in order to find the optimal number of epochs, the NN (25,20,15), NN (20,15,10), and NN (15,10,5) models were fit using different choices of epochs 100, 250, 500, 1000, 1500, and 2000 , keeping the batch size fixed at 175,000.The results of these models are given in Table 4.For the models considered in step 2, the testing loss was lowest when the number of epochs was 1500.Hence, the combination of the batch size equal to 175,000 and 1500 epochs was deemed optimal under this approach.
Approach 2: epochs number is varied, keeping the batch size fixed; then, for the epochs number giving the best performance, different batch sizes are considered.The same steps as those followed in Approach 1 were carried out under this approach as well but in an alternate order.Model architectures, batch sizes, and numbers of epochs are also the same as in Approach 1. • Step 1: we now initially alter the number of epochs, keeping the batch size fixed at 30,000.The results of this step are given in Table A6.For all considered model architectures, except for NN (100,75,50), the testing loss was lower than that for the Poisson regression model, when the number of epochs was 250.Moreover, the testing loss was lowest for all considered model architectures when the number of epochs was 250 (see Figure 7).Hence, the number of epochs was chosen as 250.• Step 2: For all model architectures other than (100,75,50), different batch sizes were considered with the number of epochs fixed at 250.The results are given in Tables A7 and A8 and illustrated in Figure 8.With a batch size of 30,000, all models, except for NN (15,10,5), had testing losses lower than that for the Poisson regression model.When the batch size increased to 50,000, the NN (50,35,25) also had a similar testing loss.However, a batch size of 30,000 was chosen, as all models performed well under this choice.The combination of a batch size of 30,000 and 250 epochs was deemed optimal under the second approach.

Comparison of Approaches
The two approaches yielded different combinations of batch sizes and epochs.Under Approach 1, the best batch size and epoch number combination was (175,000, 1500), whereas Approach 2 identified (30,000, 250) as the best combination.In order to compare these combinations, results from 50 separate calibrations with different starting points for GDM were considered for the NN (30,25,20), NN (25,20,15), NN (20,15,10), and NN (15,10,5) models (see Figure 9).The motivation behind considering different calibrations is the inherent randomness in the results of neural network models.Several aspects of the neural network model fitting contribute to this randomness and are discussed in detail in Section 6.2.Different calibrations were implemented by altering the seed value for the random number generator, which determines the initial value of model weights under GDM (see Section 6.2 for more details).The graphs clearly show that the combination of a batch size of 30,000 and 250 epochs performs better than a batch size of 175,000 and 1500 epochs, in terms of testing loss.The results also indicate that both (25,20,10) and (20,15,10) architectures have similar predictive performances.Hence, we consider both these architectures while investigating additional model improvements discussed in the following section.

Approaches for Preventing Over-Fitting
One of the most significant aspects that need to be addressed while training a neuralnetwork model is the overfitting of the model to the learning set, D, which may potentially affect the predictive performance of the model.Three of the most commonly used approaches for preventing over-fitting were considered here, i.e., regularisation, early stopping, and dropout.For comparing the different improvement approaches, Poisson NN models with (25,20,10) and (20,15,10) architectures were used.

Regularisation
Under this approach, a penalty function is considered for the loss function, controlled by a regularisation parameter that is added to the network parameters.Following (Ferrario et al. 2020), the modified loss function for the Poisson NN is given by where η is the regularisation parameter and θ − is the subset of network parameters considered for regularisation; ||θ − || p is the p -norm and gives ∑ j∈θ − |θ j | p 1/p .Ridge regularisation (p = 2) was selected over LASSO (least absolute shrinkage and selection operator regression) regularisation (p = 1), as the former penalises the parameters depending upon their values and not on the same scale (Ferrario et al. 2020).Following the literature, Hastie et al. (2009) and Ferrario et al. (2020), regularisation was applied to all network parameters except for the intercepts and the last output layer.The main criticism that regu-larisation faces is that it is heavily influenced by the choice of the regularisation parameter.For each of the (25,20,10) and (20,15,10) architectures, four choices of the regularisation parameter η were considered (η = 10 −1 , η = 10 −3 , η = 10 −5 and η = 10 −8 ) and the results are shown in Table 5 (see Listing A2 for the sample code).

Early Stopping
Generally, a neural network model starts to over-fit after a particular number of epochs.This is related to work in Section 4.2 on identifying the ideal number of epochs for a given batch size.The logic behind the early-stopping approach is to identify the ideal number of epochs above which the model starts to over-fit.In other words, the aim is to identify the number of epochs beyond which the validation loss starts to increase since an increase in validation loss indicates over-fitting of the model.Numerous ways of implementing early stopping exist, the details of which are discussed by Prechelt (1998).Our implementation employs a callback approach, which initially lets the model train for a large epoch.When the training is over, the model weights that gave the lowest value for the loss function on the validation set, are retrieved.
The early-stopping approach was implemented for both (25,20,15) and (20,15,10) architectures with a batch size of 30,000 and 1000 epochs.As anticipated, the models with early stopping performed better (see Table 6).For both model architectures, the ideal number of epochs that gave the best validation loss was around 250 (231 for (25,20,15) and 272 for (20,15,10)).Both models gave similar results as that of the already identified choice of 30,000 batch size and 250 epochs.The results indicate that the early-stopping approach is desirable over the version without early stopping (see Listing A3 for sample code).In the current context, since the early stopping was implemented using the callback approach, this overcomes the hurdle of identifying the optimal number of epochs and allows to have an arbitrarily large number of epochs (e.g., 500 epochs).((20,15,10), and (25,20,15) architectures) with and without early stopping.

Epochs Learning Loss
Testing Loss Portfolio Average NN (25,20,15) early stopped 2.6654 2.5456 0.0025 NN (25,20,15) 1000 epochs 2.5635 2.6704 0.0028 NN (20,15,10) early stopped 2.6623 2.5403 0.0025 NN (20,15,10) 1000 epochs 2.5865 2.6216 0.0027 5. 1.3. Dropout Under this approach, for each step of the gradient descent (50 GDM steps within each of the 250 epochs), each neuron is dropped with a probability p, independently of other neurons creating a thinned network.The model weights are shared among the different thinned networks considered and make up the final unthinned neural net.In other words, for a GDM step involving thinned network, the gradient of the weights of the dropped neurons is zero Srivastava et al. (2014).See Figure 10 for a sample representation of the dropout process for a NN (20,15,10) architecture.A fixed dropout rate was used for neurons in all three layers.Dropout rates of 1%, 2%, 5%, and 10% were considered for both model architectures.The results are shown in Table 7.When the dropout rate was 2%, the testing loss for the (25,20,15) architecture was similar to that of the implementation without dropout, whereas for the (20,15,10) architecture, the testing loss decreased (see Listing A4 for sample code).The portfolio average for all the models with dropout is slightly different from the data indicating bias at the portfolio level (see Section 6.1 for more details on the portfolio level bias of NN models and approaches for addressing it).Since the dropout rate of 2% performed better in comparison to 1%, 5%, and 10%, the former was adopted for further comparison where the randomness in the model results is also taken into consideration (see Section 5.1.4).((20,15,10) and (25,20,15) architecture) with dropout rates of 0%, 1%, 2%, 5% and, 10%.The model improvement approaches discussed above are subject to the inherent randomness in neural network training.Hence, for comparing the different approaches, 50 different calibrations were considered under each approach for both (25,20,15) and (20,15,10) architectures.The results from these different runs are illustrated in Figure 11.Comparisons among these approaches indicated that the early stopping and dropout approaches significantly improved predictive performance.Since both approaches can be applied on the same run, it was decided to adopt both simultaneously as well.For the (25,20,15) architecture, the combination of dropout and early-stopping approaches had a similar performance as that of the two approaches when applied individually.In contrast, the (20,15,10) architecture showed improvement in the predictive performance when both approaches were applied simultaneously (see Figure 11).Hence, it was decided to proceed by adopting the combined improvement approach (early stopping and dropout).

Negative Binomial Neural Network Models
A comparison was carried out between the different models under Poisson and negative binomial distribution assumptions.For the network-based models, a single calibration with the combined improvement approach (early stopping and dropout) was considered to directly compare with the regression models (see Table 8).For the networkbased models, (25,20,15) and (20,15,10) architectures were used with a batch size of 30,000 and 500 epochs for training the model.The models under the negative binomial distribution assumption show better predictive performance than those under the Poisson assumption.The main difference between the models under Poisson and negative binomial assumption is the additional dispersion parameter φ in the negative binomial distribution.The dispersion parameter is not trained as part of the neural network model fitting and is considered separately.For training the network-based models, the dispersion parameter determined from the negative binomial regression model is used as the initial value, and the parameter is adjusted once separately after the first round of model training.The adjustment factor used to update the dispersion parameter is the ratio between the testing loss of the best neural network model and the regression model.Once the dispersion parameter is modified by multiplying with the adjustment factor, the model is freshly trained using the new value of the dispersion parameter.
From the results shown in Table 8, it is evident that the models under the negative binomial distribution assumption have lower testing loss indicating a much better predictive performance compared to models under the Poisson assumption.It is true for NN and CANN models of 25,20,15 and 20,15,10 architectures.The results also show that the CANN model performed better than the NN model under the negative binomial assumption for both 25,20,15, and 20,15,10 architectures.The same is true for 20,15,10 architecture under the Poisson assumption.For the 25,20,15 architecture, the Poisson NN model had a slightly lower testing loss than the Poisson CANN model.The portfolio average from the different network-based models was slightly different compared to the actual data.This points toward the network-based model's failure to balance property at the portfolio level and is discussed in the section below.

Bias Regularisation
One main criticism faced by neural network models is that the balance property fails to hold on a population level: although the model gives accurate results on granular individual-level data, unbiasedness (or the balance property) fails on a portfolio level.In actuarial applications, this presents an important concern as the model can potentially lead to substantial mispricing at a population level.The root cause is the limited number of steps in gradient descent algorithms, which may restrict the parameter estimates from reaching the critical points of the Poisson deviance loss function Wüthrich (2020).For models under the negative binomial distributional assumption, the adopted log link function (which is not the canonical link function for the negative binomial distribution), can also contribute to bias in the results (Hilbe 2011).
A common bias regularisation approach is to adjust the intercept β (d+1) 0 in the linear function from the last layer of the neural network, which can be implemented by multiplying the results with a constant c given by where μ is the mean of the predicted values μi and μ is the mean of the observed admission numbers y i (Tzougas and Li 2021).This will ensure that the means of the modified predicted values are the same as the means of the observed values y i , such as As an alternative approach, the GLM regularisation method was also considered.Under this approach, a GLM is added after the last hidden layer in the neural network.In essence, the neural network acts as a pre-process and the output from the last hidden layer is used to fit a GLM to predict the response.The last hidden layer z (d) is considered a learned representation of the feature space X created by the network which could be viewed as a new augmented data set : i = 1, ..n) for modelling admission counts with n being the number of records in the original data set (Ferrario et al. 2020).In other words, the GLM produces IWLS estimates for the weights in the last layer, thereby ensuring unbiasedness and balance.The sample code for implementing this approach is given in Listing A5.For further discussion regarding the portfolio level unbalance of network results and details regarding the different bias regularisation approaches, we refer to Wüthrich (2021) and Wüthrich (2020).The GLM bias regularisation approach was considered using NN (20,15,10) and CANN (20,15,10) models under the negative binomial distribution assumption.This presents room for further potential improvement due to the choice of the log link function in the regression implemented for bias regularisation.Hence, the simple bias regularisation approach was also adopted, together with the regression bias regularisation method.In particular, for both the NN and CANN models, the bias regularisation was implemented by extracting their last hidden layers and feeding them into the corresponding negative binomial regression models for which their intercepts were adjusted in order to control the portfolio bias.Moreover, not that the bias regularisation approaches were applied in addition to the model improvement approaches of early stopping and dropout, with a batch size of 30,000 and 500 epochs.Table 9 shows the performance of the different models with and without bias-regularisation.Bias-regularisation has a clear impact, with the portfolio average of the relevant methods being the same as that of the observed data.(25,20,15) w/bias regu 1.0442 1.0115 0.0027

Nagging Predictor
As mentioned in Section 4.2.1, one of the main issues associated with neural network models is that results can vary among repeated runs.Aspects that can bring out this randomness are discussed in detail by Richman and Wüthrich (2020), and can include: These aspects are influenced by the choice of a random seed value for running the associated algorithms, and different seed values can potentially give slightly different results.Although the differences might be small, they should not be ignored in the context of applications of incidence rate models.The nagging predictor proposed by Richman and Wüthrich (2020) acts as a sensible approach to tackle this.The approach acts similar to the traditional bagging and aggregating approach (bagging), but without re-sampling.In other words, the aggregation occurs over the network, i.e., over different calibrations (seed values) rather than by re-sampling.The data composition in terms of the split between learning and testing data remains the same for all the calibrations, whereas the aspects of randomness, such as the ones mentioned above, vary.For the ith observation in the out-of-sample data, the nagging predictor is given by μ where µ (t) i is the predictor obtained for the ith observation with the tth network calibration (e.g., as given in Equation ( 12)) and M is the number of calibrations (seed) values considered.The testing (out-of-sample) loss for the nagging predictors is given by: where n is the number of observations in the testing data set and δ(Y i , μ(M) i ) represents the unit deviance.The nagging predictor approach was applied on top of the already identified improvement approaches (early stopping and dropout) and bias regularisation.The nagging predictors for M = 1, 2, 3 . . ., 50 were calculated for both the NN and CANN models under the negative binomial distributional assumption.Due to computational limitations, we only considered a (20,15,10) architecture for the NN and CANN models, and the results are shown in Table 10 and illustrated in Figure 12.
Table 10.Testing loss, learning loss, and portfolio average of the nagging predictor (M = 50) for the NN (20,15,10) and CANN (20,15,10)  The results suggest that the nagging predictor under the CANN approach performs better than the NN approach in terms of testing loss.Moreover, under both the NN and CANN approaches, as M increases, the testing loss starts to converge to a stabilised value, demonstrating a reduction in the variability of the prediction outcomes.

K-Fold Validation
In this section, we address the issue of potential variations in the NN model results, arising from the choice to split the data into learning and testing sets.The nagging predictor, presented in Section 6.2, does not account for this variability as it considers only those sources of randomness that arise once this split is done.Here, we consider a k-fold crossvalidation approach to analyse the impact of the learning/testing split on the results and compare the performance of different models.Under the k-fold cross-validation approach introduced by Geisser (1975), the full data set is initially split into k roughly equal sets.The models under consideration are then trained using the k − 1 set and validated/evaluated using the remaining set.The process is then repeated k times, altering the choice of validation set (Jung (2018)).In the current context, The value of k was set as k = 10 to maintain the 90:10 split of learning and testing data.Here, we used crossvalidation as a model selection procedure as discussed by Arlot and Celisse (2010) and not for training the model.The different models were compared using the average deviance loss value from the 10 folds (e.g., as in Equation ( 17)).As the models under the negative binomial distribution assumption demonstrated better predictive performances in our earlier analysis, compared to those under the Poisson assumption, only the former were considered in the k-fold validation.A nagging predictor (M = 25), together with the model improvement approaches discussed earlier (early stopping and dropout), was considered here for the network-based models.Bias regularisation was also applied, implementing both approaches as described in Section 6.1.The average testing and learning loss over the ten different folds are given in Table 11.The results from the cross-validation indicate that the NN performed better than the regression and CANN models.

Concluding Remarks
In this research, we developed an ensemble of models for predicting the rate of admissions related to respiratory diseases in an insured US population.The results indicate that the neural network-based models have better predictive performances compared to traditional GLM-type models.A potential reason for this may be the ability of neural network models to capture possible interactions of non-multiplicative types between the different features.Although, in principle, these interactions could also be captured in GLMtype models, they must be identified and specified explicitly within the model.Adapting traditional approaches based on methods, such as step-wise Akaike information criterion (AIC) or Bayesian information criterion (BIC) for identifying relevant interactions, can be tedious and time-consuming for complex and large data sets, as in this case.Our kfold cross-validation indicated that under an underlying negative binomial distributional assumption, the NN models gave better predictive performances, as determined by the testing data loss, than the GLM-type and CANN models.The better performances of the NN models compared to the CANN models could be partly due to the latter not involving training processes for the regression parts of the models.
The comprehensive nature of the developed models allows them to be extended with ease for modelling admission rates for other diseases, or other data in similar contexts and applications.The additional model improvement approaches, such as early stopping and dropout, further improved the predictive performances of the models.The nagging predictor addresses the inherent randomness in neural network results, and the adapted bias regularisation approach effectively resolved the population-level bias in the model results.
A potentially interesting line of further research is to develop zero-inflated versions of the NN-based models considered in this study, motivated by a high concentration of zeros in data of this nature.Furthermore, the development of hyperparameter tuning algorithms that can determine a range of different hyperparameters under a systematic search approach would be beneficial, especially in the context of large volumes of data involving high computational costs.

Figure 1 .
Figure 1.Distribution of admissions over different ICD chapters (details about chapters in TableA1).

Figure 2 .
Figure 2. Age-wise crude rates of admission due to respiratory diseases: (a) male, female; (b) urban and rural areas; (c) categories of EECLASS.

Figure 4 .
Figure 4.A sample representation of a feed-forward neural network with embedding layers and three hidden layers with 20,15,10 neurons in each layer.

Figure 5 .
Figure 5.An illustration of a sample CANN model with three hidden layers and 20,15,10 neurons in each layer.
are considered in smaller batches (D (−) 1 , . . ., D (−) S ) created randomly, having approximately the same size b ∈ N .The batch size b refers to the size of the smaller batches created.Epochs give the number of times that the full learning data set is iteratively considered during training(Ferrario et al. 2020).

Figure 6 .
Figure 6.Performance of the different models under the initial step of approach 1, under varying batch sizes: (a) testing loss for the Poisson NN models under different architectures.The red line shows the testing loss for the Poisson regression model; (b) change in the testing loss for the Poisson NN models under different architectures as the batch size increases.

Figure 7 .
Figure 7. Change in the testing (out-of-sample) loss for Poisson neural network models with different architectures as epochs increase.

Figure 8 .
Figure 8. Change in the deviance testing (out-of-sample) loss for the Poisson neural network models with different architectures as batch size increases.

Figure 9 .
Figure 9. Performance from 50 different calibrations for Poisson NN models under different architectures on testing data set, for the best combination of batch sizes and epochs identified in Approach 1 (175,000, 1500) and Approach 2 (30,000, 250).The horizontal red line shows the deviance loss value for the Poisson regression model.

Figure 10 .
Figure 10.An illustration of dropout process for a sample NN model with three hidden layers and 20,15,10 neurons in each layer.The gray crossed-out circles in each layer represent the neurons that were randomly dropped with probability p.

Figure 11 .
Figure 11.Performance from 50 different calibrations of the Poisson NN model under (25,20,15) and (20,15,10) architectures on testing and learning data sets for different model improvement approaches.

•
The split of the learning data into training and validation sets; • The split of the training data into mini-batches; • Model initialisation.

Table 1 .
Description of variables in the admission data set.

Table 2 .
Frequency of number of admissions related to respiratory diseases for individuals under the age range 30-65.

Table 3 .
Relative number of records and empirical frequency for each exposure group.

Table 6 .
The testing loss, learning loss, and portfolio average of the Poisson neural network models

Table 7 .
Testing loss and learning loss of the Poisson neural network models

Table 8 .
Testing loss, learning loss, portfolio average of regression, and network-based models under the Poisson and negative binomial distributional assumptions.

Table 9 .
Testing loss, learning loss, and portfolio average of regression and network-based models under negative binomial distributional assumptions with and without bias regularisation.

Table 11 .
Average of testing loss, learning loss, and portfolio average of regression and network-based models under negative binomial distributional assumptions from the 10 different folds.

Table A3 .
Details regarding the characteristics of different plan types.

Table A4 .
The testing loss, learning loss, and portfolio average of Poisson regression model and the Poisson neural network models with

Table A5 .
The testing loss, learning loss, and portfolio average of the Poisson neural network models with (100,75,50),

Table A7 .
The testing loss, learning loss, and portfolio average of the Poisson neural network models with different architectures for 10,000, 30,000, 50,000, 75,000, 100,000, and 175,000 batch sizes.