Dirichlet Process Log SkewNormal Mixture with a MissingatRandomCovariate in Insurance Claim Analysis
Abstract
:1. Introduction
 RQ1. If an additional unobservable heterogeneity is introduced by the inclusion of covariates, then what is the best method to capture the withincluster heterogeneity in modeling the total losses, comparing several conventional approaches?
 RQ2. If an additional estimation bias results from the use of the incomplete covariates under missingatrandom (MAR) conditions, then what is the best way to increase the imputation efficiency, comparing several conventional approaches?
 RQ3. If an individual loss is distributed with lognormal densities, then what is the best way to approximate the sum of the lognormal outcome variables, comparing several conventional approaches?
2. Discussion on the Research Questions and Related Work
2.1. Can the Dirichlet Process Capture the Heterogeneity and Bias? RQ1 and RQ2
2.2. Can a Log SkewNormal Mixture Approximate the LogNormal Convolution? RQ3
2.3. Our Contributions and Paper Outline
3. Model: DP Log SkewNormal Mixture for ${\mathit{S}}_{\mathit{h}}\mathit{X}$
3.1. Background
 ${\mathit{\theta}}_{j}$: the parameters of the outcome variable defined by cluster j.
 ${w}_{j}$: the parameters of the covariates defined by cluster j.
3.2. Model Formulation with Discrete and Continuous Clusters
3.3. Modeling ${S}_{h}{X}_{h}$ with a Complete Case Covariate
 Stage 1.
 Cluster membership update:
 Step I.
 Let the clusterindex $j=1,2,\cdots ,J$ for the observation h be ${s}_{h}$. First, the cluster membership j is initialized by some clustering methods such as hierarchical or kmeans clustering. This provides an initial clustering of the data (${S}_{h},{\mathit{X}}_{h})$ as well as the initial number of clusters.
 Step II.
 Next, with the parameters sampled from the DPM prior ${G}_{0}$ described in Section 4.1 and the conditional probability term $p\left({s}_{h}\right{s}_{h})$ on lines 6 and 9 in Algorithm A2 for the observation assignment, the ultimate probabilities of the selected observation h being in the current discrete clusters and the proposed continuous cluster are computed, respectively. (The use of such a nonparametric prior to the development of a new continuous cluster allows the shape of the cluster to be driven by the data). Note that the term $p\left({s}_{h}\right{s}_{h})$ is known as the Chinese Restaurant process (see Blei and Frazier 2011) probability given by$$p\left({s}_{h}\right{s}_{h})=\left\{\begin{array}{cc}c\xb7{\displaystyle \frac{{n}_{j}^{h}}{\alpha +H1}},\hfill & \mathrm{for}\phantom{\rule{4.pt}{0ex}}h\phantom{\rule{4.pt}{0ex}}\mathrm{entering}\phantom{\rule{4.pt}{0ex}}\mathrm{into}\phantom{\rule{4.pt}{0ex}}\mathrm{the}\phantom{\rule{4.pt}{0ex}}\mathrm{existing}\phantom{\rule{4.pt}{0ex}}\mathrm{cluster}:\phantom{\rule{4.pt}{0ex}}{s}_{h}=j.\hfill \\ c\xb7{\displaystyle \frac{\alpha}{\alpha +H1}},\hfill & \mathrm{for}\phantom{\rule{4.pt}{0ex}}h\phantom{\rule{4.pt}{0ex}}\mathrm{entering}\phantom{\rule{4.pt}{0ex}}\mathrm{into}\phantom{\rule{4.pt}{0ex}}\mathrm{the}\phantom{\rule{4.pt}{0ex}}\mathrm{new}\phantom{\rule{4.pt}{0ex}}\mathrm{cluster}:\phantom{\rule{4.pt}{0ex}}{s}_{h}=J+1.\hfill \end{array}\right.$$
 Step III.
 Lastly, the new cluster membership is determined and updated by the Polya Urn scheme using a multinomial distribution based on the resulting cluster probabilities. This is briefly illustrated in Figure 2. Please note how the development of the cluster weighting components ${\mathit{\omega}}_{j}^{*},{\mathit{\omega}}_{J+1}^{*}$ in Equations (6a) and (6b) is made in Figure 2.
 Stage 2.
 Parameter update:
 Once all observations have been assigned to particular clusters $j=1,2,\cdots ,J$ at a given iteration in the Gibbs sampling, the parameters of our interest—$\alpha $ and ${\mathit{\theta}}_{j},{\mathit{w}}_{j}$—for each cluster are updated, given the new cluster membership. This is accomplished using the posterior densities denoted by $p\left(\alpha \rightJ),\phantom{\rule{0.277778em}{0ex}}p\left(\mathit{\theta}\right{S}_{h},{\mathit{X}}_{h})$, and $p\left(\mathit{w}\right{\mathit{X}}_{h})$, in which ${S}_{h},{\mathit{X}}_{h}$ represents all observations in cluster j. When it comes to the forms of the prior and posterior densities from lines 17 to 23 in Algorithm A2 that are used to simulate the parameters $\{{\alpha}^{*},{\mathit{\theta}}_{j}^{*},{\mathit{w}}_{j}^{*}\}$, we detail them in Appendix A.
3.4. Modeling ${S}_{h}{X}_{h}$ with the MAR Covariate
 (a)
 Adding an imputation step in the parameter update stage:It is true that the missing covariate impacts on the parameter—$\mathit{\theta},\mathit{w}$—update. For the parameters for the covariates ${\mathit{w}}_{j}=\{{\pi}_{j},{\mu}_{j},{\tau}_{j}^{2}\}$, only the observations h without the missing covariate are used for updating. If the cluster does not have any observations with complete data for that covariate, then a draw from the prior distribution for $\{{\pi}_{j},{\mu}_{j},{\tau}_{j}^{2}\}$ would be used to update it. For the parameters for the outcome ${\mathit{\theta}}_{j}=\{{\mathit{\beta}}_{j},{\sigma}_{j}^{2},{\xi}_{j},{\tilde{\mathit{\beta}}}_{j}\}$, however, we must first impute values for the missing covariates ${x}_{1h}$ for all observations h within the cluster j. Since we already defined a full joint model—$f\left({S}_{h}\right{\mathit{X}}_{h},{\mathit{\theta}}_{j})\xb7f\left({\mathit{X}}_{h}\right{\mathit{w}}_{j})$—in Section 3.2, we can obtain draws for the MAR covariate ${x}_{1h}$ from the imputation model, such as$${f}_{Bern}\left({x}_{1h}\right{S}_{h},{x}_{2h},{\mathit{\theta}}_{j},{\mathit{w}}_{j})\propto f\left({S}_{h}\right{\mathit{X}}_{h},{\mathit{\beta}}_{j},{\sigma}_{j}^{2},{\xi}_{j})\xb7{f}_{Bern}\left({x}_{1h}\right{\pi}_{j})$$
 (b)
 Adding a reclustering step in the cluster membership update stage:To calculate each cluster probability after the parameter updates, the algorithm redefines the two main components: (1) the covariate model and (2) the outcome model. For the covariate model $f\left({\mathit{X}}_{h}\right{\mathit{w}}_{j})$, we set this equal to the density functions of only those covariates with complete data for observation h. Assuming that ${\mathit{X}}_{h}=\{{x}_{1h},{x}_{2h}\}$, and the covariate ${\mathit{x}}_{1}$ is missing for observation h, then we drop ${x}_{1h}$ and only use ${x}_{2h}$ in the covariate model:$$f\left({\mathit{X}}_{h}\right{\mathit{w}}_{j})={f}_{N}\left({x}_{2h}\right{\mathit{w}}_{2j})$$This is the refined covariate model for the cluster j with the observation h, where the data in ${\mathit{x}}_{1}$ are not available. For the outcome model $f\left({S}_{h}\right{\mathit{X}}_{h},{\mathit{\theta}}_{j})$, the algorithm simply takes the imputation model in Equation (8) for the observation h and integrates it out of the covariates with missingness ${x}_{1h}$. This reduces the degrees of variance introduced by the imputations. In other words, as the covariate ${\mathit{x}}_{1}$ is missing for observation h, this missing covariate can be removed from the ${\mathit{X}}_{h}$ term that it is being conditioned on. Therefore, the refined outcome model is$$f\left({S}_{h}\right{x}_{2h},{\mathit{\theta}}_{j})\propto \int f\left({S}_{h}\right{\mathit{X}}_{h},{\mathit{\theta}}_{j})\xb7{f}_{Bern}\left({x}_{1h}\right{\mathit{w}}_{1j})d{x}_{1h}$$The same process is performed for each observation with missing data and each combination of missing covariates. Hence, using Equations (9) and (10), the cluster probabilities and the predictive distribution can be obtained as illustrated in Step III in Figure 4.
 (c)
 Reupdating the parameters:The cluster probability computation is followed by the parameter reestimation for each cluster, which is illustrated via the diagram in Figure 5. This is the same idea as what we have discussed about the parameter ($\mathit{\theta},\mathit{w}$) update in Section 3.3.
3.5. Gibbs Sampler Modification in Detail for the MAR Covariate
 (a)
 In line 6, with the presence of a missing covariate ${x}_{1h}$, the modification of the cluster probability for the observation $({S}_{h},\overline{){x}_{1h}},{x}_{2h})$ that belongs to the discrete cluster j can be made as follows:$$P({s}_{h}=j)=p\left({s}_{h}\right{s}_{h})\xb7f\left({x}_{2h}\right{\mu}_{j},{\tau}_{j}^{2})\xb7f\left({S}_{h}\right{x}_{2h},{\mathit{\beta}}_{j},{\sigma}_{j}^{2},{\xi}_{j},{\tilde{\mathit{\beta}}}_{j})$$
 (b)
 In line 9, with the presence of a missing covariate ${x}_{1h}$, the modification of the cluster probability for the observation $({S}_{h},\overline{){x}_{1h}},{x}_{2h})$ that belongs to the continuous cluster $J+1$ can be made as follows:$$P({s}_{h}=J+1)=p\left({s}_{h}\right{s}_{h})\xb7{f}_{0}\left({x}_{2h}\right)\xb7{f}_{0}\left({S}_{h}\right{x}_{2h})$$
 (c)
 In line 22, with the presence of a missing covariate ${x}_{1h}$, the imputation should be made before simulating the parameter ${\mathit{\theta}}_{j}^{*}$ as follows:$$\left\{\begin{array}{ll}\left\{\begin{array}{c}\mathrm{First},\phantom{\rule{4.pt}{0ex}}\mathrm{sample}\phantom{\rule{4.pt}{0ex}}{x}_{1h}\sim f\left({S}_{h}\right{\mathit{X}}_{h},{\mathit{\beta}}_{j},{\sigma}_{j}^{2},{\xi}_{j},{\tilde{\mathit{\beta}}}_{j})\xb7{f}_{Bern}\left({x}_{1h}\right{\pi}_{j})\\ \mathrm{Then}\phantom{\rule{4.pt}{0ex}}\mathrm{sample}\phantom{\rule{4.pt}{0ex}}{\mathit{\theta}}_{j}^{*}\mathrm{from}\phantom{\rule{4.pt}{0ex}}\mathrm{the}\phantom{\rule{4.pt}{0ex}}\mathrm{posterior}:\phantom{\rule{4.pt}{0ex}}p\left(\mathit{\theta}\right{S}_{h},{\mathit{X}}_{h})\hfill \end{array}\right.\phantom{\rule{0.0pt}{0ex}}\phantom{\rule{1.em}{0ex}}& \begin{array}{c}\mathrm{if}\phantom{\rule{4.pt}{0ex}}{x}_{1h}\phantom{\rule{3.33333pt}{0ex}}\mathrm{is}\phantom{\rule{4.pt}{0ex}}\mathrm{missing}.\end{array}\\ \mathrm{Sample}\phantom{\rule{4.pt}{0ex}}{\mathit{\theta}}_{j}^{*}\mathrm{from}\phantom{\rule{4.pt}{0ex}}\mathrm{the}\phantom{\rule{4.pt}{0ex}}\mathrm{posterior}:\phantom{\rule{4.pt}{0ex}}p\left(\mathit{\theta}\right{S}_{h},{\mathit{X}}_{h})& \mathrm{otherwise}\end{array}\right.$$The imputation model formulation above was discussed in Section 3.4.
4. Bayesian Inference for ${\mathit{S}}_{\mathit{h}}{\mathit{X}}_{\mathit{h}}$ with the MAR Covariate
4.1. Parameter Models and the MAR Covariate
4.2. Data Models and the MAR Covariate
 (a)
 Covariate model for the discrete cluster $f\left({\mathit{X}}_{h}\right{\mathit{w}}_{j})$Focusing on the scenario where ${\mathit{x}}_{1}$ is binary, ${\mathit{x}}_{2}$ is Gaussian, and the only covariate with missingness is ${x}_{1h}$, we simply drop the covariate ${x}_{1h}$ to develop the covariate model for the discrete cluster. For instance, when computing the covariate probability term for the hth observation in cluster j, the covariate model $f({x}_{1h},{x}_{2h}{\pi}_{j},{\mu}_{j},{\tau}_{j}^{2})$ simply becomes $f\left({x}_{2h}\right{\mu}_{j},{\tau}_{j}^{2})$ due to the missingness of ${x}_{1h}$. As we have ${\mathit{x}}_{2}$, which is assumed to be normally distributed as defined in Equation (1), its probability term is$$f\left({x}_{2h}\right{\mu}_{j},{\tau}_{j}^{2})=\frac{1}{\sqrt{2\pi {\tau}_{j}^{2}}}exp\left\{\frac{{({x}_{2h}{\mu}_{j})}^{2}}{2{\tau}_{j}^{2}}\right\}$$$$f({x}_{1h},{x}_{2h}{\pi}_{j},{\mu}_{j},{\tau}_{j}^{2})={\pi}_{j}^{{x}_{1h}}{\left(1{\pi}_{j}\right)}^{1{x}_{1h}}\xb7\frac{1}{\sqrt{2\pi {\tau}_{j}^{2}}}exp\left\{\frac{{({x}_{2h}{\mu}_{j})}^{2}}{2{\tau}_{j}^{2}}\right\}$$
 (b)
 Covariate model for the continuous cluster ${f}_{0}\left({\mathit{X}}_{h}\right)$If the binary covariate ${x}_{1h}$ is missing, then by the same logic, we drop the covariate ${x}_{1h}$ for the continuous cluster. However, using Equation (4), the covariate model for the continuous cluster integrates out the relevant parameters simulated from the Dirichlet process prior ${G}_{0}$ as follows:$$\begin{array}{cc}\hfill {f}_{0}\left({x}_{2h}\right)=\int f({x}_{2h}& \mu ,{\tau}^{2})\phantom{\rule{0.277778em}{0ex}}d{G}_{0}(\mu ,{\tau}^{2})=\int f\left({x}_{2h}\right\mu ,{\tau}^{2})\xb7p\left(\mu \right{\tau}^{2})\xb7p\left({\tau}^{2}\right)\phantom{\rule{0.277778em}{0ex}}d\mu \phantom{\rule{0.277778em}{0ex}}d{\tau}^{2}\hfill \\ \hfill \phantom{\rule{1.em}{0ex}}& =\frac{{\gamma}_{0}^{{e}_{0}}\Gamma ({e}_{0}+1/2)}{2\sqrt{\pi}\Gamma \left({e}_{0}\right)}{\left({\gamma}_{0}+\frac{{({x}_{2h}{\mu}_{0})}^{2}}{4}\right)}^{({e}_{0}+1/2)}\hfill \end{array}$$$$\begin{array}{cc}\hfill {f}_{0}& ({x}_{1h},{x}_{2h})=\int f({x}_{1h},{x}_{2h}\pi ,\mu ,{\tau}^{2})\xb7p\left(\pi \right)\xb7p\left(\mu \right{\tau}^{2})\xb7p\left({\tau}^{2}\right)\phantom{\rule{0.277778em}{0ex}}d\pi \phantom{\rule{0.277778em}{0ex}}d\mu \phantom{\rule{0.277778em}{0ex}}d{\tau}^{2}\hfill \\ \hfill \phantom{\rule{1.em}{0ex}}& =\frac{\mathit{B}({x}_{1h}+{c}_{0},\phantom{\rule{0.277778em}{0ex}}1{x}_{1h}+{d}_{0})}{\mathit{B}({c}_{0},{d}_{0})}\xb7\frac{{\gamma}_{0}^{{e}_{0}}\Gamma ({e}_{0}+1/2)}{2\sqrt{\pi}\Gamma \left({e}_{0}\right)}{\left({\gamma}_{0}+\frac{{({x}_{2h}{\mu}_{0})}^{2}}{4}\right)}^{({e}_{0}+1/2)}\hfill \end{array}$$The derivation of the distributions above is provided in Appendix C.3.
 (c)
 Outcome model for the discrete cluster $f\left({S}_{h}\right{\mathit{X}}_{h},{\mathit{\theta}}_{j})$In developing the outcome model, as with the parameter model case discussed in Section 4.1 and Appendix C.2, it should be ensured that the covariate is complete beforehand. With all missing data in ${x}_{1h}$ imputed, the outcome model for the discrete cluster is obtained by marginalizing the joint $f({S}_{h},{x}_{1h}{x}_{2h},{\mathit{\theta}}_{j},{\pi}_{j})$ out the MAR covariate ${x}_{1h}$, which is a log skewnormal mixture expressed as follows:$$\begin{array}{cc}& f\left({S}_{h}\right{x}_{2h},{\mathit{\beta}}_{j},{\sigma}_{j}^{2},{\xi}_{j},{\tilde{\mathit{\beta}}}_{j})=\sum _{{x}_{1h}=0}^{1}f\left({S}_{h}\right{x}_{1h},{x}_{2h},{\mathit{\beta}}_{j},{\sigma}_{j}^{2},{\xi}_{j},{\tilde{\mathit{\beta}}}_{j})\xb7f\left({x}_{1h}\right{\pi}_{j})\hfill \\ \hfill \phantom{\rule{1.em}{0ex}}& =f({S}_{h},\phantom{\rule{0.277778em}{0ex}}{x}_{1h}=1{x}_{2h},{\mathit{\beta}}_{j},{\sigma}_{j}^{2},{\xi}_{j},{\tilde{\mathit{\beta}}}_{j},{\pi}_{j})+f({S}_{h},\phantom{\rule{0.277778em}{0ex}}{x}_{1h}=0{x}_{2h},{\mathit{\beta}}_{j},{\sigma}_{j}^{2},{\xi}_{j},{\tilde{\mathit{\beta}}}_{j},{\pi}_{j})\hfill \\ \hfill \phantom{\rule{1.em}{0ex}}& =\delta \left({\mathit{X}}_{h}^{T}{\tilde{\mathit{\beta}}}_{j}\right)\phantom{\rule{0.277778em}{0ex}}\mathbb{1}({S}_{h}=0)+\left[1\delta \left({\mathit{X}}_{h}^{T}{\tilde{\mathit{\beta}}}_{j}\right)\right]\xb7{\displaystyle \frac{2}{{\sigma}_{j}{S}_{h}}}\hfill \\ \hfill \phantom{\rule{1.em}{0ex}}& \xb7\varphi \left(\frac{log{S}_{h}({\beta}_{j0}+{\beta}_{j1}+{\beta}_{j2}{x}_{2h})}{{\sigma}_{j}}\right)\xb7\Phi \left({\xi}_{j}\frac{log{S}_{h}({\beta}_{j0}+{\beta}_{j1}+{\beta}_{j2}{x}_{2h})}{{\sigma}_{j}}\right){\pi}_{j}\hfill \\ \hfill \phantom{\rule{1.em}{0ex}}& +\phantom{\rule{0.277778em}{0ex}}\delta \left({\mathit{X}}_{h}^{T}{\tilde{\mathit{\beta}}}_{j}\right)\phantom{\rule{0.277778em}{0ex}}\mathbb{1}({S}_{h}=0)+\left[1\delta \left({\mathit{X}}_{h}^{T}{\tilde{\mathit{\beta}}}_{j}\right)\right]\xb7{\displaystyle \frac{2}{{\sigma}_{j}{S}_{h}}}\hfill \\ \hfill \phantom{\rule{1.em}{0ex}}& \xb7\varphi \left(\frac{log{S}_{h}({\beta}_{j0}+{\beta}_{j2}{x}_{2h})}{{\sigma}_{j}}\right)\xb7\Phi \left({\xi}_{j}\frac{log{S}_{h}({\beta}_{j0}+{\beta}_{j2}{x}_{2h})}{{\sigma}_{j}}\right)\xb7(1{\pi}_{j})\hfill \end{array}$$$$\begin{array}{cc}\hfill \phantom{\rule{1.em}{0ex}}& f\left({S}_{h}\right{x}_{1h},{x}_{2h},{\mathit{\beta}}_{j},{\sigma}_{j}^{2},{\xi}_{j},{\tilde{\mathit{\beta}}}_{j})\hfill \\ \hfill \phantom{\rule{1.em}{0ex}}& =\delta \left({\mathit{X}}_{h}^{T}{\tilde{\mathit{\beta}}}_{j}\right)\phantom{\rule{0.277778em}{0ex}}\mathbb{1}({S}_{h}=0)+\left[1\delta \left({\mathit{X}}_{h}^{T}{\tilde{\mathit{\beta}}}_{j}\right)\right]\xb7{\displaystyle \frac{2}{{\sigma}_{j}{S}_{h}}}\hfill \\ \hfill \phantom{\rule{1.em}{0ex}}& \xb7\varphi \left(\frac{log{S}_{h}({\beta}_{j0}+{\beta}_{j1}{x}_{1h}+{\beta}_{j2}{x}_{2h})}{{\sigma}_{j}}\right)\xb7\Phi \left({\xi}_{j}\frac{log{S}_{h}({\beta}_{j0}+{\beta}_{j1}{x}_{1h}+{\beta}_{j2}{x}_{2h})}{{\sigma}_{j}}\right)\hfill \end{array}$$
 (d)
 Outcome model for the continuous cluster ${f}_{0}\left({S}_{h}\right{\mathit{X}}_{h})$Once a missing covariate ${\mathit{x}}_{1}$ is fully imputed, and the outcome model is marginalized out and conditioned to the MAR covariate ${x}_{1h}$, the outcome model ${f}_{0}\left({S}_{h}\right{x}_{2h})$ for the continuous cluster can also be computed by integrating out the relevant parameters using Equation (4):$${f}_{0}\left({S}_{h}\right{x}_{2h})=\int f\left({S}_{h}\right{x}_{2h},\mathit{\beta},{\sigma}^{2},\xi ,\tilde{\mathit{\beta}})\xb7p\left(\mathit{\beta}\right)\xb7p\left({\sigma}^{2}\right)\xb7p\left(\xi \right)\xb7p\left(\tilde{\mathit{\beta}}\right)\phantom{\rule{0.277778em}{0ex}}d\mathit{\beta}\phantom{\rule{0.277778em}{0ex}}d{\sigma}^{2}\phantom{\rule{0.277778em}{0ex}}d\xi \phantom{\rule{0.277778em}{0ex}}d\tilde{\mathit{\beta}}$$However, it can be too complicated to compute its form analytically. Instead, we can integrate the joint model out of the parameters using Monte Carlo integration. For example, we can perform the following steps for each $h=1,\cdots ,H$:
 (i)
 Sample $\mathit{\beta},{\sigma}^{2},\xi ,\tilde{\mathit{\beta}}$ from the DP prior densities ${G}_{0}$ specified previously;
 (ii)
 Plug these samples into $f\left({S}_{h}\right{x}_{2h},\mathit{\beta},{\sigma}^{2},\xi ,\tilde{\mathit{\beta}})\xb7p\left(\mathit{\beta}\right)\xb7p\left({\sigma}^{2}\right)\xb7p\left(\xi \right)\xb7p\left(\tilde{\mathit{\beta}}\right)$;
 (iii)
 Repeat the above steps many times, recording each output;
 (iv)
 Divide the sum of all output values by the number of Monte Carlo samples, which will be the approximate integral.
5. Empirical Study
5.1. Data
5.2. Three Competitor Models and Evaluation
5.3. Result with International General Insurance Liability Data
5.4. Result with LGPIF Data
6. Discussion
6.1. Research Questions
6.2. Future Work
 (a)
 Dimensionality: First, in our analysis, we only used two covariates (binary and continuous) for simplicity. Hence, more complex data should be considered. As the number of covariates grows, the likelihood components (covariate models) to describe the covariates grow, which results in the shrinking of the cluster weights. Therefore, using more covariates might enhance the level of sensitivity and accuracy in the creation of cluster memberships. However, it can also introduce more noise or hidden structures that render the resulting predictive distributions unstable. In this sense, further research on the problem of high dimensional covariates in the DPM framework would be worthwhile.
 (b)
 Measurement error: Second, although our focus in this article was the MAR covariate, mismeasured covariates is an equally significant challenge that impairs the proper model development in insurance practice. For example, Aggarwal et al. (2016) pointed out that “model risk” mainly arises due to missingness and measurement error in variables, leading to flawed risk assessments and decision making. Thus, further investigation is necessary to explore the specialized construction of the DPM Gibbs sampler for mismeasured covariates, aiming to prevent the issue of model risk.
 (c)
 Sum of the log skewnormal: Third, as an extension to the approximation of total losses ${S}_{h}$ (the sum of individual losses) for a policy, we recommend researching ways to approximate the sum of total losses $\tilde{S}$ across entire policies. In other words, we pose the following question: “How do we approximate the sum of log skewnormal random variables?” From the perspective of an executive or an entrepreneur whose concern is the total cash flow of the firm, nothing might be more important than the accurate estimation of the sum of total losses in order to identify the insolvency risk or to make important business decisions.
 (d)
 Scalability: Lastly, we suggest investigating the scalability of the posterior simulation with our DPM Gibbs sampler. As shown in our empirical study on the PnCdemand dataset, our DPM framework produced reliable estimates with relatively small sample sizes ($n\le 160$). This was because our DPM framework actively utilized significant prior knowledge in posterior inference rather than heavily relying on the actual features of the data. In the result from the LGPIF dataset, our DPM exhibited stable performance at a sample size $n=4529$ as well. However, a sample size of over 10,000 was not explored in this paper. With increasing amounts of data, our DPM framework raises the question of computational efficiency due to the growing demand for computational resources or degradation in performance (see Ni et al. 2020). This is an important consideration, especially in scenarios where the insurance loss information is expected to grow over time.
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Variable Definitions
$i=1,\dots ,{N}_{h}$  Observation index i in a policy h 
$h=1,\dots ,H$  Policy index h with a total policy number H 
$j=1,\dots ,J$  Cluster index for J clusters 
${s}_{h}$  Cluster index $j=1,\dots ,J$ for observation h 
${n}_{j}$  Number of observations in cluster j 
${n}_{j}^{h}$  Number of observations in cluster j where observation h was removed from 
${Y}_{ih}$  Individual loss i in a policy observation h 
${S}_{h}$  Outcome variable which is $\Sigma {Y}_{ih}$ in a policy observation h. 
$\tilde{S}$  Outcome variable which is $\Sigma {S}_{h}$ across entire policies 
${\mathit{X}}_{h}$  Vector of covariates (including ${\mathit{x}}_{1},{\mathit{x}}_{2}$) for a policy observation h 
${\mathit{x}}_{1}$  Vector of covariate (Fire5) 
${\mathit{x}}_{2}$  Vector of covariate (Ln(coverage)) 
${x}_{1h}$  Individual value of covariate (Fire5) for a policy observation h 
${x}_{2h}$  Individual value of covariate (Ln(coverage)) for a policy observation h 
${p}_{0}(\xb7)$  Parameter model (for prior) 
$p(\xb7)$  Parameter model (for posterior) 
${f}_{0}(\xb7)$  Data model (for continuous cluster) 
$f(\xb7)$  Data model (for discrete cluster) 
$\delta (\xb7)$  Logistic sigmoid function—expit(·)—to allow for a positive probability of the zero outcome 
${\mathit{\theta}}_{j}$  Set of parameters—$\mathit{\beta},{\sigma}^{2},\xi $—associated with $f(\Sigma Y\mathit{X})$ for cluster j 
${\mathit{w}}_{j}$  Set of parameters—$\pi ,\mu ,\tau $—associated with $f\left(\mathit{X}\right)$ for cluster j 
${\mathit{\omega}}_{j}$  Cluster weights (mixing coefficient) for cluster j 
${\mathit{\beta}}_{0},{\Sigma}_{0}$  Vector of initial regression coefficients and variancecovariance matrix (i.e., ${\widehat{\sigma}}^{2}{\left({\mathit{X}}^{T}\mathit{X}\right)}^{1}={\mathit{X}}^{T}\mathit{X}{(\Sigma Y\Sigma \widehat{Y})}^{T}(\Sigma Y\Sigma \widehat{Y})/(np)$) obtained from the baseline multivariate gamma regression of $\Sigma \widehat{Y}>0$ 
${\mathit{\beta}}_{j}$  Regression coefficient vector for a mean outcome estimation 
${\sigma}_{j}^{2}$  Clusterwise variation value for the outcome 
${\xi}_{j}$  Skewness parameter for log skewnormal outcome 
${\tilde{\mathit{\beta}}}_{0},{\tilde{\Sigma}}_{0}$  Vector of initial regression coefficients and variancecovariance matrix obtained from the baseline multivariate logistic regression of $\Sigma \widehat{Y}=0$ 
${\tilde{\mathit{\beta}}}_{j}$  Regression coefficient vector for a logistic function to handle zero outcomes 
${\pi}_{j}$  Proportion parameter for Bernoulli covariate 
${\mu}_{j},{\tau}_{j}$  Location and spread parameter for Gaussian covariate 
$\alpha $  Precision parameter that controls the variance of the clustering simulation. For instance, a larger $\alpha $ allows selecting more clusters. 
${G}_{0}$  Prior joint distribution for all parameters in the DPM: $\beta ,{\sigma}^{2},\xi ,\pi ,\mu ,\tau $, and $\alpha $. It allows all continuous, integrable distributions to be supported while retaining theoretical properties and computational tractability such as asymptotic consistency and efficient posterior estimation. 
${a}_{0},{b}_{0}$  Hyperparameters for inverse gamma density of ${\sigma}_{j}^{2}$ 
${c}_{0},{d}_{0}$  Hyperparameters for Beta density of ${\pi}_{j}$ 
${\nu}_{0}$  Hyperparameters for Student’s t density of ${\xi}_{j}$ 
${\mu}_{0},{\tau}_{0}^{2}$  Hyperparameters for Gaussian density of ${\mu}_{j}$ 
${e}_{0},{\gamma}_{0}$  Hyperparameters for inverse gamma density of ${\tau}_{j}^{2}$ 
${g}_{0},{h}_{0}$  Hyperparameters for gamma density of $\alpha $ 
$\eta $  Random probability value for gamma mixture density of the posterior on $\alpha $ 
${\pi}_{\eta}$  Mixing coefficient for gamma mixture density of the posterior on $\alpha $ 
Appendix A. Parameter Knowledge
Appendix A.1. Prior Kernel for Distributions of Outcome, Covariates, and Precision
Appendix A.2. Posterior Inference for Outcome, Covariates, and Precision
Algorithm A1 Posterior inference ${\mathit{\theta}}_{j}^{*}=\{{\mathit{\beta}}_{j}^{*},{\sigma}_{j}^{2*},{\xi}_{j}^{*},{\tilde{\mathit{\beta}}}_{j}^{*}\}$ 

Appendix B. Baseline Inference Algorithm for the DPM
Algorithm A2 DPM Gibbs sampling for new cluster development 

Appendix C. Development of the Distributional Components for the DPM
Appendix C.1. Derivation of the Distribution of Precision α
 Observation 1 forms a new cluster with a probability = $\frac{\alpha}{\alpha}$
 Observation 2 forms a new cluster with a probability = $\frac{\alpha}{\alpha +1}$
 Observation 3 enters into an existing cluster with a probability = $\frac{2}{\alpha +2}$
 Observation 4 enters into an existing cluster with a probability = $\frac{3}{\alpha +3}$
 Observation 5 forms a new cluster with a probability = $\frac{\alpha}{\alpha +4}$
Appendix C.2. Outcome Data Model of S_{h} Development with the MAR Covariate x_{1} for the Discrete Clusters
Appendix C.3. Covariate Data Model of x_{2} Development with the MAR Covariate x_{1} for the Continuous Clusters
References
 Aggarwal, Ankur, Michael B. Beck, Matthew Cann, Tim Ford, Dan Georgescu, Nirav Morjaria, Andrew Smith, Yvonne Taylor, Andreas Tsanakas, Louise Witts, and et al. 2016. Model risk–daring to open up the black box. British Actuarial Journal 21: 229–96. [Google Scholar] [CrossRef]
 Antoniak, Charles E. 1974. Mixtures of dirichlet processes with applications to bayesian nonparametric problems. The Annals of Statistics 2: 1152–74. [Google Scholar] [CrossRef]
 Bassetti, Federico, Roberto Casarin, and Fabrizio Leisen. 2014. Betaproduct dependent pitman–yor processes for bayesian inference. Journal of Econometrics 180: 49–72. [Google Scholar] [CrossRef]
 Beaulieu, Norman C., and Qiong Xie. 2003. Minimax approximation to lognormal sum distributions. Paper present at the 57th IEEE Semiannual Vehicular Technology Conference, VTC 2003Spring, Jeju, Republic of Korea, April 22–25; Piscataway: IEEE, vol. 2, pp. 1061–65. [Google Scholar]
 Billio, Monica, Roberto Casarin, and Luca Rossini. 2019. Bayesian nonparametric sparse var models. Journal of Econometrics 212: 97–115. [Google Scholar]
 Blackwell, David, and James B. MacQueen. 1973. Ferguson distributions via pólya urn schemes. The Annals of Statistics 1: 353–55. [Google Scholar] [CrossRef]
 Blei, David M., and Peter I. Frazier. 2011. Distance dependent chinese restaurant processes. Journal of Machine Learning Research 12: 2461–88. [Google Scholar]
 Braun, Michael, Peter S. Fader, Eric T. Bradlow, and Howard Kunreuther. 2006. Modeling the “pseudodeductible” in insurance claims decisions. Management Science 52: 1258–72. [Google Scholar] [CrossRef]
 Browne, Mark J., JaeWook Chung, and Edward W. Frees. 2000. International propertyliability insurance consumption. The Journal of Risk and Insurance 67: 73–90. [Google Scholar]
 Cairns, Andrew J. G., David Blake, Kevin Dowd, Guy D. Coughlan, and Marwa KhalafAllah. 2011. Bayesian stochastic mortality modelling for two populations. ASTIN Bulletin: The Journal of the IAA 41: 29–59. [Google Scholar]
 Diebolt, Jean, and Christian P. Robert. 1994. Estimation of finite mixture distributions through bayesian sampling. Journal of the Royal Statistical Society: Series B (Methodological) 56: 363–75. [Google Scholar] [CrossRef]
 Escobar, Michael D., and Mike West. 1995. Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association 90: 577–88. [Google Scholar] [CrossRef]
 Ferguson, Thomas S. 1973. A bayesian analysis of some nonparametric problems. The Annals of Statistics 1: 209–30. [Google Scholar] [CrossRef]
 Furman, Edward, Daniel Hackmann, and Alexey Kuznetsov. 2020. On lognormal convolutions: An analytical–numerical method with applications to economic capital determination. Insurance: Mathematics and Economics 90: 120–34. [Google Scholar] [CrossRef]
 Gelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge: Cambridge University Press. [Google Scholar]
 Gershman, Samuel J., and David M. Blei. 2012. A tutorial on bayesian nonparametric models. Journal of Mathematical Psychology 56: 1–12. [Google Scholar] [CrossRef]
 Ghosal, Subhashis. 2010. The dirichlet process, related priors and posterior asymptotics. Bayesian Nonparametrics 28: 35. [Google Scholar]
 Griffin, Jim, and Mark Steel. 2006. Orderbased dependent dirichlet processes. Journal of the American statistical Association 101: 179–94. [Google Scholar] [CrossRef]
 Griffin, Jim, and Mark Steel. 2011. Stickbreaking autoregressive processes. Journal of Econometrics 162: 383–96. [Google Scholar] [CrossRef]
 Hannah, Lauren A., David M. Blei, and Warren B. Powell. 2011. Dirichlet process mixtures of generalized linear models. Journal of Machine Learning Research 12: 1923–53. [Google Scholar]
 Hogg, Robert V., and Stuart A. Klugman. 2009. Loss Distributions. Hoboken: John Wiley & Sons. [Google Scholar]
 Hong, Liang, and Ryan Martin. 2017. A flexible bayesian nonparametric model for predicting future insurance claims. North American Actuarial Journal 21: 228–41. [Google Scholar] [CrossRef]
 Hong, Liang, and Ryan Martin. 2018. Dirichlet process mixture models for insurance loss data. Scandinavian Actuarial Journal 2018: 545–54. [Google Scholar] [CrossRef]
 Huang, Yifan, and Shengwang Meng. 2020. A bayesian nonparametric model and its application in insurance loss prediction. Insurance: Mathematics and Economics 93: 84–94. [Google Scholar] [CrossRef]
 Kaas, Rob, Marc Goovaerts, Jan Dhaene, and Michel Denuit. 2008. Modern Actuarial Risk Theory: Using R. Berlin and Heidelberg: Springer Science & Business Media, vol. 128. [Google Scholar]
 Lam, Chong Lai Joshua, and Tho LeNgoc. 2007. Logshifted gamma approximation to lognormal sum distributions. IEEE Transactions on Vehicular Technology 56: 2121–29. [Google Scholar] [CrossRef]
 Li, Xue. 2008. A Novel Accurate Approximation Method of Lognormal Sum Random Variables. Ph.D. thesis, Wright State University, Dayton, OH, USA. [Google Scholar]
 Neal, Radford M. 2000. Markov chain sampling methods for dirichlet process mixture models. Journal of Computational and Graphical Statistics 9: 249–65. [Google Scholar]
 Neuhaus, John M., and Charles E. McCulloch. 2006. Separating betweenand withincluster covariate effects by using conditional and partitioning methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68: 859–72. [Google Scholar] [CrossRef]
 Ni, Yang, Yuan Ji, and Peter Müller. 2020. Consensus monte carlo for random subsets using shared anchors. Journal of Computational and Graphical Statistics 29: 703–14. [Google Scholar] [CrossRef]
 Quan, Zhiyu, and Emiliano A. Valdez. 2018. Predictive analytics of insurance claims using multivariate decision trees. Dependence Modeling 6: 377–407. [Google Scholar] [CrossRef]
 Richardson, Robert, and Brian Hartman. 2018. Bayesian nonparametric regression models for modeling and predicting healthcare claims. Insurance: Mathematics and Economics 83: 1–8. [Google Scholar] [CrossRef]
 Rodriguez, Abel, and David B. Dunson. 2011. Nonparametric bayesian models through probit stickbreaking processes. Bayesian Analysis (Online) 6: 145–78. [Google Scholar]
 Roy, Jason, Kirsten J. Lum, Bret Zeldow, Jordan D. Dworkin, Vincent Lo Re III, and Michael J. Daniels. 2018. Bayesian nonparametric generative models for causal inference with missing at random covariates. Biometrics 74: 1193–202. [Google Scholar] [CrossRef]
 Sethuraman, Jayaram. 1994. A constructive definition of dirichlet priors. Statistica Sinica 4: 639–650. [Google Scholar]
 Shah, Anoop D., Jonathan W. Bartlett, James Carpenter, Owen Nicholas, and Harry Hemingway. 2014. Comparison of random forest and parametric imputation models for imputing missing data using mice: A caliber study. American Journal of Epidemiology 179: 764–74. [Google Scholar] [CrossRef] [PubMed]
 Shahbaba, Babak, and Radford Neal. 2009. Nonlinear models using dirichlet process mixtures. Journal of Machine Learning Research 10: 1829–50. [Google Scholar]
 Shams Esfand Abadi, Mostafa. 2022. Bayesian Nonparametric Regression Models for Insurance Claims Frequency and Severity. Ph.D. thesis, University of Nevada, Las Vegas, NV, USA. [Google Scholar]
 Si, Yajuan, and Jerome P. Reiter. 2013. Nonparametric bayesian multiple imputation for incomplete categorical variables in largescale assessment surveys. Journal of Educational and Behavioral Statistics 38: 499–521. [Google Scholar] [CrossRef]
 Suwandani, Ria Novita, and Yogo Purwono. 2021. Implementation of gaussian process regression in estimating motor vehicle insurance claims reserves. Journal of Asian Multicultural Research for Economy and Management Study 2: 38–48. [Google Scholar] [CrossRef]
 Teh, Yee Whye. 2010. Dirichlet Process. In Encyclopedia of Machine Learning. Berlin and Heidelberg: Springer Science & Business Media, pp. 280–87. [Google Scholar]
 Ungolo, Francesco, Torsten Kleinow, and Angus S. Macdonald. 2020. A hierarchical model for the joint mortality analysis of pension scheme data with missing covariates. Insurance: Mathematics and Economics 91: 68–84. [Google Scholar] [CrossRef]
 Zhao, Lian, and Jiu Ding. 2007. Least squares approximations to lognormal sum distributions. IEEE Transactions on Vehicular Technology 56: 991–97. [Google Scholar] [CrossRef]
Model  AIC  SSPE  SAPE  10% CTE  50% CTE  90% CTE  95% CTE 

GaGLM  830.56  268.6  139.8  6.5  13.8  54.5  78.0 
GaMARS  830.58  267.2  138.2  6.1  13.0  57.2  71.1 
GaGAM  845.94  266.7  136.1  6.2  13.3  58.1  72.2 
LogNDPM    272.0  134.7  6.4  13.8  59.3  79.3 
Model  AIC  SSPE  SAPE  10% CTE  50% CTE  90% CTE  95% CTE 

TweedieGLM  26,270.3  2.04 × 10${}^{14}$  89,380,707  955.9  12,977.2  133,374.4  340,713.1 
TweedieMARS  24,721.4  1.99 × 10${}^{14}$  88,594,850  961.7  10,391.0  129,409.2  355,112.6 
TweedieGAM  21,948.9  1.95 × 10${}^{14}$  88,213,987  989.4  13,026.2  140,199.5  398,263.1 
LogSNDPM    1.98 × 10${}^{14}$  83,864,890  975.3  13,695.1  147,486.6  425,682.6 
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. 
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kim, M.; Lindberg, D.; Crane, M.; Bezbradica, M. Dirichlet Process Log SkewNormal Mixture with a MissingatRandomCovariate in Insurance Claim Analysis. Econometrics 2023, 11, 24. https://doi.org/10.3390/econometrics11040024
Kim M, Lindberg D, Crane M, Bezbradica M. Dirichlet Process Log SkewNormal Mixture with a MissingatRandomCovariate in Insurance Claim Analysis. Econometrics. 2023; 11(4):24. https://doi.org/10.3390/econometrics11040024
Chicago/Turabian StyleKim, Minkun, David Lindberg, Martin Crane, and Marija Bezbradica. 2023. "Dirichlet Process Log SkewNormal Mixture with a MissingatRandomCovariate in Insurance Claim Analysis" Econometrics 11, no. 4: 24. https://doi.org/10.3390/econometrics11040024