Machine Learning and Frequency–Severity Decomposition for Insurance Pricing

Nguyen, Nguyet

doi:10.3390/math14101640

Open AccessArticle

Machine Learning and Frequency–Severity Decomposition for Insurance Pricing

by

Nguyet Nguyen

Department of Mathematics & Statistics, Youngstown State University, Youngstown, OH 44555, USA

Mathematics 2026, 14(10), 1640; https://doi.org/10.3390/math14101640

Submission received: 27 March 2026 / Revised: 27 April 2026 / Accepted: 4 May 2026 / Published: 12 May 2026

(This article belongs to the Special Issue Modern Trends in Mathematics, Probability and Statistics for Finance)

Download

Browse Figures

Versions Notes

Abstract

Insurance pricing plays a central role in risk management and financial decision-making, as accurate premium estimation directly impacts portfolio stability and profitability. This study investigates insurance pure premium estimation by integrating classical actuarial models with modern machine learning techniques. We compare the traditional frequency–severity decomposition framework with direct modeling approaches, including XGBoost and Tweedie models. For claim frequency, we evaluate Poisson-based models, generalized additive models, and XGBoost. For claim severity, we compare a Gamma generalized linear model with XGBoost. The results show that XGBoost improves predictive performance for both components based on the evaluation metrics considered. Within the decomposition framework, the XGBoost–XGBoost model achieves the lowest prediction error among the models considered. However, lift-based analysis reveals that the XGBoost–Gamma model provides superior risk segmentation, highlighting a trade-off between prediction accuracy and risk ranking. Direct modeling approaches, while competitive, do not consistently achieve lower error than the decomposition framework across the evaluation metrics considered. Overall, the findings demonstrate that machine learning enhances predictive performance, but its effectiveness is maximized within the frequency–severity framework. The results highlight the importance of both frequency and severity modeling in insurance pricing, while suggesting that their relative contributions to risk segmentation depend on model specification and evaluation criteria. These findings have important implications for risk management and pricing strategies in insurance portfolios.

Keywords:

insurance pricing; pure premium; frequency–severity decomposition; machine learning; XGBoost; generalized linear models; Tweedie models; risk segmentation

MSC:

90-05

1. Introduction

Obtaining accurate estimations of pure premiums is a central challenge in automobile insurance pricing. Insurers must determine premiums that reflect the underlying risk of each policyholder while ensuring fairness, competitiveness, and regulatory compliance, as premium estimation directly affects portfolio stability and profitability.

A cornerstone of actuarial practice is the frequency–severity framework, which decomposes the pure premium into two latent components: the expected number of claims and the expected claim size. By modeling these components separately, actuaries can employ statistical distributions tailored to the distinct stochastic properties of each process before recombining them for final loss estimation. This framework provides both interpretability and flexibility, making it fundamental to modern insurance ratemaking.

Generalized Linear Models (GLMs) have long served as the industry standard for actuarial modeling. Existing literature highlights the capacity of GLMs to provide an interpretable yet flexible structure for non-normal, skewed, and heteroscedastic insurance data [1,2,3]. While Poisson and Negative Binomial models are commonly used for claim frequency, Gamma models are widely applied to claim severity due to the strictly positive and right-skewed nature of loss amounts. Extensions such as Generalized Additive Models (GAMs) allow for nonlinear relationships while preserving model transparency [4,5].

Previous studies have demonstrated the applicability of generalized linear models in modeling claim frequency and insurance pricing across a wide range of contexts, including emerging markets [4,5,6].

Machine learning (ML) has emerged as a powerful tool in insurance analytics. Tree-based algorithms and gradient boosting methods are particularly effective in capturing complex interactions and nonlinear relationships that traditional parametric models may overlook. Empirical evidence suggests that models such as XGBoost can improve performance in tasks ranging from claim frequency prediction to fraud detection [7,8,9].

Recent studies such as [10,11] provide empirical comparisons of decomposition-based and direct modeling approaches. These studies highlight that model performance can vary depending on the underlying data structure and the choice of evaluation metrics. Compared to these studies, our analysis provides a unified evaluation framework that applies both decomposition-based and direct modeling approaches to the same dataset using consistent evaluation metrics, allowing for a more controlled comparison of model performance.

Recent work has also explored both direct and component-based modeling strategies using machine learning. Direct modeling of total claim amounts, using methods such as Support Vector Regression (SVR), XGBoost, and neural networks, has demonstrated strong predictive accuracy for aggregate loss estimation [12]. Other studies incorporate machine learning within the classical frequency–severity framework, where gradient boosting models are compared with traditional GLMs [11]. These findings indicate that machine learning consistently improves claim frequency prediction, while results for claim severity remain more mixed due to the high variability of loss amounts.

There is also a growing body of work on neural network–based approaches to insurance modeling, including the contributions of Richman and Wüthrich [13,14], which demonstrate the effectiveness of deep learning methods in capturing complex nonlinear relationships in frequency and severity components. These approaches often integrate decomposition structures within flexible modeling frameworks, suggesting that the distinction between direct and component-based modeling may depend more on implementation than on fundamental modeling philosophy.

Alternative unified modeling approaches have also been proposed, most notably through the Tweedie distribution, which provides a compound Poisson–Gamma representation of aggregate losses [15,16]. While such models offer strong theoretical appeal, their empirical performance relative to decomposition-based methods may depend on model specification and data characteristics.

Motivated by these considerations and the mixed empirical findings in the literature, this study is guided by the following research questions: (i) whether machine learning methods improve predictive performance within the frequency–severity decomposition framework, and (ii) how direct modeling approaches compare with decomposition-based methods in predicting aggregate losses.

In particular, we evaluate model performance not only in terms of prediction accuracy but also in terms of the ability to identify high-risk policies.

To address these questions, we develop a comprehensive modeling framework that integrates classical actuarial models with modern machine learning techniques. We evaluate several models for claim frequency, including Poisson GLMs, spline-based models, and XGBoost. For claim severity, we compare a Gamma GLM with XGBoost. These models are combined within the frequency–severity decomposition framework to estimate pure premium. In addition, we examine direct modeling approaches, including XGBoost and Tweedie models, for predicting aggregate losses.

In this context, our contribution is to provide a unified empirical comparison of decomposition and direct modeling approaches using both classical actuarial models and modern machine learning methods, evaluated under multiple performance metrics. This allows us to provide a consistent comparison across modeling approaches, highlighting not only differences in predictive accuracy but also the role of model structure and evaluation criteria in shaping empirical conclusions.

The contribution of this study is threefold. First, we show that machine learning methods can improve predictive performance for claim frequency and provide competitive performance for claim severity, particularly in capturing nonlinear relationships and interactions. Second, we provide empirical evidence that the frequency–severity decomposition framework remains competitive relative to direct modeling approaches, even when advanced machine learning methods are applied. Third, we identify a trade-off between prediction accuracy and risk segmentation, although the magnitude of this trade-off is modest and depends on model configuration.

Overall, this study contributes to the literature on actuarial science and financial mathematics by providing a comprehensive comparison of classical and modern approaches to insurance pricing. The results highlight the continued relevance of the frequency–severity decomposition framework while demonstrating how machine learning can be effectively integrated to enhance predictive performance and risk differentiation.

The remainder of the paper is organized as follows. Section 2 describes the dataset and data preparation procedures; Section 3 presents the modeling framework and statistical methods; Section 4 reports the empirical findings; and Section 5 concludes with a discussion of the results and their implications for insurance pricing and risk management.

2. Data Description and Preparation

This study utilizes the French motor third-party liability (freMTPL2) datasets from the CASdatasets package [17]. The raw data comprise two files: freMTPL2freq, containing 677,991 policy-year records, and freMTPL2sev, containing 26,444 individual claim amounts. These datasets provide a comprehensive set of risk features, including driver demographics, vehicle attributes, and geographic factors.

Table 1 summarizes the variables included in the frequency dataset. These covariates represent standard rating factors used in automobile insurance pricing, including driver age, vehicle characteristics, geographic region, and exposure.

The severity dataset contains 26,444 claim-level observations and includes the policy identifier and the corresponding claim amount for each reported claim. Because the frequency data are recorded at the policy level while the severity data are recorded at the claim level, the two datasets must be reconciled before modeling.

2.1. Data Integration and Preprocessing

To reconcile the policy-level frequency data with the claim-level severity data, all individual claim amounts in freMTPL2sev were aggregated by policy identifier (IDpol). This produced, for each policy, the total claim amount incurred during the exposure period and the number of associated claims. Policies with no reported claims were assigned a total loss of zero. The aggregated severity information was then merged with the frequency dataset to create a unified policy-level file containing exposure, claim counts, total incurred losses, and all rating variables.

For policies with at least one claim, an average claim severity variable was also computed for descriptive purposes. Pure premium values were obtained by dividing the total incurred loss by the policy’s exposure. These derived quantities are used only for exploratory analysis in this section; the formal notation and modeling framework are introduced later in Section 3.

2.2. Exploratory Data Analysis

The raw insurance portfolio exhibits the extreme class imbalance and heavy-tailed distributions typical of motor liability risks. Figure 1 illustrates these characteristics using log-scaled axes to visualize the full range of the data.

The frequency distribution (Figure 1, left) is dominated by zero-claim policies, with a rapid decay in frequency as claim counts increase. Notably, the raw data contain rare observations of up to 16 claims per policy, appearing as isolated points in the extreme tail. Similarly, the claim severity and pure premium distributions (Figure 1, center and right) span several orders of magnitude. The severity plot reveals a significant concentration of claims around 1000 EUR, but also shows an exceptionally long right tail reaching towards

10^{6}

EUR, representing catastrophic losses.

These visualizations highlight the necessity of the preprocessing steps described in Section 2.3. Specifically, the extreme sparsity of high claim counts and the high-leverage outliers in the severity tail motivate the use of capping (winsorization) to ensure the numerical stability and generalizability of the frequency and severity models.

2.3. Data Cleaning and Preparation

This section describes the preprocessing steps applied to improve model stability and address the heavy-tailed nature of insurance data. These procedures are particularly important for both Poisson frequency models and Gamma severity models, which are sensitive to extreme observations.

We apply several preprocessing steps to mitigate the influence of extreme values. Specifically, claim frequency is capped at four, and claim severity is winsorized at EUR 100,000. These transformations are commonly used in actuarial applications to stabilize estimation in the presence of heavy-tailed loss distributions.

We acknowledge that these preprocessing choices may affect both the distributional properties of the data and the interpretation of the resulting pure premium estimates. For example, capping claim frequency implies that policyholders with more than four claims are treated equivalently, while winsorization truncates extreme claim amounts that may carry important information about tail risk.

To assess the robustness of these choices, we conduct sensitivity analyses using alternative frequency caps (2, 3, and 4) and severity thresholds (EUR 50,000, EUR 100,000, and EUR 200,000). The detailed results are reported in Appendix A.

The results show that frequency model performance remains largely unchanged across different caps, indicating robustness to this preprocessing choice. For severity models, the mean squared error increases as the winsorization threshold increases, reflecting the influence of larger claims rather than instability in model performance. Importantly, the relative comparison across models remains consistent.

Overall, these findings indicate that the main conclusions regarding model comparison are stable across different preprocessing specifications.

Policies with exposure below 0.1 years are excluded to avoid instability arising from extremely short exposure periods. We recognize that this filtering reduces the sample size and may introduce selection bias, and we therefore interpret the results with this limitation in mind.

Alternative approaches, such as robust regression techniques that downweight rather than truncate extreme observations may offer a complementary strategy and are left for future research.

Exposure Filtering

Policies with extremely low exposure are prone to producing artificial variance in annualized claim rates. We excluded all records with an exposure below 0.1 years. This threshold ensures that the observations used for training represent a meaningful period of risk. As shown in Table 2, this step improved the mean exposure from 0.529 to 0.632 years.

Capping and Winsorization

Given the extreme right-skewness observed in Figure 1, we applied capping and winsorization as follows:

Claim Frequency: Observations with more than 4 claims were capped at 4. This choice is supported by the empirical distribution of the data, where observations with more than four claims are extremely rare (fewer than 0.001% of policies). Modeling these sparse observations separately would introduce estimation noise without improving predictive performance.
Claim Severity: The severity distribution exhibits substantial right-skewness, with a small number of extremely large claims (up to EUR 4.07 million) that are several orders of magnitude larger than the median claim size. Such observations can disproportionately influence model estimation, particularly for Gamma-based models and error-based evaluation metrics. Winsorization at EUR 100,000 is therefore applied to limit the influence of these extreme values while preserving the overall structure of the distribution.

2.4. Final Modeling Dataset Profile

The resulting dataset provides a stabilized foundation for estimating pure premiums. Table 3 summarizes the response variables that will be utilized in Section 3. The disparity between the mean and median values across all metrics reinforces the inherent skewness that persists even after cleaning, necessitating the specialized modeling framework proposed in this study.

3. Methodology

This study investigates alternative approaches to modeling insurance pure premium by comparing the classical frequency–severity decomposition with direct modeling of aggregate claim amounts. Let

Y_{i}

denote the total claim amount for policy i,

e_{i} > 0

the exposure, and

x_{i}

the associated covariates. The target of interest is the pure premium,

π_{i} = \frac{E [Y_{i} ∣ x_{i}]}{e_{i}} .

The pure premium is defined as the expected loss per policy over a given exposure period. In this paper, we use the terms pure premium and expected loss interchangeably, as both refer to this quantity.

3.1. Frequency–Severity Decomposition Framework

The classical actuarial framework expresses expected aggregate loss as:

E [Y_{i} ∣ x_{i}] = e_{i} λ_{i} m_{i},

where

λ_{i}

denotes the expected claim frequency per unit exposure and

m_{i}

denotes the expected claim severity conditional on at least one claim. Dividing by exposure, the pure premium is given by

π_{i} = λ_{i} m_{i} .

All models are trained using a train–test split. For XGBoost models, hyperparameters such as learning rate, maximum depth, subsampling ratio, and minimum child weight are tuned to balance bias and variance. Regularization techniques and early stopping are used to prevent overfitting.

3.2. Claim Frequency Models

We evaluate multiple models for claim frequency and select the best-performing models for use in the decomposition framework.

3.2.1. Poisson Generalized Linear Model

The number of claims

N_{i}

is assumed to follow a Poisson distribution:

N_{i} \sim Poisson (μ_{i}),

with mean

μ_{i} = e_{i} λ_{i}

. Using a log-link function:

log (μ_{i}) = log (e_{i}) + x_{i}^{⊤} β,

where

log (e_{i})

is included as an offset.

3.2.2. Negative Binomial Model

To account for overdispersion, the Negative Binomial model is considered:

Var (N_{i}) = μ_{i} + α μ_{i}^{2},

with the same log-link structure as the Poisson model.

3.2.3. Poisson GLM with Natural Cubic Splines

To capture nonlinear relationships, selected predictors are modeled using natural cubic splines:

log (μ_{i}) = log (e_{i}) + β_{0} + \sum_{j = 1}^{4} f_{j} (z_{i j}) + \sum_{k = 1}^{p} γ_{k} x_{i k} .

3.2.4. Generalized Additive Model (GAM)

The GAM extends this approach by estimating smooth functions:

log (μ_{i}) = log (e_{i}) + β_{0} + \sum_{j = 1}^{4} s_{j} (z_{i j}) + \sum_{k = 1}^{p} γ_{k} x_{i k} .

3.2.5. XGBoost for Frequency

To capture complex nonlinearities and interactions, we apply XGBoost:

{\hat{μ}}_{i} = \sum_{t = 1}^{T} f_{t} (x_{i}), f_{t} \in F .

3.3. Hyperparameter Tuning

For XGBoost models, hyperparameters are selected using a validation-based tuning procedure. The search space includes parameters such as learning rate, maximum tree depth, subsampling ratio, and minimum child weight. Model performance is evaluated using mean squared error on a validation set.

The final selected parameters reflect a balance between predictive accuracy and model complexity, helping to mitigate overfitting. Table 4 summarizes the key hyperparameters used in the final models.

Model Selection for Frequency

Based on predictive performance and risk segmentation metrics, the two best-performing models—Poisson GLM with splines and XGBoost—are selected for use in pure premium estimation.

3.4. Claim Severity Models

We consider two alternative approaches for modeling claim severity.

3.4.1. Gamma GLM

Claim severity

Z_{i}

is modeled using a Gamma distribution with log-link:

log (θ_{i}) = x_{i}^{⊤} ϕ .

3.4.2. XGBoost for Severity

To capture nonlinear relationships, we apply XGBoost to model the logarithm of severity:

log (Z_{i}) = \sum_{k = 1}^{K} f_{k} (x_{i}) + ε_{i} .

3.5. Pure Premium Estimation

The estimated pure premium under the decomposition approach is:

{\hat{π}}_{i}^{decomp} = {\hat{λ}}_{i} {\hat{m}}_{i} .

We consider four model combinations:

Spline–Gamma: Poisson spline frequency + Gamma severity,
Spline–XGBoost: Poisson spline frequency + XGBoost severity,
XGBoost–Gamma: XGBoost frequency + Gamma severity,
XGBoost–XGBoost: XGBoost frequency + XGBoost severity.

3.6. Direct Modeling of Aggregate Loss

To provide a benchmark, we consider models that directly predict aggregate claim amounts.

3.6.1. Tweedie Model

The Tweedie distribution satisfies:

Var (Y_{i} ∣ x_{i}) = ϕ μ_{i}^{p}, 1 < p < 2,

corresponding to a compound Poisson–Gamma model. The mean is modeled as:

log (μ_{i}) = log (e_{i}) + x_{i}^{⊤} β .

3.6.2. XGBoost for Total Loss

We also apply XGBoost to directly model total claim amounts:

log (1 + Y_{i}) = f (x_{i}) .

In the direct modeling approach, exposure is included as an input feature rather than as an offset. This allows the model to learn nonlinear relationships between exposure and aggregate loss. However, because exposure is not treated as an offset, this specification differs from the traditional actuarial structure used in the frequency–severity decomposition framework. As a result, the direct XGBoost model is not strictly comparable to the decomposition-based models. We therefore interpret comparisons between direct and decomposition approaches cautiously and acknowledge this as a limitation of the study.

3.7. Model Evaluation

Model performance is evaluated using both accuracy and risk segmentation metrics.

Prediction accuracy is assessed using Mean Squared Error (MSE), Mean Absolute Error (MAE), and Pearson correlation.

Risk segmentation is evaluated using decile-based analysis, where policies are ranked by predicted values and grouped into ten deciles. A well-performing model should exhibit strong concentration of losses in the highest-risk deciles.

4. Results

Model comparisons are based on observed differences in evaluation metrics and are descriptive in nature; no formal statistical significance tests are conducted. This section presents the empirical results for the claim frequency, claim severity, and pure premium models. We begin by evaluating the performance of alternative models for claim frequency, followed by claim severity, and then assess the combined frequency–severity decomposition and direct modeling approaches.

4.1. Claim Frequency Model Performance

Table 5 summarizes the performance of the claim frequency models. Lift is defined as the ratio of the observed loss in a selected high-risk segment to the average loss across the entire portfolio, providing a measure of the model’s ability to identify high-risk policies. We note that the observed differences in performance metrics are relatively small and should be interpreted with caution. While formal statistical significance tests could be conducted using repeated resampling or cross-validation, the current analysis focuses on out-of-sample predictive performance. The consistency of model rankings across multiple train–test splits provides additional support for the robustness of the results.

Among all models, XGBoost achieves the lowest MSE and MAE, as well as the highest correlation and lift among the models considered. In particular, the lift in the top decile indicates a strong ability to identify high-risk policies, highlighting the effectiveness of machine learning methods for risk segmentation.

Among the parametric models, the Poisson model with spline terms and the GAM provide modest improvements over the standard Poisson GLM, suggesting that incorporating nonlinear effects enhances predictive performance. In contrast, the Negative Binomial model performs similarly to the Poisson GLM. The similar predictive performance suggests that accounting for overdispersion does not substantially improve predictive accuracy in this dataset.

We further evaluate model performance using distribution-specific deviance measures. For claim frequency models, we report out-of-sample Poisson deviance, which provides a distribution-consistent assessment for count data.

For claim frequency models, we report the Poisson deviance, defined as

D = 2 \sum_{i} [y_{i} log (\frac{y_{i}}{{\hat{y}}_{i}}) - (y_{i} - {\hat{y}}_{i})],

where

y_{i}

denotes the observed claim count and

{\hat{y}}_{i}

the predicted value. The Poisson deviance measures the discrepancy between observed and predicted counts under the Poisson distribution, with lower values indicating a better fit. This metric is appropriate for count data and aligns with the distributional assumptions of Poisson-based models. By convention, the term

y_{i} log (y_{i} / {\hat{y}}_{i})

is defined as zero when

y_{i} = 0

.

The results indicate that XGBoost achieves the lowest deviance among all models, and the relative ranking of models is consistent with the error-based metrics, supporting the robustness of the empirical findings.

We note that error-based metrics such as MSE are sensitive to extreme values and may be influenced by preprocessing choices such as capping and winsorization. Therefore, these metrics are interpreted in conjunction with deviance-based measures.

To assess the robustness of lift-based comparisons, we repeat the analysis across multiple random train–test splits and observe that the relative ranking of models remains broadly consistent. We acknowledge that formal statistical inference for lift measures, such as confidence intervals or hypothesis tests, is not included and represents a direction for future research.

Finally, we note that proper scoring rules, such as the continuous ranked probability score (CRPS), may provide additional insights into predictive distributions. As the current analysis focuses on point predictions, we leave this as a direction for future research.

Claim Capture Curves

Figure 2 presents the cumulative claim capture curves for the competing models. The XGBoost model consistently captures a larger proportion of claims in the highest-risk segments, followed by the Poisson model with spline terms and the GAM. The standard Poisson and Negative Binomial models exhibit weaker concentration of claims.

These results are consistent with the lift statistics and confirm that models incorporating nonlinear effects improve risk segmentation, with machine learning methods providing additional gains.

Decile Lift Analysis

Figure 3 presents the decile lift curves for the competing models. The XGBoost model achieves the highest lift in the top decile, followed by the Poisson model with spline terms and the GAM. Differences across the middle deciles are relatively small, reflecting the low overall claim frequency and limited variability in moderate-risk groups.

Interpretation of Model Components

Figure 4 presents the variable importance results for the XGBoost frequency model, measured using gain. Gain reflects the average improvement in the objective function attributable to splits on each variable and therefore provides a direct measure of each predictor’s contribution to predictive performance. This metric is widely used in applied machine learning and is particularly relevant in actuarial applications where the focus is on predictive accuracy.

The results indicate that BonusMalus is by far the most influential predictor, followed by DrivAge, Density, and VehAge. These findings are consistent with actuarial intuition, as past claims experience and driver characteristics are known to be strong determinants of claim frequency.

We note that alternative importance measures, such as cover or frequency, may yield different rankings, as they capture different aspects of variable usage within the model. Consequently, the importance values should be interpreted in the context of the chosen metric. In particular, gain-based importance emphasizes variables that contribute most to reducing prediction error, which aligns with the objectives of risk modeling and premium estimation.

Figure 5 presents the estimated smooth effects from the GAM for selected predictors, along with approximate 95% confidence bands. Overall, the results reveal nonlinear relationships between several covariates and claim frequency. The effect of VehAge exhibits a mild nonlinear pattern, with a slight decrease for mid-range vehicle ages followed by a gradual increase, suggesting that both very new and older vehicles may be associated with higher risk. A similar but less pronounced pattern is observed for DrivAge, where the effect remains relatively stable across most ages but declines slightly at higher values, indicating a potential reduction in risk among older drivers.

In contrast, BonusMalus shows a strong nonlinear effect, with a pronounced increase in risk for higher values, reflecting its role as a key indicator of prior claims experience. This confirms its importance as a dominant predictor in the model. The effect of Density appears relatively flat, with narrow confidence bands across most of its range, suggesting that population density contributes limited additional explanatory power once other variables are accounted for.

The width of the confidence bands varies across the range of each predictor, with wider intervals in regions with fewer observations, indicating greater uncertainty in those areas. Overall, these results highlight the ability of the GAM to capture complex nonlinear effects while providing interpretable insights into the contribution of individual risk factors.

These findings are consistent with the use of GAMs as a flexible extension of GLMs, allowing nonlinear effects to be captured without imposing strict parametric assumptions.

Model Selection

The results indicate that XGBoost provides the strongest overall predictive performance, achieving the lowest error measures, the highest correlation, and the greatest lift in the highest-risk decile. In addition, the Poisson model with spline terms delivers competitive performance while retaining the interpretability and transparency of the generalized linear modeling framework widely used in actuarial practice.

To enable a direct comparison between a traditional actuarial approach and a modern machine learning method in subsequent analyses, both models are retained. The Poisson spline model serves as the representative GLM-based specification, while XGBoost represents the machine learning approach. This dual-model strategy allows us to evaluate not only predictive accuracy but also the trade-offs between interpretability and performance in insurance pricing applications.

4.2. Claim Severity Model Performance

Table 6 summarizes the performance of the claim severity models. We report out-of-sample Gamma deviance for claim severity models, ensuring consistency with the assumed distributional framework.

The Gamma deviance is defined as

D = 2 \sum_{i} [\frac{y_{i} - {\hat{y}}_{i}}{{\hat{y}}_{i}} - log (\frac{y_{i}}{{\hat{y}}_{i}})],

where

y_{i}

denotes the observed claim severity and

{\hat{y}}_{i}

the predicted value. The Gamma deviance measures the discrepancy between observed and predicted values under the Gamma distribution, with lower values indicating a better fit. This metric is particularly appropriate for modeling positive, right-skewed outcomes such as claim severity. The low correlation values indicate that claim severity remains difficult to predict, reflecting the high variability and stochastic nature of individual claim amounts.

The comparison between the Gamma GLM and XGBoost reveals a trade-off across evaluation metrics. The XGBoost model achieves lower mean absolute error and higher correlation, indicating an improved ability to capture relative variation in claim severity.

In contrast, the Gamma GLM achieves a substantially lower Gamma deviance and slightly lower mean squared error, reflecting a better fit under the assumed distributional framework.

The difference in Gamma deviance is substantial, suggesting that the Gamma GLM provides a better distribution-consistent fit, while XGBoost offers improved predictive flexibility.

These results highlight that, while machine learning models may improve certain aspects of predictive performance, traditional actuarial models remain competitive when evaluated using distribution-consistent metrics. The differing performance across metrics underscores the importance of evaluating models using multiple criteria, particularly in the presence of heavy-tailed loss distributions.

Figure 6 illustrates the relationship between actual and predicted claim severity. The Gamma GLM produces highly concentrated predictions, indicating strong shrinkage toward the mean and limited ability to capture variability in claim sizes. In contrast, the XGBoost model exhibits greater dispersion and improved alignment with observed values, highlighting its ability to capture nonlinear relationships and complex interactions.

However, both models display considerable scatter, underscoring the inherently stochastic nature of claim severity and the difficulty of accurately predicting individual claim amounts. Observations with zero or near-zero values were excluded from the log-scale visualization to avoid numerical issues associated with logarithmic transformation.

4.3. Results for Frequency–Severity Decomposition for Pure Premium

Table 7 presents the performance of the frequency–severity decomposition models using both error-based and ranking-based metrics. For aggregate loss evaluation, we focus on error-based and ranking-based metrics. While distribution-consistent measures such as Tweedie deviance could also be considered, their computation depends on additional model assumptions and estimation procedures, and is therefore left for future work.

Among the models considered, the XGBoost–XGBoost configuration achieves the lowest MSE and MAE, as well as the highest correlation, indicating strong overall predictive accuracy. This indicates that incorporating machine learning methods for both frequency and severity improves pure premium estimation.

In contrast, the ranking-based results reveal a different pattern. The XGBoost–Gamma model achieves the highest lift in the top decile, although the magnitude of the difference relative to other models is modest and should be interpreted with caution. This pattern suggests that improvements in modeling claim frequency may contribute to stronger risk segmentation, although the effect cannot be isolated within the current framework.

To assess the robustness of these findings, we repeat the analysis across multiple random train–test splits. While the magnitude of lift varies across samples, the relative ranking of models remains broadly consistent.

The difference between error-based and lift-based metrics highlights an important trade-off. The results suggest a potential trade-off between prediction accuracy and risk segmentation, although the differences are modest. While the XGBoost–XGBoost model minimizes prediction error, the XGBoost–Gamma model achieves higher lift in the highest-risk segment. This indicates that different model configurations may be preferred depending on the objective, such as pricing accuracy versus risk classification.

We also examine the sensitivity of lift-based results to preprocessing choices, particularly the capping of claim severity. While the magnitude of lift varies under alternative thresholds, the qualitative patterns remain similar.

The classical Spline–Gamma model yields the weakest performance across all metrics, although it remains a useful benchmark due to its interpretability.

Overall, these results suggest that machine learning improves predictive performance within the decomposition framework, while differences in risk segmentation depend on model configuration and evaluation criteria.

Risk Segmentation Performance

Figure 7 illustrates the decile lift curves for the decomposition models. All models exhibit a strong concentration of losses in the highest-risk decile, indicating effective identification of high-risk policies.

Among the models, the XGBoost–Gamma configuration achieves the highest lift. However, this difference should be interpreted cautiously, as lift-based metrics are sensitive to model specification, sampling variability, and preprocessing choices. In contrast, although the XGBoost–XGBoost model provides the best overall prediction accuracy, its lift is lower, which is consistent with the modest trade-off observed between accuracy and ranking-based performance.

Alternative measures of risk segmentation, such as the area under the lift curve, may provide additional insights and are left for future research.

These findings suggest that both frequency and severity components contribute to model performance, although their relative impact on risk segmentation cannot be fully disentangled within the current analysis.

The non-monotonic behavior observed in lower-risk deciles reflects the inherent variability of insurance losses, particularly for policies with low exposure. These results provide empirical evidence supporting the continued use of frequency–severity decomposition, particularly when combined with machine learning methods.

4.4. Comparison Between Decomposition and Direct Modeling for Total Loss

To evaluate the effectiveness of direct modeling approaches, we compare the best-performing decomposition model (XGBoost–XGBoost) with direct XGBoost and Tweedie models for total claim amount prediction. All models are trained and evaluated using the same training and testing splits to ensure a fair comparison of predictive performance.

Table 8 summarizes the performance of decomposition and direct modeling approaches.

The decomposition-based XGBoost–XGBoost model achieves the lowest MSE and MAE and the highest correlation among the models considered based on the evaluation metrics used. Although the numerical differences are modest, these results suggest that modeling claim frequency and severity separately may offer some performance advantages in predicting total losses.

The direct XGBoost model provides competitive performance, indicating that flexible machine learning methods can capture nonlinear relationships in aggregate losses. However, its performance is slightly below that of the decomposition framework, suggesting that modeling aggregate losses directly may not fully exploit the structural information captured by the frequency–severity approach.

The Tweedie model is implemented as a generalized linear model with a log link and serves as a baseline parametric benchmark. While more flexible extensions of Tweedie regression are available, including spline-based and machine learning approaches, these are not considered in the current study. The observed differences in performance should therefore be interpreted in light of the differing levels of model flexibility.

The Tweedie power parameter is estimated using profile likelihood methods, yielding an estimated value of approximately 1.52. This value lies within the compound Poisson–Gamma range and is consistent with typical insurance applications. We do not report a confidence interval for this estimate, which represents a limitation of the current analysis.

The results indicate that, while the data exhibit a compound structure, the single-equation specification of the Tweedie GLM may be more restrictive than the decomposition framework, which allows for separate modeling of frequency and severity components.

Overall, these findings suggest that decomposition-based approaches can provide improved predictive performance in this setting. However, the comparison reflects differences in both modeling structure and functional flexibility, and conclusions are therefore interpreted in terms of empirical performance rather than definitive methodological superiority.

5. Conclusions and Discussion

This study examines the integration of classical actuarial models and modern machine learning techniques for insurance pricing, with particular emphasis on comparing the traditional frequency–severity decomposition framework with direct modeling approaches. The results provide important insights into predictive performance and practical implications for risk management.

The empirical findings show that machine learning methods, particularly XGBoost, improve predictive accuracy for both claim frequency and claim severity. The improvement is especially evident for severity modeling, where the Gamma GLM tends to shrink predictions toward the mean and fails to capture the variability of claim sizes. In contrast, XGBoost effectively models nonlinear relationships and interactions, leading to lower prediction errors. Despite these improvements, correlation values remain relatively low, reflecting the inherently stochastic nature of claim severity.

Within the decomposition framework, the XGBoost–XGBoost model achieves the lowest prediction error among the models considered. However, improvements in error-based metrics are moderate, which highlights the high variability of insurance losses, particularly for policies with low exposure.

An important contribution of this study is the identification of a trade-off between prediction accuracy and risk segmentation. Although the XGBoost–XGBoost model minimizes prediction error, the XGBoost–Gamma model achieves the highest lift and provides improved identification of high-risk policies. The results suggest that differences in risk segmentation may be influenced by both frequency and severity components, although their relative contributions cannot be isolated within the current modeling framework. These findings indicate that model configuration plays an important role in risk segmentation performance, rather than providing conclusive evidence about the dominance of a specific component.

The comparison between decomposition and direct modeling approaches provides further insight. While direct XGBoost and Tweedie models deliver competitive performance, they do not achieve lower error than the decomposition-based framework based on the evaluation metrics considered in this study.

A deeper comparison with existing studies helps explain differences in empirical findings. The results of [10] suggest that direct modeling approaches can achieve lower prediction error than decomposition-based methods in predicting aggregate losses, while our results indicate that decomposition remains competitive. Several factors may contribute to this discrepancy.

First, differences in data characteristics play an important role. The dataset used in this study exhibits strong zero-inflation and heavy-tailed severity, which can affect the relative performance of direct and decomposition approaches. In contrast, prior studies may use datasets with different distributional properties, leading to different conclusions.

Second, model specification differs across studies. In particular, [11] show that gradient boosting improves claim frequency prediction but that classical models may remain competitive for claim severity. This aligns with our findings, where machine learning provides consistent gains for frequency but more modest improvements for severity. Since decomposition models allow separate specification of frequency and severity components, they can better accommodate such differences.

Third, evaluation criteria influence the conclusions. Some studies focus primarily on aggregate prediction error, while our analysis also considers risk segmentation through lift metrics. These different objectives may favor different modeling approaches.

Finally, recent literature highlights that more flexible implementations of Tweedie and frequency–severity models, including deep learning extensions, can further affect comparative performance. For example, machine learning approaches can capture nonlinear dependencies and interactions that are not modeled in classical GLMs, but their performance depends on tuning, data structure, and evaluation metrics.

Overall, these considerations suggest that differences across studies may be attributed to variations in data, model specification, and evaluation objectives, rather than representing conflicting conclusions.

This discrepancy highlights an important distinction between general predictive modeling and actuarial applications. Direct modeling may perform well in settings where minimizing prediction error is the sole objective. However, in insurance pricing, the frequency–severity decomposition reflects the underlying structure of claim processes and enables separate modeling of distinct risk components. This structural advantage allows for greater flexibility in capturing heterogeneous risk factors and improves risk segmentation.

In addition, direct modeling imposes a single functional relationship between covariates and aggregate loss, which may be overly restrictive when frequency and severity are influenced by different drivers. In contrast, the decomposition framework accommodates distinct covariate effects for each component, resulting in improved predictive performance in practice.

From a practical perspective, these findings have important implications for insurance pricing and risk management. If the objective is to minimize prediction error, models that incorporate machine learning for both frequency and severity are preferred. If the goal is to identify high-risk policies for underwriting or portfolio management, models that emphasize frequency modeling are more effective. However, these conclusions should be interpreted with caution, as they depend on model specification, evaluation metrics, and data preprocessing choices.

The results are also consistent with prior studies comparing gradient boosting methods with classical GLMs within the frequency–severity framework [11]. As in previous work, machine learning improves claim frequency prediction. However, while earlier studies report mixed results for severity modeling, the present analysis shows that XGBoost can provide competitive or improved performance relative to the Gamma GLM when appropriately specified and tuned. This suggests that the relative performance of machine learning for severity is context-dependent and may improve with richer data and more flexible model configurations.

Overall, this study demonstrates that machine learning enhances predictive performance, but its effectiveness is maximized when combined with the classical frequency–severity decomposition framework. The results indicate that the relative performance of decomposition and direct modeling approaches depends on model specification and evaluation criteria, rather than reflecting a universal dominance of one approach. These findings provide empirical support for decomposition-based pricing approaches and offer a foundation for future research on hybrid models that balance predictive accuracy and risk segmentation in complex insurance environments.

We note that comparisons across model configurations combine differences in both model structure and model specification. As a result, the observed differences in lift cannot be attributed to a single component, such as frequency or severity. The findings should therefore be interpreted as empirical patterns rather than definitive evidence of the relative importance of individual components.

A more rigorous assessment of the relative contributions of frequency and severity would require a controlled modeling framework or variance decomposition analysis, which we leave for future research.

Recent developments in Tweedie-based modeling, including extensions using neural networks and gradient boosting, further expand the range of modeling approaches. Incorporating such methods into a unified comparison framework represents an important direction for future research.

Funding

This research received no external funding.

Data Availability Statement

The data used in this study are part of the CASdatasetsR package. Specifically, the freMTPL2freqand freMTPL2sevdatasets were used. These data are publicly available for research purposes and can be accessed at https://entrepot.recherche.data.gouv.fr/dataset.xhtml?persistentId=doi:10.57745/P0KHAG (accessed on 3 May 2026).

Conflicts of Interest

The author declares no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AIC	Akaike Information Criterion
GAM	Generalized Additive Model
GLM	Generalized Linear Model
MAE	Mean Absolute Error
ML	Machine Learning
MSE	Mean Squared Error
REML	Restricted Maximum Likelihood
NB	Negative Binomial
XGBoost	Extreme Gradient Boosting

Appendix A. Sensitivity Analysis for Preprocessing Choices

Table A1. Sensitivity analysis of model performance under alternative preprocessing thresholds.

Frequency Cap	Severity Cap (EUR)	Frequency MSE	Severity MSE
2	50,000	0.04601	15,969,957
2	100,000	0.04601	34,015,418
2	200,000	0.04601	78,486,594
3	50,000	0.04657	15,969,957
3	100,000	0.04657	34,015,418
3	200,000	0.04657	78,486,594
4	50,000	0.04663	15,969,957
4	100,000	0.04663	34,015,418
4	200,000	0.04663	78,486,594

References

de Jong, P.; Heller, G.Z. Generalized Linear Models for Insurance Data; Cambridge University Press: Cambridge, UK, 2008. [Google Scholar]
Klugman, S.A.; Panjer, H.H.; Willmot, G.E. Loss Models: From Data to Decisions; Wiley: Hoboken, NJ, USA, 2012. [Google Scholar]
McCullagh, P.; Nelder, J.A. Generalized Linear Models, 2nd ed.; Chapman & Hall: London, UK, 1989. [Google Scholar]
Frees, E.W. Regression Modeling with Actuarial and Financial Applications; Cambridge University Press: Cambridge, UK, 2014. [Google Scholar]
Ohlsson, E.; Johansson, B. Non-Life Insurance Pricing with Generalized Linear Models; Springer: Berlin/Heidelberg, Germany, 2010. [Google Scholar]
Salhi, C.; Tebache, D. Application of generalized linear models in insurance pricing in emerging markets. J. Corp. Gov. Insur. Risk Manag. 2024, 11, 180–198. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; Association for Computing Machinery: New York, NY, USA, 2016; pp. 785–794. [Google Scholar]
Henckaerts, R.; Côté, M.-P.; Antonio, K.; Verbelen, R. Boosting insights in insurance tariff plans with tree-based machine learning methods. N. Am. Actuar. J. 2021, 25, 255–285. [Google Scholar] [CrossRef]
Noll, A.; Salzmann, R.; Wüthrich, M.V. Case study: French motor third-party liability claims. SSRN Electron. J. 2018. [Google Scholar] [CrossRef]
Moonoo, D.; Hosein, P. Predicting automobile insurance claim rate versus through severity and frequency predictions. In Proceedings of the 2024 IEEE International Conference on Technology Management, Operations and Decisions (ICTMOD), Sharjah, United Arab Emirates, 4–6 November 2024; pp. 1–6. [Google Scholar] [CrossRef]
Clemente, C.; Guerreiro, G.R.; Bravo, J.M. Modelling motor insurance claim frequency and severity using gradient boosting. Risks 2023, 11, 163. [Google Scholar] [CrossRef]
Bekkaye, C.; Oukhoya, H.; Zari, T.; Guerbaz, R.; El Bouanani, H. Advanced strategies for predicting and managing auto insurance claims using machine learning models. Stat. Optim. Inf. Comput. 2025, 14, 1440–1457. [Google Scholar] [CrossRef]
Richman, R.; Wüthrich, M.V. Neural network embedding of the over-dispersed Poisson model. Scand. Actuar. J. 2019, 2019, 451–470. [Google Scholar]
Wüthrich, M.V. Bias regularization in neural network models for general insurance pricing. Eur. Actuar. J. 2020, 10, 179–202. [Google Scholar] [CrossRef]
Jørgensen, B. The Theory of Dispersion Models; Chapman & Hall: London, UK, 1997. [Google Scholar]
Smyth, G.K.; Jørgensen, B. Fitting Tweedie’s compound Poisson model to insurance claims data. ASTIN Bull. 2002, 32, 143–157. [Google Scholar] [CrossRef]
Dutang, C.; Charpentier, A. CASdatasets: Insurance Datasets, R Package Version 1.2-0; CRC Press: Boca Raton, FL, USA, 2024. Available online: https://entrepot.recherche.data.gouv.fr/dataset.xhtml?persistentId=doi:10.57745/P0KHAG (accessed on 3 May 2026).

Figure 1. Log-scaled distributions of claim frequency (left), claim severity (center), and pure premium (right) based on the raw dataset.

Figure 2. Comparison of claim capture curves across frequency models.

Figure 3. Decile lift chart for frequency models.

Figure 4. Variable importance for the XGBoost frequency model based on gain (relative contribution to the objective function).

Figure 5. Estimated smooth effects from the GAM frequency model. Solid lines represent fitted smooth functions and dashed lines indicate approximate 95% confidence bands.

Figure 6. Comparison of actual and predicted claim severity for the Gamma GLM and XGBoost models on a log scale. Observations with zero values are excluded. Color represents the magnitude of observed claim severity.

Figure 7. Decile lift curves for pure premium predictions across different model configurations. Decile 1 corresponds to the highest predicted risk.

Table 1. Variables in the freMTPL2freq dataset.

Variable	Type	Description
`IDpol`	Integer	Unique policy identifier.
`ClaimNb`	Integer	Number of claims during the exposure period.
`Exposure`	Numeric	Fraction of the year the policy was in force.
`Area`	Categorical	Geographic area classification.
`VehPower`	Categorical	Vehicle power category.
`VehAge`	Integer	Age of the vehicle (years).
`DrivAge`	Integer	Age of the driver (years).
`BonusMalus`	Numeric	Bonus–malus coefficient.
`VehBrand`	Categorical	Vehicle manufacturer or brand.
`VehGas`	Categorical	Fuel type (gasoline or diesel).
`Density`	Numeric	Population density of the insured area.
`Region`	Categorical	Administrative region of residence.

Table 2. Comparison of dataset statistics before and after cleaning.

Exposure Distribution			Claim Severity Quantiles (EUR)
Statistic	Before	After	Quantile	Raw	Capped
Min	0.0027	0.1000	50% (Median)	1172	1172
Mean	0.5287	0.6318	95%	4765	4765
Max	2.0100	2.0100	99%	16,451	16,451
N	677,991	556,439	100% (Max)	4,075,400	100,000

Table 3. Summary statistics of the final modeling dataset.

Variable	Mean	Median	Min	Max
ClaimNb (Capped)	0.045	0	0	4
Exposure	0.632	0.630	0.10	2.01
Total Claim Amount (EUR)	82.09	0	0	115,600
Average Severity (EUR)	76.46	0	0	100,000

Table 4. XGBoost Hyperparameters.

Model	Learning Rate	Max Depth	Subsample	Min Child Weight
Frequency	0.05	4	0.8	5
Severity	0.05	4	0.8	5
Direct	0.05	4	0.8	5

Table 5. Performance comparison of claim frequency models.

Model	MSE	MAE	Correlation	Lift (Top 10%)	Poisson Deviance
Poisson GLM	0.04654	0.08432	0.136	2.48	44,992
Poisson + Splines	0.04643	0.08415	0.144	2.76	44,662
Negative Binomial	0.04656	0.08445	0.136	2.48	44,995
GAM	0.04637	0.08418	0.147	2.62	44,788
XGBoost	0.04573	0.08349	0.185	3.06	43,873

Note: Bold values indicate the best performance for each evaluation metric.

Table 6. Performance comparison of claim severity models.

Model	MSE	MAE	Correlation	Gamma Deviance
Gamma GLM	27,565,815	1467	0.017	9361
XGBoost	28,229,951	1181	0.034	13,203

Note: Bold values indicate the best performance for each evaluation metric.

Table 7. Performance of frequency–severity decomposition models.

Model	MSE	MAE	Correlation	Lift (Top 10%)
Spline–Gamma	15,553,426	244.57	0.0066	1.509
XGBoost–Gamma	15,548,539	243.19	0.0126	2.039
Spline–XGBoost	15,550,927	242.83	0.0112	1.452
XGBoost–XGBoost	15,547,197	241.64	0.0165	1.663

Note: Bold values indicate the best performance for each evaluation metric.

Table 8. Comparison of decomposition and direct models for total claim amount.

Model	MSE	MAE	Correlation
XGBoost–XGBoost (Decomposition)	1,641,356	138.99	0.0656
XGBoost (Direct)	1,644,657	150.43	0.0460
Tweedie GLM (Direct)	1,649,283	162.65	0.0329

Note: Bold values indicate the best performance for each evaluation metric.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Nguyen, N. Machine Learning and Frequency–Severity Decomposition for Insurance Pricing. Mathematics 2026, 14, 1640. https://doi.org/10.3390/math14101640

AMA Style

Nguyen N. Machine Learning and Frequency–Severity Decomposition for Insurance Pricing. Mathematics. 2026; 14(10):1640. https://doi.org/10.3390/math14101640

Chicago/Turabian Style

Nguyen, Nguyet. 2026. "Machine Learning and Frequency–Severity Decomposition for Insurance Pricing" Mathematics 14, no. 10: 1640. https://doi.org/10.3390/math14101640

APA Style

Nguyen, N. (2026). Machine Learning and Frequency–Severity Decomposition for Insurance Pricing. Mathematics, 14(10), 1640. https://doi.org/10.3390/math14101640

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Machine Learning and Frequency–Severity Decomposition for Insurance Pricing

Abstract

1. Introduction

2. Data Description and Preparation

2.1. Data Integration and Preprocessing

2.2. Exploratory Data Analysis

2.3. Data Cleaning and Preparation

2.4. Final Modeling Dataset Profile

3. Methodology

3.1. Frequency–Severity Decomposition Framework

3.2. Claim Frequency Models

3.2.1. Poisson Generalized Linear Model

3.2.2. Negative Binomial Model

3.2.3. Poisson GLM with Natural Cubic Splines

3.2.4. Generalized Additive Model (GAM)

3.2.5. XGBoost for Frequency

3.3. Hyperparameter Tuning

Model Selection for Frequency

3.4. Claim Severity Models

3.4.1. Gamma GLM

3.4.2. XGBoost for Severity

3.5. Pure Premium Estimation

3.6. Direct Modeling of Aggregate Loss

3.6.1. Tweedie Model

3.6.2. XGBoost for Total Loss

3.7. Model Evaluation

4. Results

4.1. Claim Frequency Model Performance

4.2. Claim Severity Model Performance

4.3. Results for Frequency–Severity Decomposition for Pure Premium

4.4. Comparison Between Decomposition and Direct Modeling for Total Loss

5. Conclusions and Discussion

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Sensitivity Analysis for Preprocessing Choices

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI