1. Introduction
Obtaining accurate estimations of pure premiums is a central challenge in automobile insurance pricing. Insurers must determine premiums that reflect the underlying risk of each policyholder while ensuring fairness, competitiveness, and regulatory compliance, as premium estimation directly affects portfolio stability and profitability.
A cornerstone of actuarial practice is the frequency–severity framework, which decomposes the pure premium into two latent components: the expected number of claims and the expected claim size. By modeling these components separately, actuaries can employ statistical distributions tailored to the distinct stochastic properties of each process before recombining them for final loss estimation. This framework provides both interpretability and flexibility, making it fundamental to modern insurance ratemaking.
Generalized Linear Models (GLMs) have long served as the industry standard for actuarial modeling. Existing literature highlights the capacity of GLMs to provide an interpretable yet flexible structure for non-normal, skewed, and heteroscedastic insurance data [
1,
2,
3]. While Poisson and Negative Binomial models are commonly used for claim frequency, Gamma models are widely applied to claim severity due to the strictly positive and right-skewed nature of loss amounts. Extensions such as Generalized Additive Models (GAMs) allow for nonlinear relationships while preserving model transparency [
4,
5].
Previous studies have demonstrated the applicability of generalized linear models in modeling claim frequency and insurance pricing across a wide range of contexts, including emerging markets [
4,
5,
6].
Machine learning (ML) has emerged as a powerful tool in insurance analytics. Tree-based algorithms and gradient boosting methods are particularly effective in capturing complex interactions and nonlinear relationships that traditional parametric models may overlook. Empirical evidence suggests that models such as XGBoost can improve performance in tasks ranging from claim frequency prediction to fraud detection [
7,
8,
9].
Recent studies such as [
10,
11] provide empirical comparisons of decomposition-based and direct modeling approaches. These studies highlight that model performance can vary depending on the underlying data structure and the choice of evaluation metrics. Compared to these studies, our analysis provides a unified evaluation framework that applies both decomposition-based and direct modeling approaches to the same dataset using consistent evaluation metrics, allowing for a more controlled comparison of model performance.
Recent work has also explored both direct and component-based modeling strategies using machine learning. Direct modeling of total claim amounts, using methods such as Support Vector Regression (SVR), XGBoost, and neural networks, has demonstrated strong predictive accuracy for aggregate loss estimation [
12]. Other studies incorporate machine learning within the classical frequency–severity framework, where gradient boosting models are compared with traditional GLMs [
11]. These findings indicate that machine learning consistently improves claim frequency prediction, while results for claim severity remain more mixed due to the high variability of loss amounts.
There is also a growing body of work on neural network–based approaches to insurance modeling, including the contributions of Richman and Wüthrich [
13,
14], which demonstrate the effectiveness of deep learning methods in capturing complex nonlinear relationships in frequency and severity components. These approaches often integrate decomposition structures within flexible modeling frameworks, suggesting that the distinction between direct and component-based modeling may depend more on implementation than on fundamental modeling philosophy.
Alternative unified modeling approaches have also been proposed, most notably through the Tweedie distribution, which provides a compound Poisson–Gamma representation of aggregate losses [
15,
16]. While such models offer strong theoretical appeal, their empirical performance relative to decomposition-based methods may depend on model specification and data characteristics.
Motivated by these considerations and the mixed empirical findings in the literature, this study is guided by the following research questions: (i) whether machine learning methods improve predictive performance within the frequency–severity decomposition framework, and (ii) how direct modeling approaches compare with decomposition-based methods in predicting aggregate losses.
In particular, we evaluate model performance not only in terms of prediction accuracy but also in terms of the ability to identify high-risk policies.
To address these questions, we develop a comprehensive modeling framework that integrates classical actuarial models with modern machine learning techniques. We evaluate several models for claim frequency, including Poisson GLMs, spline-based models, and XGBoost. For claim severity, we compare a Gamma GLM with XGBoost. These models are combined within the frequency–severity decomposition framework to estimate pure premium. In addition, we examine direct modeling approaches, including XGBoost and Tweedie models, for predicting aggregate losses.
In this context, our contribution is to provide a unified empirical comparison of decomposition and direct modeling approaches using both classical actuarial models and modern machine learning methods, evaluated under multiple performance metrics. This allows us to provide a consistent comparison across modeling approaches, highlighting not only differences in predictive accuracy but also the role of model structure and evaluation criteria in shaping empirical conclusions.
The contribution of this study is threefold. First, we show that machine learning methods can improve predictive performance for claim frequency and provide competitive performance for claim severity, particularly in capturing nonlinear relationships and interactions. Second, we provide empirical evidence that the frequency–severity decomposition framework remains competitive relative to direct modeling approaches, even when advanced machine learning methods are applied. Third, we identify a trade-off between prediction accuracy and risk segmentation, although the magnitude of this trade-off is modest and depends on model configuration.
Overall, this study contributes to the literature on actuarial science and financial mathematics by providing a comprehensive comparison of classical and modern approaches to insurance pricing. The results highlight the continued relevance of the frequency–severity decomposition framework while demonstrating how machine learning can be effectively integrated to enhance predictive performance and risk differentiation.
The remainder of the paper is organized as follows.
Section 2 describes the dataset and data preparation procedures;
Section 3 presents the modeling framework and statistical methods;
Section 4 reports the empirical findings; and
Section 5 concludes with a discussion of the results and their implications for insurance pricing and risk management.
2. Data Description and Preparation
This study utilizes the French motor third-party liability (
freMTPL2) datasets from the
CASdatasets package [
17]. The raw data comprise two files:
freMTPL2freq, containing 677,991 policy-year records, and
freMTPL2sev, containing 26,444 individual claim amounts. These datasets provide a comprehensive set of risk features, including driver demographics, vehicle attributes, and geographic factors.
Table 1 summarizes the variables included in the frequency dataset. These covariates represent standard rating factors used in automobile insurance pricing, including driver age, vehicle characteristics, geographic region, and exposure.
The severity dataset contains 26,444 claim-level observations and includes the policy identifier and the corresponding claim amount for each reported claim. Because the frequency data are recorded at the policy level while the severity data are recorded at the claim level, the two datasets must be reconciled before modeling.
2.1. Data Integration and Preprocessing
To reconcile the policy-level frequency data with the claim-level severity data, all individual claim amounts in freMTPL2sev were aggregated by policy identifier (IDpol). This produced, for each policy, the total claim amount incurred during the exposure period and the number of associated claims. Policies with no reported claims were assigned a total loss of zero. The aggregated severity information was then merged with the frequency dataset to create a unified policy-level file containing exposure, claim counts, total incurred losses, and all rating variables.
For policies with at least one claim, an average claim severity variable was also computed for descriptive purposes. Pure premium values were obtained by dividing the total incurred loss by the policy’s exposure. These derived quantities are used only for exploratory analysis in this section; the formal notation and modeling framework are introduced later in
Section 3.
2.2. Exploratory Data Analysis
The raw insurance portfolio exhibits the extreme class imbalance and heavy-tailed distributions typical of motor liability risks.
Figure 1 illustrates these characteristics using log-scaled axes to visualize the full range of the data.
The frequency distribution (
Figure 1, left) is dominated by zero-claim policies, with a rapid decay in frequency as claim counts increase. Notably, the raw data contain rare observations of up to 16 claims per policy, appearing as isolated points in the extreme tail. Similarly, the claim severity and pure premium distributions (
Figure 1, center and right) span several orders of magnitude. The severity plot reveals a significant concentration of claims around 1000 EUR, but also shows an exceptionally long right tail reaching towards
EUR, representing catastrophic losses.
These visualizations highlight the necessity of the preprocessing steps described in
Section 2.3. Specifically, the extreme sparsity of high claim counts and the high-leverage outliers in the severity tail motivate the use of capping (winsorization) to ensure the numerical stability and generalizability of the frequency and severity models.
2.3. Data Cleaning and Preparation
This section describes the preprocessing steps applied to improve model stability and address the heavy-tailed nature of insurance data. These procedures are particularly important for both Poisson frequency models and Gamma severity models, which are sensitive to extreme observations.
We apply several preprocessing steps to mitigate the influence of extreme values. Specifically, claim frequency is capped at four, and claim severity is winsorized at EUR 100,000. These transformations are commonly used in actuarial applications to stabilize estimation in the presence of heavy-tailed loss distributions.
We acknowledge that these preprocessing choices may affect both the distributional properties of the data and the interpretation of the resulting pure premium estimates. For example, capping claim frequency implies that policyholders with more than four claims are treated equivalently, while winsorization truncates extreme claim amounts that may carry important information about tail risk.
To assess the robustness of these choices, we conduct sensitivity analyses using alternative frequency caps (2, 3, and 4) and severity thresholds (EUR 50,000, EUR 100,000, and EUR 200,000). The detailed results are reported in
Appendix A.
The results show that frequency model performance remains largely unchanged across different caps, indicating robustness to this preprocessing choice. For severity models, the mean squared error increases as the winsorization threshold increases, reflecting the influence of larger claims rather than instability in model performance. Importantly, the relative comparison across models remains consistent.
Overall, these findings indicate that the main conclusions regarding model comparison are stable across different preprocessing specifications.
Policies with exposure below 0.1 years are excluded to avoid instability arising from extremely short exposure periods. We recognize that this filtering reduces the sample size and may introduce selection bias, and we therefore interpret the results with this limitation in mind.
Alternative approaches, such as robust regression techniques that downweight rather than truncate extreme observations may offer a complementary strategy and are left for future research.
Exposure Filtering
Policies with extremely low exposure are prone to producing artificial variance in annualized claim rates. We excluded all records with an exposure below 0.1 years. This threshold ensures that the observations used for training represent a meaningful period of risk. As shown in
Table 2, this step improved the mean exposure from 0.529 to 0.632 years.
Capping and Winsorization
Given the extreme right-skewness observed in
Figure 1, we applied capping and winsorization as follows:
Claim Frequency: Observations with more than 4 claims were capped at 4. This choice is supported by the empirical distribution of the data, where observations with more than four claims are extremely rare (fewer than 0.001% of policies). Modeling these sparse observations separately would introduce estimation noise without improving predictive performance.
Claim Severity: The severity distribution exhibits substantial right-skewness, with a small number of extremely large claims (up to EUR 4.07 million) that are several orders of magnitude larger than the median claim size. Such observations can disproportionately influence model estimation, particularly for Gamma-based models and error-based evaluation metrics. Winsorization at EUR 100,000 is therefore applied to limit the influence of these extreme values while preserving the overall structure of the distribution.
2.4. Final Modeling Dataset Profile
The resulting dataset provides a stabilized foundation for estimating pure premiums.
Table 3 summarizes the response variables that will be utilized in
Section 3. The disparity between the mean and median values across all metrics reinforces the inherent skewness that persists even after cleaning, necessitating the specialized modeling framework proposed in this study.
4. Results
Model comparisons are based on observed differences in evaluation metrics and are descriptive in nature; no formal statistical significance tests are conducted. This section presents the empirical results for the claim frequency, claim severity, and pure premium models. We begin by evaluating the performance of alternative models for claim frequency, followed by claim severity, and then assess the combined frequency–severity decomposition and direct modeling approaches.
4.1. Claim Frequency Model Performance
Table 5 summarizes the performance of the claim frequency models. Lift is defined as the ratio of the observed loss in a selected high-risk segment to the average loss across the entire portfolio, providing a measure of the model’s ability to identify high-risk policies. We note that the observed differences in performance metrics are relatively small and should be interpreted with caution. While formal statistical significance tests could be conducted using repeated resampling or cross-validation, the current analysis focuses on out-of-sample predictive performance. The consistency of model rankings across multiple train–test splits provides additional support for the robustness of the results.
Among all models, XGBoost achieves the lowest MSE and MAE, as well as the highest correlation and lift among the models considered. In particular, the lift in the top decile indicates a strong ability to identify high-risk policies, highlighting the effectiveness of machine learning methods for risk segmentation.
Among the parametric models, the Poisson model with spline terms and the GAM provide modest improvements over the standard Poisson GLM, suggesting that incorporating nonlinear effects enhances predictive performance. In contrast, the Negative Binomial model performs similarly to the Poisson GLM. The similar predictive performance suggests that accounting for overdispersion does not substantially improve predictive accuracy in this dataset.
We further evaluate model performance using distribution-specific deviance measures. For claim frequency models, we report out-of-sample Poisson deviance, which provides a distribution-consistent assessment for count data.
For claim frequency models, we report the Poisson deviance, defined as
where
denotes the observed claim count and
the predicted value. The Poisson deviance measures the discrepancy between observed and predicted counts under the Poisson distribution, with lower values indicating a better fit. This metric is appropriate for count data and aligns with the distributional assumptions of Poisson-based models. By convention, the term
is defined as zero when
.
The results indicate that XGBoost achieves the lowest deviance among all models, and the relative ranking of models is consistent with the error-based metrics, supporting the robustness of the empirical findings.
We note that error-based metrics such as MSE are sensitive to extreme values and may be influenced by preprocessing choices such as capping and winsorization. Therefore, these metrics are interpreted in conjunction with deviance-based measures.
To assess the robustness of lift-based comparisons, we repeat the analysis across multiple random train–test splits and observe that the relative ranking of models remains broadly consistent. We acknowledge that formal statistical inference for lift measures, such as confidence intervals or hypothesis tests, is not included and represents a direction for future research.
Finally, we note that proper scoring rules, such as the continuous ranked probability score (CRPS), may provide additional insights into predictive distributions. As the current analysis focuses on point predictions, we leave this as a direction for future research.
Claim Capture Curves
Figure 2 presents the cumulative claim capture curves for the competing models. The XGBoost model consistently captures a larger proportion of claims in the highest-risk segments, followed by the Poisson model with spline terms and the GAM. The standard Poisson and Negative Binomial models exhibit weaker concentration of claims.
These results are consistent with the lift statistics and confirm that models incorporating nonlinear effects improve risk segmentation, with machine learning methods providing additional gains.
Decile Lift Analysis
Figure 3 presents the decile lift curves for the competing models. The XGBoost model achieves the highest lift in the top decile, followed by the Poisson model with spline terms and the GAM. Differences across the middle deciles are relatively small, reflecting the low overall claim frequency and limited variability in moderate-risk groups.
Interpretation of Model Components
Figure 4 presents the variable importance results for the XGBoost frequency model, measured using gain. Gain reflects the average improvement in the objective function attributable to splits on each variable and therefore provides a direct measure of each predictor’s contribution to predictive performance. This metric is widely used in applied machine learning and is particularly relevant in actuarial applications where the focus is on predictive accuracy.
The results indicate that BonusMalus is by far the most influential predictor, followed by DrivAge, Density, and VehAge. These findings are consistent with actuarial intuition, as past claims experience and driver characteristics are known to be strong determinants of claim frequency.
We note that alternative importance measures, such as cover or frequency, may yield different rankings, as they capture different aspects of variable usage within the model. Consequently, the importance values should be interpreted in the context of the chosen metric. In particular, gain-based importance emphasizes variables that contribute most to reducing prediction error, which aligns with the objectives of risk modeling and premium estimation.
Figure 5 presents the estimated smooth effects from the GAM for selected predictors, along with approximate 95% confidence bands. Overall, the results reveal nonlinear relationships between several covariates and claim frequency. The effect of
VehAge exhibits a mild nonlinear pattern, with a slight decrease for mid-range vehicle ages followed by a gradual increase, suggesting that both very new and older vehicles may be associated with higher risk. A similar but less pronounced pattern is observed for
DrivAge, where the effect remains relatively stable across most ages but declines slightly at higher values, indicating a potential reduction in risk among older drivers.
In contrast, BonusMalus shows a strong nonlinear effect, with a pronounced increase in risk for higher values, reflecting its role as a key indicator of prior claims experience. This confirms its importance as a dominant predictor in the model. The effect of Density appears relatively flat, with narrow confidence bands across most of its range, suggesting that population density contributes limited additional explanatory power once other variables are accounted for.
The width of the confidence bands varies across the range of each predictor, with wider intervals in regions with fewer observations, indicating greater uncertainty in those areas. Overall, these results highlight the ability of the GAM to capture complex nonlinear effects while providing interpretable insights into the contribution of individual risk factors.
These findings are consistent with the use of GAMs as a flexible extension of GLMs, allowing nonlinear effects to be captured without imposing strict parametric assumptions.
Model Selection
The results indicate that XGBoost provides the strongest overall predictive performance, achieving the lowest error measures, the highest correlation, and the greatest lift in the highest-risk decile. In addition, the Poisson model with spline terms delivers competitive performance while retaining the interpretability and transparency of the generalized linear modeling framework widely used in actuarial practice.
To enable a direct comparison between a traditional actuarial approach and a modern machine learning method in subsequent analyses, both models are retained. The Poisson spline model serves as the representative GLM-based specification, while XGBoost represents the machine learning approach. This dual-model strategy allows us to evaluate not only predictive accuracy but also the trade-offs between interpretability and performance in insurance pricing applications.
4.2. Claim Severity Model Performance
Table 6 summarizes the performance of the claim severity models. We report out-of-sample Gamma deviance for claim severity models, ensuring consistency with the assumed distributional framework.
The Gamma deviance is defined as
where
denotes the observed claim severity and
the predicted value. The Gamma deviance measures the discrepancy between observed and predicted values under the Gamma distribution, with lower values indicating a better fit. This metric is particularly appropriate for modeling positive, right-skewed outcomes such as claim severity. The low correlation values indicate that claim severity remains difficult to predict, reflecting the high variability and stochastic nature of individual claim amounts.
The comparison between the Gamma GLM and XGBoost reveals a trade-off across evaluation metrics. The XGBoost model achieves lower mean absolute error and higher correlation, indicating an improved ability to capture relative variation in claim severity.
In contrast, the Gamma GLM achieves a substantially lower Gamma deviance and slightly lower mean squared error, reflecting a better fit under the assumed distributional framework.
The difference in Gamma deviance is substantial, suggesting that the Gamma GLM provides a better distribution-consistent fit, while XGBoost offers improved predictive flexibility.
These results highlight that, while machine learning models may improve certain aspects of predictive performance, traditional actuarial models remain competitive when evaluated using distribution-consistent metrics. The differing performance across metrics underscores the importance of evaluating models using multiple criteria, particularly in the presence of heavy-tailed loss distributions.
Figure 6 illustrates the relationship between actual and predicted claim severity. The Gamma GLM produces highly concentrated predictions, indicating strong shrinkage toward the mean and limited ability to capture variability in claim sizes. In contrast, the XGBoost model exhibits greater dispersion and improved alignment with observed values, highlighting its ability to capture nonlinear relationships and complex interactions.
However, both models display considerable scatter, underscoring the inherently stochastic nature of claim severity and the difficulty of accurately predicting individual claim amounts. Observations with zero or near-zero values were excluded from the log-scale visualization to avoid numerical issues associated with logarithmic transformation.
4.3. Results for Frequency–Severity Decomposition for Pure Premium
Table 7 presents the performance of the frequency–severity decomposition models using both error-based and ranking-based metrics. For aggregate loss evaluation, we focus on error-based and ranking-based metrics. While distribution-consistent measures such as Tweedie deviance could also be considered, their computation depends on additional model assumptions and estimation procedures, and is therefore left for future work.
Among the models considered, the XGBoost–XGBoost configuration achieves the lowest MSE and MAE, as well as the highest correlation, indicating strong overall predictive accuracy. This indicates that incorporating machine learning methods for both frequency and severity improves pure premium estimation.
In contrast, the ranking-based results reveal a different pattern. The XGBoost–Gamma model achieves the highest lift in the top decile, although the magnitude of the difference relative to other models is modest and should be interpreted with caution. This pattern suggests that improvements in modeling claim frequency may contribute to stronger risk segmentation, although the effect cannot be isolated within the current framework.
To assess the robustness of these findings, we repeat the analysis across multiple random train–test splits. While the magnitude of lift varies across samples, the relative ranking of models remains broadly consistent.
The difference between error-based and lift-based metrics highlights an important trade-off. The results suggest a potential trade-off between prediction accuracy and risk segmentation, although the differences are modest. While the XGBoost–XGBoost model minimizes prediction error, the XGBoost–Gamma model achieves higher lift in the highest-risk segment. This indicates that different model configurations may be preferred depending on the objective, such as pricing accuracy versus risk classification.
We also examine the sensitivity of lift-based results to preprocessing choices, particularly the capping of claim severity. While the magnitude of lift varies under alternative thresholds, the qualitative patterns remain similar.
The classical Spline–Gamma model yields the weakest performance across all metrics, although it remains a useful benchmark due to its interpretability.
Overall, these results suggest that machine learning improves predictive performance within the decomposition framework, while differences in risk segmentation depend on model configuration and evaluation criteria.
Risk Segmentation Performance
Figure 7 illustrates the decile lift curves for the decomposition models. All models exhibit a strong concentration of losses in the highest-risk decile, indicating effective identification of high-risk policies.
Among the models, the XGBoost–Gamma configuration achieves the highest lift. However, this difference should be interpreted cautiously, as lift-based metrics are sensitive to model specification, sampling variability, and preprocessing choices. In contrast, although the XGBoost–XGBoost model provides the best overall prediction accuracy, its lift is lower, which is consistent with the modest trade-off observed between accuracy and ranking-based performance.
Alternative measures of risk segmentation, such as the area under the lift curve, may provide additional insights and are left for future research.
These findings suggest that both frequency and severity components contribute to model performance, although their relative impact on risk segmentation cannot be fully disentangled within the current analysis.
The non-monotonic behavior observed in lower-risk deciles reflects the inherent variability of insurance losses, particularly for policies with low exposure. These results provide empirical evidence supporting the continued use of frequency–severity decomposition, particularly when combined with machine learning methods.
4.4. Comparison Between Decomposition and Direct Modeling for Total Loss
To evaluate the effectiveness of direct modeling approaches, we compare the best-performing decomposition model (XGBoost–XGBoost) with direct XGBoost and Tweedie models for total claim amount prediction. All models are trained and evaluated using the same training and testing splits to ensure a fair comparison of predictive performance.
Table 8 summarizes the performance of decomposition and direct modeling approaches.
The decomposition-based XGBoost–XGBoost model achieves the lowest MSE and MAE and the highest correlation among the models considered based on the evaluation metrics used. Although the numerical differences are modest, these results suggest that modeling claim frequency and severity separately may offer some performance advantages in predicting total losses.
The direct XGBoost model provides competitive performance, indicating that flexible machine learning methods can capture nonlinear relationships in aggregate losses. However, its performance is slightly below that of the decomposition framework, suggesting that modeling aggregate losses directly may not fully exploit the structural information captured by the frequency–severity approach.
The Tweedie model is implemented as a generalized linear model with a log link and serves as a baseline parametric benchmark. While more flexible extensions of Tweedie regression are available, including spline-based and machine learning approaches, these are not considered in the current study. The observed differences in performance should therefore be interpreted in light of the differing levels of model flexibility.
The Tweedie power parameter is estimated using profile likelihood methods, yielding an estimated value of approximately 1.52. This value lies within the compound Poisson–Gamma range and is consistent with typical insurance applications. We do not report a confidence interval for this estimate, which represents a limitation of the current analysis.
The results indicate that, while the data exhibit a compound structure, the single-equation specification of the Tweedie GLM may be more restrictive than the decomposition framework, which allows for separate modeling of frequency and severity components.
Overall, these findings suggest that decomposition-based approaches can provide improved predictive performance in this setting. However, the comparison reflects differences in both modeling structure and functional flexibility, and conclusions are therefore interpreted in terms of empirical performance rather than definitive methodological superiority.
5. Conclusions and Discussion
This study examines the integration of classical actuarial models and modern machine learning techniques for insurance pricing, with particular emphasis on comparing the traditional frequency–severity decomposition framework with direct modeling approaches. The results provide important insights into predictive performance and practical implications for risk management.
The empirical findings show that machine learning methods, particularly XGBoost, improve predictive accuracy for both claim frequency and claim severity. The improvement is especially evident for severity modeling, where the Gamma GLM tends to shrink predictions toward the mean and fails to capture the variability of claim sizes. In contrast, XGBoost effectively models nonlinear relationships and interactions, leading to lower prediction errors. Despite these improvements, correlation values remain relatively low, reflecting the inherently stochastic nature of claim severity.
Within the decomposition framework, the XGBoost–XGBoost model achieves the lowest prediction error among the models considered. However, improvements in error-based metrics are moderate, which highlights the high variability of insurance losses, particularly for policies with low exposure.
An important contribution of this study is the identification of a trade-off between prediction accuracy and risk segmentation. Although the XGBoost–XGBoost model minimizes prediction error, the XGBoost–Gamma model achieves the highest lift and provides improved identification of high-risk policies. The results suggest that differences in risk segmentation may be influenced by both frequency and severity components, although their relative contributions cannot be isolated within the current modeling framework. These findings indicate that model configuration plays an important role in risk segmentation performance, rather than providing conclusive evidence about the dominance of a specific component.
The comparison between decomposition and direct modeling approaches provides further insight. While direct XGBoost and Tweedie models deliver competitive performance, they do not achieve lower error than the decomposition-based framework based on the evaluation metrics considered in this study.
A deeper comparison with existing studies helps explain differences in empirical findings. The results of [
10] suggest that direct modeling approaches can achieve lower prediction error than decomposition-based methods in predicting aggregate losses, while our results indicate that decomposition remains competitive. Several factors may contribute to this discrepancy.
First, differences in data characteristics play an important role. The dataset used in this study exhibits strong zero-inflation and heavy-tailed severity, which can affect the relative performance of direct and decomposition approaches. In contrast, prior studies may use datasets with different distributional properties, leading to different conclusions.
Second, model specification differs across studies. In particular, [
11] show that gradient boosting improves claim frequency prediction but that classical models may remain competitive for claim severity. This aligns with our findings, where machine learning provides consistent gains for frequency but more modest improvements for severity. Since decomposition models allow separate specification of frequency and severity components, they can better accommodate such differences.
Third, evaluation criteria influence the conclusions. Some studies focus primarily on aggregate prediction error, while our analysis also considers risk segmentation through lift metrics. These different objectives may favor different modeling approaches.
Finally, recent literature highlights that more flexible implementations of Tweedie and frequency–severity models, including deep learning extensions, can further affect comparative performance. For example, machine learning approaches can capture nonlinear dependencies and interactions that are not modeled in classical GLMs, but their performance depends on tuning, data structure, and evaluation metrics.
Overall, these considerations suggest that differences across studies may be attributed to variations in data, model specification, and evaluation objectives, rather than representing conflicting conclusions.
This discrepancy highlights an important distinction between general predictive modeling and actuarial applications. Direct modeling may perform well in settings where minimizing prediction error is the sole objective. However, in insurance pricing, the frequency–severity decomposition reflects the underlying structure of claim processes and enables separate modeling of distinct risk components. This structural advantage allows for greater flexibility in capturing heterogeneous risk factors and improves risk segmentation.
In addition, direct modeling imposes a single functional relationship between covariates and aggregate loss, which may be overly restrictive when frequency and severity are influenced by different drivers. In contrast, the decomposition framework accommodates distinct covariate effects for each component, resulting in improved predictive performance in practice.
From a practical perspective, these findings have important implications for insurance pricing and risk management. If the objective is to minimize prediction error, models that incorporate machine learning for both frequency and severity are preferred. If the goal is to identify high-risk policies for underwriting or portfolio management, models that emphasize frequency modeling are more effective. However, these conclusions should be interpreted with caution, as they depend on model specification, evaluation metrics, and data preprocessing choices.
The results are also consistent with prior studies comparing gradient boosting methods with classical GLMs within the frequency–severity framework [
11]. As in previous work, machine learning improves claim frequency prediction. However, while earlier studies report mixed results for severity modeling, the present analysis shows that XGBoost can provide competitive or improved performance relative to the Gamma GLM when appropriately specified and tuned. This suggests that the relative performance of machine learning for severity is context-dependent and may improve with richer data and more flexible model configurations.
Overall, this study demonstrates that machine learning enhances predictive performance, but its effectiveness is maximized when combined with the classical frequency–severity decomposition framework. The results indicate that the relative performance of decomposition and direct modeling approaches depends on model specification and evaluation criteria, rather than reflecting a universal dominance of one approach. These findings provide empirical support for decomposition-based pricing approaches and offer a foundation for future research on hybrid models that balance predictive accuracy and risk segmentation in complex insurance environments.
We note that comparisons across model configurations combine differences in both model structure and model specification. As a result, the observed differences in lift cannot be attributed to a single component, such as frequency or severity. The findings should therefore be interpreted as empirical patterns rather than definitive evidence of the relative importance of individual components.
A more rigorous assessment of the relative contributions of frequency and severity would require a controlled modeling framework or variance decomposition analysis, which we leave for future research.
Recent developments in Tweedie-based modeling, including extensions using neural networks and gradient boosting, further expand the range of modeling approaches. Incorporating such methods into a unified comparison framework represents an important direction for future research.