The methodology introduced in
Section 3 is general and can be applied to different assets, regions and perils, as long as the results of an underlying catastrophe model are made available. In this section we apply it to tropical cyclone parametric risk transfer for a hotel in Kingston, Jamaica.
4.1. Underlying Catastrophe Model and Description of the Problem
The proposed methodology relies on the results of the Risk Engineering + Development (RED) hurricane model for the Caribbean and its associated exposure database (
www.redrisk.com). The loss computation module within the catastrophe model computes the losses that could be caused by both winds and storm surges at specific exposures. The stochastic catalog of hurricanes contains thousands of years of stochastic hurricane activity. Each hurricane track is associated with a variable number of observation points at 6 h time steps, logging the corresponding physical parameters or measures of the hurricane, such as the specific location of the cyclone eye, its MWS, MSLP and RMW. More details on the RED model can be found in [
8].
To define the parametric model, a single specific insured asset is considered. The location of the insured assets is designated with a star on the map in
Figure 3 and overlaid with a sample set of hurricane paths simulated with the catastrophe risk model. These synthetic events have estimated losses associated with the considered exposure as well as physical parameters for multiple observations along the path over time. The dots in the figure represent observational points of the hurricanes at different times, with a time step of 6 h. Thus, each hurricane shown in the figure is associated with a data pair containing information on the losses of the insured asset caused by the hurricane and its physical parameters. These include, at each observation point, the longitude and latitude of the eye of the cyclone, its MWS in kt units, its MSLP in mbar units, and its RMW in km units.
The payout
y is defined in the context of a classical insurance policy, based on the loss
l caused by the event, as follows:
The policy is only liable for losses above the lower bound (or attachment level) of payable losses, while losses above the upper bound (or exhaustion level) would disburse a payout limited by the policy limit of liability, reached at the exhaustion level.
As anticipated in
Section 2, this type of prediction problem is inherently complex and imbalanced. Events were categorized into non-payable, partially payable, and fully payable by the determination of the stated lower and upper bounds of payable losses. As insurance policies typically target catastrophic, rare events, the majority of the stochastic events result in no payout, while a small subset leads to partial or full compensation. Thus, it is critical to correctly categorize the payout type of events as well as to accurately predict the numerical losses for partially payable events. Our approach combines classification and regression models within a hybrid track to address this imbalance, aiming to minimize the gap between predicted and actual payouts for each specific event.
Table 1 summarizes the distribution of three payout classes for all events. The very clear imbalanced distribution is reflected in the fact that only
of the events reach a compensable loss.
4.2. Feature Selection: Pre-Processing with Data Analytics
Algorithm 1 begins by initializing parameters, setting the number of closest observations, and enabling or disabling descriptive statistics and geodesic distance calculations. It then prints the settings. Next, it defines the point of interest (POI) with specific coordinates and sets the lower and upper bounds on the losses that can be insured, thus also determining the basic information of a specific insurance policy.
| Algorithm 1 Processing and Analysis of Event Losses. |
- 1:
Initialize Parameters - 2:
Print settings - 3:
Define POI and Process Losses , - 4:
Load losses file Compute columns: , , - 5:
Function: Distribution of Events Compute non-payable, partially payable, and fully payable event counts Return statistics as a table - 6:
Compute Statistics and Boxplot Compute descriptive statistics for Generate boxplot with and as reference lines - 7:
Load and Process Stochastic Catalogue Compute distances: Euclidean and, if enabled, geodesic Sort by and distance to POI Drop unnecessary columns () - 8:
Compute Descriptive Statistics Per Event Compute count, mean, std, median, min, max, Q1, and Q3 for each event Merge results into - 9:
Count Occurrences of Each Event Compute event code frequencies and descriptive statistics Filter to include only events with losses Count occurrences of filtered event codes in - 10:
Select Closest Observations Per Event Keep only observations per event in - 11:
Transform Catalogue for Single Event Rows Pivot so each row represents a unique event Replace NaN values in distances by forward-filling and multiplying by 2 Merge with if is enabled - 12:
Merge Catalogue and Losses Data Merge with on event codes Fill NaN values in losses with 0
|
Using Python 3.12.3, it loads a CSV file containing loss data for each event with real losses and computes new columns for the payout type. Ordinal encoding of the three categories according to the stated lower and upper bounds, i.e., 0 for events where the loss of the event does not reach the lower bound of payable losses (USD 15,000); 1 for events where the loss of the event reaches the lower bound of payable losses, but does not reach the upper bound (USD 115,000); and 2 for events where the loss of the event exceeds the upper bound of payable losses, but can be paid out only at the upper bound amount. Then, a function is defined to compute and return the counts and percentages of non-payable, partially payable, and fully payable events.
Table 2 presents the distribution of loss-producing events.
The algorithm proceeds then to load and process a stochastic catalog from another CSV file, computing distances (Euclidean and optionally geodesic) and sorting the events by code and distance to the POI. Unnecessary columns are dropped. The computation of descriptive statistics for each of the observational parameters (distance, MWS, MSLP, and RMW) for each event is based on all available observation points, including count, mean, standard deviation, median, minimum, maximum, first quartile, and third quartile of each indicator corresponding to the event. These results are later decided to be merged into a data frame as needed for model training since these contain information of all observational points. According to the subsequent model training described in
Section 4.3 and
Section 4.4, the algorithm keeps only the specified number of closest observations for each event as the basic and necessary features. The catalog was performed with the same event row-to-column transformation so that each row represents a unique event with physical parameters information from each observation. Some events come in and dissipate fast, lacking a specific number of observations near to the POI. Differences in the duration of each hurricanes imply the number of observations varies and thus generates NaNs after transformation. To maintain feature consistency across all events, missing values in distances were imputed by forward-filling the last observed value and multiplying it by two, assuming that the hurricane had moved further away from the POI. Other physical parameters, including MWS, RMW and MSLP, were forward-filled to reflect persistence in physical conditions. Using this simple but physically motivated approximation to simulate continued storm movement after the last observation. If chosen, descriptive statistics on each event are merged with the basic features consisting of parameters of a specific number of closest observed points. Finally, the algorithm merges the processed event data with the loss data on event codes and fills NaN values in the losses with zero.
4.3. First Modeling Track: Classification Models
The goal of this track is to train models that classify all events as precisely as possible. As explained before, two classification methods were implemented on this dataset. One consists of two two-level classifiers that first decide whether the event needs to be paid or not and, in the former case, decide whether the event needs to be fully paid using the upper bound value. Another approach is to build a three-level classifier that directly categorizes events into the three payout types. Algorithm 2 begins by setting several initial parameters, including the number of closest observations to consider, whether to use feature importance, the maximum number of important features, and whether to add interaction terms. Next, it loads a dataset from a CSV file and analyzes the distribution of different event categories, such as non-payable, partially payable, and fully payable events. For feature importance, an XGBRegressor model is trained on the training data. If the flag for using feature importance is enabled, only the important features are selected for training. If interaction terms are enabled, the algorithm adds ‘distance-parameter’ interaction features by weighting each physical parameter of the
n closest observations by the inverse of their distance to the insured asset, thereby capturing the joint effect of storm parameters and spatial proximity. The dataset is then split into features (inputs) and targets (outputs). The data is further divided into training, validation, and testing sets, which are used to train the model, tune hyperparameters, and report final performance, respectively. Notice that this test set remains strictly held out throughout the entire study, with no exposure during training, validation, or model selection in any track. The algorithm calculates the absolute frequency and relative frequency of each class in both the training&validation set and test set, then combines these frequencies into a summary for analysis.
Table 3 shows this summary, where the overall training and validation sets and the test set are evenly split across different categories.
| Algorithm 2 Classification Analysis. |
- 1:
Load the dataset Compute counts and percentages for each category of event - 2:
Notebook settings , , , - 3:
Split data into training and test sets - 4:
Split training set into training and validation sets Split data into training and validation sets Compute frequency and relative frequency for training and validation datasets - 5:
The 1st 2-level classification model Set XGBoost and cross-validation with early stopping Get the best number of boosting rounds Train final model using the optimal number of boosting rounds Predict on the validation set Evaluate model performance on the validation set - 6:
The 2nd 2-level classification model Set XGBoost and cross-validation with early stopping Get the best number of boosting rounds Train final model using the optimal number of boosting rounds Predict on the validation set that need to be further classified Evaluate model performance on this subset of validation set - 7:
Summarize results from the two classifiers Evaluate model performance on the whole validation set - 8:
Use a 3-level classification model: Set XGBoost and cross-validation with early stopping Get the best number of boosting rounds Train final model using the optimal number of boosting rounds Evaluate model performance on the validation set
|
For the classification model, both approaches follow similar steps. The data is converted into a format suitable for the XGBoost (2.1.3) algorithm. The algorithm sets the parameters for the XGBoost model, specifying that it is a binary classification problem with two classes or a multi-class classification problem with three classes. To mitigate class imbalance, XGBoost’s built-in weighting mechanism was set to the ratio of majority (class 0) and minority (class 1 and 2). This re-weights the loss function such that errors on the minority class have a proportionally higher impact during training. Early stopping is performed to determine the optimal number of boosting rounds, and the final model is trained using this optimal number. If feature importance is enabled, the algorithm retrieves and plots the feature importance scores from the trained model. The trained model is then used to make predictions on the testing set. The algorithm evaluates the models’ performance using various metrics and computes confusion matrices to visualize the classification results. It should be noted that both classification methods are based on the specific initial set of features here. Specifically, the number of recent observations is specified as 15, without activating the only use of important features and the addition of interaction terms. The classification results of the trained classifiers are reported on the same test set.
Table 4 shows the performance of the trained two two-level classifiers and three-level classifier on the same dataset, respectively. Both methods are better at classifying ‘non-payable’ events than at ‘partially payable’ and ‘fully payable’ events. In terms of F1-score, the way of training two two-level classifiers and then summarizing the results is slightly better for ‘partially payable’ events, while training a three-level classifier directly is slightly better for ‘fully payable’ events. However, in terms of recall of ‘partially payable’ and ‘fully payable’ events, two two-level classifiers detects ‘partially payable’ events relatively well.
Figure 4 shows the performance on each class through their respective confusion matrices. It is interesting to note that the direct three-class classifier yields fewer misclassifications on ’non-payable’ events, yet it tends to make the more significant mistake of misclassifying ’non-payable’ events as ’fully payable’.
For the predictions given by the two classification methods,
Table 5 demonstrates the agreement between the classification results of the two classification methods. There was 99.96% agreement on the results, while predictions differed on very few events. According to the description in
Section 3, these non-agreement events will have waited to vote on the final classification along with the 3rd classifier defined based on the values predicted by the regression model and the lower and upper bounds.
4.4. Second Modeling Track: Regression Algorithms
The goal of this track is to train the model to predict the events that cause actual losses as accurately as possible, so that its predicted payments do not deviate too much from the amount that should be paid. Algorithm 3 starts by setting several initial parameters, including the lower and upper bounds for losses, the number of closest observations to consider, and various flags for adding interaction features, removing rows with zero or extreme losses, and using feature importance. Next, the dataset is loaded from a CSV file, and the algorithm checks for any missing values. If the flag for adding interactions is enabled, interaction terms are added to the dataset for each distance-parameter combination. The algorithm then filters the data for regression analysis. It computes descriptive statistics for the total losses and, if specified, removes rows with zero or extreme losses. The regression track adheres to the same strict data partition established in
Section 4.3, and the held-out test set remains frozen to wait for reporting the final results of hybrid models on it at the end. Another dataset that contains other events is further divided into training and validation sets. Horizontal box-plots are created to visualize the distribution of the training and validation sets.
| Algorithm 3 Regression Analysis and Models Ensemble. |
- 1:
Notebook settings , , , , , , , - 2:
Load the dataset and check for missing values - 3:
Filter data for regression , - 4:
Split data into training and test sets - 5:
Split training set into training and validation sets Split data into training and validation sets Compute summary statistics for train and validation sets - 6:
Plot horizontal boxplots - 7:
Feature transformation and filtering for dealing with imbalance, weights assignment and/or feature selection - 8:
Various Regression models Initialize and train the XGBoost, LightGBM, CatBoost regressor on training set Make predictions on the validation sets Compute evaluation metrics - 9:
Analysis of results Add actual and predicted losses to the results Scatter plot of filtered data with reference lines and bounds
|
To prevent a large number of zeros from interfering with model training, events with zero losses were removed from the training set for the regression model. For feature selection, two types of feature selection are implemented separately: using the first 15 nearest observations or using the first 6 nearest observations and the descriptive statistics of each physical parameter among all observations. It is worth noting that the addition of the interaction term is also enabled here, extending the input features of the regression model.
Figure 5 shows the box plots of the original training and validation sets filtered for the regression model with the stated lower bound and upper bound as reference lines. Events that are payable (i.e., reach the lower losses bound for payout) are essentially identified as outliers.
To address the challenge of highly imbalanced data distribution, some pre-processing approaches mentioned in
Section 2 were implemented to systematically reduce the number of common cases or augment the rare samples (where the original output target is larger than the specific lower bound for payable losses) for the training set. In particular, random undersampling [
42,
43], random oversampling [
44,
45], and SMOGN, which stands for the combined method based on undersampling, oversampling and introduction of noise [
39], were implemented. Unlike non-heuristic random undersampling and oversampling for classification tasks, random undersampling and oversampling for regression require user-defined thresholds to identify the most common or less important observations for the values of the target variables [
38]. Here all samples that have losses below the lower bound of compensability (class 0) are considered the most common.
Table 6 presents the distribution of the most common and rare significant cases for the original training set and after processing by these three methods. For SMOGN, due to the variation in the feature space affecting the similarity measure between the samples; thus, the post-processing distributions are slightly different under different feature choices.
Various gradient boosting decision tree models (GBDTs), such as XGBoost (2.1.3), LightGBM (4.6.0), and CatBoost (1.2.8), are initialized as well as trained on the enhanced training data. A weight adjustment policy was also employed during model training. Aligns with the general principle of weighted machine learning as discussed in [
41,
46], where more important but rare samples are emphasized by modifying the model’s cost function.
After the models were trained, the three GBDTs can be used individually or in an ensemble with their median for prediction. The performance of each prediction way was reported on the validation set. Following the concept of basis risk, the total error is defined as the sum of the absolute differences between the predicted payout (adjusting the predicted loss to the predicted payout based on the predefined lower and upper bounds) and the actual payment (adjusting the actual loss according to the stated lower and upper bounds).
Table 7 summarizes the best performance on the validation set and the corresponding model or prediction method of the models trained by the three pre-processing methods under the two types of feature choices, where ‘best performance’ means that the minimum TAE is achieved for the prediction of the actual class 1.
The total error in the validation set is smallest when the pre-processing is simple random oversampling and the model is trained with CatBoost, especially when the feature selection is the parameter information of the 6 nearest observational points and descriptive statistics of all observations for each parameter and interaction term. The scatter plots in
Figure 6 visualize the actual payment versus predicted payout from the different regression models for the events with the category of partial payment, i.e., the actual class equals 1, under this kind of pre-processing and feature selection. The purple points are the prediction performance of the best CatBoost-trained model.
4.5. Models Ensemble and Final Prediction
The final prediction results are reported on the entire original test set used for the classification model in
Section 4.3, containing those with zero loss for a total of 28,698 events. Since
Section 4.4 shows that the CatBoost model trained with specific feature selection and simple random oversampling pre-processing performed best on key payout events (partially payable). This model was used to predict the entire test set and classify all events based on the stated lower and upper limits of the payouts to report the generalizability of the overall prediction model.
Figure 7a illustrates the performance of the regression model as a classifier. It is interesting to note that for ‘partially payable’ events, the recall of the regression as a classifier is 1, which is significantly higher than the two classifiers dedicated to event classification. The classifications predicted by the regression model are aggregated with the classifications predicted by the two previous classifiers and voted on. The one with the highest number of votes was taken as the final class. As provided in
Section 3, if an extreme case occurs where the three classifiers give completely different classifications, the final class will be the intermediate class, i.e., class 1, ‘partially payable’ event. Even though this did not happen in the whole test set here, the possibility cannot be ruled out.
Figure 7b shows the performance of the voted classifiers. Compared to the two specialized classifiers in
Section 4.3, the recognition of ‘partially payable’ is significantly improved.
In the end, following the mechanism of model ensembles set in
Section 3, the final payment amount was predicted on the basis of the final classification after the vote and numerical predictions of a regression model. If the agreed class is 0, i.e., ‘non-payable’, the final payout is 0; if the agreed class is 2, i.e., ‘fully payable’, the final payout is the upper bound of the payable loss. The rest are considered as class 1, i.e., ‘partially payable’, and the regression as a classifier is also classified as 1; the predicted payout is the loss predicted by the best regression model referred to in
Section 4.4. In other cases (including cases with different classifications for each of the three classifications), if the regression is classified as 0, then the final payout amount is the lower bound of the payable loss. If the regression classification is 2, the final payment amount is the upper bound of the payable loss.
The final predicted payment can be compared to the actual payout adjusted by the lower and upper bounds of the actual loss.
Figure 8 visually illustrates the final prediction performance on the whole test set. The color of the dots close to light yellow means that they overlap a lot; specific numbers have been marked above the corresponding points. Out of
test events, except for a few large deviations, the overall predictions were in good agreement with the actual values.
Table 8 also records some performance metrics calculated from the final predictions.
For validation purposes, the method was tested against a linear regression baseline model (see
Section 6) and against a CatBoost regression that was improved through both the oversampling technique and the parameter settings. It required careful tuning of the best single regression method to obtain close results to the hybrid framework results. Still, the method proposed in this paper outperforms them when looking at both the test set reported metrics and the k-fold cross-validation with five folds. The hybrid framework method is also much more robust in terms of classification performance. Both CatBoost models were kept because they show different strengths in the classification task. CatBoost 2 learned to classify all three classes correctly, while CatBoost 1, even though it got better overall error metrics, could only identify Class 0 and Class 1.
As shown from the results in
Figure 9 the CatBoost configurations also show important differences between the two methods. The CatBoost 1 uses a Tweedie loss function with a variance power of 1.5, which is designed for insurance data that often has many zero values and some very large values. This loss function helps the model handle the skewed distribution of insurance claims. The CatBoost 1 version runs for 1000 iterations with a learning rate of 0.05 and a tree depth of 5. In contrast, the CatBoost 2 uses the RMSE loss function, which is a simpler approach that minimizes the squared differences between predictions and actual values. The CatBoost 2 version only runs for 100 iterations, which is ten times fewer than the standard version. Both versions use the same learning rate of 0.05 and depth of 5, so these parameters stay constant across the two approaches. The shorter training time in the custom version makes it faster to train, but the standard version with more iterations may capture more complex patterns in the data because it has more time to learn.
Both CatBoost-based methods keep the actual loss values unchanged during oversampling. They only copy complete rows of data, including all features and the target loss amount. The difference between them is that the CatBoost 2 method has a stricter limit on how many copies it makes, while the CatBoost 1 method allows more copies but still maintains a reasonable upper bound. The CatBoost 2 method may stop at 500 total Class 1 examples, while the CatBoost 1 method could go higher if the data allows it without exceeding the 20x duplication limit per example. The changes in the CatBoost method also affected how the training sample was selected. Both CatBoost methods were trained on a careful selection that includes the same quantity of true zero losses and non-zero small losses together from Class 0. Then the method tries to balance the combined sum of Class 1 and Class 2 samples so they match as closely as possible to Class 0 training samples.
Figure 9,
Table 8 and
Table 9 show that the hybrid method outperforms the standalone regression methods in multiple aspects. The test set results demonstrate that the hybrid framework achieves lower error compared to both CatBoost models across all measured metrics. The k-fold cross-validation results (with 95% confidence intervals) confirm this pattern across all five folds, where the hybrid method maintains consistently better performance with smaller variations between folds. This indicates that the hybrid framework is not only more accurate but also more stable across different data splits. The classification performance also shows the robustness of the hybrid method, matching the best CatBoost model while keeping lower error. The confusion matrices reveal important differences in how each method classifies the three payment classes, with the hybrid framework showing balanced performance across all classes, achieving the smallest error of false payouts and missed payouts. These results from both the test set and cross-validation suggest that the hybrid approach provides more reliable predictions for insurance payout estimation than using a single regression model alone.
Table 10 summarizes the recommended model choices for different use cases. While the hybrid framework provides the best overall accuracy and robustness for deployment and policy design, CatBoost models can be used for rapid prototyping or benchmarking due to their simpler setup and faster training. Linear regression offers a fully transparent and interpretable alternative for explainable analysis, though with reduced predictive performance.