To test the actual performance of the lightning risk index insurance products constructed in
Section 3, this section conducts a series of experiments based on satellite monitoring data and historical disaster records in the study area. The experiment included a daily calculation of the insurance indices and trigger determination, measurement of the frequency and amount of payment, comparison of different percentile parameters, various model settings, and the reasonableness test of loss ratios. In the above-mentioned experiment, the aim is to assess the safety and financial sustainability of the product.
4.1. Datasets and Experimental Setup
The study selected L2-class lightning group data, which was formed in minutes, from the Chinese weather satellite FY-4A lightning mapping imager. The dataset used in this study includes the following key information: the time range spans from 1 March 2023 to 31 May 2023; the observation area covers part of China and Australia; and the total sample size is 2,061,376. The variable information is shown in
Table 2.
The lightning mapping imager can detect lightning in China and the surrounding regions, thereby enabling severe convective weather monitoring and tracking, providing early warning of lightning disasters. The LMI used a CCD array that operated at a wavelength of 777.4 nm to capture lightning optical emission. The optical events were filtered and clustered to “Event”, “Group” and “Flash” products. A “Group” is suggested as a lightning discharge. An “Event” is suggested, as an optical event of one single pixel exceeded the background threshold during one frame.
By contrast, lightning events are only instant light detected by individual pixels, which are susceptible to background light and instrument noise, having a low and unstable signal-to-noise ratio. Flashes mix multiple independent discharge processes and lose important process details. Group data allows for both stable signals of high signal-to-noise ratio and the extraction of abundant feature information, such as the duration of discharge, total energy, spatial coverage, etc. These features reflect the strength and scale of the thunderstorm system and are of the most immediate physical significance for assessing flash disaster risk.
The experiments in this study were conducted using Python 3.13.2, which was run on a computer equipped with an AMD Ryzen 9 7945HX processor with Radeon Graphics at 2.50 GHz, a NVIDIA GeForce RTX 4060 graphics card, and 16 GB of memory.
4.3. Calculation Results of Risk Indices
This experiment starts with principal component analysis. The interpretation variance of the first principal component in this dataset is 48.17%, and that of the second is 33.68%. The sum of the two is 81.85%, indicating that the first and second components include most of the information from the original data. The load matrix is shown in
Table 3, where PCA 1 stands for the loading of the first principal component while PCA 2 stands for the loading of the second principal component.
The first principal component has a high and close positive payload in total radiation energy and the quantity of lightning strikes, while the payload is close to zero at times of peak risk. This indicates that the first component mainly reflects the overall energy output scale of lightning activity and represents a combination of radiation energy and the number of events.
The second principal component has a very high positive payload at the distance and time from the peak hour, while the payload is close to zero in the total radiation energy and quantity of lightning. This indicates that the second principal component primarily captures the temporal features of lightning occurrence as the proximity to the peak period of risk.
We train the random forest model after calculating training labels. In the experiment, we divide the data into a total of 82,449 samples for training and 412,246 samples for testing. The experiment optimized the key parameters using grid search combined with a five-fold cross-validation. The search space is set as follows:
Number of decision trees (n_estimators): 100, 200. This parameter controls the number of base learners in ensemble learning. If the value is too small, it may lead to underfitting. If it is too large, it will increase the computational cost and the benefits will decrease.
Maximum depth (max_depth): 10, 20, None, where None indicates that tree nodes will continue to split until all leaf nodes are pure or reach minimum sample limits, allowing models to learn complex non-linear relationships.
The evaluation index for model performance is the determination coefficient (), which is chosen to ensure the robustness of parameter selection. Finally, the parameter group that yields the highest average on the validation set is selected as the optimal model configuration. After searching, the best parameters determined by the grid search are n_estimators = 100 and max_depth = 20. The of this model under this configuration on the test set is 0.99926.
In the random forest model, a high determination coefficient is expected and reasonable. The main reason for this is that the learning objective of the model is the training label constructed through the PCA, which is a linear combination of original features. Random forests, with powerful nonlinear fitting capabilities, can closely approximate this deterministic mapping relationship with extremely high accuracy. It is important to emphasize that the core output of the random forest model in this study is not the predicted value but the feature importance based on the prediction process. This importance is used to measure the contribution of lightning features to the training label and provide objective weights for the subsequent fuzzy comprehensive evaluations. Therefore, the high accuracy of the model in fitting does not affect the rationality and interpretability of the final weight results.
The feature weights provided by the model for group event count, group radiance and peak distance time are 0.6019, 0.2276 and 0.1705. The feature importance of random forest output indicates that the number of lightning events is the most contributing risk factor, followed by total radiation energy and distance time from the peak period. This finding shows that the frequency of lightning activity has the most significant impact on risk; moreover, total energy and time series features are also important.
The statistical chart of the impact of SHAP values on model output is shown in
Figure 4. The SHAP values of the number of flash events and total radiation energy are distributed on both sides of the zero axis, with negative values concentrated near the zero axis, while positive values are more dispersed and spread over a large range, suggesting that few high-energy and large-scale lightning events can have a significant boost to risk. Most ordinary flashes contribute less, with negative values attached to the zero axis, conforming to the rule that extreme events are rare. The distribution pattern of the two on the positive and negative sides is similar, suggesting that the risk of a high number of lightning strikes is often accompanied by high energy and multiple events. The SHAP values of the distance time from the peak risk period are consistent with their physical expectations as a feature of time, which means the closer to the peak, the higher the risk; the farther away, the lower the risk. The magnitudes of positive and negative effects are similar. These features above indicate that the risk judgment mechanisms identified by the model are consistent with physical perception and that the feature importance results are reasonable.
SHAP values also have direct practical value in insurance operations. In practice, when the index is triggered but the policyholder believes that no actual loss has occurred or vice versa, basis risk disputes appear to arise. SHAP value analysis is capable of providing a feature-by-feature contribution decomposition for each risk determination. Such attributional information can help insurance institutions explain to policyholders the basis for each trigger decision, thereby enhancing the transparency and credibility of index-based insurance and reducing claim disputes technically.
After obtaining the feature material weight vector, this paper uses fuzzy mathematic methods to calculate the lightning risk indices. The statistics of data risk levels are shown in
Table 4, where we still adopt the lightning risk grading criteria from reference [
44].
Table 4 shows that the distribution of sample risk levels is significantly skewed to the right. Low-risk samples absolutely dominate, while medium-risk samples account for about 4% and high-risk samples account for only 0.03%. This distribution structure is highly consistent with the natural attribute that long periods of silence are followed by short bursts. This indicates that the model has a good ability to distinguish between normal conditions and extreme events.
The number and percentage distribution of samples in each score range are shown in
Figure 5.
Figure 5 shows a highly concentrated single-peak pattern of risk score distribution, with peaks in the range [0.1, 0.2), with a total of 93% of samples concentrated in the lower range [0.1, 0.3). As the scores rises, the percentage of samples falls sharply, cumulatively at 1.5% for more than 0.5 and 70 samples for more than 0.9, indicating that the model had a strong screening capability for extremely high-risk samples.
4.4. Insurance Premium and Claim Calculation Results
According to the methodology described in
Section 3.2.1, the MEF and Hill plots are drawn as
Figure 6 and
Figure 7. The mean excess rises very rapidly when the threshold in
Figure 6 is 0.8120. Thus, the MEF plot sets the trigger threshold at 0.8120. In
Figure 7, when the K value is 183, there is a significant and continuous increase, at which time the corresponding threshold is 0.8324. Combining the two methods above, we ultimately set the trigger threshold as an average of 0.8222. Considering that the study area covers two very large countries, China and Australia, the experiment assumed that the limit for compensation per accident is 800,000 yuan. Following the methodology described in
Section 3.3, this experiment calculates the theoretical compensation and insurance premiums for all data and consolidates them into the records of the claims shown in
Table 5. Due to the limitations of the space available, only records of claims greater than zero are kept.
Table 5 shows a total of 64 events triggered in the experimental data during the three-month period, of which 55 were partial and nine were capped at 800,000 yuan. Taking into account the positioning of this study as a regional insurance product, the nine events reflect nine extremely intense convective weather events occurring in a wide area covering China and Australia, with a density of only about 0.42 per million square kilometers per month. The frequency is completely in line with the climatic characteristics of thunderstorm activities in the regions of both countries. This result is therefore evidence that the model can effectively identify regional extreme disaster events.
Ultimately, due to the fact that the premium is the highest in May, the experiment resulted in determining the final premium of 13,960,984.95 yuan.
4.5. Verification of Loss Ratio
This subsection uses monitoring data generated by the same satellite from 1 June 2023 to 30 June 2023 to simulate compensation records to validate model stability by calculating loss ratio. The loss ratio is calculated as
The June claims records are shown in
Table 6.
Table 6 shows that 30 days in June all triggered compensation, and there is no zero-compensation day. This might be because the study area was in the peak summer period when thunderstorm activities were at their strongest, resulting in a significant increase in the frequency and intensity of severe convective weather. Payoffs on a single-day basis were concentrated between 486,899.4 yuan and 800,000 yuan, of which the maximum was reached on 5, 12 and 14 June. The overall distribution of payments was more even and not extremely volatile.
According to the provisions on the mandatory stress scenarios for stress testing in Insurance Company Solvency Supervision Rules (II) issued by the China Banking and Insurance Regulatory Commission, a comprehensive loss ratio increases to 120% of the base scenario within the next quarter constitutes a stress scenario that requires attention. Given that the test window for this experiment covers only a single month, the coverage area is a wide range of China and Australia, the volatility of the compensation rate is magnified by the shorter observation time window, and therefore the compensation ratio is considered reasonable between 0.5 and 1.5. The total amount actually claimed in June is calculated at 18,749,943.00 yuan, with the ratio of the premium based on the data of the previous three months is 1.343, indicating that the model did not show systematic pricing deviations in the off-sample data. The loss ratio passes the test, and the model was robust.
4.6. Sensitivity Analysis
Lightning strikes are rare events, but when they occur, they often cause severe consequences. This is reflected in the lightning data by extremely high kurtosis and skewness. Therefore, the percentile thresholds in the membership functions cannot be set to the conventional values commonly used in fuzzy mathematics, such as
,
,
and
. To justify the parameter selection, we respectively modify the thresholds for the number of lightning strikes and radiance intensity to the conventional percentiles, as shown in
Table 7 and
Table 8, and compare the resulting loss ratios across different parameter settings. The peak time difference exhibits relatively normal skewness and kurtosis and thus remains unchanged. The experimental results are presented in
Table 9.
The results show that the loss ratios in both cases deviate more significantly from 1 compared to the original model. This is mainly because the traditional percentile method misclassifies a large number of low-risk lightning strike data as medium-risk or even high-risk, thereby introducing excessive noise that prevents the model from accurately identifying the lightning hazards that are truly likely to occur. Therefore, this experiment demonstrates that percentile parameters set according to the actual data distribution, rather than based on empirical rules, are more suitable for operational application scenarios than traditional percentile parameters.
4.7. Model Comparison
To quantify the contribution of each component in the proposed framework, this subsection compares the loss ratios of the original model, its reduced variants, and a variant where forests is replaced by XGBoost. Since only the fuzzy logic-based method follows a conventional normalization practice for deriving the lightning risk index, the outputs of the other models were kept in their original unscaled form to ensure a fair comparison. For each model, the trigger threshold was selected using the same procedure, and the corresponding premiums and payoffs were calculated accordingly. The comparison results are presented in
Table 10.
It is observed that when using only the PCA score for the payout simulation, no claim was triggered throughout the entire month of June. However, there were publicly reported lightning disaster events in China alone during that period, indicating that the model fails to capture actual lightning occurrences and would therefore be unacceptable to policyholders in practice. For the PCA + RF model, the MEF plot did not show a clear upward trend in the mean excess values, while the Hill plot produced stable and interpretable results. Therefore, the trigger threshold for this model was selected solely based on the Hill plot. This also suggests that the raw RF output alone may not be fully suitable for the operational scenario considered in this study. Among all model variants, the original framework yields a loss ratio closest to 1, indicating that it achieves the best overall performance and that each of its components makes a positive contribution to the model’s effectiveness.