The results of the analysis of human safety in fire accidents are presented here in three complementary stages. First, descriptive statistics provide an overview of the demographic structure of fire casualties, highlighting differences in age, gender, and injury severity between injured and deceased individuals. Second, stochastic modeling is applied to fit candidate probability distributions to the observed age data, allowing for a rigorous assessment of symmetry, skewness, and kurtosis effects. Finally, data mining techniques, including association rule mining and decision tree induction, are employed to uncover hidden patterns and conditional pathways that characterize the risk profiles of casualties. Together, the results thus obtained offer a comprehensive empirical foundation for subsequent interpretation and discussion.
3.1. Descriptive Statistics
This section presents the demographic and descriptive characteristics of fire casualties, focusing on age, gender, and injury severity. Descriptive statistics of the empirical distribution of the above-mentioned variables are shown in
Table 3. From here, it can be seen that there is a clear predominance of men among the injured, and a further study shows that the reason is occupational causes where men are more often exposed. On the other hand, the participation of women in fatal outcomes is significantly higher than in injuries. This suggests that women, although less frequently injured, have a relatively higher risk of a fatal outcome, probably due to the age structure.
Regarding the age of the injured, the dominant group is the 20-to-64-year-old group, which accounts for 72.4% of all injured persons. Among the deceased, the dominant group is those over 65 years of age, who account for almost 59% of all deaths. Thus, injuries are concentrated in the working-age population, whereas fatalities disproportionately affect the elderly, underscoring age as a critical risk factor. Finally, in most cases, injuries have minor consequences, but a significant proportion of serious injuries indicate the need for prevention and protection in work environments.
In addition, the age of the injured and deceased is also considered as two separate numerical variables. For this purpose, statistical indicators that are usually used to summarize continuous attributes, such as measures of central tendency, dispersion, and shape of the distribution (skewness and kurtosis), are calculated. These values are shown in the following
Table 4, where similar conclusions can be drawn as in the previous one. In other words, injuries are clustered around the working-age population, with a symmetrical distribution (mean ≈ median = 47, skewness ≈ 0) and moderate variability.
In contrast, fatalities are concentrated in the older population, with a significant negatively asymmetrical distribution (mean ≈ 65, median = 71, skewness < 0) and a pronounced leptokurtosis (greater than 3). This fact indicates a concentration of the distribution of victims of older age, as well as a significant presence of extreme cases, i.e., the possibility of suffering both in the youngest and oldest population. Thus, a fundamental demographic difference between injuries and deaths from fires is observed. Namely, it is obvious that injuries predominantly affect the economically active population, while deaths are disproportionately present among the elderly.
Note that in addition to the demographic characteristics presented above, the dataset also includes several contextual attributes describing the temporal and spatial conditions of fire incidents, such as accident hour groups, seasonal variation, and fire object types. Although these factors are not the primary focus of the descriptive analysis, they are incorporated in the subsequent data mining procedures, where their relationships with injury and fatal outcomes are explored through association rule mining and decision tree modeling.
3.2. Stochastic Modeling
To formalize the observed demographic patterns, stochastic modeling is applied to fit candidate probability distributions to the age data of injuries and fatalities. To this end, the GAMLSS methodology proposed in [
21] was used, which allows flexible modeling not only of the mean values
but also of the scale parameters
, shape
and kurtosis
of the distribution. This makes it particularly suitable for analyzing the age distribution of injuries and fatalities, as the data show various asymmetries and leptokurtic that classical approaches cannot fully capture. For a distribution with parameters
, GAMLSS defines the so-called linear predictor:
where
is the link function for parameter
,
is the parametric (e.g., regression) component, and
are smooth functions (e.g., splines). In our study, the parameters are estimated using a penalized maximum likelihood method [
22], while optimization is performed via Newton-Raphson and Fisher scoring algorithms [
23]. In doing so, for smooth functions, a backfitting algorithm iteratively adjusts parametric and nonparametric components [
24], and the entire estimation procedure is carried out using the so-called “gamlss.dist” package in the statistical software “R” [
25].
The fitting of the age distribution of the injured (
-variable) is described in more detail below, for which, as already mentioned, descriptive statistics showed that the age distribution of injuries is symmetric with moderate variability (SD ≈ 19.6) and mild leptokurtosis (≈2.4). In order to formalize these patterns, four candidate distributions were fitted: Normal Distribution (ND), Generalized Normal Distribution (GND), Student’s
t-distribution (TD), and Generalized
t-distribution (GTD). It is worth noting that these distributions were chosen from the point of view of their symmetry, which was established in the empirical distribution of the age of injured persons. The estimated parameter values of these distributions, along with the corresponding goodness-of-fit statistics mentioned above, are shown in
Table 5 below.
Based on this, it is clear, for instance, that although GTD achieved a competitive AIC, it did not pass the goodness-of-fit tests and was therefore not considered adequate despite its flexibility. Among remaining candidate distributions, the ND also failed the goodness-of-fit tests, and the TD proved to be too flexible, with too large a value of the parameter
. In contrast, the GND distribution achieved the lowest AIC/BIC values, minimized the CDF-based mean square error (MSE), and was not rejected by the KS, AD, or CVM tests, with the significance at the level
. Additionally, estimated shape parameter of the GND (
) indicates moderate deviation from Gaussian tails, where
. Thus, the GND, given by the probability density function:
where
is the gamma function, represents uniquely providing adequate fit. According to this, the GND best fits the age distribution of the injured and can be used as a symmetric age profile with moderately strong tails, which is consistent with descriptive statistics (e.g., almost zero skewness).
This is also illustrated in
Figure 1, which shows the empirical and competitive distributions (left), as well as a quantile-to-quantile (Q-Q) plot showing the fit of the empirical age distribution to the GND (right). According to them, it is also clearly noticeable that GND best fits the empirical age distribution of injured persons; accordingly, it was further selected as the generative model for subsequent inference, prediction, and risk quantification. In the case of stochastic modeling of the age-specific mortality distribution (
Y-variable), descriptive statistics showed that the distribution was negatively skewed (mean
64.5, median
71, skewness
1.06) with high variability (SD
21.5) and pronounced leptokurtosis (
3.61). These values indicated a distribution centered in older age groups, with negative skewness and heavier tails. This matches the empirical profile: fatalities are concentrated among elderly individuals, with a sharp peak and extended left tail, which indicates significant mortality probabilities in the younger population.
To formalize these patterns, as can be seen in
Table 6, four candidate distributions were fitted: Normal distribution (ND), Weibull distribution (WD), Generalized Gamma distribution (GGD), and Reflected Log-Normal distribution with positive support (RefLOGND+), given by the probability density function:
Here, is the location parameter (e.g., mean value), is the scale parameter, is the reflection constant, and is the standard normal CDF. Note that, in this case, the reflection constant is chosen according to equality , thus ensuring that the RefLOGND+ has only positive values. Obviously, the ND, WD, and GGD failed the goodness-of-fit tests, so the RefLOGND+ uniquely provided an adequate fit. As in the previous one with GND, the RefLOGND+ achieved the lowest AIC/BIC and MSE values and is the only one not rejected by the KS, AD, and CVM tests.
Table 6.
Estimated values of competing distributions and goodness-of-fit statistics (Y-variable).
Table 6.
Estimated values of competing distributions and goodness-of-fit statistics (Y-variable).
| Distribution | Parameters | Infor. Criteria and Errors | Test-Statistics/(p-Values) |
|---|
| | | AIC | BIC | MSE | KS | AD | CVM |
|---|
| ND | 64.52 | 21.47 | | 7737.5 | 7747.0 | | 0.000) | 0.000) | 0.000) |
| WD | 70.88 | 3.313 | | 7876.9 | 7886.4 | | 0.000) | 0.000) | 0.000) |
| GGD | 81.83 | 0.1549 | 19.44 | 7521.5 | 7535.8 | | 0.000) | 0.000) | 0.000) |
| RefLOGND+ | 3.500 | 0.5534 | | 7465.1 | 7474.6 | | 0.0638 (0.0530) | 5121.6 (0.1655) | 1.1093 (0.1545) |
A similar conclusion can be drawn from
Figure 2, where the fitting of the empirical distribution of the age of the deceased with four competitive distributions is shown, along with the Q-Q plot of the fit with the RefLOGND+ distribution. Empirical distribution clearly indicates a negative asymmetry with limited support and concentration near the upper limit, which naturally motivates the reflection of right-skewed distributions. Thus, the RefLOGND+ is the most appropriate generative model for the age of deaths, offering both descriptive accuracy and predictive utility. It captures the negative skewness and leptokurtosis observed in the data, outperforming the normal, Weibull, and generalized gamma alternatives. It is also noticeable, based on the Q-Q plots, that RefLOGND+ shows some weaknesses in fitting younger age categories. However, the pronounced accuracy for older age groups, as the most common category of deaths, makes RefLOGND+ particularly suitable for risk assessment and simulation of future age-fatality patterns, which we now describe in more detail.
Using the empirically “best” distributions (GND for injuries and RefLOGND+ for fatalities), age-group predictions can be generated. As noted above, selected models ensure that the predicted probabilities are not just smooth numbers, but also risk estimates based on the age-group distribution. In this sense, predictive values for each age interval
are obtained using the CDF of the selected model. More precisely, for the variables injuries (
X) and fatalities (
Y), we calculate:
where
and
are the CDFs of the GND and RefLOGND+, respectively. The predicted values thus obtained are visualized and shown in
Figure 3, where it can be seen that in the case of injuries (
X-variable), symmetric concentration is noticeable for middle adulthood, peaking between 35–49 years old (
0.26), remaining high between 50–64 years old (
0.25), and gradually decreasing thereafter. On the other hand, mortality indicates a low initial value until midlife, followed by a sharp increase for the age group 50–64 (
0.20), a peak between 65–79 (
0.35), as well as a continuous increase between 80 and the older population (
0.28). Note that this pattern is consistent with selected models that incorporate the risk of activity-related injuries and age-related frailty that drives mortality. In that way, these predictions also support differentiated prevention strategies: exposure control for the working-age population aged 20–64 years and specific interventions targeting vulnerability for the population aged 65+ years.
In order to further quantify the structural divergence between injuries and fatalities across age groups, we introduce the relative risk index (
RR), defined for each age interval
as follows:
Here, and denote the predictive probabilities obtained from the fitted RefLOGND+ and GND distributions, respectively. This ratio measures the relative overrepresentation of a given age group among fatalities compared to injuries. Values indicate that the age group is proportionally more frequent among fatalities than among injuries, while indicates dominance of non-fatal outcomes.
For visualization purposes, the relative risk curve is presented in
Figure 3 on a logarithmic scale, enabling symmetric interpretation of the structural transition around the neutral threshold
. For age groups below 65 years, the relative risk remains below unity, confirming that fire incidents in these categories are predominantly injury-driven and related to exposure mechanisms. However, after age 65, the index increases sharply, exceeding 2 in the 65–79 group and reaching values above 5 in the 80+ category. This indicates that the oldest population is more than five times relatively overrepresented in fatal outcomes compared to injuries. The relative risk curve, therefore, provides quantitative confirmation of the demographic divergence identified by stochastic modeling. It supports the interpretation that fire injuries are primarily an exposure-driven phenomenon concentrated in the working-age population, while fire deaths are characterized by physiological fragility, vulnerability, and reduced adaptive capacity in older age.