Forensic and Cause-and-Effect Analysis of Fire Safety in the Republic of Serbia: An Approach Based on Data Mining

Mitrović, Nikola; Stojanović, Vladica S.; Jovanović, Mihailo; Mladjan, Dragan

doi:10.3390/fire8080302

Open AccessArticle

Forensic and Cause-and-Effect Analysis of Fire Safety in the Republic of Serbia: An Approach Based on Data Mining

¹

Department of Forensic Sciences, University of Criminal Investigation and Police Studies, 196 Cara Dušana Street, 11000 Belgrade, Serbia

²

Department of Computer Sciences and Informatics, University of Criminal Investigation and Police Studies, 196 Cara Dušana Street, 11000 Belgrade, Serbia

³

Department of Criminalistics, University of Criminal Investigation and Police Studies, 196 Cara Dušana Street, 11000 Belgrade, Serbia

^*

Author to whom correspondence should be addressed.

Fire 2025, 8(8), 302; https://doi.org/10.3390/fire8080302

Submission received: 21 June 2025 / Revised: 22 July 2025 / Accepted: 28 July 2025 / Published: 31 July 2025

(This article belongs to the Special Issue Fire Safety and Sustainability)

Download

Browse Figures

Versions Notes

Abstract

The manuscript examines the cause-and-effect relationships of fires in the Republic of Serbia over a fifteen-year period, primarily from the aspect of human safety. For this purpose, numerical variables describing the number of injuries and deaths in fires were introduced, on which various analysis and modeling techniques were implemented, which can be viewed in the context of data mining (DM). First, for both observed variables, stochastic modeling of their temporal dynamics was analyzed, and subsequently, cluster analysis of the values of these variables was performed using two different methods. Finally, by interpreting these variables as outputs (objectives) for the classification problem, several decision trees were formed that describe the influence and relationship of different fire causes on situations in which injuries or human casualties occur or not. In that way, several different types of fires have been identified, including rare but deadly incidents that require urgent preventive measures. Key risk factors such as fire cause, location, season, etc., have been found to significantly influence human casualties. These findings provide practical insights for improving fire protection policies and emergency response. Through such a comprehensive analysis, it is believed that some important results have been obtained that precisely describe the specific relationships between the causes and consequences of fires occurring in the Republic of Serbia.

Keywords:

human safety; data mining; stochastic modeling; segmentation; classification; data analysis

1. Introduction

Fires have almost always been and remain one of the most important and urgent problems of human society. Therefore, the scientific study of fires is of crucial importance, as it contributes to understanding, preventing, controlling and reducing their consequences. In addition to the numerous reasons for introducing scientific methodology into fire research, one of the key ones is their forensic analysis, i.e., determining the cause of the fire and reconstructing the events themselves, as well as the impact on the occurrence of fires with human casualties or major damage. Such analysis can support the definition of various preventive and insurance measures and processes, so it is not surprising that many researchers, in an attempt to understand the cause-and-effect relationships of fires, devote significant time to this topic [1,2].

In this sense, mathematical-predictive models that use various forms of machine learning and artificial intelligence techniques have been particularly important recently. Thus, for instance, Madaio et al. [3] developed the so-called Firebird framework in order to identify and prioritize fires in commercial buildings, while Choi et al. [4] perform forest fire risk prediction using Google Earth Engine and compare different machine learning models. The application of operational statistical methods in aerial firefighting interventions is given by Sherry et al. [5], while Gündüz et al. [6] use machine learning to detect burned areas after a fire. It should be noted that some general introduction to the application of machine learning methods in forest fire analysis is also given in [7,8,9]. In addition to the above, in several other related articles (see, e.g., [10,11,12,13]), the applications of various machine learning models are presented, and they continue to be improved and developed to this day.

Applying a similar approach, an analysis of cause-and-effect relationships and features of fire incidents that primarily affect people’s safety has been conducted here. For this purpose, a set of data on fires on the territory of the Republic of Serbia was observed in the period since the beginning of the official implementation of a new, more comprehensive methodology for recording fires. More precisely, the observed dataset was collected over a fifteen-year period, starting from 1 January 2009, and ending on 31 December 2023. In this period nearly 6000 fires in Serbia resulted in human casualties, either injuries or fatalities. While these represent a fraction of all recorded fire incidents in the country during that period, they account for the most serious public safety concerns. By analyzing fires with direct human consequences, this study aims to better understand the key risk factors and contexts in which fire-related harm occurs. To this end, several different DM techniques, based on some well-known theoretical results (see, e.g., [14,15]), were applied to this dataset, which provide a basis for discovering certain rules within the data and thus enable successful inference and decision-making.

In more detail, the structure of the manuscript is as follows: The next section, Section 2, describes a set of variables of different types that are important for revealing the relationships and factors that influence basic aspects of human safety, i.e., injuries or deaths caused by fires. DM methods used in the analysis of observed fire data are also presented here, as well as their further practical application. Subsequently, Section 3 contains the main research results obtained according to the application of different DM techniques: stochastic modeling, segmentation and classification of fire data. As previously pointed out, the primary aspect of applying the aforementioned DM techniques is from the aspect of analyzing variables that describe injuries and deaths caused by fires. Therefore, by applying the so-called zero-and-one inflated Poisson distribution, stochastic modeling of the empirical distributions of these two variables was performed. Then, clustering of the set of their values was carried out using K-means and agglomerative algorithms. Finally, using a larger number of input (independent) variables, a classification of their impact on the occurrence of injuries, or deaths from fires, was carried out using decision tree models. The following section, Section 4, discusses and analyzes the results obtained precisely from the point of view of observing details and patterns that can improve the safety of the population, that is, reduce fire risks. Finally, the last section, Section 5, contains some concluding remarks.

2. Materials and Methods

In this section, the dataset used in the study is first described, with special emphasis on the independent (input) and dependent (output) variables that were introduced for this purpose. The basic techniques and tools that were implemented in the DM analysis of the observed dataset are also briefly discussed.

2.1. Dataset

In this study, as mentioned earlier, a dataset is used that refers to persons injured and killed in fires in the territory of the Republic of Serbia in the period from 2009 to 2023. The data were obtained from the Ministry of Internal Affairs of the Republic of Serbia, Sector for Analytics, Telecommunications and Informatics (SATIT). Based on them, it was found that a total of 6013 people were injured in fires during the observed time period, and 1370 of them were killed in 5769 independent events. Furthermore, according to an official report from the International Association of Fire and Rescue Services (CTIF) [16], the age-standardized fire mortality rate in Serbia (ICD-10 codes X00-X09) is 1.5 per 100,000 inhabitants. As shown in Table 1, this value is higher than the world average, as well as compared to most European countries with comparable population density. Similarly, the same data source shows that the fire injury rate in Serbia—4.9 per 100,000 inhabitants—is also above average, i.e., higher than in most of the other countries. Taking this into account, as well as the long-term statistical trend of growth of these indicators in Serbia [17], it is clear that deaths and injuries in fires represent a significant problem, which highlights the need for improving preventive and public safety measures.

In further analysis, each record of an individual fire is interpreted as an instance of the observed dataset, i.e., an individual event with information on the cause, location of the fire, type of object, place of occurrence, time of notification, arrival of the fire brigade and extinguishing of the fire, as well as the number of fatalities and the number of injured persons. At the same time, as already mentioned earlier, the main motive of this research is to determine the interdependence of the appropriate input variables to the number of injured and victims, which are viewed as output (target) variables. For this purpose, both of these groups of variables are described separately below.

2.1.1. Input Variables

Our research used a total set of ten input variables that were classified in accordance with the appropriate methodological guidelines. This is primarily due to the possibility of their further DM analysis, primarily in terms of classification in relation to the value of the output (target) variables. The basic characteristics of all input variables are given in Table 2 below.

As can be seen, among the above input variables, six are nominal, two are ordinal, and two are numeric. Their brief description can be given as follows:

“Cause of Fire (CoF)” variable contains five groups of values, labeled A, B, C, D, and E, which identify the causes that led to the fire, such as human factors, technical factors, or natural causes. The method of grouping different causes of fires is based on the classification carried out in the following Section 3, while its detailed structure is shown in Table 3.
The variable “Fire Location (FL)” is classified into three groups, and it has a similar function as the “FCL” variable. Additionally, this variable highlights more detailed information about the location of the fire in residential buildings, for example, fire on the ground floor, fire on the second floor, etc. The structure of the individual values of this variable is shown in Table 4.
The variable “Fire Category by Location (FCL)” also contains four groups, representing information on the fire location (open space/indoor space) and building type. The detailed structure of the individual values of this variable is shown in Table 5.
The “Season” variable is classified into four groups; the data indicate the susceptibility of the season and weather conditions to the occurrence of fires.
The variable “Day of Day (DoD)” is classified into two groups; the data indicate the fires during weekdays and during weekends.
The variable “City of Fire Origin (CFO)” is used for spatial data classification, critical location identification, and geographic fire analysis. The values of this variable are divided into eight groups, using Quantum Geographic Information System (QGIS) software [18]. These groups are presented in the form of maps in Figure A1 and Figure A2 of Appendix A.
The variable “Year of Fire Occurrence (YFO)” serves for the temporal classification of the data, the values of which are analyzed in annual intervals.
“Hour of Notification (HoN)” variable is used for the temporal classification of data, the values of which are observed in the number and frequency of fire alarms in certain time intervals.
“Alert/Arrival (A/A)” variable identifies the evaluation of the effectiveness of the fire response. It shows the time interval between receiving a fire notification and the arrival of the fire department at the location.
“Alert/Extinguishment (A/E)” variable identifies the interval from the time the fire department arrives at the location to the time the fire is extinguished.

In addition, note that for all of the above input variables (usually called attributes) their information gain

(I G)

was calculated with respect to the two output variables (“Injuries” and “Fatalities”, described below) and these values are also shown in Table 2. As is known, the

I G

values represent the difference in entropy between the states before and after the selection of the attribute

A

. More precisely,

I G

values are calculated according to the following formula:

I G (A) = H (S) - H (A | S),

where

H (S) = - \sum_{i = 1}^{n} p_{i} {l o g}_{2} p_{i}

is the entropy between the classes

S_{1}, \dots, S_{n}

of the set

S

, while

p_{1}, \dots, p_{n}

are aposteriori probabilities of choosing classes

S_{1}, \dots, S_{n},

and

H (A | S) = \sum_{i = 1}^{n} \frac{|S_{i}|}{|S|} \cdot H (S_{i})

is the entropy after dividing

S

into the classes

S_{i}

, with respect to the attribute

A

. As can be seen, the largest

I G

value has the attribute “Cause of the Fire”, which is therefore taken as primary during classification, i.e., the formation of decision trees (see Section 3).

2.1.2. Output Variables

In addition to the input variables mentioned above, the observed dataset also contains two output variables of numeric type (which we often call series), which describe the total number of injured

(X)

, as well as fatalities

(Y)

in each of the observed fires. In this way, they can be interpreted as positive integer-valued random variables, and some of the well-known procedures for estimating their parameters and stochastic modeling can be applied (see, e.g., Stojanović et al. [19,20]). The basic statistical indicators of these variables (their minima, quartiles, maxima, mode, mean, variance, standard deviation skewness, and sums) are given in Table 6. From here it is easy to see, for instance, that the variable

X

contains mostly the value of one injured, although its extreme value (column Max) is as high as 30 injured people. Similarly, variable

Y

, which describes the number of deaths, mostly contains zero values, which indicate fires without fatalities. Unfortunately, there are also fatal incidents here with a maximum number of 7 fire victims.

Additionally, note that for both output variables the mean values (column Mean) are approximately equal to their variances (column Var). Therefore, we assume that empirical distributions of these two variables can be fit by using the well-known Poisson distribution, that is, using its modifications known as the zero-and-one inflated Poisson distribution. Still, in order to perform DM analysis of the observed data using classification techniques, i.e., decision trees, we also performed the output values of variables

X, Y

in the form of two classes. More precisely, the values of these variables are coded as “YES” (if there are injuries, i.e., fatalities), or “NO” if there are no injuries (i.e., fatalities). The main theoretical aspects of the DM-based approach thus defined are briefly presented below.

2.2. Data Segmentation (Clustering)

Clustering is the process of discovering groups (clusters) of similar values in observed data. It is a form of so-called unsupervised learning that involves searching the input database for spontaneously generated divisions between individual data. In simple terms, clustering divides data into smaller logical groups, so-called clusters, so that objects within the same cluster are similar to each other, while objects from different clusters are different from each other. Clusters are often used to detect changes or deviations, with the primary goal of finding individual data that do not fit into established norms (see, for instance, [21]). Thus, by considering the total number of injuries and deaths as two numerical variables, cluster analysis of fires in the Republic of Serbia allows for the identification of different types of incidents and the recognition of risk patterns that are not obvious from a simple overview of the data. The theoretical foundations of the clustering methods and techniques used in this study are briefly described below.

2.2.1. Silhouette Score Method

The Silhouette Score (SS) is a well-known method for selecting the optimal number of clusters, that is, for measuring the quality of clustering. For each element

x_{i}

,

i = 1, \dots, n

of the dataset

X = \{x_{1}, x_{2}, \dots, x_{n}\}

, interpreted as points, the SS first calculates the following two values:

-: $a (i)$ is the average distance of the point $x_{i}$ from all other points within its cluster (compactness),
-: $b (i)$ is the average distance to points in the nearest other cluster (separability).

Using these values, the so-called “silhouette” value for point

x_{i}

is calculated as follows:

S (i) = \frac{b (i) - a (i)}{\max \{a (i), b (i)\}} = \{\begin{matrix} 1 - \frac{a (i)}{b (i)}, & a (i) < b (i) \\ 0, & a (i) = b (i) \\ \frac{b (i)}{a (i)} - 1, & a (i) > b (i) . \end{matrix}

(1)

Thereafter, the overall SS value is calculated as the average of the values

S (i)

, obtained for all

x_{i}, i = 1, \dots, n

. It is worth noting that, according to Equation (1), SS values range from −1 to 1, with higher values (closer to 1) indicating better clustering. This method is typically used by calculating SS values for different number of clusters

k = 1, 2, \dots

Then, the optimal

k

is chosen as the one for which the SS value is either the largest or stabilizes (a so-called SS plateau is obtained).

2.2.2. K-Means Algorithm

A large number of algorithms are used for clustering, and one of the most commonly used is known as the K-means algorithm. This algorithm finds a certain (predefined) number of

k

clusters that are represented by the so-called centroids (cluster centers). Based on the data and the given number of initial centroids, the K-means algorithm generates clusters, whereby the number of clusters, the maximum number of iterations, the maximum number of optimization steps, etc., can be preset. The algorithm itself is very simple and consists of the following steps:

Initialization step: $k$ initial centroid points $c_{1}, c_{2}, \dots, c_{k}$ are (randomly) selected.
Assignment step: $k$ clusters are formed based on the “proximity” of each point $x_{i}, i = 1, \dots, n$ , to the nearest centroid $c_{j}$ , $j = 1, \dots, k$ . In other words, for a given dataset $X$ , the algorithm seeks a partition into $k$ clusters $C_{1}, C_{2}, \dots, C_{k}$ that minimizes the distance $d (x_{i}, c_{j})$ which is usually measured as the Euclidean, or some other similarity metrics. In this way, each point $x_{i} \in X$ is assigned to the cluster $C_{j}$ whose centroid $c_{j}$ is closest $x_{i}$ .
Update step: New centroids are calculated as the midpoints in each cluster.
Repeat steps 2 and 3 until the centroid changes.

It should be emphasized that K-means is efficient for large datasets, assuming that the clusters are spherical in shape. Thus, the algorithm is sensitive to so-called isolated data (outliers), and this represents its possible weakness.

2.2.3. Agglomerative-Hierarchical (AH) Clustering

The AH algorithm uses the so-called bottom-up approach and can be described through the following steps:

step: Start with n clusters (each point is its own cluster).
step: Calculate the distances between all clusters.
step: Merge the two closest clusters into a new one.
step: Update the distance between the clusters.
step: Repeat steps 3 and 4 until the given number of clusters is formed.

Thus, the AH method starts by considering each point as a separate cluster, and in each subsequent iteration the two closest clusters are merged until the desired number of clusters remains. The advantages of this algorithm are that it does not require a prior definition of the number of clusters and allows simple visualization via the so-called dendrograms [21]. Still, from a computational point of view, it is more demanding than the previous K-means algorithm.

2.3. Data Classification (Decision Trees)

As already pointed out, one of the main ideas of this study is to analyze the cause-and-effect relationships in fires, thereby enabling the identification of key factors that influence the most serious outcomes of these events, such as injuries or deaths. In this way, classification models are obtained that use attributes such as the cause of the fire, time and location of the incident, based on which relationships between different types of fires can be discovered and cases in which injuries and/or human casualties occur are classified. This enables proactive risk management and more efficient resource allocation, with the aim of successfully preventing such incidents.

2.3.1. Basic Principles

Classification is a supervised learning technique in which a target (response or dependent) variable is predicted based on known input variables (predictors, attributes or independent variables). One of the most intuitive and commonly used methods for classification is the decision trees technique. Decision trees are a powerful and at the same time very simple graphical tool for classification, prediction and decision support [22,23]. They are typically used with large and highly correlated datasets by breaking them down into smaller sets using a set of rules. Each decision tree is a classifier that represents the classification function in the form of a tree, i.e., hierarchically, so that the nodes at the top of the tree have the greatest influence on the classification. In this way, decision trees model the decision-making process (from the root to the leaves), whereby data is divided over each node based on one attribute, and each leaf of the tree contains a specific prediction (class). The construction of a decision tree involves the application of the following principles:

-: Selection of the best attribute for division (based on entropy, Gini index or statistical tests);
-: Branching and continuing the division until the stopping criterion is met (when all cases are within the same class or the maximum depth of the tree is reached).

2.3.2. CHAID Algorithm

CHAID (Chi-squared Automatic Interaction Detector) is a decision analysis method that uses statistical criteria, specifically the chi-square test of independence, to divide data into homogeneous groups. Basically, CHAID consists of several statistical procedures, which are applied in multiple steps, starting from the selection of attributes for branching, through testing its dependence on the target variable, to the formation of nodes and branches. The first step in building a CHAID tree is to select the predictor

(X)

with the strongest statistical association with the target variable

(Y)

. For this purpose, the chi-square test of independence, i.e., the χ² statistic, is used

χ^{2} = \sum_{i = 1}^{r} \sum_{j = 1}^{s} \frac{{(f_{i j} - f_{i j}^{'})}^{2}}{f_{i j}'},

(2)

where

r

and

s

are the number of cases for the variables

X

and

Y

respectively,

f_{i j}

are observed, and

f_{i j}^{'}

are the expected values for each case (see, for more detail, e.g., [24]).

Note that the

χ^{2}

statistic, defined by Equation (2), allows the calculation of the so-called

p

-value, as the probability of incorrectly rejecting the hypothesis of independence of variables

X, Y

. Therefore, the attribute

X

with the lowest

p

-value will be chosen to split the node, while, on the other hand, the node is not split if

X

does not have a statistically significant relationship with the dependent variable

Y

. This ensures that splitting is only performed when there is a real statistical basis for doing so, thus avoiding over-segmentation of the data.

An important feature of the CHAID algorithm is the ability to merge similar categories of independent variables. If some values (cases or categories) of the independent variable

X

do not show statistically significant differences compared to the dependent variable

Y,

they are merged. This is also implemented by applying the

χ^{2}

test for each possible pair of categories. More precisely, if the

p

-value between two categories exceeds a given merging threshold

(α)

, those categories are merged into one. This process is repeated until a set of categories that are statistically significantly different from each other is obtained. In this way, the number of branches emerging from one node can be greater than two, resulting in a non-binary tree. This is a significant advantage of the CHAID algorithm over some other algorithms, as it allows for a more natural and interpretable data classification.

3. Results

This section presents the main results of the DM analysis of the observed dataset, obtained by applying three different techniques whose theoretical aspects were explained in the previous section. First, for two numerical variables (series)

X

and

Y

, which present the number of injuries and deaths in a fire, respectively, stochastic modeling of their empirical distribution was conducted. For this purpose, the so-called zero-and-one Poisson distribution is used, and subsequently, segmentation (clustering) of the sets of these variables was carried out. Finally, by re-coding the values of variables

X

and

Y

with YES/NO cases (with or without injuries and fatalities), they can also be viewed as output (target) variables in the classification of causal fire cases. Therefore, using decision tree models and the CHAID algorithm, a classification of cause-and-effect fire cases was performed, which gives the output values of

X

and

Y

.

3.1. Stochastic Modeling

To model the empirical distributions of variables

X

and

Y

, as explained earlier, we use the fact that mean values and variances of the observed series are approximately equal. Thus, it can be assumed that Poisson or some Poisson-based distributions can be used in this type of stochastic modeling. Thus, our idea is to model empirical distributions of the variables

X

and

Y

with the so-called zero-and-one (

0 - 1

) Poisson distribution. The

0 - 1

inflated distribution was presented in the pioneering work of Saito et al. [25], and thereafter,

0 - 1

Poisson distribution was examined in detail by Zhang et al. [26,27]. Subsequently, many authors have developed various types of zero-and-one distributions and processes (see, e.g., [28,29,30]) and some of these results are applied in the following. To that cause, in Table 7 the summary values of the number of zero-and-ones are shown, as well as the percentage of their occurrence and the so-called zero-and-one (

0 - 1

) indices, within both sets of observed variables.

According to this, it can be clearly seen that the variable

X

has particularly frequent 1-values, while the

Y

variable mostly has 0-values. Moreover, the significance of the inflationary presence of

0 - 1

values was additionally tested by using the so-called

0 - 1

indices (see, e.g., Weiß et al. [31]):

I_{0} = \frac{{\hat{p}}_{0} \exp ({\bar{X}}_{n}) - 1}{{\hat{σ}}_{0}}, I_{1} = \frac{{\hat{p}}_{1} {\bar{X}}_{n} \exp ({\bar{X}}_{n}) - 1}{{\hat{σ}}_{1}},

where

n = 5769

is the sample size,

{\bar{X}}_{n} = n^{- 1} \sum_{i = 1}^{n} x_{i}

is the sample mean and

{\hat{σ}}_{0},

{\hat{σ}}_{1}

are sample deviations of the statistics

Z_{0} = {\hat{p}}_{0} \exp ({\bar{X}}_{n})

and

Z_{1} = {\hat{p}}_{1} {\bar{X}}_{n} \exp ({\bar{X}}_{n})

, respectively. Note that in the case of ordinary Poisson distribution, equalities

I_{0} = 0

and

I_{1} = 1

hold.

Thus, by applying some general asymptotic results (see, e.g., [31]), it can be shown that both statistics

Z_{0}

and

Z_{1}

are asymptotically Gaussian distributed. By applying the usual procedure based on the standard Gaussian distribution, one can test the hypothesis

H_{0}

that the presence of

0 - 1

values in these series is not significantly emphasized. The results of such testing are also shown in Table 7 above, where it is noticeable that the hypothesis

H_{0}

is rejected in the case of index

I_{1}

for

X

-variable and

I_{0}

for

Y

-variable. Therefore, series

X

has a significant presence of 1-values, and series

Y

has a significant presence of 0-values, so both can be modeled using the zero-and-one inflationary Poisson distribution.

For that purpose, we first consider the

X

-variable which we model using a 1-Poisson distribution, whose probability mass function (PMF) is defined as follows:

p_{X} (x) = \{\begin{matrix} ϕ_{1} + ϕ_{2} p (x; λ), x = 1 \\ ϕ_{2} p (x; λ), x \neq 1 . \end{matrix}

(3)

Herein is

x = 0, 1, 2, \dots,

while

p (x; λ) = λ^{x} \exp (- λ) / x!

is the PMF of the ordinary Poisson distribution with the parameter

λ > 0

. In addition,

ϕ_{1} \in (0, 1)

is a parameter that shows the additional proportion of ones, compared to the ordinary Poisson distribution, and

ϕ_{2} = 1 - ϕ_{1}

. According to Equation (3), it is obvious that the PMF

p_{X} (x)

includes the 1-values more prominently than in the ordinary Poisson distribution. Also, the 1-Poisson distribution depends on three unknown parameters

λ, ϕ_{1}, ϕ_{2} > 0

which can be estimated using various estimation methods.

To estimate these parameters, we apply the so-called method of moments (MoM), based on equalization the first two theoretical moments

μ_{r} (λ, ϕ_{1}) = E (X^{r})

, where

r = 1, 2,

with the empirical ones

{\hat{μ}}_{r} = n^{- 1} \sum_{i = 1}^{n} x_{i}^{r}

,

r = 1, 2 .

In the case of the 1-Poisson distribution, it can be easily shown that for theoretical moments is valid:

μ_{r} (λ, ϕ_{1}) = ϕ_{1} + ϕ_{2} M_{r} (λ), r = 1, 2,

where

M_{1} (λ) = λ

and

M_{2} (λ) = λ^{2} + λ

are the first two theoretical moments of the ordinary Poisson distribution. Therefore, the MoM estimates can be obtained by solving the following system of equations:

\begin{matrix} ϕ_{1} + ϕ_{2} λ & = {\hat{μ}}_{1} \\ ϕ_{1} + ϕ_{2} (λ^{2} + λ) & = {\hat{μ}}_{2} \end{matrix}

with respect to the unknown parameters

λ, ϕ_{1}, ϕ_{2} > 0

. After some simple calculations, as the solution to the above system one obtains:

{\hat{λ}}_{1 / 2} = \frac{b \pm \sqrt{b^{2} - 4 a b}}{2 a}, {\hat{ϕ}}_{2} = \frac{a}{\hat{λ} - 1} = \frac{b}{{\hat{λ}}^{2}}, {\hat{ϕ}}_{1} = 1 - {\hat{ϕ}}_{2},

where

a = {\hat{μ}}_{1} - 1 = \frac{1}{n} \sum_{i = 1}^{n} (x_{i} - 1), b = {\hat{μ}}_{2} - {\hat{μ}}_{1} = \frac{1}{n} \sum_{i = 1}^{n} x_{i} (x_{i} - 1)

(4)

are the first two sample falling factorial moments.

The results of this estimation procedure are shown in Table 8 below, where in addition to the estimates of the 1-Poisson parameters

λ, ϕ_{1}, ϕ_{2} > 0

, the estimate of the Poisson distribution parameter

{\hat{λ}}_{P o i s} = {\hat{μ}}_{1} = {\bar{X}}_{n}

is also given. As can be seen, both distributions have (approximately) equal estimated values of the parameter

λ

, but the additional proportion of 1-values, expressed by the estimated value of the parameter

ϕ_{1} \approx 1 / 2

, clearly indicates their pronounced presence. In that sense, for both distributions, statistical testing of their agreement with the empirical distribution of the

X

-series was also performed. For this purpose, the following three statistical tests are used:

-: Wilcoxon rank sum test with continuity correction (W);
-: permutation asymptotic general independence test (Z);
-: asymptotic two-sample Kolmogorov–Smirnov test (D).

Note that all the above tests were carried out in the statistical-oriented programming language “R” (version 4.5.0) and using ‘twosamples’ software package [32]. The values of test statistics, as well as the corresponding

p

-values are also given in Table 8.

Based on them, it is obvious that in the case of the 1-Poisson distribution, the null hypothesis of its agreement with the empirical distribution of the

X

-series not rejected in any of the listed tests. In contrast, the ordinary Poisson distribution is clearly not an adequate stochastic model, as can also be seen in the following Figure 1. Similarly to the

X

-variable, we consider below modeling the empirical distribution of the

Y

-variable using a 0-Poisson distribution, whose PMF is given by the equality:

p_{Y} (y) = \{\begin{matrix} ϕ_{0} + ϕ_{2} p (y; λ), y = 0 \\ ϕ_{2} p (y; λ), y \neq 0 . \end{matrix}

Herein is, as earlier,

y = 0, 1, 2, \dots

,

p (y; λ) = λ^{y} \exp (- λ) / y!

is the PMF of the ordinary Poisson distribution,

ϕ_{0} \in (0, 1)

is an additional proportion of zeros, and

ϕ_{2} = 1 - ϕ_{0}

.

In the same way as in 1-Poisson distribution, the unknown parameters

λ, ϕ_{0}, ϕ_{2} > 0

of the 0-Poisson distribution can be estimated by using MoM, that is, by solving the next system of equations:

\begin{matrix} ϕ_{2} λ & = {\hat{μ}}_{1} \\ ϕ_{2} (λ^{2} + λ) & = {\hat{μ}}_{2} . \end{matrix}

After some computations, the solution to the above system is obtained as follows:

\hat{λ} = \frac{{\hat{μ}}_{2}}{{\hat{μ}}_{1}} - 1 = \frac{b}{a + 1}, {\hat{ϕ}}_{2} = \frac{{\hat{μ}}_{1}}{\hat{λ}} = \frac{{\hat{μ}}_{2}}{\hat{λ} (\hat{λ} + 1)}, {\hat{ϕ}}_{0} = 1 - {\hat{ϕ}}_{2},

where

{\hat{μ}}_{1}, {\hat{μ}}_{2}

are the sample moments and

a, b

are the sample falling factorial moments defined by Equation (4). The results of the parameters estimation of the 1-Poisson distribution, as well as the parameter

{\hat{λ}}_{P o i s} = {\hat{μ}}_{1} = {\bar{Y}}_{n}

of the ordinary Poisson distribution, are given in Table 9.

According to the results thus obtained, it is clear that both distributions have approximately equal estimated values of the parameter

λ

. Nevertheless, unlike the previous

X

-variable, here the additional proportion of 0-values is significantly smaller (

ϕ_{0} \approx 0.08

). Therefore, both considered distributions can be adequate stochastic models of the

Y

-distribution. This is confirmed by all test statistics indicating the asymptotic equivalence of the fitted distributions with the empirical distribution of the

Y

-variable; that is, in neither case is there any basis for rejecting the hypothesis of equivalence of these distributions.

This is also visible in Figure 2 above, where similarly to the previous one, the time dynamics, ACF, and the empirical distribution of the

Y

-series, fitted by the Poisson and 0-Poisson distributions, are shown. According to them, it is clear that asymptotic equivalence of the corresponding distributions exists, so both proposed fitting distributions can be considered adequate. Finally, note that, based on it, some other stochastic characteristics of the observed variables can also be examined (e.g., predicting probabilities or the number of fires with a higher number of injuries and deaths, etc.). In addition, below we analyze some other characteristics of the observed numerical variables, related to the application of the previously mentioned DM techniques.

3.2. Data Clustering

As already explained in the previous section, here we consider the possibility of clustering the variables

X

and

Y

, that, respectively, describe the number of injuries and the number of deaths from fires in the Republic of Serbia. To this end, we emphasize that the variables

X

and

Y

they have a relatively low correlation (Pearson correlation coefficient is approximately 0.23), which indicates the justification of their clustering. In the first step, in accordance with the above, the Silhouette Score (SS) is used as a key tool for selecting the optimal number of clusters. It should be noted that the choice of the optimal number of clusters in unsupervised clustering methods (such as K-means and AH methods) is not determined a priori. It is usually based on heuristics, i.e., on the behavior of the data structure, and in that case, SS is one of the most commonly used metrics for assessing the quality of clustering. At the same time, it should be pointed out that the SS has a dual role, as it directly combines the compactness of the clusters (i.e., their internal distance) with the distance between them (the inter-cluster distance). In our case, the dependence of the SS values in regard on the number of clusters

(k)

, also shown in Figure 3, indicates the following:

For the K-means clustering, the SS values gradually increase from $k = 2$ to $k = 9$ , reaching a plateau between $k = 9$ and $k = 10$ , where they become identical.
Similarly, for the AH clustering, the SS values increases rapidly up to $k = 10$ , while there is no further improvement from $k = 10$ to $k = 11$ .

This stabilization (plateau) of the SS value indicates that the optimal granularity has been reached, and therefore the optimal number of clusters is chosen to be

k = 9

for K-means and

k = 10

for AH.

Namely, we assume that the new clusters no longer contribute significantly to the improvement in terms of separation and coherence. In addition, the choice of the number of 9–10 clusters allows for relatively simple but sufficiently high-quality interpretability. Therefore, the values chosen in this way are expected to provide a satisfactory balance between detail and stability for both models. Finally, it should be noted that the optimal SS value for the AH method (0.9708) is slightly higher than the corresponding value for the K-means algorithm (0.9523). Therefore, AH is expected to cluster data slightly better than the K-means algorithm.

3.2.1. K-Means Clustering

Clustering using the K-means algorithm, as previously highlighted, was performed for a set of a total of

k = 9

clusters. For this purpose, “KMeans()” function is used from “sklearn.cluster” library in “IDLE Python 3.12” 64-bit environment [33]. The results of this clustering are shown in Figure 4 below, and their descriptive statistical analysis is shown in the following Table 10. It should be noted that the largest clusters are A (3745 with cases) and B (1060 with cases), which together account for about 83% of all data. Note that cluster A represents typical, minor fire accidents with one injury and no fatalities. On the contrary, Cluster B includes cases with one fatality and no injuries—probably very fast caused fires. The remaining clusters identify less common but significant patterns, which could be briefly described as follows:

-: Cluster C: higher number of injuries, almost no fatalities.
-: Cluster D: serious fireplaces (average of almost 10 injured).
-: Cluster E: transition between mild and serious fires (“yellow alert”).
-: Clusters F and G: rare but extreme fires with a high number of victims and injuries.
-: Cluster H: very specific accidents with few injuries and high mortality.
-: Cluster I: mixed profile with 1–2 injured and fatalities.

Overall, it can be said that the clusters A and B represent the two main categories in the comparison and segmentation of the number of injuries versus the number of deaths without injuries. On the other hand, cluster C (with an average of about 4.7 injuries and 0.09 deaths) includes moderately severe fires in buildings and vehicles with multiple occupants. Some of the typical causes of such fires (which will be analyzed in more detail below) are electrical wires, fireplaces, open flames, explosions, etc. In these cases, evacuation was faster, so the mortality was extremely low, but people were injured while escaping or from smoke inhalation. This is precisely why it can be an indicator for functional evacuation, where human casualties are mostly avoided, but there are serious physical consequences for their health.

Also, cluster D is particularly important, as it indicates very serious fires that occur inside buildings with a high density of people, with a large number of injuries, but also with fatalities. It assumes that there was an incomplete evacuation, late intervention or sudden escalation of the fire. Therefore, this cluster has a high priority for systematic data analysis. Cluster E includes cases of a total of 659 fires of moderate intensity, where there is a certain number of injuries, but without fatalities. Through a more detailed analysis of the data, we have concluded that these fires occur exclusively inside buildings and vehicles, but also in open spaces.

Furthermore, both clusters F and G, which contain only a few cases

(F = 4, G = 3)

identify some extreme cases, with a higher number of injuries and deaths, that are important for safety policies. In doing so, cluster G is particularly specific due to its high variability in the number of injured, expressed by the standard deviation (columns StDev in Table 10).

Cluster H (with an average of approximately 0.13 injured and 2.1 fatalities) denotes accidents with a very low number of injuries but a relatively high mortality. It is also specific from an analysis point of view, as it indicates very critical situations, where the fire was extremely fast and deadly, and should not be ignored in safety policies.

Cluster I (~1.3 injured and ~1.0 fatalities) represents mixed cases where there is one (or more) injured and killed. These are most often fires in which there was (only) partial evacuation, which may indicate a certain incomplete effectiveness of safety procedures. Therefore, there is an opportunity here to improve safety policies, with the aim of saving lives through better fire detection and more efficient evacuation.

Finally, note that the segmentation of the observed dataset, obtained using K-means clustering, is useful for targeting security and prevention measures. For instance, clusters A, B, and E indicate common accidents, while F, G, and I are key for analyzing critical incidents. Therefore, the resulting segmentation provides a rich basis for prevention, intervention strategies, and risk categorization, successfully identifying dominant patterns and extreme cases.

3.2.2. AH Clustering

Hierarchical clustering produces similar results, but with a larger number of clusters, and therefore the segmentation is somewhat more nuanced. It should also be noted (again) that the AH clustering method, in addition to a slightly larger number of

k = 10

clusters, also has a higher SS value than the corresponding value obtained in the case of the K-means algorithm. Similarly to the previous one, the AH clustering method uses the “AgglomerativeClustering()” function within the “sklearn.cluster” library in the “IDLE Python”. The results of this clustering are shown in Figure 5, and their statistical analysis is given in Table 11 below.

A practical interpretation of clusters obtained by the agglomerative method, in the context of fires with injuries and fatalities, can be given as follows:

-: Cluster A contains the most severe incidents, with an average of 21 injuries and 4.67 deaths. These are typically fires in large facilities (hospitals, factories), that is, mass tragedies. It should also be noted that this cluster has high variability in the number of injuries and fatalities (columns StDev in Table 11) and is equivalent to cluster G in K-means clustering.
-: Cluster B, with an average of approximately 3.35 injuries and almost no deaths, shows medium-severity fires with successful evacuation.
-: Cluster C (~10.5 injuries, ~0.67 deaths) indicates a very serious situation and a potential delay in the response of the competent services.
-: Cluster D (~5.8 injuries, 0 deaths) represents moderately risky events, probably fires in schools, business premises, etc.
-: Cluster E (~1.2 injuries, 1 death) has mixed outcomes. These are most likely possible accidents with partial evacuations.
-: Cluster F (~0.18 injured, ~2.1 deaths) shows extremely deadly fires, which typically becomes indoors.
-: Cluster G (~2.0 injured, 0 deaths) are typical smaller fires with minor injuries.
-: Cluster H (~1.0 injured, 0 deaths) represents the most common pattern, i.e., smaller fires with one injured.
-: Cluster I (~2.25 injured, ~5.25 deaths) indicates catastrophic fires with high mortality, which are quite rare, but very serious.
-: Cluster J (0 injured, 1 deaths) contains data on fires with fatalities and no injuries. It is mostly about explosions, sudden incidents, etc.

It is clear that this segmentation, as well as the previous one, provides a very rich basis for prevention, intervention strategies, and risk categorization. For instance, clusters H and J are obviously equivalent to clusters A and B, obtained by applying the K-means algorithm. Therefore, they contain the most frequent cases, with low intensity and mild consequences, one injured (H) or one victim without injuries (J), so they form the basis for common or minor incidents.

On the contrary, clusters B, D, and G show medium-severe fires, with more injuries but (almost) no deaths, and indicate, similar to the K-means algorithm, fires that occur inside buildings, vehicles, and open spaces. Further, clusters C and E have mixed outcomes, with both injuries and fatalities, indicating events with partial evacuation. Finally, clusters A, F and I are extremely serious incidents with high mortality rates, often with more fatalities than injuries. Therefore, they indicate incidents that require special attention from a safety point of view, as well as for a plan for crisis situations, with the aim of reducing the number of fire victims.

3.3. Classification Analysis

In this part, a classification of the above-mentioned target variables, i.e., classes with or without injuries and fatalities, was performed in relation to the set of input variables (described in the previous Table 2). In doing so, as previously pointed out, the variable “Cause of the Fire (CoF)” has the highest IG-values. Thus, already in the first classification step, as seen in Figure 6, this attribute “separates” all groups of fire causes (described in more detail in the previous Table 3).

It can be noted that the classification of fire causes from groups A, B and E is quite satisfactory (the lowest accuracy is 85%). Therefore, our main goal is the classification of the target variables in the cause C, and especially of cause D, which (the only one) contains a unique cause of the fire: a cigarette butt. It is particularly worth noting that the current classification of the target variables “Injuries” and “Fatalities” for cause D is quite uniform, i.e., it has the highest percentage of victims and therefore requires significantly more detailed analysis.

The classification itself was performed using the “Decision Tree” tool in the IBM SPSS Statistics (version 26) software environment [34], using the cross validation with ten sample folds and variable “A/E” as an influence one. As already explained, the obtained decision trees show the influence of the input variables listed in the previous Table 2, on situations in which personal injury or death occurs in a fire. In this way, the decision tree represents a classifier that provides a clear segmentation of the impact of different fire characteristics on the occurrence of risk outcomes for human safety. Some of the basic metrics of the classifier obtained in our research are shown in the following Table 12.

According to this, it can be seen that both decision trees of causes C are quite extensive (with more than 100 nodes). In particular, in the case of the classification of the variable “Fatalities”, the most extensive tree is obtained, where all independent variables are included. On the other hand, when the target variable “Injuries” is classified, (only) the attribute “Hour of Notification” is not included. As for the metric characteristics of the trees in cause D, it is clear that they have a significantly smaller number of input variables and nodes, which is consistent with the size of this part of the sub-dataset, which has 272 instances. However, as highlighted above, this cause has the highest mortality rate, and is therefore of particular importance in this study. At the same time, it can be noted that all obtained trees have a similar depth (6 or 7 branching levels), as well as approximately equal, relatively small values of standard errors. In addition, the number of terminal nodes (“leaves”), which show a final classification for a particular outcome, indicates the effective accuracy of the resulting decision trees.

The detailed structure of decision trees for cause D is shown in Figure A3 and Figure A4 in Appendix A. (Due to the size of the decision trees for group C causes, their graphical representation is provided in the “Supplementary Materials”.) As can be seen, the output values of both target variables are obtained according to the six independent variables (“FL”, “Season”, “CFO”, “FCL”, “A/A”, “DoD”), where the fire location (variable “FL”) is primary in both classifications. At the same time, both trees have a similar, multi-layered and detailed structure, indicating a complex decision-making process. Let us point out that the data segmentations with complete decision-making (100% confirmed cases indicating a very high risk of injury and death, that is, 100% negative cases indicating completely safe situations) are clearly differentiated. Therefore, both decision trees thus obtained can be considered highly predictive, with the potential to identify risk situations and enable targeted action (e.g., preventive measures in high-risk cities, seasons or categories of facilities).

As confirmation, Table 13 and Table 14 below provide the results of classification for all the aforementioned decision trees. More precisely, both of these tables provide the total values of correctly and incorrectly predicted outcomes for both target variables. Obviously, the classification can be considered satisfactory, given that the total number of correct outcomes is significantly higher than the number of incorrectly predicted ones. In addition, the distribution of incorrect outputs is fairly evenly distributed, indicating that errors occur in different parts of the trees, and therefore there is no overfitting. This indicates that the independent variables have relatively equal contributions, i.e., there is no “dominant” variable that exclusively determines the outcome. Thus, multiple combinations of attributes influence the decision, making the model more stable and allowing its generalization. It can also be seen that the classification is significantly better in the case of cause D (cigarette butt), which contains less instances, but, as already mentioned, represents an important cause of fire from the point of view of high mortality.

The above classification tables allow for a more detailed analysis of the quality of the resulting decision trees. As is known, the accuracy measures of a particular decision tree can be expressed in several ways, and some of the most commonly used qualitative measures can be expressed in the following way:

\begin{matrix} A c c u r a c y & = \frac{T P + T N}{N}, P r e c i s i o n = \frac{T P}{T P + F P}, S e n s i t i v i t y (R e c a l l) = \frac{T P}{T P + F N}, \\ S p e c i f i t y & = \frac{T N}{T N + F P}, F - m e a s u r e = 2 \times \frac{P r e c i s i o n \times S e n s i t i v i t y}{P r e c i s i o n + S e n s i t i v i t y} . \end{matrix}

The above classification tables allow for a more detailed analysis of the quality of the resulting decision trees. In the following, we denote the total number of true positive

(T P)

, false positive

(F P)

, true negative

(T N)

and false negative

(F N)

predictions, while

N = T P + F P + T N + F N

is the total size of the observed dataset. The calculated values of all listed quality measures, obtained for all the aforementioned decision trees, are shown in the following Table 15.

According to them, it is clear that there is a high accuracy rate, as a result of the favorable distribution of values in the classification matrices given in Table 13 and Table 14. Therefore, for cause C, both models have satisfactory, high quality measures, with the fact that true positive cases are better detected in the target variable “Injuries”, and false positive cases are reduced in the variable “Fatalities”. On the other hand, the quality measures for cause D are extremely high, because all of them have values very close to (or even equal to) one. For instance, the classification of injuries in the decision tree model is absolutely sensitive, because there are no false negative cases (Sensitivity = 1.0). Conversely, the classification model of fatalities “never takes” false positive cases (Precision = Specificity = 1.0). This confirms that decision trees obtained by the CHAID algorithm are practically applicable in predicting outcomes based on observed input variables.

In addition, the accuracy of the model’s classification is also confirmed by the so-called ROC (Receiver Operating Characteristic) curves. As is well known, ROC curves are polygonal lines whose vertical axes show the rate of true positive responses (i.e., sensitivity), while the horizontal axis shows the rate of false positive responses (complementary to specificity). The area under the ROC curve, usually denoted as AUC (Area Under Curve), is a measure of classification quality, i.e., it represents the ability of the model to correctly distinguish between different classes. As can be seen in Figure 7 and Figure 8, the very high AUC values (once again) confirm that the proposed classification model, based on decision trees, can provide a very high percentage of correct predictions.

4. Discussion

The conducted analysis confirms the high value of DM methods in research related to human safety in fires. First, the application of stochastic modeling through the 0-1 Poisson distribution indicated a statistically significant inflation of zeros and ones in the number of injured and killed persons. This feature also indicates that the most common fires are those with one person injured, while fires with casualties are rarer but still present to a worrying extent. Such results provide a basis for predicting future events of a similar nature, as well as for formulating proactive measures for their prevention.

Additionally, cluster analysis using the K-means and the agglomerative (hierarchical) method allowed the identification of typical and atypical fire behavior patterns. In this regard, it should be noted that although both of the observed variables

(X, Y)

represent human consequences of fire incidents, the low Pearson correlation coefficient

(ρ \approx 0.23)

between the number of injuries and deaths indicates that these two outcomes are often the result of different types of fire dynamics. For instance, fires with rapid progression and no time for evacuation usually lead to deaths without injuries (e.g., cluster B in the K-means clustering), while those that allow partial escape may lead to multiple injuries but no deaths (e.g., cluster C in the same clustering method). Thus, the weak correlation confirms the need to treat injuries and fatalities as different dimensions in fire analysis and management.

It should also be pointed that each identified cluster reveals specific patterns that suggest appropriate fire management strategies. For example, in K-means clustering, cluster A represents minor incidents with only one injury (and no fatalities), indicating that in this case standard safety protocols are proving effective. In contrast, cluster B includes sudden deaths without injuries, which indicates the need to improve detection systems and fire-resistant infrastructure for such fires. On the other hand, as already highlighted above, cluster C indicates fires where (partial) evacuation led to multiple injuries, but no fatalities. The following cluster D, with a high number of injuries (but also fatalities), highlights the importance of effective evacuation procedures, while extreme clusters like F and G require urgent focus on fire drills, structural safety, and specialized equipment. Finally, the presence of cluster I (mixed injury-fatality profile) highlights partial failures in evacuation or delayed response, stressing the need for integrated risk mitigation strategies.

It is particularly significant that clusters with a large number of victims were identified, which, although rare, clearly indicates the need for specific crisis protocols. In this way, as already highlighted, the resulting segmentation provides the important basis for prevention, intervention strategies and risk categorization, successfully identifying dominant patterns and extreme cases. It is important to highlight that each cluster has been characterized in detail based on both statistical descriptors and practical implications. Despite the general similarity of segmentation results obtained via K-means and agglomerative-hierarchical (AH) methods, the AH algorithm showed a slightly higher Silhouette Score (0.9708 vs. 0.9523), suggesting a marginally better internal structure of the resulting clusters. Given the relatively small number of extreme cases in the dataset and the presence of subtle hierarchical patterns, the AH approach appears to be more appropriate for this type of safety-oriented analysis. This aligns with previous research where AH clustering was preferred in studies involving rare but critical events and non-globular clusters [35,36]. Future studies may further explore the use of density-based or model-based clustering for additional refinement.

Finally, the classification analysis performed using the CHAID algorithm enables the identification of key input variables that significantly influence the probability of injury or death. Among others, it is shown that variables such as fire location, season, cities, and facility category showed strong associations with severity of consequences. At the same time, special emphasis is placed on fires caused by cigarette butts (group D), which show a high incidence of fatalities, clearly indicating the need for specific educational and preventive measures. The high precision and accuracy of the classification models further confirms the justification of this approach. In addition, approximately uniform distribution of errors in CHAID decision trees indicates stable and reliable models that do not make systematic errors on certain data. Since fire injuries and fatalities depend on multiple interdependent factors, uniform errors imply that no group is completely predictable, which is typical for events with numerous latent factors. This is a desirable signal in classification, especially in the domain of security and prevention.

We note also that all proposed algorithms successfully identify dominant patterns and extreme cases. In this way, as is seen, the applied DM techniques provide a concise and precise description of the results and their interpretation, as well as the conclusions that follow from them. This is also necessary from a security perspective, as it allows for better prevention, as well as more efficient and timely decision-making. Based on the above, we believe that this work provides an operational basis for the improvement of public policies and practices, including better spatial distribution of resources and recognition of risky seasons and facilities, as well as strategic planning of interventions. At the same time, the proposed methodological framework is easily adaptable to other types of accidents that involve endangering human lives [37,38].

5. Conclusions

This study explores the causes and consequences of fires in Serbia, focusing on human injuries and fatalities. Over 15 years of national fire data were analyzed, revealing that most fires result in either one injury or no victims, while a small number of cases lead to severe consequences. The results show that most fires in Serbia are low-intensity with limited consequences, yet certain causes—such as fires initiated by cigarette butts—are linked to disproportionally high mortality. Clustering and classification identified several high-risk fire patterns and contextual factors (e.g., season, location, type of facility) that require targeted interventions.

Thus, the findings in this study can serve as a foundation for the development of more effective prevention strategies, timely fire response mechanisms, and enhanced public safety measures. In addition, these findings may inform the design of tailored fire management practices suited to the specific needs of different regions in Serbia. For example, by aligning the identified risk profiles with local infrastructure, resource availability, and historical incident characteristics, fire services can improve the allocation of firefighting capacities, design more targeted prevention campaigns, and adapt emergency response protocols accordingly.

Although detailed data on local fire management practices were not available for every area, the results of this study may support some strategic improvements. For example, by aligning the specific characteristics of clusters with regional needs, local governments can refine emergency response plans, invest in targeted public education, and optimize resource allocation. Also, clusters with frequent fatalities may indicate the need for improved detection technologies, while those with multiple injuries could focus on better evacuation procedures. In this way, fire classification provides valuable insights for improving preparedness and reducing human losses.

Based on the aforementioned, the results achieved in this study can be considered significant, although it is also necessary to point out certain limitations of this research. First of all, as already pointed out, the available database does not contain information such as weather conditions, construction characteristics of buildings and age of victims, which would further enrich the analysis. Also, a comparison with other algorithms like CART, C5.0 or Random Forests can be included in some future research, in order to contribute to the robustness and broader insight into the performance of the model. To this end, the integration of geospatial data (GIS) and socio-demographic indicators would additionally enable more precise spatial risk analysis and the development of interactive early warning systems.

For future work, we emphasize that the proposed model can be expanded with additional socio-demographic factors, as well as the integration of weather and meteorological data in order to increase the predictive power of the system. The introduction of geolocation analysis could further improve spatial segmentation and enable the development of early warning systems and targeted prevention campaigns. Finally, the ethical and legal aspects of the implementation of predictive models in public safety systems can be considered, as can the application of such systems in accordance with data protection and human rights laws, especially if they are used for automatic risk assessment or decision-making.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/fire8080302/s1: Dataset used in this research; Figures S1–S3: Decision tree obtained for the target variable “Injuries” (cause C); Figures S4–S6: Decision tree obtained for the target variable “Fatalities” (cause C).

Author Contributions

Conceptualization, N.M., V.S.S. and D.M.; methodology, N.M., V.S.S. and D.M.; software, N.M., V.S.S. and M.J.; validation, N.M., V.S.S. and M.J.; formal analysis, N.M., V.S.S. and M.J.; investigation, N.M. and V.S.S.; resources, N.M.; data curation, N.M.; writing—original draft preparation, N.M. and V.S.S.; writing—review and editing, D.M. and M.J.; visualization, N.M. and V.S.S.; supervision, D.M.; project administration, M.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article/Supplementary Materials. Further inquiries can be directed at the corresponding author.

Acknowledgments

The authors sincerely thank the Ministry of Internal Affairs of the Republic of Serbia, whose officially provided the dataset presented in this study.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

The Figure A1 below shows the grouped values of cities and populated areas in the Republic of Serbia where fires occurred (variable “CFO”, listed in Table 2), in relation to the target variable “Injuries”.
The Figure A2 below shows the grouped values of cities and populated areas in the Republic of Serbia where fires occurred (variable “CFO”, listed in Table 2), in relation to the target variable “Fatalities”.
The Figure A3 shows the decision tree for cause D and the target variable “Injuries”.
The Figure A4 shows the decision tree for cause D and the target variable “Fatalities”.

Figure A1. Map of populated places grouped within the “CFO” variable (target variable “Injuries”).

Figure A2. Map of populated places grouped within the “CFO” variable (target variable “Fatalities”).

Figure A3. Decision tree for the target variable “Injuries” (cause D).

Figure A4. Decision tree for the target variable “Fatalities” (cause D).

References

Butry, D.T.; Prestemon, J.P.; Abt, K.L.; Sutphen, R. Economic Optimization of Wildfire Intervention Activities. Forest Policy Econ. 2010, 12, 115–121. [Google Scholar] [CrossRef]
Zou, Y.; Rasch, P.J.; Wang, H.; Xie, Z.; Zhang, R. Increasing Large Wildfires over the Western United States Linked to Diminishing Sea Ice in the Arctic. Nat. Commun. 2021, 12, 6048. [Google Scholar] [CrossRef]
Madaio, M.; Chen, S.T.; Haimson, O.L.; Zhang, W.; Cheng, X.; Hinds-Aldrich, M.; Chau, D.H.; Dilkina, B. Firebird: Predicting Fire Risk and Prioritizing Fire Inspections in Atlanta. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016. [Google Scholar]
Choi, J.; Yun, Y.; Chae, H. Forest Fire Risk Prediction in South Korea Using Google Earth Engine: Comparison of Machine Learning Models. Land 2025, 14, 1155. [Google Scholar] [CrossRef]
Sherry, L.; Chaudhari, M. Aerial Fire Fighting Operational Statistics (2024): Very Large/Large Air Tankers. Fire 2025, 8, 160. [Google Scholar] [CrossRef]
Gündüz, H.İ.; Torun, A.T.; Gezgin, C. Post-Fire Burned Area Detection Using Machine Learning and Burn Severity Classification with Spectral Indices in İzmir: A SHAP-Driven XAI Approach. Fire 2025, 8, 121. [Google Scholar] [CrossRef]
Alkhatib, R.; Sahwan, W.; Alkhatieb, A.; Schütt, B. A Brief Review of Machine Learning Algorithms in Forest Fires Science. Appl. Sci. 2023, 13, 8275. [Google Scholar] [CrossRef]
Abid, F. A Survey of Machine Learning Algorithms Based Forest Fires Prediction and Detection Systems. Fire Technol. 2021, 57, 559–590. [Google Scholar] [CrossRef]
Rubí, J.N.S.; Paulo de Carvalho, H.P.; Paulo, R.L.G. Application of Machine Learning Models in the Behavioral Study of Forest Fires in the Brazilian Federal District region. Eng. Appl. Artif. Intell. 2023, 118, 105649. [Google Scholar] [CrossRef]
Jain, P.; Coogan, S.C.; Subramanian, S.G.; Crowley, M.; Taylor, S.; Flannigan, M.D. A Review of Machine Learning Applications in Wildfire Science and Management. Environ. Rev. 2020, 28, 478–505. [Google Scholar] [CrossRef]
Sun, L.; Xu, C.; He, Y.; Zhao, Y.; Xu, Y.; Rui, X.; Xu, H. Adaptive Forest Fire Spread Simulation Algorithm Based on Cellular Automata. Forests 2021, 12, 1431. [Google Scholar] [CrossRef]
Wood, D.A. Prediction and Data Mining of Burned Areas of Forest Fires: Optimized Data Matching and Mining Algorithm Provides Valuable Insight. Artif. Intell. Agric. 2021, 5, 24–42. [Google Scholar] [CrossRef]
McNorton, J.R.; Di Giuseppe, F.; Pinnington, E.; Chantry, M.; Barnard, C. A Global Probability-of-Fire (PoF) Forecast. Geophys. Res. Lett. 2024, 51, e2023GL107929. [Google Scholar] [CrossRef]
Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.; Springer: New York, NY, USA, 2009. [Google Scholar] [CrossRef]
Aggarwal, C. Data Mining: The Textbook; Springer International Publishing AG: Cham, Switzerland, 2015. [Google Scholar] [CrossRef]
The CTIF World Fire Statistics, Report № 29. 2024. Available online: https://www.ctif.org/world-fire-statistics (accessed on 13 July 2025).
Marić, P.; Mlađan, D.; Stevanović, B.; Nikolić, G.; Đukanović, S. Statistical Approach for Establishing Individual Fire Risk in European Countries and Republic of Serbia (in Serbian). In Proceedings of the ISC: Safety Engineering; Fire and Explosion Protection, Novi Sad, Serbia, 26–27 September 2018; pp. 125–134. [Google Scholar]
Quantum Geographic Information System (QGIS), Software. Available online: https://qgis.org/ (accessed on 13 June 2025).
Stojanović, V.; Ljajko, E.; Tošić, M. Parameters Estimation in Non-Negative Integer-Valued Time Series: Approach Based on Probability Generating Functions. Axioms 2023, 12, 112. [Google Scholar] [CrossRef]
Stojanović, V.S.; Bakouch, H.S.; Gajtanović, Z.; Almuhayfith, F.E.; Kuk, K. Integer-Valued Split-BREAK Process with a General Family of Innovations and Application to Accident Count Data Modeling. Axioms 2024, 13, 40. [Google Scholar] [CrossRef]
Xu, R.; Wunsch, D. Survey of Clustering Algorithms. IEEE Trans. Neural Netw. 2005, 16, 645–678. [Google Scholar] [CrossRef]
De Ville, B. Decision trees. Wiley Interdiscip. Rev. Comput. Stat. 2013, 5, 448–455. [Google Scholar] [CrossRef]
Myles, A.J.; Feudale, R.N.; Liu, Y.; Woody, N.A.; Brown, S.D. An Introduction to Decision Tree Modeling. J. Chemom. 2004, 18, 275–285. [Google Scholar] [CrossRef]
Ritschard, G. CHAID and Earlier Supervised Tree Methods. In Contemporary Issues in Exploratory Data Mining in the Behavioral Sciences; Routledge: New York, NY, USA, 2013; pp. 48–74. [Google Scholar]
Saito, M.Y.; Rodrigues, J. A Bayesian Analysis of Zero and One Inflated Distributions. Rev. Mat. Estatíst. 2005, 23, 47–57. [Google Scholar]
Zhang, C.; Tian, G.; Ng, K. Properties of the Zero-and-One Inflated Poisson Distribution and Likelihood-Based Inference Methods. Stat. Interface 2016, 9, 11–32. [Google Scholar] [CrossRef]
Zhang, C.; Tian, G.-L.; Yuen, K.C.; Wu, Q.; Li, T. Multivariate Zero-and-One Inflated Poisson Model with Applications. J. Comput. Appl. Math. 2020, 365, 112356. [Google Scholar] [CrossRef]
Qi, X.; Li, Q.; Zhu, F. Modeling Time Series of Count with Excess Zeros and Ones Based on INAR(1) Model with Zero-and-One Inflated Poisson Innovations. J. Comput. Appl. Math. 2019, 346, 572–590. [Google Scholar] [CrossRef]
Mohammadi, Z.; Sajjadnia, Z.; Bakouch, H.S.; Sharafi, M. Zero-and-One Inflated Poisson–Lindley INAR(1) Process for Modelling Count Time Series with Extra Zeros and Ones. J. Stat. Comput. Simulat. 2022, 92, 2018–2040. [Google Scholar] [CrossRef]
Stojanović, V.S.; Bakouch, H.S.; Ljajko, E.; Qarmalah, N. Zero-and-One Integer-Valued AR(1) Time Series with Power Series Innovations and Probability Generating Function Estimation Approach. Mathematics 2023, 11, 1772. [Google Scholar] [CrossRef]
Weiß, C.H.; Homburg, A.; Puig, P. Testing for Zero Inflation and Overdispersion in INAR(1) Models. Stat. Pap. 2019, 60, 823–848. [Google Scholar] [CrossRef]
Dowd, C. Twosamples: Fast Permutation Based Two Sample Tests, R Package, Version 2.0.1. 2023. Available online: https://cloud.r-project.org/web/packages/twosamples/index.html (accessed on 25 May 2025).
Scikit-Learn Documentation. Available online: https://scikit-learn.org/stable/modules/clustering.html (accessed on 17 April 2025).
Baizyldayeva, U.B.; Uskenbayeva, R.K.; Amanzholova, S.T. Decision Making Procedure: Applications of IBM SPSS Cluster Analysis and Decision Tree. World Appl. Sci. J. 2013, 21, 1207–1212. [Google Scholar]
Abdalla, H.I. A Brief Comparison of K-means and Agglomerative Hierarchical Clustering Algorithms on Small Datasets. Proceeding of the 2021 International Conference on Wireless Communications, Networking and Applications, Berlin, Germany, 17–19 December 2021; WCNA 2021 Lecture Notes in Electrical Engineering. Qian, Z., Jabbar, M., Li, X., Eds.; Springer: Singapore, 2021. [Google Scholar] [CrossRef]
Peterson, A.D.; Ghosh, A.P.; Maitra, R. Merging K-means with Hierarchical Clustering for Identifying General-Shaped Groups. Stat (Int. Stat. Inst.) 2018, 7, e172. [Google Scholar] [CrossRef]
Pireddu, A.; Bedini, A.; Lombardi, M.; Ciribini, A.L.C.; Berardi, D. A Review of Data Mining Strategies by Data Type, with a Focus on Construction Processes and Health and Safety Management. Int. J. Environ. Res. Public. Health 2024, 21, 831. [Google Scholar] [CrossRef]
Linardos, V.; Drakaki, M.; Tzionas, P.; Karnavas, Y. Machine Learning in Disaster Management: Recent Developments in Methods and Applications. Mach. Learn. Knowl. Extract. 2022, 4, 446–473. [Google Scholar] [CrossRef]

Figure 1. Plot above: Dynamics of the number of injuries

(X)

within the observed dataset. Plots below: Autocorrelation function (ACF) of the variable

X

(left) and empirical distribution of the same variable (right), fitted with the ordinary Poisson and 1-Poisson distribution.

Figure 1. Plot above: Dynamics of the number of injuries

(X)

within the observed dataset. Plots below: Autocorrelation function (ACF) of the variable

X

(left) and empirical distribution of the same variable (right), fitted with the ordinary Poisson and 1-Poisson distribution.

Figure 2. Plot above: Dynamics of the number of fatalities

(Y)

within the observed dataset. Plots below: Autocorrelation function (ACF) of the variable

Y

(left) and empirical distribution of the same variable (right), fitted with the ordinary Poisson and 0-Poisson distribution.

Figure 2. Plot above: Dynamics of the number of fatalities

(Y)

within the observed dataset. Plots below: Autocorrelation function (ACF) of the variable

Y

(left) and empirical distribution of the same variable (right), fitted with the ordinary Poisson and 0-Poisson distribution.

Figure 3. Dependence of SS values with respect to the number of clusters

(k)

: (a) K-means clustering method, (b) AH clustering.

Figure 3. Dependence of SS values with respect to the number of clusters

(k)

: (a) K-means clustering method, (b) AH clustering.

Figure 4. K-means clustering with the optimal number of

k = 9

clusters.

Figure 4. K-means clustering with the optimal number of

k = 9

clusters.

Figure 5. AH clustering with the optimal number of

k = 10

clusters.

Figure 5. AH clustering with the optimal number of

k = 10

clusters.

Figure 6. The first classification branch, obtained by the attribute “Cause of the Fire” and target variables: (a) Injuries; (b) Fatalities.

Figure 7. ROC areas of decision tree classification for cause C with target variables: (a) Injuries; (b) Fatalities.

Figure 8. ROC areas of decision tree classification for cause D with target variables: (a) Injuries; (b) Fatalities.

Table 1. Fire death and injury rates in Serbia and some of other European countries (source: CTIF 2024 report [16]).

Country	Deaths per 100,000 Inh.	Injuries per 100,000 Inh.
Bulgaria	2.39	4.4
Croatia	0.88	4.1
Finland	0.92	6.6
Greece	0.67	0.9
Hungary	0.97	8.0
Portugal	0.52	13.2
Serbia	1.50	4.9
Average	1.14	3.2

Note: “Average” values are based on CTIF 2024 report [16], calculated across

n = 55

reporting countries.

Table 2. Independent variables relevant for the cause-and-effects analysis of fires.

Ord. Num.	Variable (Attribute)	Type	Values	Description	$I G$
Ord. Num.	Variable (Attribute)	Type	Values	Description	Injuries	Fatalities
1.	Cause of the Fire (CoF)	Nominal	A, B, C, D, E	The various causes of fires.	0.0602	0.0562
2.	Fire Location (FL)	Nominal	Position I–III	Positions of fire locations.	0.0212	0.0197
3.	Fire Category by Location (FCL)	Nominal	Category I–IV	The various category of fires.	0.0230	0.0227
4.	Season	Nominal	Winter, Spring, Summer, Automn	Season of the fire.	0.0086	0.0090
5.	Day of Day (DoD)	Nominal	Weekday, Weekend	Day of the week when the fire occurred.	0.0004	0.0002
6.	City of Fire Origin (CFO)	Nominal	Group I–VIII	City where the fire occurred.	0.0086	0.0075
7.	Year of Fire Occurrence (YFO)	Ordinal	2009, …, 2023	The year the fire occurred.	0.0040	0.0044
8.	Hour of Notification (HoN)	Ordinal	1, …, 24 [h]	The ordinal number of hour of the day when the fire occurred.	0.0082	0.0094
9.	Alert/Arrival (A/A)	Numeric	0, …, 186 [min]	Time from notification to arrival of the fire service.	0.0063	0.0055
10.	Alert/Extinguishment (A/E)	Numeric	0, …, 1432 [min]	Time from notification to extinguishing the fire.	0.0152	0.0187

Table 3. Description of grouped Causes of Fire (“CoF” variable).

Values	Cause of the Fire	Number of Cases
A	Unspecified; Electrical conductors; Other causes; Collision; Exothermic reaction	3504
B	Explosion; Friction; Damage-defects	268
C	Open flames; Construction defects; Fireplaces; Conductors overheating from overload; Electrical devices	1619
D	Cigarette butt	272
E	Welding; Natural occurrences; Grinding; Self-ignition; Static electricity	106

Table 4. Description of grouped Fire Location (“FL” variable).

Values	Fire Location	Number of Cases
Position I	Ground floor	2758
Position II	Basement-basement, Floor from 1st to 4th, Attic, Attic-roof	1532
Position III	Floors from 4th to 7th, Floors from 8th to 15th, Floors higher than 16th, High attic, Unspecified	1479

Table 5. Description of grouped Fire Category (by Location).

Values	Fire Category (by Location)	Number of Cases
Category I	Residential building; Residential and commercial building; Trade and craft shop; Catering facility; Office building; Religious object; Health institution; Kindergarten/school/faculty; Hotel/motel; Cinema/theater; Nursing home; Department store; Home for neglected children	3502
Category II	Bus; Tanker trucks; Road. vehicle. Other road vehicles; Freight road vehicle; Agricultural machines; Passenger car; Freight wagon; Water transportation vehicle; Electric locomotive; Means of air transport; Diesel locomotive	562
Category III	Coniferous forest; Orchard; Macchia (low vegetation); Mixed forest; Meadow; Other open space; Cereals; Deciduous forest; Vineyard	716
Category IV	Barrack/shed; Agricultural building; Other civil. facilities; Garbage dump; Container; Sil; Parking lot-garage; Production plant; Transformer station; Warehouses; Chimney; Construction site; Gas plant; Construction machinery; Working machinery; Refinery	989

Table 6. Descriptive statistics of the number of injuries and fatalities.

Variable	Min	25%	50%	75%	Max	Mode	Mean	Var	StDev	Skew	Sum
Injuries (X)	0	1	1	1	30	1	1.042	1.041	1.020	7.862	6013
Fatalities (Y)	0	0	0	0	7	0	0.237	0.238	0.488	2.855	1370

Table 7. Results of zero-and-one values counting and testing for series

X

and

Y

.

Table 7. Results of zero-and-one values counting and testing for series

X

and

Y

.

Variable	0-Count	${\hat{p}}_{0}$	$I_{0}$ (p-Values)	1-Count	${\hat{p}}_{1}$	$I_{1}$ (p-Values)
Injuries ( $X$ )	1120	0.1941	−0.5046 (0.6931)	3840	0.6656	1.6819 * ( $0.0463$ )
Fatalities ( $Y$ )	4507	0.7812	1.7203 * ( $0.0427$ )	1183	0.2051	1.5250 (0.0636)

* p < 0.05.

Table 8. Estimates of the Poisson and 1-Poisson distributions, based on sample from the

X

-series.

Table 8. Estimates of the Poisson and 1-Poisson distributions, based on sample from the

X

-series.

Distribution	Parameters			Statistics/(p-Values)
Distribution	$λ$	$ϕ_{1}$	$ϕ_{2}$	W	Z	D
Poisson	1.0423	$-$	$-$	${17,223,254}^{*}$ (3.74 $\times 10^{- 3}$ )	1.8472 (0.0647)	${0.1510}^{*}$ (~0.0000)
1-Poisson	1.0423	0.5009	0.4991	16,695,024 (0.6712)	0.3409 (0.7332)	0.0104 (0.9139)

* p < 0.05.

Table 9. Estimates of the Poisson and 0-Poisson distributions, based on sample from the

Y

-series.

Table 9. Estimates of the Poisson and 0-Poisson distributions, based on sample from the

Y

-series.

Distribution	Parameters			Statistics/(p-Values)
Distribution	$λ$	$ϕ_{0}$	$ϕ_{2}$	W	Z	D
Poisson	0.2375	$-$	$-$	16,539,970 (0.4340)	−1.5793 (0.1143)	0.0154 (0.4984)
0-Poisson	0.2394	0.0813	0.9187	16,662,252 (0.8663)	1.0421 (0.2793)	0.0106 (0.9037)

Table 10. Descriptive statistics of clusters obtained using the K-means method.

Cluster		Injuries							Fatalities
Label	Count	Mean	StDev	Min	25%	50%	75%	Max	Mean	StDev	Min	25%	50%	75%	Max
A	3745	1.000	0.000	1.0	1.00	1.0	1.00	1.0	0.000	0.000	0.0	0.00	0.0	0.00	0.0
B	1060	0.000	0.000	0.0	0.00	0.0	0.00	0.0	1.000	0.000	1.0	1.00	1.0	1.00	1.0
C	101	4.693	0.956	4.0	4.00	4.0	5.00	7.0	0.089	0.349	0.0	0.00	0.0	0.00	2.0
D	12	9.917	1.929	8.0	8.75	10.0	10.25	15.0	0.500	1.000	0.0	0.00	0.0	0.25	3.0
E	659	2.184	0.387	2.0	2.00	2.0	2.00	3.0	0.000	0.000	0.0	0.00	0.0	0.00	0.0
F	4	2.250	2.062	0.0	1.50	2.0	2.75	5.0	5.250	0.957	4.0	4.75	5.5	6.00	6.0
G	3	21.00	8.185	14.0	16.50	19.0	24.50	30.0	4.667	2.082	3.0	3.50	4.0	5.50	7.0
H	66	0.136	0.426	0.0	0.00	0.0	0.00	2.0	2.106	0.310	2.0	2.00	2.0	2.00	3.0
I	119	1.303	0.590	1.0	1.00	1.0	1.00	4.0	1.017	0.129	1.0	1.00	1.0	1.00	2.0

Table 11. Descriptive statistics of clusters obtained using agglomerative (hierarchical) clustering.

Cluster		Injuries							Fatalities
Label	Count	Mean	StDev	Min	25%	50%	75%	Max	Mean	StDev	Min	25%	50%	75%	Max
A	3	21.00	8.185	14.0	16.5	19.0	24.5	30.0	4.667	2.082	3.0	3.50	4.0	5.5	7.0
B	187	3.348	0.521	3.0	3.0	3.0	4.00	6.0	0.080	0.326	0.0	0.00	0.0	0.0	2.0
C	9	10.56	1.810	9.0	10.0	10.0	11.0	15.0	0.667	1.118	0.0	0.00	0.0	1.0	3.0
D	43	5.837	0.949	5.0	5.0	6.0	6.00	8.0	0.000	0.000	0.0	0.00	0.0	0.0	0.0
E	113	1.204	0.404	1.0	1.0	1.0	1.00	2.0	1.000	0.000	1.0	1.00	1.0	1.0	1.0
F	67	0.179	0.548	0.0	0.0	0.0	0.00	3.0	2.104	0.308	2.0	2.00	2.0	2.0	3.0
G	538	2.000	0.000	2.0	2.0	2.0	2.00	2.0	0.000	0.000	0.0	0.00	0.0	0.0	0.0
H	3745	1.000	0.000	1.0	1.0	1.0	1.00	1.0	0.000	0.000	0.0	0.00	0.0	0.0	0.0
I	4	2.250	2.061	0.0	1.5	2.0	2.75	5.0	5.250	0.957	4.0	4.75	5.5	6.0	6.0
J	1060	0.000	0.000	0.0	0.0	0.0	0.00	0.0	1.000	0.000	1.0	1.00	1.0	1.0	1.0

Table 12. Metric characteristics of obtained decision trees.

Tree Metric	Cause C		Cause D
Tree Metric	Injuries	Fatalities	Injuries	Fatalities
Total nodes	120	139	53	52
Terminal nodes	78	91	35	33
Tree depth	7	6	7	7
Input variables	7	8	6	6
Estimate	0.035	0.048	0.007	0.007
Std. Error	0.005	0.006	0.005	0.005

Table 13. Classification table for obtained decision trees (cause C).

Observed	Predicted
	Injuries			Fatalities
	YES	NO	Total	YES	NO	Total
YES	976	28	1004	381	39	420
NO	20	366	386	28	942	970
Total	996	394	1390	409	981	1390

Table 14. Classification table for obtained decision trees (cause D).

Observed	Predicted
	Injuries			Fatalities
	YES	NO	Total	YES	NO	Total
YES	123	0	123	158	2	160
NO	2	147	149	0	112	112
Total	125	147	272	158	114	272

Table 15. Classification quality measures for observed fire causes and obtained decision trees.

Measure	Cause C		Cause D
Measure	Injuries	Fatalities	Injuries	Fatalities
Accuracy	0.9655	0.9518	0.9926	0.9926
Precision	0.9799	0.9315	0.9840	1.0000
Sensitivity	0.9721	0.9071	1.0000	0.9875
Specificity	0.9482	0.9711	0.9866	1.0000
F-measure	0.9760	0.9192	0.9919	0.9937

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mitrović, N.; Stojanović, V.S.; Jovanović, M.; Mladjan, D. Forensic and Cause-and-Effect Analysis of Fire Safety in the Republic of Serbia: An Approach Based on Data Mining. Fire 2025, 8, 302. https://doi.org/10.3390/fire8080302

AMA Style

Mitrović N, Stojanović VS, Jovanović M, Mladjan D. Forensic and Cause-and-Effect Analysis of Fire Safety in the Republic of Serbia: An Approach Based on Data Mining. Fire. 2025; 8(8):302. https://doi.org/10.3390/fire8080302

Chicago/Turabian Style

Mitrović, Nikola, Vladica S. Stojanović, Mihailo Jovanović, and Dragan Mladjan. 2025. "Forensic and Cause-and-Effect Analysis of Fire Safety in the Republic of Serbia: An Approach Based on Data Mining" Fire 8, no. 8: 302. https://doi.org/10.3390/fire8080302

APA Style

Mitrović, N., Stojanović, V. S., Jovanović, M., & Mladjan, D. (2025). Forensic and Cause-and-Effect Analysis of Fire Safety in the Republic of Serbia: An Approach Based on Data Mining. Fire, 8(8), 302. https://doi.org/10.3390/fire8080302

Article Menu

Forensic and Cause-and-Effect Analysis of Fire Safety in the Republic of Serbia: An Approach Based on Data Mining

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset

2.1.1. Input Variables

2.1.2. Output Variables

2.2. Data Segmentation (Clustering)

2.2.1. Silhouette Score Method

2.2.2. K-Means Algorithm

2.2.3. Agglomerative-Hierarchical (AH) Clustering

2.3. Data Classification (Decision Trees)

2.3.1. Basic Principles

2.3.2. CHAID Algorithm

3. Results

3.1. Stochastic Modeling

3.2. Data Clustering

3.2.1. K-Means Clustering

3.2.2. AH Clustering

3.3. Classification Analysis

4. Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI