Forest Fire Probability Mapping in Eastern Serbia: Logistic Regression versus Random Forest Method

Forest fire risk has increased globally during the previous decades. The Mediterranean region is traditionally the most at risk in Europe, but continental countries like Serbia have experienced significant economic and ecological losses due to forest fires. To prevent damage to forests and infrastructure, alongside other societal losses, it is necessary to create an effective protection system against fire, which minimizes the harmful effects. Forest fire probability mapping, as one of the basic tools in risk management, allows the allocation of resources for fire suppression, within a fire season, from zones with a lower risk to those under higher threat. Logistic regression (LR) has been used as a standard procedure in forest fire probability mapping, but in the last decade, machine learning methods such as fandom forest (RF) have become more frequent. The main goals in this study were to (i) determine the main explanatory variables for forest fire occurrence for both models, LR and RF, and (ii) map the probability of forest fire occurrence in Eastern Serbia based on LR and RF. The most important variable was drought code, followed by different anthropogenic features depending on the type of the model. The RF models demonstrated better overall predictive ability than LR models. The map produced may increase firefighting efficiency due to the early detection of forest fire and enable resources to be allocated in the eastern part of Serbia, which covers more than one-third of the


Introduction
Forest fires, as global phenomena, present numerous forms of threats to many countries around the world, from Canada and Siberia through to Indonesia, Australia, Africa, and the Amazonia. Although statistics show that the frequency of fires, alongside the total burnt area, have been decreasing from year to year globally [1], several regions will experience larger and more intense fires [2][3][4]. The Mediterranean region is traditionally the most at risk in Europe [5], but in recent years, forest fires have also become an important issue in Central Europe [6]. Among other European countries, Serbia experienced increased fire activity during the last two decades [7].
The increasing risk and associated damage caused by fires are usually linked to climate changes [8]. Considering the current climate scenario [9], which predicts an average temperature rise of 4-6 • C by the end of this century [10,11], and a decrease in total rainfall with an uneven distribution throughout the year in the south of Europe featuring long

Study Area
The study area is in the eastern and southern parts of Serbia ( Figure 1). It has a total land area of 30,235.5 km 2 . Broadleaved, conifer, and mixed forest cover 12,587.5, 170.7, 340.6 km 2 , respectively. Natural grasslands, transitional woodland-shrub, and sparsely vegetated areas cover 925.7, 2949.5, and 48.0 km 2 , respectively, while other areas represent agricultural, urban, or other non-wood areas. The elevation ranges from 28 to 2169 m. It has been observed that durations of extremely hot weather last longer and periods of extremely cold weather are shortened compared to the reference period of 1960-1990 [27]. These trends of extreme temperature conditions are most pronounced in the summer season [27]. A decrease in spring precipitation has been detected over the central and eastern parts of Serbia [28]. The annual quantity of rainfall is often insufficient, and droughts are evident in eastern and south-eastern parts of Serbia [28]. Scenarios where the monthly quantity of precipitation falls in only a few days of the month are expected to become more frequent, which will lead to more extreme weather events such as floods and droughts [29]. Serbia has two peaks in its fire season [30]. The first peak occurs in March, which is associated with 25% of the annual burnt area, while the second and largest peak appears in August [31]. In August 2012, more than 40 large fires were recorded, including two fires that were more than 1000 ha in size [32]. More than 16,000 ha of forest land were destroyed in 2007, with the value of the wood lost was estimated at 40 million euros [33].

Dependent Variable
Data regarding the fire location were obtained from NASA's Fire Information for Resource Management System (FIRMS), which distributes near real-time fire products from Moderate Resolution Imaging Spectroradiometer (MODIS) from the Terra and Aqua platforms with a spatial resolution of 1 km (https://firms.modaps.eosdis.nasa.gov). Each position of a MODIS-identified active fire represents the center of a 1 × 1 km pixel that is labeled by the algorithm as containing one or more fires inside the pixel. Only fire pixels with a confidence higher than 80% for the period from January 2001 to December 2018 were considered for further analysis, because in certain cases the product underestimates the occurrence of fire [34].

Independent Variables
Independent variables were grouped into four categories: topography, vegetation, anthropogenic factors, and climate. Specific variables within categories were selected based on previous studies on forest fire occurrence [24,[34][35][36][37][38][39]. Detailed information on preselected variables is presented in Table 1. Topographic parameters, elevation, slope, and aspect, were derived from the digital elevation model (DEM) with a precision of 3 arcsec previously downloaded from the United States Geological Survey (http://landsat.usgs.gov//landsatcollections.php). Average values of elevation, slope, and dominant aspect were calculated for each polygon in a 1 × 1 km grid using ArcGIS software (ESRI, Redlands, CA, USA). Obtained values of the slope and dominant aspect were divided into classes according to Carmo et al. [35], while elevations were divided into classes of 200 m ( Figure 2).
Vegetation and land cover data were obtained from the CORINE 2012 data set (https://land.copernicus.eu/pan-european/corine-land-cover/clc-2012). The vector layer was intersected with polygon grid data. Objects in this vector layer were filtered for the following CORINE 2012 land cover classes (CLCs): broad-leaved forest (BF), coniferous forest (CF), mixed forest (MF), natural grasslands (NG), transitional woodland-shrub (TWS), and sparsely vegetated areas (SVA) (Figure 2). Intersecting the polygon grid data with the polygon CLC layer filtered this way, a new polygon layer with a table of attributes containing a polygon grid Object ID, CLC class ID with its description, and an area of CLC class that falls into the respective grid polygon were generated. Data on drought code (DC), as a component of the Canadian Forest Fire Weather Index [40], were obtained from the Republic Hydrometeorological Service of Serbia (http://www.meteoalarm.rs/eng/fwi_osm.php) as tables with coordinates of meteorological stations with values on the drought code for each day in the observed period (2012-2018). Each of these tables was then converted into a spatial point layer with points representing meteorological stations and a respective table of attributes with values of the drought code. The layer was then used for interpolation using the ordinary kriging method [41], resulting in raster layers whose pixels represent values of drought code. This way, we were able to calculate zonal statistics of drought code for each day and each polygon of a grid layer for raster layers that represent the observed period. Finally, average values for each cell in the grid cells were calculated based on all of the drought code data for the observed period.
Layers of roads, populated places, railroads, and agricultural land were used for the calculation of the distance from each object of a 1 × 1 km grid to the nearest object of the respective layer, so the attribute table of the generated point and polygon grid layer was extended with distances to the nearest populated place, road, railway and agricultural land (https://www.openstreetmap.org).
Population data were obtained from a raster dataset available for download in Geo-TIFF format at the Center for International Earth Science Information Network (CIESIN), Columbia University (http://dx.doi.org/10.7927/H4639MPP). The sum of the number of people per polygon grid was also calculated by a zonal statistic tool.
At the end of the GIS portion of the analysis, Boolean values (Yes/No) were assigned to the elements of the grid by using the Spatial Join tool. Values of Yes were assigned to the elements for which the historical wildfire(s) occurred, and values of No were assigned to those for which it did not.
Each cell with at least one historical record of a fire event from 2001 to 2018 was classified as a fire cell and coded as one (1). In total, 429 cells were selected as a fire cell for Eastern Serbia. The use of binary LR alongside RF classification assumes that the predicted variable is dichotomous. Therefore, necessary non-fire cells were sampled based on the propensity score matching (PSM) method [42] and coded as zero (0). In general, a propensity score analysis reduces the effect of confounding in observational (nonrandomized) studies and can be used for matching, stratification, inverse probability of treatment weighting, and covariate adjustment, all based on the propensity score [43]. Thus, PSM forms matched sets with equal ratio of treated and untreated subjects [44], in our case, of fire and non-fire cells, who share a similar value of propensity score. Matching without replacement [45] was applied for the selection of non-fire cells in this study. Namely, non-fire cells can be selected only once as a match for a given fire cell. Among greedy and optimal matching procedures of PSM, the latter was used in this study. Both approaches choose the same sets of controls for the overall matched samples, but optimal matching does a better job of minimizing the difference in propensity score value within each pair [46]. PSM was conducted in the R package 'MatchIt' v 3.02 [47].
As it was important to avoid the selection of non-fire cells in the vicinity of the fire cells, because of the similar environmental conditions, we tested distances among created pairs of cells. The distance between pairs of fire and non-fire cells varied from 2.2 to 268.1 km, with an average of 134.2 km. Additionally, to validate the created models, whole paired samples were randomly divided into two equal subsamples, training and validation, with the same number of fire and non-fire cells [24,39,48,49].
One of the basic assumptions that must be met before applying LR is the absence of high correlations (multicollinearity) among the explanatory variables (predictors) included in the model. Explanatory variables were checked for multicollinearity by the variance inflation factor (VIF) and Spearman's rho correlation [50]. Only variables with VIF ≤ 10 and a Spearman's rho coefficient lower than 0.7 [51] were considered for model building.

Forest Fire Occurrence Frequency across Categorical Predictors
Forest fire distribution across categorical predictors was analyzed for all events in the 15 years. Observed versus expected frequencies were analyzed and compared. Observed frequencies represented the number of forest fires that occurred within the explanatory variable category, while the expected frequencies were represented by the surface covered by the respective variable category in the study area [24]. Comparisons between observed and expected frequencies were based on chi-square statistics [52] at a significance level of 0.001. Mean values were calculated for each class of categorical predictors, and histograms were created accordingly.

Modeling Procedures
Based on collected and generated data, the model of probabilities for forest fire occurrences was generated for each cell of the polygon grid for the territory of Eastern Serbia.

LR Models
The algorithm obtained by the LR calculates the conditional probability of the fire event occurring from one or more input variables [53]. Also, LR can be used for estimating the contribution and significance of each variable by the Wald test [54], and afterwards can select a combination of variables that can be introduced in more complex models [55]. To estimate forest fire occurrence probability in Eastern Serbia by LR, the following equation was applied: with "Pi" as the probability of the forest fire occurrence, "α" as a constant, "β" as a coefficient from regression, and "X" as original values of variables.

RF Models
This method uses a large number of decision trees, which produce their predictions and combine them into a single, more accurate, prediction [56]. RF has gained popularity in recent years, as it has been proven to perform better than LR according to several studies [39,57,58]. Specifically, it considerably outperforms LR in the accuracy measured, as well as the Brier score and area under curve (AUC) [59][60][61], on large and diverse datasets. Thus, considering the similarity of forest fire probability prediction to other risk-assessment applications, we considered RF as a strongly supported model of choice worth investigation. Similar to LR, the RF method is also good at handling both categorical and continuous types of explanatory variables.
The method proposed by Ye et al. and Genuer et al. [37,62] was applied for the variable selection. Each variable was included in the model N − 1 times, where N represents the number of variables considered as a predictor. RF models were run 16 times and an additional variable was excluded in each iteration. The variable importance obtained from each iteration was used for the calculation of relative variable importance, which represents an average value for a specific variable ranked from 0 to 1. All variables were scored from 1 to 16, according to the average importance. The variable with the highest average importance was scored as 1, and the variable with the lowest importance was scored as 16. Then, a RF model with each of the 15 highest ranked variables was generated for comparison with LR models. RF analyses were conducted by the default value of mtry (4), which represents the number of variables at each split and the 100 trees in the forest (ntree).

Model Validation
Model evaluation was conducted by a receiver operating curve (ROC) analysis, calculating the proportion of fire cells, classified as fire (sensitivity), and non-fire cells, classified as non-fire (specificity), for the obtained models. A ROC curve plots the values that represent sensitivity versus values that represent 1-specificity for the range of possible cut points [63]. AUC values between 0.5-0.7 indicate poor precision, values between 0.7-0.8 indicate acceptable precision, values between 0.8-0.9 indicate excellent precision, and values higher than 0.9 indicate outstanding model precision [63]. Additional evaluation of the models was conducted by 2 × 2 classification tables based on the overall accuracy. Thus, the tables are a result of the cross-classification of the dichotomous outcome variable, whose values are derived from the predicted and observed probabilities. Also, to compare the predictive capacity of the models produced, the distribution of fire cells of the validation group across classes of forest fire occurrence probability was analyzed. Due to the limited size of data, the k-fold cross-validation (CV) method was also applied to compare obtained models. Analysis was conducted in the R package 'Caret' v 6.0-86 [64] using "glm" and "rf" methods for LR and RF, respectively. ROC analysis, accuracy, and Kappa values obtained from CV were used to compare models, with k-value set as 10.
To designate each cell as fire or non-fire, an optimal cutoff point must first be defined. For the models selected by the LR and RF procedures, based on the ROC analysis and prediction accuracy obtained under the training subsets, the optimal cutoff point was determined by the sensitivity equals specificity method [65] using the easyROC webtool [66]. Then, the estimated probability for each cell was compared to the optimal cutoff point. If the estimated probability for the particular cell exceeded the optimal cutoff point, then that cell was designated as a fire cell. Conversely, if the estimated probability was lower than the optimal cutoff point, a particular cell was designated as a non-fire cell. Cutoff values were applied within selected models to both training and validation subsets.

Variable Importance Analysis
The quantification of variable importance is used for variable selection in many applied studies. To assess the importance of individual variables selected by the LR procedure, Wald statistics were applied to the models [67]. Variable importance in RF classification could be assessed by the permutation importance indices [68] or by the Gini impurity function for a classification problem [69]. For RF, the variable importance measures are summed across all trees in the forest and scaled in the same manner so that the most important variable has a value of 1.

Mapping Forest Fire Occurrence Probability
Forest fire occurrence probability calculated by the LR or RF models for fire cells and non-fire cells was used to map the forest fire occurrence probability in Eastern Serbia in All statistical analyses related to LR and RF were conducted using the software Statistica 13 (TIBCO Software Inc., Palo Alto, CA, USA).

Forest Fire Occurrence Frequency
Forest fire distributions across the categorical explanatory variables are displayed in Figure 3. An almost normal distribution of forest fires was recorded across elevation classes, with higher frequencies at elevations between 200 and 1000 (with a less pronounced peak in the class of 600−800 m), and with lower frequencies below and above this range (Figure 3a). The frequencies of forest fire across elevation classes highly correspond to surface area covered by the respective classes (χ 2 =39.83, p < 0.001). This connection was not significant between forest fire occurrence and aspect classes (χ 2 = 1.11, p < 0.774). Regarding aspect, the highest frequencies of fires occurred on southern aspects, then almost equally eastern and western aspects, while the lowest frequency was recorded on northern aspects (Figure 3b). The highest frequency of forest fire was recorded on the slopes with an inclination between 10 • and 19.99 • , then between 5 • and 9.99 • , while the lowest frequency was recorded at the lowest inclination (0−4.99 • ). No fire was recorded at the highest inclination of over 30 • (Figure 3c). Forest fires were influenced by slope (χ 2 = 47.54, p < 0.001). Also, land cover type had a significant influence on forest fire occurrence (χ 2 = 107.89, p < 0.001). More than 55% of forest fires occurred in broad-leaved forests, and almost 29% and 14% occurred in the transitional woodland-shrubs and natural grasslands respectively, while less than 2% of forest fires occurred in the mixed forest and coniferous forests. No forest fire was recorded in the sparsely vegetated area (Figure 3d).

Models of Forest Fire Occurrence
In the first step, the total forested area (TFA) variable was excluded for further analysis due to the high correlation (R = 0.84) with distance agricultural land (DAgL), as displayed in Figure S1 (see the Supplementary Materials). The remaining 16 explanatory variables met the conditions of VIF ≤ 10 and were considered for the future models. After several iterations, the final models were created by LR and RF procedure for 15 variables. The highest impact on fire probability had a drought code, followed by distance to municipality, distance to water, distance to railway, and distance to arable land, while a relative contribution of mixed forest had the lowest impact in the RF model. Also, in the LR models, the highest impact on fire probability had a drought code, while a relative contribution of coniferous forest had the lowest impact on forest fire occurrence ( Table 2).
The overall prediction accuracy of the LR and RF models, calculated by using an optimal cut-off value, was 86.7 and 87.7%, respectively (Table 3). Moreover, the LR and RF models display AUC values of 94.2% and 94.5%, respectively, affirming a high predictive power (Figure 4b).

Relative Importance of Variables
In the LR models, drought code was the most important variable for fire occurrences, followed by the distance to rail and agricultural land. Conversely, the lowest importance of the forest fire occurrence was associated with the contribution of coniferous and mixed forest, and also aspect classes ( Table 2). In the RF models, drought code was the most important variable followed by the distances to municipality, water, rail, and arable land. Similarly, to the LR models, contribution of mixed and coniferous forest as well as aspect classes had the lowest impact on the RF models (Table 2). Interestingly, distance to municipality was recognized by RF as very important model variable, while in the LR models this variable did not have a statistically significant effect on model performance.

Spatial Modeling of Probability of Fire Occurrence
Zones with a very high probability of forest fire occurrence were situated in the southeastern part of the study area in both models and vary from 19.7% (LR) to 18.9% (RF) of the forested area. Zones with a very low probability of forest fire occurrence were situated in the northwestern part of the study area and range from 21.1% (LR) to 22.2% (RF) of the forested area ( Figure 5).

Model Validation
LR models performed better in the very low fire distribution class, compared to the RF models, identifying lower forest fire incidence in the validation set of data. On the other hand, RF models performed better in high and very high classes than LR models, identifying a higher number of forest fire events in the same set of data (Table 4). If the two lowest classes (low and very low) classes are compared among models, then there is a very slight difference between the LR and RF models (e.g., 23.3% vs. 23.7%, respectively). On the other hand, the RF models classified 58.6% of forest fire events in the two higher classes (high and very high), while the LR models classified 54.0% in the same classes of the validation set of data. The overall prediction accuracy of the LR and RF models, based on 10-fold cross validation, was 86.5 (kappa = 0.7296) and 91.7% (kappa = 0.8345), respectively. Moreover, the LR and RF models display AUC values of 92.4% and 97.5%, respectively, after 10-fold cross validation.

Discussion
There has been a huge gap in forest fire risk assessment in Serbia without any system for forest fire risk assessment at the national or even regional scale [70]. Within this study, the first step has been made on that course by producing maps with forest fire occurrence probability that cover more than one-third of the state territory. To map the forest fire probability in Eastern Serbia, the area that has been the most affected by forest fires in the past [32], two different methods were applied, LR and RF. Both methods have been widely used in forest fire risk assessment within the past few decades, with the dominance of LR at the beginning of this century [24,35,50,55,71], while the RF method has prevailed since the last decade [25,37,58,60,70]. The RF models had slightly better performance than the LR models in the training and validation sets. Better performance of the RF models over the LR models in fire occurrence prediction is consistent with similar studies in our region [72,73] and other regions [39,58]. The LR models shift performance from outstanding in training data sets to excellent in validation data sets, while in the RF procedure this shift is less pronounced according to applied model classification [63]. Based on the results obtained within this study and literature survey, we can strongly recommend the use of the RF method for the forest fire occurrence mapping of the entire territory of Serbia.
The propensity-score matching method was used to select non-fire events for known fire events. This method was adopted from the field of medical sciences [74][75][76][77], like many others, and it was used for the first time in natural hazard risk assessment by Hudson et al. [78] for evaluating the effectiveness of flood damage mitigation measures. We used this method for the first time in forest fire risk assessment to pair historical fire event data obtained from the NASA FIRMS with non-fire events using elevation as matching criteria.
Elevation had a significant and positive effect on forest fire occurrence probability, with higher frequencies observed between 200 and 1000 m. The fire season is shorter at higher elevations due to snow melting later, leading to a lower fire frequency [79]. Also, a higher relative humidity due to a decrease in temperature of 1 • C per 100 m rise in elevation reduces chances for fuel ignition [80]. However, the rapid decrease in fire frequency at altitudes over 1000 m, observed by Ramón González et al. [71], was not recorded in our study. It is more likely that climate changes will influence a positive effect of the elevation on forest fire frequency in the future, as has already been shown by Schwartz et al. [81]. Southern, intermediate slopes between 5 • and 20 • experienced increased fire activity in our study due to the lower moisture of combustible material being exposed to the sun radiation more than other aspects with milder and steeper slopes [34,81,82]. All topographical features strongly affect vegetation and its burn ability [83]. Forest fire frequency was much higher in the broad-leaved forests than in the coniferous forests. The negative effect of the distance to water on forest fire occurrence probability may be linked to the NASA sensor's ability to detect only larger forest fires, while a smaller fire is suppressed easily when it is in the vicinity of the water bodies and the early phase of development are thereby not detected. Conifers cover less than 1% of the study area and were scattered in the higher mountains within the bigger broad-leaved forest complex and therefore were lesser exposed to anthropogenic activity, explaining the negative effect of its proportion on fire activity. Transitional woodland-shrubs are usually situated in the zone between forests and agricultural or arable land. Both types of land cover are connected to spring and autumn burning activity, and therefore transitional woodland-shrubs had increased fire frequency. Forests near human settlements and infrastructures that are densely populated are more prone to small fires due to negligence and/or accidental ignitions [84], while distant locations are prone to often larger, but less frequent, forest fires [85]. Therefore, an expected decrease in the population density in rural areas in the southern part of the study area [86] may result in lower fire activity in the future. Conversely, the expansion of urbanized areas and road cover with an associated increase in population density due to migration from south to north can be expected to lead to higher fire activity in the northern part of the study area.
Zones with the highest probability for forest fire occurrence are located in the southeastern part of the study area in all models, which correspond to the more pronounced drought periods during summer [27,28]. A higher drought code (higher temperature and lack of precipitation) leads to the lower moisture of fuel and makes an area more susceptible to ignition. It is well known that higher temperatures [87] reduce fuel moisture, making the fuels highly susceptible to ignition. Additionally, a study by Chang et al. [88] described low precipitation as a determinant factor for ignition. Drought code was the most important variable, followed by anthropogenic features, in both the LR and RF models. These results were consistent with other studies on determinant factors for the occurrence of wildfire where climatic and anthropogenic predictors had a higher influence on the fire occurrence probability [70,[89][90][91][92]. All those models were efficiently applied at a smaller scale (such as national parks or protected areas), while our models showed similar efficacy at a larger scale. The produced maps can be used by firefighting services for strategic and operative planning. Defined zones with higher forest fire occurrence probability, in the southeast of the study area, should be intensively monitored during the fire season, especially during the second peak in August [30], when the largest forest fires can be expected [31]. Intense monitoring allows early detection of forest fires and leads to rapid response. All of these measures, along with enhanced equipment for fire suppression, can significantly decrease the burnt area in Eastern Serbia. Also, silvicultural measures that reduce fire risk [23] can be applied under zones with a higher forest fire occurrence probability according to the created maps. Thus, in the fire prone zone, less flammable tree species [93] should be selected for afforestation. Additionally, to prevent the transfer of ground fire to the crown, the introduction of silvicultural measures, such as pruning of the lower branches, should be obligatory in order to reduce fire hazard in the vulnerable zones. Other fuel reduction treatments, such as thinning, prescribed burning, and fuel breaks, can be useful tools to achieve these objectives at the landscape level [20,21].

Conclusions
The overall accuracy of the RF models was higher than those of the LR models. Both types of model identified drought code and anthropogenic features as the most important forest fire predictors. The models displayed a very high predictive ability, but the RF models were slightly more efficient and could be recommended for forest fire occurrence mapping in the eastern part of Serbia. The obtained maps could improve the efficacy of forest fire suppression in the study area in several ways. First, the fire probability map could be used for position optimization of the devices used in the early detection of forest fires. Also, firefighting resource allocation could be planned and applied in a manner consistent with the fire frequency. Finally, forest management planning and silvicultural measures should be adapted in terms of the forest fire risk reduction, based on the obtained maps.
Supplementary Materials: The following are available online at https://www.mdpi.com/1999-4 907/12/1/5/s1: Figure S1: Correlation plot for all preselected variables based on Spearman's rho coefficient; Table S1: Tables of contingency for training and validation sets of data for the tested random forest models.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.

Conflicts of Interest:
The authors declare no conflict of interest.