Wildfires are perturbing events affecting ecological processes and are part of the dynamics of many ecosystems in the world, influencing their composition, structure, and functioning. These hazards can cause significant losses in terms of vegetation, houses, and human and animal lives [1
]. Wildfires are becoming more frequent and extended in the last years, mainly under the influence of climate changes [4
] and land use management [7
]. Hence, the need to study them through their modelization, in terms of both fire spread and fire risk/susceptibility assessment, and to understand how to limit the disastrous effects they can have on the environment and on the socioeconomic tissue.
A wide range of methods have been developed to model fire hazards, typically taking advantage of Geographical Information System (GIS) and Remote Sensing (RS) techniques. These can be classified into physics-based methods, multi-criteria analyses coupled with statistical approaches, and machine learning methods. Physics-based methods involve the simulation and the prediction of fire behaviors through mathematical equations of fluid mechanics, combustion of canopy biomass, and heat transfer mechanisms [11
]. Multi-criteria analyses, coupled with statistical methods, assumes that the probability of occurrence of burned areas can be quantitatively assessed by investigating the relationship between fire occurrences and predisposing factors [16
]. Finally, machine learning (ML) methods are based on more or less complex data-driven algorithms able to model the hidden and non-linear relationships between a set of topographical and land use/land cover related predisposing factors with the observed wildfire events [25
ML approaches successfully applied for wildfire risk assessment and susceptibility mapping mainly include Artificial Neural Networks (ANN) [27
], Support Vector Machines (SVM) [26
], Random Forests (RF) [25
], and Multiple or Logistic Regression (LR) [25
]. Most of the studies cited above involve the use of different approaches to evaluate which one performs better. For example, in [28
] authors compare LR and ANN to model wildfire risk and to detect potential area for fire occurrence; it resulted that ANN had the higher accuracy, estimated as prediction capability. In [27
] authors find that Kernel LR model outperforms the benchmark, based on SVM, for tropical forest fire susceptibility mapping in a protected area in Vietnam. In [37
] authors evaluate the performances of three ML approaches (RF, SVM, and ANN) for the elaboration of a wildfire susceptibility map in Iran; in this case RF provided the most accurate predictions. Similarly, authors in [32
] proved that RF had the highest predictive accuracy compared to SVM in a study in Dayu County (China). In [34
] authors compare two stochastic approaches (RF and extreme learning machine) versus a deterministic procedure for wildfire susceptibility assessment in a region in Portugal, revealing the advantage of using ML-based methods, especially RF since it additionally provides the internal evaluation of the variable importance ranking. Several ML methods are compared in [39
] to evaluate their potential for wildfire susceptibility mapping in Amol County (Iran); even in this case, RF performed the highest accuracy, followed by SVM, while LR had the lowest accuracy.
Susceptibility is a fundamental concept in wildfire risk management. Broadly speaking, it can be defined as the extent to which a causal mechanism might affect and destabilize a potentially hazardous system [41
]. In the spatio-temporal domain, susceptibility maps indicate areas with the potential to experience a particular hazard in the future based solely on the intrinsic local properties of a site and on the observed past events, expressed in terms of relative spatial likelihood. According to Tonini et al. [40
], wildfire susceptibility maps display the wildfire’s occurrence probability, ranked from low to high, under a given environmental context.
From the above-mentioned literature review, RF has proven to be one of the most effective algorithm for wildfire susceptibility assessment. This is attested by the higher performances in terms of accuracy achieved by RF in comparison with other methods. RF stands out among other ML algorithms because of the following factors: its calibration is quite easy as it involves only a few parameters and the data do not need to be rescaled or transformed; it reduces over-fitting in decision trees and helps to improve the accuracy; it automatically outputs the variable importance ranking; it can handle directly categorical variables, such as land use classes or vegetation type, which are key factors in wildfire susceptibility assessment.
The central-South American forest is one of the areas most affected by wildfires in the world [42
]. Wildfires risk in the Amazonian forest will probably be higher in the near future because of the increasing frequency of drought periods coupled with the growing rate of deforestation [44
]. The main cause of fire ignition in the area is human-made, principally due to the commune practice of slash-and-burn [47
]. This, called in jargon chaqueo
, consists of cutting trees, low vegetation/agricultural residuals, and burning the biomass to make way for agriculture, livestock, logging, or simply to clear the agricultural land to prepare fields for the next year’s crop. This practice can easily get out of control and initiate large fires. For example, in 2019, Bolivia faced an extremely extensive wildfire event that had a serious ecological impact in the department of Santa Cruz [49
]. This complex area is characterized by a mosaic where wet and dry tropical forests alternate with savannas, and it is extremely prone to wildfires. Despite Bolivia being amongst the top-ten countries with the highest expected annual burned forest area at risk in the world [42
], the literature on wildfire’s risk and suppression is quite limited, principally because of the scarcity of available data and resources [51
]. To fill this gap, as part of the present study, we implemented an accurate dataset of burned areas based on Moderate Resolution Imaging Spectroradiometer (MODIS) wildfire product and reporting events that occurred in the entire department of Santa Cruz in the period of 2010–2019. The factors that can predispose the wildfires, such as the topography of the area, the land cover, and the ecoregions, were also collected and processed in the form of digital spatial data. This information allowed estimating the susceptibility of wildfires in the entire department, with a special focus on the municipality of San Ignacio de Velasco. Analyses were performed by using an ML-based approach, namely RF, and outputs presented in the form of maps were finally validated. In addition, the influence of the different predisposing factors and the relative probability of prediction success over a range of discrete values, corresponding to the different classes of land cover and ecoregions, were investigated and discussed.
2. Study Area
The study area (Figure 1
) corresponds to the department of Santa Cruz in Bolivia. It includes 15 provinces and 56 municipalities, with capital Santa Cruz de la Sierra, in the Andrés Ibáñez province. With an area of 370,621 km2
, Santa Cruz is the is the largest of the Bolivian departments, covering 34% of the entire national territory. It includes part of the Amazonian and the Chaco plains, with an elevation around 800 m.a.s.l. for the large majority of the territory, which can exceed 1500 m.a.s.l. on the sub-Andean relief.
Wildfires are historically recognized as part of the disturbance regime of the area. However, in the recent years, due to their increasing frequency and amplitude, and to their relationship with human activities and agricultural practices, wildfires management and containment has become a challenging task [52
]. The main cause of wildfires in Bolivia is the slash-and-burn agricultural practice, followed by activities related to the pasture management, waste burning, hunting, and others human activities [48
]. According to MODIS data (Figure 1
), in the period from 2010 to 2019, about 2203 wildfires per year (Figure 2
) resulted in an average burned area of 1,266,275 ha (Figure 3
In general, the largest number of events occurs between July and October, with August and September being the two months with the highest concentration of burned area. Although forest in this area is exposed to a marked seasonality, and hence it is susceptible to changes in fire regimes, we did not consider this variability in the present study. Indeed, since typically there is only one fire season that hits during the frame-period coinciding with the driest months of the year, monthly burned areas have been aggregated on a yearly basis.
Municipality of San Ignacio De Velasco
San Ignacio de Velasco is the largest municipality of the province of José Miguel de Velasco, located at the northeast of the department of Santa Cruz (Figure 1
). This area was selected to carry out analyses at a more local scale because it has been the municipality most affected by wildfires within the entire study area during the recent period (2010–2019) (Figure 4
]. With a surface of about 48,960 km2
, San Ignacio de Velasco has the highest demographic growth after the city of Santa Cruz de la Sierra. Administratively, its area is divided into twelve districts, of which two are highly urbanized, nine are inhabited by indigenous communities, and the last corresponds to Noel Kempff Mercado National Park. The forestry potential and the tourism are the main resources of this municipality.
3. Materials and Methods
The methodological workflow developed in the present study (Figure 5
) includes the following process:
Implementation of the database, including: (i) the dependent variable (i.e., the burned areas derived by MODIS product); (ii) the independent variables (based on the topographical, ecological, and land cover/vegetation).
Implementation of an ML approach, using RF and five equal-size folds for the validation procedure, allowing to maximize the spatial generalization of the predictions.
Elaboration of the wildfire susceptibility maps, based of the probabilistic outputs resulting from RF, and assessment of the variable importance ranking.
Validation of the performances of the model performed by estimating the Area under the Receiver Operating Characteristic (ROC) curve (AUC), and computed considering the temporal splitting of the original dataset into training (2010–2016) and testing (2017–2019).
3.1. Dependent Variable: Burned Areas
Data acquisition and implementation of the input datasets are the most challenging steps in any modeling study. Wildfire prevention, defense, and suppression plans require, first and foremost, accurate estimates of the differential susceptibility of the land to wildfires in relation to the characteristics of the territory and to past observed events. Thus, a key factor for wildfire susceptibility modeling is the observed burned areas, available as mapped fire perimeters and spanning several years.
Burned areas were selected and processed based on the collection of the 6 MODIS burned area mapping product (MCD64), which contains fewer unclassified grid cells as a result of its improvements. In more details, MCD64 employs daily 500-m MODIS surface reflectance data coupled with 1-km MODIS active fire observations [54
]. The MODIS burned area product has been downloaded from the University of Maryland (ftp://ba1.geog.umd.edu/
, accessed on 20 March 2020) as a GeoTIFF file. Detected burned areas are labeled with the Julian day of the given month for each monthly GeoTIFF. These features have then been aggregated on a yearly basis and the yearly raster datasets clipped over the study area (Figure 1
3.2. Independent Variables: Predisposing Factors
The following independent variables (Table 1
) provide a detailed knowledge of the topography, ecological conditions, and land cover, including vegetation, allowing to understand how these factors can predispose to wildfire occurrence: Digital Elevation Model (DEM) and slope (derivative of DEM), ecoregions (assemblage of species, natural communities, and environmental conditions [55
]), and land cover (physical material at the surface of the earth, that includes vegetation information). All of these maps were elaborated in raster format with a grid cell size of 100 × 100 m (Figure 6
), which is good enough to capture the spatial characteristics of wildfire locations and, at the same time, large enough to assure a reasonable processing speed.
It is worth noting that meteorological factors, like wind speed and wind direction, temperature, humidity, and rainfall were not included as predisposing factors for this study since, according to Tonini et al. [40
], these are local conditions that cause a hazard to occur if and only if the area is susceptible to that hazard (i.e., acting as triggering factors), while the susceptibility is assessed based only on factors that are stable over time (i.e., predisposing factors).
3.2.1. Topographic Conditions: Altitude and Slope
Topography is an important factor that influences wildfires because its properties affect the distribution, composition and flammability of vegetation, local climate (such as average wind speeds), and human accessibility. Therefore the raster DEM, obtained from the competent authorities of the Municipal Autonomous Government of Santa Cruz, played an important role in this study. Slope was extracted based on the DEM; this factor is important since an increase in slope can increase the fire spread rate. Fire can spread more quickly up the steep areas and less quickly down the steep areas.
3.2.2. Ecological Conditions: Ecoregions
Ecoregions can be defined as relatively large areas of land or water, containing a distinct assemblage of natural communities and species which share similar environmental conditions. The biodiversity of flora, fauna, and ecosystems differs from one ecoregion to another [56
]. This factor is important for the assessment of wildfires in Santa Cruz since it provides information about the vegetation of the area. The information regarding vegetation characteristics of each ecoregion helped in determining which classes have to be prioritized and maintained. The department of Santa Cruz includes nine ecoregions (Table 2
), described below [57
Yungas (also called humid forest): a cloud forest located between 1000 and 3300 m.a.s.l., where permanent moisture is supplied by cloud drizzle and rainfall brought from the Amazon basin by the easterlies. The beta diversity of this ecosystem is the highest in Bolivia.
Bolivian Tucuman forest: located between 300 and 3300 m.a.s.l. In this ecoregion, the minimum annual temperature range is lower than in Yungas because of the influence of cold southerly winds, called "surazos". The vegetation cover is dense, including trees more than 15 m tall.
Southwestern Amazon forest: located between 150 and 500 m.a.s.l., it is composed of all the Amazon forest types. The species richness is the same as that of the moist Yungas forest. Trees are more than 45 m tall. This region has suffered from strong human pressure.
Flooded savanna: located between 100 and 200 m.a.s.l., it is in fact a seasonally flooded savanna due to the numerous rivers from the Andes that flow through the Amazon lowlands.
Gran Chaco (also called dry forest): located between 200 and 600 m.a.s.l. It has the lowest mean annual precipitation (795 mm), a mean annual temperature of 21.7 °C, and a maximum of 48 °C. It is among the largest and best preserved dry forests in the world.
Chiquitano Dry forest: located in a transition zone between the moist Amazon rain forest and the Gran Chaco dry forest, at an altitude between 100 and 1400 m.a.s.l. It is endemic to Bolivia, highly biodiverse, and it has been extremely affected by wildfires in recent years.
Dry Inter-Andean forest: located between 500 and 3300 m.a.s.l., and includes patches of dry forest alternated with Yungas forest and deep inaccessible valleys. Due to its topographical specificity, this ecosystem is characterized by a variety of endemic species.
Chaco Serrano: is dominated by the horco-quebracho (Schinopsis hanckeana) along with the drinking molle (Lithrea molleoides), especially in the south, and by a large number of cacti and spiny legumes in the north. At higher altitudes, the forest is replaced by grasslands or gramineous steppes with a predominance of species of the genus Stipa and Festuca.
Cerrado: a wide range of climatic conditions exists across the Cerrado ecoregion. Precipitations are between 1000 and 2000 mm per year, with a pronounced dry season from April to September, and mean annual temperatures ranging from 16 °C to 25 °C. This ecoregion is characterized by an enormous biodiversity of plants and animals that is progressively threatened by the expansion of agriculture and the burning of vegetation to make charcoal.
3.2.3. Landscape Features: Land Cover
Land cover represents the landscape features on the Earth’s surface. The different characteristics (such as load and moisture content) of the distinct land cover types can affect the ignition and spread of fires. The official land cover map for Santa Cruz was elaborated based on the 2018 product of the Climate Change Initiative (CCI) of the European Space Agency (ESA) [60
]. This map was reclassified taking into account the portion of the area covered by each class; classes with a surface lower than 0.1% of the total study area were aggregated to the closest class. The resulting 10 classes are listed in Table 2
3.3. Machine Learning Approach: Random Forest
RF is an ensemble-learning algorithm based on decision trees [61
]. As in general for ML-based approaches, RF is capable of learning from and making predictions on data by modeling the hidden relationships between a set of input and output variables. Inputs are the independent variables, or predisposing factors, while the outputs are the dependent variables, represented in the present case by the burned areas.
In more details, decision trees are supervised classifiers providing decisions at multiple levels and constituted by a root node and child nodes. At each node, decisions are performed based on training predictor values. The number of trees () to implement the model and the number of variables () randomly sampled as candidates at each split, are the only hyperparameters that need to be specified by the user. The algorithm generates subsets of the training dataset, counting about two-thirds of the observations chosen by bootstrapping (i.e., random sampling with replacement). The remaining one-third of the observations (called “out-of-bag”) are kept out and used to assess the prediction-error, allowing to optimize the values of the hyperparameters. For each subset, variables are then randomly selected at each decision tree and, at each node, the best variable is assessed based on the minimization of the Gini impurity value. This last denotes the probability of classifying incorrectly an observation if it were classified randomly, according with the class distribution in the dataset. The prediction of new data is finally computed by taking, for classification problems, the maximum voting of all the trees. This value can be converted into a probabilistic output, by normalizing it over the number of iterations (i.e., ). In this study, hyperparameters were set to 500 for and 3 for .
In addition, RF allows to assess the relative importance of each variable on the prediction. This is achieved by evaluating the Mean Decrease Accuracy, computed by estimating how much the tree nodes which use that variable reduce the mean square errors on the out-of-bag. Moreover, the partial dependence plot gives a graphical depiction of the marginal effect of each variable on the class probability over different ranges of continuous or discrete values.
3.4. Model Validation
A well-established procedure to validate the outputs of a model in ML is to split the dataset into three-subsets, defined as training, validation, and testing. In more details, the training subset is needed to generate the model, which will be used to get predictions on new data. The ultimate purpose of the validation subset is the optimization of the hyperparameters of the model, performed by applying a trial and error process (i.e., by comparing the values predicted with the observed). Lastly, the testing subset provides an unbiased evaluation of the final model, allowing to assess its performance in making good predictions on new data, supposed to be drawn from the same distribution as the training data. Indeed, a good model has to give accurate predictions on previously unseen data and avoid under- and over-fitting.
Actually, the out-of-bag in RF can act for validation purposes. Nevertheless, when dealing with spatio-temporal environmental phenomena, the random selection of the data in the out-of-bag leads to a spatial issue of auto-correlation, meaning that most of the training and validation data will probably be close each other, holding similar characteristics. To overcome this, we selected the validation subset using a spatial k-fold cross-validation approach. This consists of splitting the original training dataset into k-folds, keeping out a fold at a time, training the model on the k-1 folds, and finally validating the model using the kept-out fold. The process is repeated k-times and the evaluation scores resulting from each folding is finally averaged. The use of k-folds assures a better spatial generalization of the final model. Here, we considered 5-folds, organized into spatial blocks of 100 × 100 km for the entire department of Santa Cruz, and of 50 × 50 km for the municipality of San Ignacio de Velasco. In the present study, the Area Under the Receiver Operating Characteristic (ROC) Curve (AUC) represents the evaluation score used as indicator of the goodness of the model in classifying areas more susceptible to burn. ROC is a graphical technique based on the plot of the percentage of correct classification (the true positives rate) against the false positives rate (occurring when an outcome is incorrectly predicted as belonging to the class “x” when it actually belongs to the class “y”). The AUC value lies between 0.5, denoting a bad classifier, and 1, denoting an excellent classifier.
To asses the predictive performance of the model, that is its ability to make a good prediction for future events, an independent testing subset was selected by splitting temporally the original dataset. The burned areas observed in the period of 2010–2016 were used in the training procedure, while the last three years of observations (2017–2019) were used as testing. To this end, we computed the percentage of the area with a probabilistic output value in a certain range, falling inside the burned areas in the testing dataset. Therefore, we can assume that the implemented model gives good prediction if it results in a higher percentage for high predicted values and a lower percentage for low predicted value.
The central-South American forest is one of the major fire-prone areas worldwide [42
]. Land use management and climate changes have caused wildfires to become more frequent and extended in the recent years [69
]. In Bolivia, the department of Santa Cruz was hit by extreme fires in 2019 and globally this area accounts for more than two-thirds of the total wildfires in the country [71
]. Despite Bolivia being in the list of the top countries for the expected annual burned forest area, the literature on wildfires is quite limited, also because of the scarcity of available data, resources, and informatization. The present study contributes to fill this gap, thanks to the implementation of an accurate dataset of burned area that occurred in the entire department of Santa Cruz in the period of 2010–2019. In addition, environmental variables that can favor the wildfire ignition and spread (related to topography, land cover, and ecoregions) have been collected and processed as digital spatial information. This accurate dataset allowed implementing a machine learning-based model, using the Random Forest algorithm, providing as output the probability of burning for each single unit-area. The model was applied both at the departmental and at the municipality scale, considering the smaller and more homogeneous area of San Ignacio de Velasco.
The area under the ROC curve (AUC) was computed for validation purposes, based on a 5-fold cross-validation procedure. In addition, an independent testing subset was selected by splitting temporally the original dataset, allowing to assess if the model makes good prediction on future events. As a result, we obtained a probabilistic output which allowed elaborating the susceptibility maps for the two study areas. The model validation gave good results, that improved at the municipality scale (AUC = 0.8) compared to the departmental scale (AUC = 0.73). Likewise for the prediction on the testing dataset (2017–2019), for which it results that more than 50% of the burned area coincides with the last quartile on the susceptibility map for both the study areas. It is worth noting that in the first two years (2017 and 2018) about 60% of the burned area was correctly predicted for the department of Sanza Cruz, and that this value increased from 66.52% to 69.14% for San Ignacio de Velasco. Indeed, the extreme wildfire events of 2019 made the prediction of the model less accurate. In this regard, it must be stressed that the testing subset is supposed to be drawn from the same distribution as the training data. It follows that, if its distribution differs from the observations used to train the model (e.g., extreme events or outside the norm), the model will results in weaker predictions. We must mention here that, among the two million wildfires registered worldwide every year in the recent period, only a few become extreme events [73
]. Extreme wildfires cause substantial damages and often result in civilian and firefighter fatalities since they tend to overwhelm suppression capabilities. Modeling and predicting these extraordinary events is very challenging since their extreme behavior and impact result from the complex interaction with atmospheric processes and climatic conditions (such as heat waves, long dry periods, very strong and variable winds) and local conditions (e.g., low fuel moisture content, landscape connectivity, inadequate initial attack, poor preparedness, and vulnerable communities). Thus, even if advanced wildfire susceptibility models are promising tools for wildfire assessment and prediction, future work should address the development of prediction mapping for extreme wildfire events.
The detailed investigation of the relative importance of each categorical class belonging to the variables ecoregions and land cover reveals that "flooded savanna” and “shrub or herbaceous cover, flooded, fresh/saline/brakish water” are the two classes most related with wildfires. This important outcome confirms recent findings, that seasonally wet and dry climate, coupled with hydrologic controls on the vegetation, create in this ecoregion conditions favorable to the ignition and spreading of large wildfires during the driest period, when the biomass is abundant [75
]. The occurrence of large fires, initiated by slash-and-burn practice getting out of control, is predicted to increase in the near future and the development of new tools for fire risk assessment and reduction is thus needed.
To conclude, it is worth to underline that wildfires are likely to become more dominant in Santa Cruz because of the vicious cycle linking the current slash-and-burn deforestation practices, the rapid frontier expansion, and longer drought periods. In a broader sense, these harmful conditions could be further exacerbated by the increased moisture stress acting in different forests of great global relevance such as the Amazon, Chiquitano, and Pantanal coupled with the spreading use of fire in these areas. Therefore, it is necessary to dispose of key tools for wildfire prevention-planning programs aiming to reduce human, ecological, and material losses. In addition, to take full advantage of machine learning for the mapping and assessment of natural disaster, it is of paramount importance to develop an accurate and updated digital geospatial dataset and more effort has to be done in this regard.
This study proved that: (i) Random Forest can successfully be employed for prediction mapping and to asses wildfire predisposing factors; (ii) it is possible to implement a simple but powerful model even for a country, such as Bolivia, with poor resources in terms of data-availability and informatization.