Machine Learning Methods and Synthetic Data Generation to Predict Large Wildfires

Wildfires are becoming more frequent in different parts of the globe, and the ability to predict when and where they will occur is a complex process. Identifying wildfire events with high probability of becoming a large wildfire is an important task for supporting initial attack planning. Different methods, including those that are physics-based, statistical, and based on machine learning (ML) are used in wildfire analysis. Among the whole, those based on machine learning are relatively novel. In addition, because the number of wildfires is much greater than the number of large wildfires, the dataset to be used in a ML model is imbalanced, resulting in overfitting or underfitting the results. In this manuscript, we propose to generate synthetic data from variables of interest together with ML models for the prediction of large wildfires. Specifically, five synthetic data generation methods have been evaluated, and their results are analyzed with four ML methods. The results yield an improvement in the prediction power when synthetic data are used, offering a new method to be taken into account in Decision Support Systems (DSS) when managing wildfires.


Introduction
Forest ecosystems have an inestimable capacity to sequester carbon dioxide (CO 2 ) through time [1], which is very important to the global carbon budget [2]. Droughts, insects and pathogens, landslides, hurricanes, and fires have negative impacts on natural environments [3]. Among these, wildfires are the greatest hazards to forest development [4]. Wildfires can be caused by either anthropogenic or natural causes. Anthropogenic causes, such as carelessness or arson, accumulate the highest percentage of forest fire origins [5,6], negatively impacting economy and quality of life in both local and regional scales, in addition to the harm to natural environments. The average area burned in the world in the last 16 years is about 340 million hectares [7], with the latest report on forest fires in Europe indicating that 178,000 hectares were burned [8]. In addition, the number of countries that have been harmed in some way by large wildfires is higher than ever before. This damage is further exacerbated by the process of climate change that our planet is experiencing. For example, as the intensity and frequency of drought periods increase, the follow-up impact of wildfires increase, both in fire intensity and frequency [9].
In this context, the prediction, prevention, and management actions for wildfires are crucial. Decision Support Systems (DSS) for wildfires are powerful tools to prevent and manage forest fires by providing data for efficient resource use [10]. These DSSs are based mainly in (a) prediction models of geospatial (topography, land uses, infrastructures, among others), satellite and meteorological data [11,12], (b) thematic maps and risk indexes

Study Area
The study area is located in southwestern Spain (37°28′42″ N, 6°54′19″ W, WGS-84), more specifically in western Andalusia, covering the province of Huelva. Every year, there are large recurrent wildfires in the area, defined, in this study, as those that exceed a burn area higher than 500 hectares [75,76]. While in Spain as a whole, the percentage of large wildfires is low, 0.48% in 2017, in Huelva, there is always at least one per year [8].
As an example, in 2018, there were only three large wildfires in Spain, one of which was in the study area [8]. This makes this area of special interest when modeling and predicting the presence of large wildfires. Figure 1a shows the location of 210 wildfires lasting more than 6 h within the study area between 2000 and 2018. The average burnt area surface was equal to 637 hectares, with a maximum equal to 34,290 hectares, which occurred on 27th July 2004. As Figure 1b shows, the number of large wildfires were significantly smaller than normal wildfires. This imbalanced sample makes the results using machine learning techniques biased toward the majority class. For this reason, this study analyzes different sample balancing techniques through the generation of synthetic data.

Data Analysis
A total of 20 variables were analyzed for predicting the occurrence of large wildfires, including meteorological and environmental data as well as data calculated from Landsat and Moderate Resolution Imaging Spectroradiometer (MODIS) scenes [77] (Table 1). This variable selection was based on previous research [78][79][80]. The environmental variables were obtained from the Environmental Information Network of Andalusia (REDIAM) [81]. These variables have an annual-temporal resolution that is far too involved to describe in this paper; however, the details of and process of obtaining these variables can be found at http://www.juntadeandalucia.es/medioambiente/site/rediam (accessed on 17 December 2020). For the meteorological variables, two data sources were used. Firstly, data from the Spanish Meteorological Agency (AEMET) [82] were used to characterize pre-existing variables of the total study area before wildfire incidences, while on-site data provided by the Regional Forest Fire Fighting Plan of Andalusia (INFOCA) [83] were used to define the meteorological conditions of specific wildfire locations. The variables in the study area characterize the seasonal behavior of the year in which wildfires occur. Thus, the following variables were used: • Burn area mask: binary raster coverage where a value equal to 1 has been assigned to each pixel that has been affected by a wildfire and a value equal to 0 has been assigned to those pixels that have not been affected by a wildfire.   Figure 2 summarizes the workflow for classifying wildfire size. Firstly, data were preprocessed to generate new variables (Risks 1, 2, and 3). From this new dataset, the Random Forest (RF) technique was used for identifying significant variables. Since large wildfires are fewer than those belonging to the general class of wildfires, several techniques to create synthetic datasets were analyzed to balance the sample size of both classes. These new datasets, together with the original data, were used by four types of ML classifiers in order to predict the size of a wildfire. Finally, these results were evaluated to detect which type of synthetic data generation method and prediction model provided the best results. The prediction accuracy of both classes, wildfire and large wildfire, were also analyzed based on omission and commission errors.
The high number of candidate predictor variables and the low number of observations can impact the machine learning results [86], and therefore, the accuracy of the classifier can be overly optimistic, resulting in an overfitted model [87]. To this end, all the risk variables from REDIAM were summarized using three mean risk variables (Risks 1, 2, and 3). First, the variables were grouped according to whether they represented danger (Risk 1), risk associated with the individual watersheds (Risk 2), and risk associated with geography (Risk 3). In the case of Risk 1, only one variable was linked, and the variable was renamed. In the case of the variables within Risks 2 and 3, which were discussed previously using REDIAM, they were calculated by the mean value of the sub-variables for Risks 2 and 3. Then, RF was applied to this new set of variables to determine their individual importance. Variable importance was evaluated through the Gini Index and Out of Bag accuracy, measuring the degree of association between a given variable and the classification result [88]. sampling TEchnique (SMOTE) and ADAptative SYNthetic sampling (ADASYN). With the SMOTE algorithm, the minority class is oversampled by forming convex combinations of neighboring samples [93,94], while ADASYM weighs minority samples according to their level of difficulty of learning [95]. Finally, the Synthetic Minority Oversampling Technique-TomeK link (SMOTE-TK) method was used for balancing. This algorithm applies TK as an undersampling technique on the samples that are generated by SMOTE [96]. Next, four different ML classification algorithms were applied to predict wildfire size: Random Forest (RF) [97], Multi-Layer Perceptron (MLP) [98], Support Vector Machine (SVC) [99], and LOGistic regression (LOG) [100]. A grid search was performed for each classifier to find optimal hyperparameters using Scikit-learn in Python (using GridSearchCV library), as summarized in Table 2. Of the total number of wildfires, 70% were used in the training phase and 30% were used in testing. Training and testing processes have been performed on a virtual machine on Google Colaboratory, which is a free Since the number of large wildfires (sample size equal to 53) was significantly smaller than the totality of wildfires (sample size equal to 157), data were highly imbalanced with the results being biased toward majority class wildfire. As the classifier model assumes, data are drawn from a balanced distribution, and thus, in this case, they produce undesirable results, which can be resolved by balancing techniques [89] divided into two groups: undersampling and oversampling. The former removes data in the majority class, while the latter generates synthetic data in the minority class to balance the ratio between the two classes. For undersampling methods, Random UnderSampling (RUS) and TomeK link (TK) were used. The RUS algorithm balances the classes through random elimination of instances from the majority class [90], while TK detects pairs of samples of nearest neighbors belonging to different classes [91]. TK can either be used, as in this manuscript, in undersampling (majority samples are removed) or cleaning (both samples are removed) mode [92]. The oversampling methods used were Synthetic Minority Oversampling TEchnique (SMOTE) and ADAptative SYNthetic sampling (ADASYN). With the SMOTE algorithm, the minority class is oversampled by forming convex combinations of neighboring samples [93,94], while ADASYM weighs minority samples according to their level of difficulty of learning [95]. Finally, the Synthetic Minority Oversampling Technique-TomeK link (SMOTE-TK) method was used for balancing. This algorithm applies TK as an undersampling technique on the samples that are generated by SMOTE [96].
Next, four different ML classification algorithms were applied to predict wildfire size: Random Forest (RF) [97], Multi-Layer Perceptron (MLP) [98], Support Vector Machine (SVC) [99], and LOGistic regression (LOG) [100]. A grid search was performed for each clas- sifier to find optimal hyperparameters using Scikit-learn in Python (using GridSearchCV library), as summarized in Table 2. Of the total number of wildfires, 70% were used in the training phase and 30% were used in testing. Training and testing processes have been performed on a virtual machine on Google Colaboratory, which is a free cloud service from Google for machine learning applications, with a Central Processing Unit Intel Xeon 2.30 GHz and a Graphical Processing Unit Tesla K80. Training took less than 9 min and testing only took a few seconds with this configuration. For the assessment of the ML models, True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN) were counted in order to calculate accuracy (1), precision (2), recall (3), specificity (4), Geometric-mean (G-mean) (5), and F1-score (6): • Accuracy: the ratio of correctly predicted observations to the total observations. Accuracy = TP + TN TP + TN + FP + FN (1) • Precision: the ratio of correctly predicted positive observations to the total predicted positive observations.
• Recall: measures how well the classifier can detect positive observations.
• Specificity: measures how well the classifier can detect negative observations.
• G-mean: equal to the geometric mean of recall and specificity, this shows the balance between classification on the majority and minority class.
• F1-score: the harmonic average of precision and recall. In addition, omission and commission errors were calculated to analyze the results per wildfire class. F1 − score = 2· precision·recall precision + recall (6)

Results
As in the previous analysis, the correlation coefficient between all paired variables used in this study is shown in Figure 3, where positive correlation is represented in blue and negative correlation is represented in red. In addition, color intensity and the circle size are proportional to the correlation coefficients. Wildfire size showed significant correlation with wind speed, LST, mean temperature, risks 1, 2, and 3, forest vulnerability, and relative humidity. In addition, several fuel models were created for wildfire prediction, taking into account plant characteristics and their influence on speed and intensity of flame propagation, as proposed by Rothermel [101]. However, the fuel model variable did not indicate any relationship with wildfire size or the other variables. This explains the lack of classification or mapping of the fuel model in this study. Similar results have been found in other research projects in Spain [102]. In addition, omission and commission errors were calculated to analyze the results per wildfire class.

Results
As in the previous analysis, the correlation coefficient between all paired variables used in this study is shown in Figure 3, where positive correlation is represented in blue and negative correlation is represented in red. In addition, color intensity and the circle size are proportional to the correlation coefficients. Wildfire size showed significant correlation with wind speed, LST, mean temperature, risks 1, 2, and 3, forest vulnerability, and relative humidity. In addition, several fuel models were created for wildfire prediction, taking into account plant characteristics and their influence on speed and intensity of flame propagation, as proposed by Rothermel [101]. However, the fuel model variable did not indicate any relationship with wildfire size or the other variables. This explains the lack of classification or mapping of the fuel model in this study. Similar results have been found in other research projects in Spain [102]. The Gini importance index results with the potential predictor variables are shown in Figure 4a. Wind speed and mean temperature were the most important variables, while risk 1 and risk 3 were the least important. On the other hand, Figure 4b shows that Out of Bag accuracy performs best with four variables. Based on these results, wind speed, mean The Gini importance index results with the potential predictor variables are shown in Figure 4a. Wind speed and mean temperature were the most important variables, while risk 1 and risk 3 were the least important. On the other hand, Figure 4b shows that Out of Bag accuracy performs best with four variables. Based on these results, wind speed, temperature, relative humidity, and NDVI were the four selected predictor variables in this study. Once the variables were selected, different synthetic data generation methods were applied to balance the sample. Table 3 shows the sample size of each dataset generated.  Alongside the original dataset, the resulting quality of wildfire size prediction using RF, MLP, Log, and SVC models throughout Recall, F1-score, and G-means are shown in Table 4 and Figure 7. Based on the Recall parameter, the Log model showed the best results. In addition, RUS, SMOTE TOMEK, SMOTE, and ADASYN were the best methods for generating synthetic data for this model, while TOMEK LINKS showed the worst result, yielding similar recall values to the original data using SVC. On the other hand, MLP using SMOTE and SMOTE TOMEK data yielded the best results in the F1-score. As before, the original data gave the worst results. The same results appear with the G-means parameter. The original unbalanced data alone did not provide better results than those described above, therefore showing the advantage of using synthetic data in order to improve wildfire size prediction. Once the variables were selected, different synthetic data generation methods were applied to balance the sample. Table 3 shows the sample size of each dataset generated.  Alongside the original dataset, the resulting quality of wildfire size prediction using RF, MLP, Log, and SVC models throughout Recall, F1-score, and G-means are shown in Table 4 and Figure 7. Based on the Recall parameter, the Log model showed the best results. In addition, RUS, SMOTE TOMEK, SMOTE, and ADASYN were the best methods for generating synthetic data for this model, while TOMEK LINKS showed the worst result, yielding similar recall values to the original data using SVC. On the other hand, MLP using SMOTE and SMOTE TOMEK data yielded the best results in the F1-score. As before, the original data gave the worst results. The same results appear with the G-means parameter. The original unbalanced data alone did not provide better results than those described above, therefore showing the advantage of using synthetic data in order to improve wildfire size prediction.      The overall prediction results above were discussed without considering the errors obtained in the analysis of the two wildfire classes. Figure 8 shows the errors of omission and commission for wildfire and large wildfire classes for each model and dataset used. In general, omission errors in the prediction of wildfires (Figure 8(I.a)) were greater than those for large wildfires (Figure 8(I.b)), while on the other hand, the commission error was lower for large wildfires (Figure 8(II.b)) than wildfires overall (Figure 8(II.a)). The lowest omission error in predicting wildfires (Figure 8(I.a)) was obtained when the original data were used regardless of the model applied. However, if synthetic data were used in the same case, the number of wildfires predicted as large wildfire increased. Contrariwise, the omission error in large wildfire prediction decreased if synthetic data were used (Figure 8(I.b)). In this case, the Log model using SMOTE, ADASYN, SMOTE TK, and RUS gave the best results and therefore offered a greater accuracy in predicting large wildfires. On the other hand, the commission error for wildfires (Figure 8(II.a)) was low in those cases where the Log model was applied, using SMOTE, ADASYN, SMOTE TK and RUS synthetic methods. Here, using original data showed the worst results. Finally, MLP using SMOTE and SMOTE TK and RF using TOMEK LINK and original data gave a low commission error (Figure 8(II.b)).

Discussion
ML methods applied in studies of burn-area prediction are relatively novel compared to other wildfire applications. To date, these methods are applied to forecast or predict the total area burned and fire occurrence [103,104]. However, these results do not take into account the environmental conditions at the time a wildfire starts. In the proposed methodology, a total of 20 variables have been evaluated to predict the occurrence of a large wildfire. Of these, four have been selected: wind speed, mean temperature, relative humidity, and NDVI. Each of these are linked to real-time or near real-time data. The first three variables coming from a weather station installed close to the wildfire location and the last variable from the most recent Landsat scene at the time of the wildfire, allowing the prediction to be adapted to what is happening at that precise moment of the fire. The selection of meteorological variables is similar to those selected in previous research [74].
The fuel model variable was not selected, although it has been used in previous research projects related to wildfires [105,106]. Our results on fuel models were not conclusive, which was likely due to the low temporal resolution of these data-an aspect that has been detected by other authors working in the same region [102]. Previous studies to model burn area mainly use multiple linear regression models [48,107]; however, there are not many studies using ML techniques in this field

Discussion
ML methods applied in studies of burn-area prediction are relatively novel compared to other wildfire applications. To date, these methods are applied to forecast or predict the total area burned and fire occurrence [103,104]. However, these results do not take into account the environmental conditions at the time a wildfire starts. In the proposed methodology, a total of 20 variables have been evaluated to predict the occurrence of a large wildfire. Of these, four have been selected: wind speed, mean temperature, relative humidity, and NDVI. Each of these are linked to real-time or near real-time data. The first three variables coming from a weather station installed close to the wildfire location and the last variable from the most recent Landsat scene at the time of the wildfire, allowing the prediction to be adapted to what is happening at that precise moment of the fire. The selection of meteorological variables is similar to those selected in previous research [74].
The fuel model variable was not selected, although it has been used in previous research projects related to wildfires [105,106]. Our results on fuel models were not conclusive, which was likely due to the low temporal resolution of these data-an aspect that has been detected by other authors working in the same region [102]. Previous studies to model burn area mainly use multiple linear regression models [48,107]; however, there are not many studies using ML techniques in this field [55,108,109]. These studies make predictions of wildfire probability without taking into account the environmental conditions at the onset of the fire. On the other hand, many large wildfire predictions are suboptimal, as they are concentrated in small regions where no general models fit appropriately, which is mostly likely due to small numbers of large wildfires [110]. Therefore, the prediction of large wildfires presents difficulties as they are uncommon events with respect to overall wildfire occurrence. Earlier research, as [111], propose the need to establish a threshold to delimit a large fire event [112]; however, this threshold is debatable [113]. On the other hand, recent ML-based models at continental or global scales for predicting burn areas offer good results in general term but fail to distinguish large wildfires [114]. This imbalance of data justifies the use of synthetic data as proposed in this project.
Based on our results and considering the effectiveness and efficiency, Log and MLP were the best-performing models. In the case of the Log model, it yielded very low errors of omission in the prediction of large wildfires when synthetic data were used, except when using Tomek Links. However, this model was not the most effective, as the wildfire omission error resulted in the highest value of errors. Thus, the model adopts a conservative profile so that the necessary resources will always be mobilized for a large wildfire, assuming that on some occasions, the resources will be oversized. On the other hand, using MLP as a predictive model and SMOTE and SMOTE TK as a technique to generate synthetic data will make the response more efficient but slightly less effective. Thus, the error of omission for a large wildfire is slightly increased, but the error in wildfires will be smaller. Finally, the use of the original data is not recommended, mainly because of the high number of omissions in the prediction of a large fire, indicating the need to balance the sample.
In this study, the use of machine learning applications based on synthetic data to generate a predictive model of the presence of a large wildfire in the early stages has been evaluated. Of all the variables analyzed, the most important were those with a very high temporal resolution rather than historical variables, and therefore, the deployment of sensors over the wildfire area is highly recommended in the initial phase of extinction in order to monitor temperature and wind speed. On the other hand, although the fuel model variable has not been selected in this study, future work should use updated fuel model data to improve the results. Furthermore, we propose the evaluation of this methodology on a large working area, at country or continent scale, to assess its suitability.

Conclusions
Wildfires are one of the most dangerous natural hazards across the world and, for this reason, any effort to support its analysis and management is important. Knowing at an early stage whether a wildfire is going to become a large wildfire permits better management of human and material resources. In this study, the use of machine learning methods together with the appropriate selection of variables have provided satisfactory results in the prediction of large wildfires. For this, the selection and processing of data is one of the most important aspects. In this context, the analysis carried out has shown that those data registered at the time of the wildfire were more important than those based on historical series and that it is necessary to balance the data sample due to the higher occurrence of wildfires compared to large wildfires. Given the promising results presented here, the proposed methodology will be applied in future campaigns and will be extended to other regions.  Data Availability Statement: Data generated in this study are available from the corresponding author.