Data-Driven Wildﬁre Risk Prediction in Northern California

: Over the years, rampant wildﬁres have plagued the state of California, creating economic and environmental loss. In 2018, wildﬁres cost nearly 800 million dollars in economic loss and claimed more than 100 lives in California. Over 1.6 million acres of land has burned and caused large sums of environmental damage. Although, recently, researchers have introduced machine learning models and algorithms in predicting the wildﬁre risks, these results focused on special perspectives and were restricted to a limited number of data parameters. In this paper, we have proposed two data-driven machine learning approaches based on random forest models to predict the wildﬁre risk at areas near Monticello and Winters, California. This study demonstrated how the models were developed and applied with comprehensive data parameters such as powerlines, terrain, and vegetation in different perspectives that improved the spatial and temporal accuracy in predicting the risk of wildﬁre including ﬁre ignition. The combined model uses the spatial and the temporal parameters as a single combined dataset to train and predict the ﬁre risk, whereas the ensemble model was fed separate parameters that were later stacked to work as a single model. Our experiment shows that the combined model produced better results compared to the ensemble of random forest models on separate spatial data in terms of accuracy. The models were validated with Receiver Operating Characteristic (ROC) curves, learning curves, and evaluation metrics such as: accuracy, confusion matrices, and classiﬁcation report. The study results showed and achieved cutting-edge accuracy of 92% in predicting the wildﬁre risks, including ignition by utilizing the regional spatial and temporal data along with standard data parameters in Northern California.


Introduction
Due to changing climate and rising temperatures, the prevalent reality has resulted in longer and more intense wildfire seasons [1]. Wildfires, also known as wildland fires or forest fires, are uncontrolled fires occurring in forest or grassland that can threaten human lives, structures, and impact ecosystems and natural resources [2]. Many efforts have been made to contain its rapid spread, evacuate the human population, and mitigate the losses. It is extremely difficult for humans and wildlife to escape from the wildfires as they can have an insurmountable pace of 14 miles per hour. Generally, these wildfires are triggered by extreme heat, dry fuel, and human factors. However, according to the U.S. Department of Interior, 90% of wildfires are caused by people [3]. The fire history statistics of the U.S. show that California is one of the states that faces more wildfire outbreaks due to weather conditions, such as drought, dry lightning, and excessive heat [4]. According to the California Department of Forestry and Fire prevention (Cal Fire), in 2018, one of the worst years in California history, they witnessed 7571 fires that burned across 1.6 million classification analysis. Unlike the previous two papers [15,16], Marcos R. and Juan R. [17] introduced human-caused factors in wildfire risk prediction based on four different machine models-Logistic Regression (LR), Random Forest (RF), Boosting Regression Trees (BRT), and Support Vector Machines (SVM)-to assess the wildfire risk prediction for the evaluation of human-induced wildfires in Spain. They found RF and BRT to be better models in terms of accuracy for the area under the curve than the SVM or LR models. Caroline Famiglietti et al. [18] used three different machine learning models (Decision Trees (DT), Logistic Regression (LR), and Multi-Layer Perceptron (MLP)) to predict fire risk in Northern California based on satellite data. LR is a linear classifier that provides the output as a probabilistic prediction; DT is a non-parametric approach used for classification and regression; and MLP is a feedforward neural network that consists of input, hidden, and output layers. Mahsa Salehi et al. [19] created a Context-Based Fire Risk (CBFR) high accuracy predictive model using ensemble learning techniques by considering temporal variations to predict the wildfire in the region of Blue Mountains, Australia. Their unsupervised machine learning model can maintain multiple historical models for different temporal variations. Recently, big data-driven approaches have also been used to study the wildfire risk assessment. For instance, Ritaban Dutta et al. [20] developed a two-layered machine learning model using supervised and unsupervised learning techniques to establish a relation between fire incidence and weather data. They demonstrated a bush fire incidence hot spot on a weekly temporal and spatial resolution with a very high accuracy of 91%. Table 1 shows a comparison of different research studies in different parts of the world using machine learning models to predict the wildfire risks. This research paper reports our wildfire risk prediction results based on data sets from our study area near Monticello and Winters, California. In the research, the aim was to address the practical wildfire challenge in real-time weather forecasts and location-based risk prediction by using two data-driven machine learning approaches based on ensemble machine models and regional-based comprehensive data, including location-based weather data, remote sensing data in terrain and land vegetation, as well as wildfire history data and human factors.
Conventional approaches employ mathematical and statistical methods with a reliance on equations and calculated metrics. These traditional techniques suffer from lower accuracy, efficiency, unclear cut-offs, the complexity of equations, and lack the processing capabilities to support real-time decision making. With the advent of computer-assisted fire prediction, Machine Learning and Neural Networks have been utilized to improve this lagging outcome. However, only a limited parameter such as Normalized Difference Vegetation Index (NDVI), Normalized Difference Water Index (NDWI), and Enhanced Vegetation Index (EVI) were employed in the earlier studies with limited accuracy. NDVI is the standard vegetation index used throughout the world by the scientific community. EVI is the enhanced version of NDVI to address the shortcomings of NDVI. It considers the environmental conditions in the atmosphere that influence the index values. NDWI gives the water contact in vegetation and is a key indicator of fire ignition risk. Certain environmental conditions favor fire ignition and increase the likelihood of wildfires.
To cope with this deficiency, we have incorporated a wide range of parameters and indices and focused on improving the spatial and temporal accuracy of the outcome. This will eventually facilitate the smart management of wildfires. Our proposed model has better temporal accuracy than the discussed models and can achieve real-time prediction of fires being trained with the region-specific data parameters enables it to achieve the high spatial accuracy and can pinpoint an incident in real-time if the most recent spatial data gets provided to the model. In comparison to the other models, the major intellectual merit that our proposed models can have is that these can be effective with hourly or biweekly data for region-based wildfire risk prediction along with augmented spatial sensitivity. Unlike current wildfire models, our proposed models are multi-columned machine learning models, which are trained and developed to address the needs of region-based wildfire risk prediction including the fire ignition. The organization of the paper is structured as follows. Section 2 reports the area under the study and discusses the set of comprehensive data parameters and data preprocessing, and training and test data preparation. Section 3 describes an ensemble machine learning approach (Base model I, Base model II, and AdaBoost) and a combined model with their architecture, training, and comparative evaluation results. Section 4 discusses the conclusion and future work.

Study Area and Datasets
The study area near Monticello and Winters, California, is characterized by complex topography and dense and heterogeneous vegetation as shown by the bounded area in Figure 1. This region is enveloped between Davis and Napa. Our main intent is to perform a region-based wildfire risk analysis and prediction; therefore, we divided our study area into 63 grids of dimension 1 × 1 km. The date range for the investigated study is 1/1/2015 to 12/31/2019. We tried to leverage the advancements in Big Data technologies as fire systems can be modeled using complex data such as satellite, weather, terrain, powerline, and fire history. We have considered data parameters in fire history, weather, land vegetation and terrain, and powerline for this research project.

Study Area and Datasets
The study area near Monticello and Winters, California, is characterized by complex topography and dense and heterogeneous vegetation as shown by the bounded area in Figure 1. This region is enveloped between Davis and Napa. Our main intent is to perform a region-based wildfire risk analysis and prediction; therefore, we divided our study area into 63 grids of dimension 1 × 1 km. The date range for the investigated study is 1/1/2015 to 12/31/2019. We tried to leverage the advancements in Big Data technologies as fire systems can be modeled using complex data such as satellite, weather, terrain, powerline, and fire history. We have considered data parameters in fire history, weather, land vegetation and terrain, and powerline for this research project.

Data Sources
The various data types and their sources are given below in Table 2.

Data Sources
The various data types and their sources are given below in Table 2.

Data Pre-Processing
This section reports the diverse data processing for the dataset to be used for training the model to predict the wildfire. Our dataset consists of fire history, weather, vegetation, powerline, and terrain data.
We have used QGIS (Geographic Information System) to visualize the fire perimeter in a map. Thereafter, a mesh or gridded square was embedded on the area to isolate a fire-prone region, before downloading the coordinates of the vertices for further data collection from multiple sources, and computational analysis.

A. Fire history data processing
The fire history data is pre-cleaned and validated by Fire and Resource Assessment Program (FRAP), which helped us in identifying the location and relevant information about the fire incident [21]. 'YEAR_', 'ALARM_DATE', 'CONT_DATE', 'geometry', 'CAUSE', 'REPORT_AC', 'GIS_ACRES' were the features selected for the statistical analysis and five new fields were created for fires each year to integrate the information into the Target field.

B. Weather data processing
For climate and weather data, 4 data parameters were selected out of the 22 hourly variables (i.e., dry bulb temperature, relative humidity, wind speed, and hourly precipitation). To deal with missing data, different techniques were applied for the different scenarios below.

•
For a single station, if the previous and next hour record is present for the missing value, we took the mean of the previous and next hour record and filled the missing value • For a single station, if there is consecutive data missing for up to 10 days, we took the previous years' data of the same date For datasets containing non-numerical marker values such as 'T' to indicate trace amounts of precipitation, 'T' got converted to 0.0001. All the outliers outside the range were removed. Subsequently, the training dataset was normalized.
To create the weather dataset for the models, the selected area of study was divided into 1000 × 1000-m 2 grid cells using free and open-source tool QGIS (Geographic Information System) to visualize the fire perimeter in a map ( Figure 2). Thereafter, a mesh or gridded square layer was embedded on the area to isolate a fire-prone region/s, and coordinates of the grids were downloaded to be mapped with the weather data. To keep the data balanced for each sample of fire start day's weather data, another sample from a no-fire day's weather data was added to the dataset, and various combinations of years were included as well.
into 1000 × 1000-meter grid cells using free and open-source tool QGIS (Geographic Information System) to visualize the fire perimeter in a map ( Figure 2). Thereafter, a mesh or gridded square layer was embedded on the area to isolate a fire-prone region/s, and coordinates of the grids were downloaded to be mapped with the weather data. To keep the data balanced for each sample of fire start day's weather data, another sample from a nofire day's weather data was added to the dataset, and various combinations of years were included as well.

C. Vegetation data processing
For vegetation data, Landsat 8 dataset was gathered based on the area of interest geometry, and three separate GEE (Google Earth Explorer) data products applied on the image collection to derive NDVI (Normalized Difference Vegetation Index) [22][23][24], EVI (Enhanced Vegetation Index), and NDWI (Normalized Difference Water Index) vegetation indices. In order to overlay the vegetation indices in the area of study grid, the area was segmented into grids (1 km × 1 km) using the grid endpoints. Thirteen points including the four corners, four midpoints, centroid, and four diagonal points (or midpoints of four squares enclosed in the original square) were isolated. Vegetation indices of each of these points were directly fetched from GEE. Thereafter, the value for each grid was calculated as the average value of the 13 points after a series of experiments sampling 1, 5, and 13 points per grid. An elaborate analysis of this sampling methodology revealed that 13-point sampling per grid is sufficient and represented the vegetation of this area quite accurately. Taking all available points in the grid, features such as bodies of water can completely skew the calculation. For example, if half of a grid is a body of water, the other half with land can catch fire due to thick vegetation. For this reason, a 13-point approach for fire risk prediction was ideal compared to total aggregation of point-wise vegetation indices.
As part of data cleaning, we grouped points based on point id, start and end date and calculated the mean value such that a single value/row remains for a single inner grid at a given time. We integrated the original grid table with geometry, left, right, top, and bottom coordinates. Thereafter, the missing values were imputed. We considered using 'time' and 'linear' interpolation deemed most suitable for the time-series null value imputation [25]. The mathematical formula and Landsat bands-based formula for calculating Vegetation Index such as Normalized Difference Vegetation Index (NDVI), Enhanced Vegetation Index (EVI), and the Normalized Difference Water Index (NDWI) is given below in Equations (1)-(6) respectively [26]. NIR stands for Near-infrared band, whereas SWIR is a shortwave infrared band. The L value in EVI formula adjusts the canopy background, whereas the C value is the coefficient of atmospheric resistance. B is a blue band and G is the gain factor [27].
Range: −1 to 1 In order to make the data balanced in terms of fire days vs. no-fire days, Synthetic Minority Oversampling Technique (SMOTE) [28] was used to generate the synthetic data to oversample the minority class (fire-days).
To display the fire probability of each of these indices, we chose a small subset of data before and after the fire, then set the data points before the fire to 1 (fire expected) and after the fire to 0 (No Fire). NDVI and EVI are reliable indicators of the likelihood of fire. The mean NDVI is 0.304 with a standard deviation of 0.148, whereas the mean EVI is 0.302 with a standard deviation of 0.111. Mean NDWI is 0.107 with a standard deviation of 0.135. Figure 3 shows the vegetation indices data before, during, and after the fire for one of the grid cells (id#25), which indicates the totally different ranges for the NDVI, EVI, and NDWI before, during, and after the fire. after the fire to 0 (No Fire). NDVI and EVI are reliable indicators of the likelihood of fire. The mean NDVI is 0.304 with a standard deviation of 0.148, whereas the mean EVI is 0.302 with a standard deviation of 0.111. Mean NDWI is 0.107 with a standard deviation of 0.135. Figure 3 shows the vegetation indices data before, during, and after the fire for one of the grid cells (id#25), which indicates the totally different ranges for the NDVI, EVI, and NDWI before, during, and after the fire.

D. Powerline data processing
The theoretical reason for using powerline dataset was that, although it does not provide enough information on wildfire prediction, it is a major factor in modeling wind direction and micro weather conditions to predict wildfire ignition. Powerline dataset alone lacks in temporal information since it does not change frequently. Combining with the weather dataset, it is effective in correlating the wildfire ignition conditions (powerline related fires usually occur during the windy weather).
We merged the spatial vector layers file with fire history and bounding box grid using the tools in QGIS software and exported the resultant data as a flat file. After exporting the flat file grids with no powerline passing through, it ended up having null values. For all the columns except Status and Circuit, null values were replaced with zero as the pow-

D. Powerline data processing
The theoretical reason for using powerline dataset was that, although it does not provide enough information on wildfire prediction, it is a major factor in modeling wind direction and micro weather conditions to predict wildfire ignition. Powerline dataset alone lacks in temporal information since it does not change frequently. Combining with the weather dataset, it is effective in correlating the wildfire ignition conditions (powerline related fires usually occur during the windy weather). We merged the spatial vector layers file with fire history and bounding box grid using the tools in QGIS software and exported the resultant data as a flat file. After exporting the flat file grids with no powerline passing through, it ended up having null values. For all the columns except Status and Circuit, null values were replaced with zero as the powerlines were not going through those grids. However, Status column null values were replaced as "Not operating" and, for Circuit column, null values were replaced with "Other" as the categorical variables.

E. Terrain data processing
For terrain data, 1-m Digital Elevation Models (DEM) maps were downloaded from USGS (U.S. Geological Survey) National Map 3DEP Downloadable Data Collection, merged with fire history data, and overlaid onto the study area bounding box grid using the inbuilt zonal statistics algorithm in QGIS. Then, we calculated the statistics for three features (slope, hill shade, aspect) in each grid and used them as the input to the machine learning algorithm.
Slope: Slope represents the rate of change of elevation for each digital elevation model (DEM) cell. It is the first derivative of a DEM, and by default, the slope appears as a grayscale image. We can add the Colormap function to specify a color scheme. The inclination of the slope can be output as either a value in degrees or percent rise utilizing these three options: (1) Degree: The inclination of the slope is calculated in degrees. The values range from 0 to 90. (2) Scaled: The inclination of the slope is calculated the same as degrees, but the z-factor is adjusted for scale. (3) Percent Rise: The inclination of the slope is output as percentage values. The values range from 0 to essentially infinity. A flat surface is 0 percent, and a 45-degree surface is 100 percent, and as the surface becomes more vertical, the percent rise becomes increasingly larger.
Hill Shade: To get a better look at the terrain, it is possible to calculate a hill shade, which is a raster that maps the terrain using light and shadow. A hill shade can provide very useful information about the sunlight at a given time of day, and it can also be used for aesthetic purposes to make the map look better.
Aspect: Aspect is the compass direction that a slope faces. After applying 1 × 1 km grid, statistics of the slope, hill shade, and aspect are calculated for each grid on the map.
We excluded the data for year 2014 from our model experimentation as weather data was missing for the same in dataset. With the four years of data available, we experimented with subsets of data for fitting and building the model as shown in Table 3. Combination of different years from the above types, along with stratified samples, were considered in model building after statistically analyzing the samples. The mentioned data sources, along with stratified samples, were combined and considered for model training, validation, and test datasets. Eighty percent and twenty percent of the dataset is used as training and test dataset, respectively. Furthermore, the training dataset is divided into 4:1 ratio for training and validation datasets.  Figure 4a,b are the most important features for the finalized training dataset in regard to base model I and base model II of the ensemble model, respectively. Figure 4c shows the same for combined model. The merits of our proposed model and the steps it uses to predict risk are as follows.

Wildfire Risk Prediction Models
In this section, we first formally discuss the merits of the proposed models, then explain the steps of our proposed risk prediction models in detail. We have already discussed the limitations of traditional and current machine learning models in the introduction section. Here, we will discuss the merits of our proposed models in brief. We have developed the models with the hourly or bi-weekly data for region-based wildfire risk prediction with augmented spatial sensitivity. Unlike the mentioned recent work in introduction section on wildfire models, we need a model that deals in wide range of variables such as weather, vegetation indices, terrain, and powerlines data, with both the spatial and temporal dimensions to address the practical wildfire challenges posed by the frequent fluctuations in weather forecast and region-based risk prediction. Hence, we used data-driven machine learning models that run on comprehensive datasets with multiple parameters including location-based weather, terrain, vegetation, and powerlines data, along with the fire history data to predict fire risk in our study area based on satellite data.
The high-level architectures of our proposed ensemble and combined machine learning models that ingest grid-based values are discussed in the following sections. The ensemble model stacks the best models, whereas the combined model runs on a combined dataset using the best-performing algorithm. As we have multiple layers of complex data sources, we decided to extract grid-specific data for the parameters in our study area to make the integration of datasets easier. The models proposed in this paper to predict the wildfire risk are the ensemble and combined models that run on comprehensive datasets with multiple parameters including location-based weather, terrain, vegetation, and powerlines data, along with the fire history data.
Unlike other existing models, these models are integrated models powered by machine learning algorithm such as Adaboost, Decision trees, Gradient descent, Multi-layered perceptron, Random Forest Tree (RF), and Long Short-Term Memory (LSTM) to address convoluted location-specific wildfire risk prediction. In terms of accuracy of individual models, the Random forest algorithm outperformed all other algorithms when experimented with varied target labeling and subsets of the datasets (refer Table 3). Thus, random forest was an optimal choice for the ensemble and combined wildfire risk predicting models (refer to Figure 5 for the random first tree visualization).

Wildfire Risk Prediction Models
In this section, we first formally discuss the merits of the proposed models, then explain the steps of our proposed risk prediction models in detail. We have already discussed the limitations of traditional and current machine learning models in the introduction section. Here, we will discuss the merits of our proposed models in brief. We have developed the models with the hourly or bi-weekly data for region-based wildfire risk prediction with augmented spatial sensitivity. Unlike the mentioned recent work in introduction section on wildfire models, we need a model that deals in wide range of variables such as weather, vegetation indices, terrain, and powerlines data, with both the spatial and temporal dimensions to address the practical wildfire challenges posed by the frequent fluctuations in weather forecast and region-based risk prediction. Hence, we used data-driven machine learning models that run on comprehensive datasets with multiple parameters including location-based weather, terrain, vegetation, and powerlines data, along with the fire history data to predict fire risk in our study area based on satellite data.
The high-level architectures of our proposed ensemble and combined machine learning models that ingest grid-based values are discussed in the following sections. The ensemble model stacks the best models, whereas the combined model runs on a combined dataset using the best-performing algorithm. As we have multiple layers of complex data sources, we decided to extract grid-specific data for the parameters in our study area to make the integration of datasets easier. The models proposed in this paper to predict the wildfire risk are the ensemble and combined models that run on comprehensive datasets with multiple parameters including location-based weather, terrain, vegetation, and powerlines data, along with the fire history data.
Unlike other existing models, these models are integrated models powered by machine learning algorithm such as Adaboost, Decision trees, Gradient descent, Multi-layered perceptron, Random Forest Tree (RF), and Long Short-Term Memory (LSTM) to address convoluted location-specific wildfire risk prediction. In terms of accuracy of individual models, the Random forest algorithm outperformed all other algorithms when experimented with varied target labeling and subsets of the datasets (refer Table 3). Thus, random forest was an optimal choice for the ensemble and combined wildfire risk predicting models (refer to Figure 5 for the random first tree visualization). Atmosphere 2021, 12, x FOR PEER REVIEW 11 of 19 Figure 5. Generated random forest tree for weather, terrain, vegetation, and powerlines data.
Accuracy comparison among machine learning with hyper-parameters are represented in Table 4. The Random Forest algorithm generally has better performance than decision trees. This is due to the Random Forest algorithm being an ensemble of decision trees. Thus, we employed Random Forest on the collected dataset for predicting the wildfire risk. There are two main advantages of using Random Forest. First, the algorithm handles both numerical and categorical data, which allows more options on the choice of features. Second, since a random forest algorithm aggregates a collection of decision trees, it limits overfitting to a great context.

An Ensemble Model
In this section, we present an ensemble model, two base (weak) random forest algorithms used on a different set of data sources and their predicted probabilities combined to ensemble with stacking. For the first base model, data features included weather, powerline, and terrain data chose 8-day composites of vegetation indices available in Google Earth Engine (GEE) cloud catalogs as image collections for data ingestion.
Weather, terrain, and powerline data were sparse and lacked sufficient temporal variation. Hence, they were clubbed together to produce the time series data for the base Accuracy comparison among machine learning with hyper-parameters are represented in Table 4. The Random Forest algorithm generally has better performance than decision trees. This is due to the Random Forest algorithm being an ensemble of decision trees. Thus, we employed Random Forest on the collected dataset for predicting the wildfire risk. There are two main advantages of using Random Forest. First, the algorithm handles both numerical and categorical data, which allows more options on the choice of features. Second, since a random forest algorithm aggregates a collection of decision trees, it limits overfitting to a great context.

An Ensemble Model
In this section, we present an ensemble model, two base (weak) random forest algorithms used on a different set of data sources and their predicted probabilities combined to ensemble with stacking. For the first base model, data features included weather, powerline, and terrain data chose 8-day composites of vegetation indices available in Google Earth Engine (GEE) cloud catalogs as image collections for data ingestion.
Weather, terrain, and powerline data were sparse and lacked sufficient temporal variation. Hence, they were clubbed together to produce the time series data for the base model. Figure 6 shows the architecture of the Ensample model. Data from each source (i.e., Weather, Powerline, Terrain, and Vegetation) were pre-processed, analyzed, and selected as part of the data transformation step. They were then inputted into two different random forest classifiers in order to calculate the wildfire risk probabilities from each, after which they were used in stacked generalization ensemble. In the stacked ensemble model, the outputs from the two weak models, called the base models (refer supplementary material for base models details and the interim results), were inputted to the meta classifier model resulting in a robust ensemble model. model. Figure 6 shows the architecture of the Ensample model. Data from each source (i.e., Weather, Powerline, Terrain, and Vegetation) were pre-processed, analyzed, and selected as part of the data transformation step. They were then inputted into two different random forest classifiers in order to calculate the wildfire risk probabilities from each, after which they were used in stacked generalization ensemble. In the stacked ensemble model, the outputs from the two weak models, called the base models (refer supplementary material for base models details and the interim results), were inputted to the meta classifier model resulting in a robust ensemble model. During our experiments, Adaboost classifier consistently gave high accuracy as the second-layer learning algorithm for the ensemble model, although Random forest fared best in terms of accuracy. However, the results from the Random forest were inconsistent and unreliable. Therefore, we ensembled the models using the Adaboost meta estimator for the two base models with random forest classifier.
The above-mentioned individual vegetation and weather models were stacked, and their output was used as input for the final ensemble model. The learning curve in terms of the scalability of the model (fitting time vs. # of training examples) for the ensemble model is shown in Figure 7a. The ensemble model with "base model I" and "base model II" with combined output converged obtained a cross-validation accuracy of ~84% as shown in Figure 7b. This score is higher than what was obtained from the weak models alone.
During our experiments, Adaboost classifier consistently gave high accuracy as the second-layer learning algorithm for the ensemble model, although Random forest fared best in terms of accuracy. However, the results from the Random forest were inconsistent and unreliable. Therefore, we ensembled the models using the Adaboost meta estimator for the two base models with random forest classifier.
The above-mentioned individual vegetation and weather models were stacked, and their output was used as input for the final ensemble model. The learning curve in terms of the scalability of the model (fitting time vs. # of training examples) for the ensemble model is shown in Figure 7a. The ensemble model with "base model I" and "base model II" with combined output converged obtained a cross-validation accuracy of~84% as shown in Figure 7b. This score is higher than what was obtained from the weak models alone.
Overall, precision, recall, and f1 score were marked in the range of 80-84% as shown in Figure 8a,b, but ratio tp / (tp + fn) dropped to 59% for the samples with wildfire risk in Figure 7d. Figure 7c shows the performance curve with an accuracy of~82%. For negative test data, the ensemble model resulted in a precision of 0.84 and an f1-score of 0.89 with support of 1695. For positive test data, the reported results were a precision of 0.80, an f1-score of 0.68, and a support of 725.
The ROC curve for the ensemble model is shown in Figure 9, which compares the true positive rate to the false positive as the threshold for predicting '1' change. The area under the ROC curve is inherently related to the accuracy, but the AUC-ROC is preferred because it is automatically adjusted to the baseline and gives a robust picture of how the classifier performs at different threshold choices. It also shows the best threshold and ensemble model results with modified threshold value. The ROC curve shows the best threshold of 0.499112 and a G-Means of 0.866.
The above-mentioned individual vegetation and weather models were stacked, and their output was used as input for the final ensemble model. The learning curve in terms of the scalability of the model (fitting time vs. # of training examples) for the ensemble model is shown in Figure 7a. The ensemble model with "base model I" and "base model II" with combined output converged obtained a cross-validation accuracy of ~84% as shown in Figure 7b. This score is higher than what was obtained from the weak models alone. Overall, precision, recall, and f1 score were marked in the range of 80-84% as shown in Figure 8a,b, but ratio tp / (tp + fn) dropped to 59% for the samples with wildfire risk in Figure 7d. Figure 7c shows the performance curve with an accuracy of ~82%. For negative test data, the ensemble model resulted in a precision of 0.84 and an f1-score of 0.89 with support of 1695. For positive test data, the reported results were a precision of 0.80, an f1score of 0.68, and a support of 725. Overall, precision, recall, and f1 score were marked in the range of 80-84% as shown in Figure 8a,b, but ratio tp / (tp + fn) dropped to 59% for the samples with wildfire risk in Figure 7d. Figure 7c shows the performance curve with an accuracy of ~82%. For negative test data, the ensemble model resulted in a precision of 0.84 and an f1-score of 0.89 with support of 1695. For positive test data, the reported results were a precision of 0.80, an f1score of 0.68, and a support of 725. The ROC curve for the ensemble model is shown in Figure 9, which compares the true positive rate to the false positive as the threshold for predicting '1' change. The area under the ROC curve is inherently related to the accuracy, but the AUC-ROC is preferred because it is automatically adjusted to the baseline and gives a robust picture of how the classifier performs at different threshold choices. It also shows the best threshold and ensemble model results with modified threshold value. The ROC curve shows the best threshold of 0.499112 and a G-Means of 0.866.

Combined Model
As seen in the diagram above ( Figure 10), data from each source (i.e., Weather, Powerline, Terrain, and Vegetation) combined as data features and placed into a random forest classifier. The combined dataset with all the parameters such as weather, vegetation, terrain, and power lines were utilized for training with Random forest classifier and stratified train test split fared best. Data before being ingested were pre-processed, analyzed, and selected as part of data transformation step, and then combined and ingested into random forest classifier, which in turn calculates the wildfire risk probability. For both the models (i.e., ensemble and combined models), the same hyperparameters settings, which are max_depth as 1 for the tree in the random and n_estimators as 200, worked best. The combined dataset with all the parameters such as weather, vegetation, terrain, and power lines, in a python dataframe format, was utilized for training various models. Training, test, and validation sets following a similar methodology as the weather dataset. Yet again, the type II dataset (from Table 3) with random forest classifier and stratified train test split fared.

Combined Model
As seen in the diagram above (Figure 10), data from each source (i.e., Weather, Powerline, Terrain, and Vegetation) combined as data features and placed into a random forest classifier. The combined dataset with all the parameters such as weather, vegetation, terrain, and power lines were utilized for training with Random forest classifier and stratified train test split fared best. Data before being ingested were pre-processed, analyzed, and selected as part of data transformation step, and then combined and ingested into random forest classifier, which in turn calculates the wildfire risk probability. For both the models (i.e., ensemble and combined models), the same hyperparameters settings, which are max_depth as 1 for the tree in the random and n_estimators as 200, worked best. The combined dataset with all the parameters such as weather, vegetation, terrain, and power lines, in a python dataframe format, was utilized for training various models. Training, test, and validation sets following a similar methodology as the weather dataset. Yet again, the type II dataset (from Table 3) with random forest classifier and stratified train test split fared.

Combined Model
As seen in the diagram above ( Figure 10), data from each source (i.e., Weather, Powerline, Terrain, and Vegetation) combined as data features and placed into a random forest classifier. The combined dataset with all the parameters such as weather, vegetation, terrain, and power lines were utilized for training with Random forest classifier and stratified train test split fared best. Data before being ingested were pre-processed, analyzed, and selected as part of data transformation step, and then combined and ingested into random forest classifier, which in turn calculates the wildfire risk probability. For both the models (i.e., ensemble and combined models), the same hyperparameters settings, which are max_depth as 1 for the tree in the random and n_estimators as 200, worked best. The combined dataset with all the parameters such as weather, vegetation, terrain, and power lines, in a python dataframe format, was utilized for training various models. Training, test, and validation sets following a similar methodology as the weather dataset. Yet again, the type II dataset (from Table 3) with random forest classifier and stratified train test split fared.   Figure 12a,b show the classification report, accuracy, and confusion matrix for the combined model. Clearly, combined random forest model has better evaluation results compared to ensemble model and the achieved model accuracy is 91%. Figure 11c shows the performance curve for the combined model. With the weather, vegetation, power lines, and terrain data being major parameters used in predicting fire risk, we tried different combinations of the data and models to get the best fit. The classification also shows the percentage for precision, recall, and f1 score for all classes. The classification report for the combined model shows balanced results with 0.95 precision and 0.92 f1-score with 1695 support for positive and 0.89 precision and 0.92 f1-score with 1694 support for negative test data. The dataset with all the above-mentioned parameters together was used to build one variation of the machine learning model. We called it a combined model that could fetch a model with 92% accuracy. This is by far the best model we obtained in terms of the evaluation results. Evaluation metrics for the combined model is shown in Figure  11d. Further, we tested it on a newer validation dataset (refer Figure 11b) and obtained good results. Thereafter, this robust model was considered as the final machine learning model for this project. The combined model, compared to the ensemble model, does a fantastic job in identifying the true negatives and true positives (i.e., 92% of the times it predicts the true negatives and true positives correctly). Out of the remaining 8%, false positives are close to 5.5%, and at the same time false negatives are approximately 2.5%.  Figure 12a,b show the classification report, accuracy, and confusion matrix for the combined model. Clearly, combined random forest model has better evaluation results compared to ensemble model and the achieved model accuracy is 91%. Figure 11c shows the performance curve for the combined model. With the weather, vegetation, power lines, and terrain data being major parameters used in predicting fire risk, we tried different combinations of the data and models to get the best fit. The classification also shows the percentage for precision, recall, and f1 score for all classes. The classification report for the combined model shows balanced results with 0.95 precision and 0.92 f1-score with 1695 support for positive and 0.89 precision and 0.92 f1-score with 1694 support for negative test data. The dataset with all the above-mentioned parameters together was used to build one variation of the machine learning model. We called it a combined model that could fetch a model with 92% accuracy. This is by far the best model we obtained in terms of the evaluation results. Evaluation metrics for the combined model is shown in Figure 11d. Further, we tested it on a newer validation dataset (refer Figure 11b) Figure 13 shows the best threshold and combined model results with a modified threshold value.
. Figure 13. ROC for Combined random forest model. Figure 14 shows a combined model prediction dashboard for the real-time prediction of wildfire risk for the area under study on 6/30/2018 along with terrain, powerlines, vegetation, and weather data values. The top left part displays all the grid cells for the area under study. Color-coded grids represent the fire risk probabilities for each of the grids individually based on the spatial and temporal conditions. The above figure visualizes the fire risk probability and the risk label on a chosen date. The fire risk is lower on southern and eastern boundaries on that particular day, mainly based on the calculated vegetation indices values and terrain conditions as the Eastern and Southern area is not fertile and has no vegetation that can lead to wildfire. However, the high risk on a particular date is due to a combination of vegetation, weather, terrain, along with the powerline data. Regions close to water may have lush vegetation that causes the heightened fire risk. If the weather conditions are dry, that provides an abundance of combustible fuel.  Figure 13 shows the best threshold and combined model results with a modified threshold value.  Figure 13 shows the best threshold and combined model results with a modified threshold value.  Color-coded grids represent the fire risk probabilities for each of the grids individually based on the spatial and temporal conditions. The above figure visualizes the fire risk probability and the risk label on a chosen date. The fire risk is lower on southern and eastern boundaries on that particular day, mainly based on the calculated vegetation indices values and terrain conditions as the Eastern and Southern area is not fertile and has no vegetation that can lead to wildfire. However, the high risk on a particular date is due to a combination of vegetation, weather, terrain, along with the powerline data. Regions close to water may have lush vegetation that causes the heightened fire risk. If the weather conditions are dry, that provides an abundance of combustible fuel.  Figure 14 shows a combined model prediction dashboard for the real-time prediction of wildfire risk for the area under study on 6/30/2018 along with terrain, powerlines, vegetation, and weather data values. The top left part displays all the grid cells for the area under study. Color-coded grids represent the fire risk probabilities for each of the grids individually based on the spatial and temporal conditions. The above figure visualizes the fire risk probability and the risk label on a chosen date. The fire risk is lower on southern and eastern boundaries on that particular day, mainly based on the calculated vegetation indices values and terrain conditions as the Eastern and Southern area is not fertile and has no vegetation that can lead to wildfire. However, the high risk on a particular date is due to a combination of vegetation, weather, terrain, along with the powerline data. Regions close to water may have lush vegetation that causes the heightened fire risk. If the weather conditions are dry, that provides an abundance of combustible fuel.

Conclusions
In this era of climate change and rising temperatures, blazing wildfires are becoming a year-round phenomenon [2]. Wildfires, also known as wildland fires, forest fires, or brush fires, are uncontrolled fires sweeping across millions of acres of land, causing severe and extensive damage to our ecosystem [2]. With this study, we have established a machine-learning-ready basis for future studies and explored the key impacting features for wildfire prediction from a machine learning point of view. Unlike the statistical and machine learning approaches, our model relies on the high number of spatial and temporal variables and not just the standard ones such as weather, with which we learned the past fire events and obtained cutting-edge accuracy of 92% and real-time effectiveness in wildfire risk prediction.
In comparison to the ensemble model (Base model I + Base model II -> Adaboost), the combined model emerged as the winner in terms of accuracy. The results were not surprising but as expected; the justification for achieving better accuracy in the combined model is that Ensemble Base Model I (weather, terrain, and powerline) had no knowledge of the vegetation and, similarly, Ensemble Base Model II (vegetation) had trained separately and, due to whose vegetation model, could not derive any relation with the other variables. However, the combined model with a complete set of datasets was able to deduce the relationship among all the independent variables and achieved 92% accuracy.
In order to be successful in predicting the fire risk (including the ignition), high temporal resolution is needed. The model must be trained on region-specific variables such as vegetation, terrain, and weather datasets. With high spatial and temporal accuracy datasets, it can pinpoint an incident in real-time if we can fetch the most recent data. However, drawbacks were observed, the first one being that our machine learning model depends heavily on the data availability for the terrain. Additionally, other data such as weather, powerline, vegetation, and fire history are required. To further apply our model in different regions of the world, it relies on weather conditions and other region-specific temporal and spatial variables that add the region-specific constraints to the model's hypothesis space and can attain better fire risk prediction accuracy.
Our models can be further improved to be generalized to all areas. For the future research direction, we are developing a wildfire machine learning platform to support wildfire risk analysis and prediction, as well as fire spreading prediction.

Conclusions
In this era of climate change and rising temperatures, blazing wildfires are becoming a year-round phenomenon [2]. Wildfires, also known as wildland fires, forest fires, or brush fires, are uncontrolled fires sweeping across millions of acres of land, causing severe and extensive damage to our ecosystem [2]. With this study, we have established a machinelearning-ready basis for future studies and explored the key impacting features for wildfire prediction from a machine learning point of view. Unlike the statistical and machine learning approaches, our model relies on the high number of spatial and temporal variables and not just the standard ones such as weather, with which we learned the past fire events and obtained cutting-edge accuracy of 92% and real-time effectiveness in wildfire risk prediction.
In comparison to the ensemble model (Base model I + Base model II -> Adaboost), the combined model emerged as the winner in terms of accuracy. The results were not surprising but as expected; the justification for achieving better accuracy in the combined model is that Ensemble Base Model I (weather, terrain, and powerline) had no knowledge of the vegetation and, similarly, Ensemble Base Model II (vegetation) had trained separately and, due to whose vegetation model, could not derive any relation with the other variables. However, the combined model with a complete set of datasets was able to deduce the relationship among all the independent variables and achieved 92% accuracy.
In order to be successful in predicting the fire risk (including the ignition), high temporal resolution is needed. The model must be trained on region-specific variables such as vegetation, terrain, and weather datasets. With high spatial and temporal accuracy datasets, it can pinpoint an incident in real-time if we can fetch the most recent data. However, drawbacks were observed, the first one being that our machine learning model depends heavily on the data availability for the terrain. Additionally, other data such as weather, powerline, vegetation, and fire history are required. To further apply our model in different regions of the world, it relies on weather conditions and other region-specific temporal and spatial variables that add the region-specific constraints to the model's hypothesis space and can attain better fire risk prediction accuracy.
Our models can be further improved to be generalized to all areas. For the future research direction, we are developing a wildfire machine learning platform to support wildfire risk analysis and prediction, as well as fire spreading prediction. Data Availability Statement: Publicly available datasets were analyzed in this study. This data can be found here: https://drive.google.com/drive/folders/1_3F12htWwtHFPxeyYx07Fly8Ts10hDYY? usp=sharing.