Forest Fire Prediction with Imbalanced Data Using a Deep Neural Network Method

: Forests suffer from heavy losses due to the occurrence of ﬁres. A prediction model based on environmental condition, such as meteorological and vegetation indexes, is considered a promising tool to control forest ﬁres. The construction of prediction models can be challenging due to ( i ) the requirement of selection of features most relevant to the prediction task, and ( ii ) heavily imbalanced data distribution where the number of large-scale forest ﬁres is much less than that of small-scale ones. In this paper, we propose a forest ﬁre prediction method that employs a sparse autoencoder-based deep neural network and a novel data balancing procedure. The method was tested on a forest ﬁre dataset collected from the Montesinho Natural Park of Portugal. Compared to the best prediction results of other state-of-the-art methods, the proposed method could predict large-scale forest ﬁres more accurately, and reduces the mean absolute error by 3–19.3 and root mean squared error by 0.95–19.3. The proposed method can better beneﬁt the management of wildland ﬁres in advance and the prevention of serious ﬁre accidents. It is expected that the prediction performance could be further improved if additional information and more data are available.


Introduction
Forest covers 31% of the global land surface and is of great importance in the wildland ecosystem. Forest fires are one of the major challenges for the preservation of forest, and cause great economic and ecological losses and even loss of human lives [1]. Even though much attention and expense have been paid to monitor and control forest fires [2], global annual burned forest hectares (ha) are in the millions [3]. The development of prediction models are expected to benefit fire management strategies in the ecosystem [4].
Traditionally a forest fire is monitored by watch-keepers on a watchtower, but it is not feasible to construct a lot of watchtowers scattered in extensive forests, and the watch-keeper can only detect fires already occurring instead of making predictions [5]. It is known that the occurrence of a forest fire is correlated with environmental conditions. For example, a forest fire is more likely to occur in hot and dry condition compared to old and humid ones [6]. In the modern world, since numerous meteorological stations are available, the collection of weather data is fast and cheap. In addition, with the help of satellite remote sensing technology, local area conditions, such as the state of crops and land surface temperature, can be computed based on satellite images [7]. This information can greatly benefit the construction of real-time and non-costly forest fire prediction methods.
Environmental data include many different variables, such as temperature, air pressure, humidity, wind speed, and vegetation index. These variables are usually recorded as numerical values and have valuable patterns correlated with the occurrence of forest fires. Since the dimensions of environmental data, i.e., the number of variables, are large, it autoencoder-based DNN has not been used for forest fire prediction, and the imbalanced problem is seldom considered in this task.
The objective of this work was to develop a method able to predict a forest fire. We assume the availability of a set of historical data, composed of many records of small-scale forest fires and few records of large-scale ones, which is common considering the occurrence of small-scale forest fire is much more frequent. Specifically, we assumed the availability of N records of forest fires, which hereafter are called data samples, (x i , y i ) i=1,...,N , where x is a vector containing K numerical variables associated with the environmental condition of the forest fire, e.g., weather measurements and metrics that are computed based on satellite images, and y is the corresponding severity of the forest fires. The forest fire prediction method receives the test vector x TEST as the input, containing the environmental variables collected at a certain location and time, and is required to provide a prediction y TEST . The prediction is better when it is closer to the true value y TRUE .

Benchmark Data for Forest Fire Prediction
We considered forest fire data collected during 2000-2003 from the Montesinho Natural Park of Portugal. The dataset contains 517 records of forest fires, whose environmental condition is described using 12 numerical variables, and severity is represented by the burned area measured in ha. Table 1 shows the numerical variables. The Natural Park is divided into 81 subareas using a 9 × 9 grid; therefore, the x and y coordinates indicate a certain subarea. Considering that 517 data samples are not enough to show the relevance of 81 subareas to the forest fire, the coordinates are not used for the prediction of burned area of the forest fire. The "month" and "day" are transformed to numerical data by denoting January to December as 1 to 12, and Monday to Sunday as 1 to 7, respectively. For variables, each is normalized into the scale [0, 1], with K = 10 as the dimensional input vector x for the prediction of burned area. More details of the dataset can be found in [5]. A histogram of burned area is shown in Figure 1a. Notice that the burned area is obviously imbalanced and that the number of small-scale fires is much greater than that of large-scale fires. In the dataset, there are 247 samples with a zero burned area indicating that the burned area is lower than 0.01 ha.
To ease the unbalanced problem, a logarithm transformation was applied to the original burned area values: y = ln(y raw,max + 1) − ln(y raw + 1) ln(y raw,max + 1) − ln(y raw,min + 1) (1) where y raw represents the original burned area, y raw,max and y raw,min are maximum and minimum of the burned area, respectively, and y is the transformed burned area, which is normalized into the scale [0, 1] to be used as the output for the prediction task, as shown in Figure 1b. To ease the unbalanced problem, a logarithm transformation was applied to the original burned area values: where represents the original burned area, , and , are maximum and minimum of the burned area, respectively, and is the transformed burned area, which is normalized into the scale 0,1 to be used as the output for the prediction task, as shown in Figure 1b.

Sparse Autoencoder
The sparse autoencoder [26] aims at extracting useful features from the -dimensional input of training samples , = 1, … , . As shown in Figure 2, the encoder extracts a feature vector = , , … , , from the input vector as follows: where , , are the activation function, the weight matrix and the bias vector of encoder, respectively. Then, the decoder reconstructs to based on : where , , are the activation function, the weight matrix and the bias vector of decoder, respectively. The training of sparse autoencoder aims at minimizing the following loss function to encourage the extraction of discriminative features: where , and are terms with respect to reconstruction error, the sparsity regularization and the regularization, respectively, and and are coefficients. measures how much the reconstruction is close to the input :

Sparse Autoencoder
The sparse autoencoder [26] aims at extracting useful features from the K-dimensional input of N training samples x i , i = 1, . . . , N. As shown in Figure 2, the encoder extracts a feature vector q i = q i,1 , . . . , q i,K 1 from the input vector x i as follows: where f 1 , W 1 , b 1 are the activation function, the weight matrix and the bias vector of encoder, respectively. Then, the decoder reconstructs x i tox i based on q i : where f 2 , W 2 , b 2 are the activation function, the weight matrix and the bias vector of decoder, respectively. The training of sparse autoencoder aims at minimizing the following loss function to encourage the extraction of discriminative features: where T re , T sparse and T L2 are terms with respect to reconstruction error, the sparsity regularization and the L 2 regularization, respectively, and η 1 and η 2 are coefficients. T re measures how much the reconstructionx i is close to the input x i : is used to constrain the hidden neurons to be inactive most of the time in order to extract discriminative features. Denote the mean activation of the -th hidden neuron, = 1, . . , , over all the samples , = 1, … , as:  T sparse is used to constrain the hidden neurons to be inactive most of the time in order to extract discriminative features. Denote the mean activation of the j-th hidden neuron, j = 1, .., K 1 , over all the samples x i , i = 1, . . . , N as: Ideally the expected value ofp j should be a small value, e.g., 0.05, since the activation is required to be at zero for most of the samples. T sparse is computed using the Kullback-Leibler (KL) divergence function to evaluate whetherp j is close to an expected value p: The above function reaches zero, the minimal value, when allp j are equal to p. The T L2 , is used to constrain the weight values to prevent the network from overfitting:

Deep Neural Networks
The DNN aims at constructing an empirical mapping function from a K-dimensional input space The DNN is composed of an encoder and a regression layer ( Figure 3). The encoder extracts high-level features from x i using multiple hidden layers, and the regression layer provides a predictionŷ i of y i .  Following the idea of the sparse autoencoder, and defined i (4) are applied to assist the encoder in extracting high-level features from hidden layers. Denote the feature vectors progressively extracted from the hid as , ..., . Then, the dimension of , , is typically larger than the input to obtain a sparse-overcomplete feature vector, which was found to be capa efiting feature extraction of the following layers [28]. For the following hidden dimension < , = 2,3, … , , forces effective feature extraction. The vation function is adopted for all the layers of the encoder to allow the fast Following the idea of the sparse autoencoder, T sparse and T L2 defined in Equation (4) are applied to assist the encoder in extracting high-level features from x i using M hidden layers. Denote the feature vectors progressively extracted from the hidden layers as q 1 i , . . . , q M i . Then, the dimension of q 1 i , K 1 , is typically larger than the input dimension K to obtain a sparse-overcomplete feature vector, which was found to be capable of benefiting feature extraction of the following layers [28]. For the following hidden layers, the dimension K m < K m−1 , m = 2, 3, . . . , M, forces effective feature extraction. The ReLU activation function is adopted for all the layers of the encoder to allow the fast training of DNN [29].
The regression layer computes the prediction based on high-level feature q M i : where f is the sigmoid activation function, w and b are the weight matrix and bias of the regression layer, respectively. Then, the training of DNN is to minimize: where the first term is the mean squared prediction error, T m sparse is the sparsity regularization for the m-th hidden layer, the last term is the L2 regularization.

Data Balancing Procedure
In order to deal with the imbalanced distribution of the output y, a data balancing procedure is proposed. It is described as following: Step 1. Identify the range of the output [y min , y max ], and divide it equally into N a nonoverlapping intervals, i.e., y (0) = y min , y (1) , y (1) , y (2) , . . . , y (N a −1) , y (N a ) = y max ]; Step 2. Generate a random number y rand from the uniform distribution U [y min ,y max ] and select the interval y (j−1) , y (j) for synthetic sample generation, where y (j−1) < y rand ≤ y (j) , j = 1, . . . , N a . Notice that each interval has the same possibility of selection due to random sampling from a uniform distribution, which helps to avoid over or under-estimating a specific interval; Step 3. Randomly choose a sample (x i , y i ) whose y i is in the selected interval; Step 4. Generate a synthetic sample (x i + n x , y i ) by introducing the Gaussian noise n x . There is a possibility of 10% that elements of n x are all zeros, i.e., the synthetic sample is the same as the original sample, and a possibility of 90% that elements of n x are randomly sampled from a Gaussian distribution N (µ=0,σ) to increase the diversity of the synthetic samples; Step 5. Repeat steps 2-4 until N max random samplings are done. Then, the obtained synthetic samples are used for training the DNN.

Results
The general step of applying the proposed method for forest fire prediction is shown in Figure 4.
Since the number of data samples is limited, a 10-fold cross-validation is adopted to evaluate the performance of the DNN. The dataset is randomly divided into 10 subsets, each containing approximately 10% of the samples. For each fold of the cross-validation, a subset is taken as the test set, and the other subsets are used as the training set for building the DNN model with the data balancing procedure. The predictions are obtained on the test set using the DNN and are inversely transformed to the original scale. The subsets are used as the test set in turn to obtain predictions on the whole dataset. The performance of the DNN is evaluated using Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE):   Figure 5 shows an example of applying the data balancing procedure (Section 2.4). The scaled range 0,1 of is divided into = 100 non-overlapping intervals with length 0.01. Figure 5a shows the number of samples with respect to the intervals, where the first interval contains much more samples than the others, i.e., the distribution of is extremely imbalanced. Synthetic samples are generated by randomly choosing an interval, randomly choosing a sample from the interval, and adding Gaussian noise to the input of the sample. Since variables of are scaled into 0,1 , the standard deviation of noise distribution , is set to be 0.001 to slightly modify the original variables for the diversity of synthetic samples.
= 20000 samplings are performed and approximately 200 synthetic samples are generated for each interval, as shown in Figure 5b. The number of synthetic samples is less than since nothing is generated for the intervals which do not have any data sample. Lower values of these two metrics indicate better performance. Considering the random effect caused by dataset splitting, the 10-fold cross-validation is repeated 10 times. The average MAE and RMSE are computed as the final performance for DNN. Figure 5 shows an example of applying the data balancing procedure (Section 2.4). The scaled range [0, 1] of y is divided into N a = 100 non-overlapping intervals with length 0.01. Figure 5a shows the number of samples with respect to the intervals, where the first interval contains much more samples than the others, i.e., the distribution of y is extremely imbalanced. Synthetic samples are generated by randomly choosing an interval, randomly choosing a sample from the interval, and adding Gaussian noise to the input of the sample. Since variables of x are scaled into [0, 1], the standard deviation σ of noise distribution N (µ=0,σ) is set to be 0.001 to slightly modify the original variables for the diversity of synthetic samples. N max = 20, 000 samplings are performed and approximately 200 synthetic samples are generated for each interval, as shown in Figure 5b. The number of synthetic samples is less than N max since nothing is generated for the intervals which do not have any data sample. A DNN with = 3 hidden layers is constructed. Its encoder structure is = 10, = 100, = 50, = 6, which is set following the general principle discussed in Section 2.3. For the hyperparameters, = 1, = 10 which is set by computing the magnitude ratio in the loss function to keep the balance, the expected sparsity is chosen considering a possible set 0.01, 0.02, … , 0.2 using an internal 10-fold grid search, i.e., for each fold of the cross-validation, first perform 10-fold cross-validations using only the training data for DNN with a given value of , and then select the DNN with best-performing on the training set to make predictions on the test set. The median value of selected is 0.08. The obtained results are reported in Table 2 in terms of the mean and standard deviation of MAE and RMSE, which are computed based on 10 times of crossvalidation.  A DNN with M = 3 hidden layers is constructed. Its encoder structure is K = 10, K 1 = 100, K 2 = 50, K 3 = 6, which is set following the general principle discussed in Section 2.3. For the hyperparameters, η 1 = 1, η 2 = 10 −7 which is set by computing the magnitude ratio in the loss function to keep the balance, the expected sparsity p is chosen considering a possible set {0.01, 0.02, . . . , 0.2} using an internal 10-fold grid search, i.e., for each fold of the cross-validation, first perform 10-fold cross-validations using only the training data for DNN with a given value of p, and then select the DNN with bestperforming p on the training set to make predictions on the test set. The median value of selected p is 0.08. The obtained results are reported in Table 2 in terms of the mean and standard deviation of MAE and RMSE, which are computed based on 10 times of cross-validation. Table 2. Segmented performance of different methods in form of MAE average and standard deviation (RMSE average and standard deviation), the best performance in each interval is reported in bold.

Interval of y
Interval of y raw (ha)

Discussion
For comparison, the popular regression methods ANN, SVM and RF were used for forest fire prediction and their average MAEs and RMSEs of 10 times 10-fold crossvalidation are computed. The ANN is typically a feedforward network with one hidden layer. The activation function of the hidden and output neurons is sigmoid. The number of hidden neurons, N h , is selected by an internal 10-fold grid search considering the possible set {4, 6, . . . , 20}, the median value of selected N h is 10. The SVM maps the input variables into a high dimensional space, and then finds the best linear hyperplane for regression with the support of a nonlinear kernel function. The regularization parameter, C, is selected by an internal 10-fold grid search considering the possible set {0.01, 0.1, 1, 10, 100}. The median value of selected C is 1. A Random Forest (RF) was constructed by averaging the outputs of multiple decision trees, each trained using training samples random selected by bootstrap technique. The number of trees, N tree , was selected by an internal 10-fold grid search considering the possible set {50, 100, . . . , 500}, the median value of selected N tree is 200. All experiments were conducted using a computer with Intel i7-8550U CPU, 8.00 GB RAM, Windows 10 OS. The DNN was built using the Keras 2.0 framework, and the other models for comparison were constructed using the scikit-learn 0.24.2 package. The obtained results are shown in Table 2 in terms of the mean and standard deviation of MAE and RMSE. The metrics were computed regarding 10 intervals of the burned area to investigate the prediction accuracy of the methods for different scales of forest fire.
In Table 2, the ANN, SVM and RF give better performance for small-scale forest fire whose burned area is less than 15.42 ha. The proposed method outperforms the other methods when the scale is larger indicating that the DNN successfully pays more attention to the large-scale forest fires with the support of data balancing procedure.
To better understand the performance of the methods, Figure 6 shows prediction results with respect to one fold of the cross-validation. Since the ANN, SVM and RF behave similarly, and to make the figure clear, only the SVM which has the best performance on the smallest interval is shown for comparison. Since in the dataset there are more than half of the samples with zero burned areas, the SVM actually learns always to provide a very small value no matter what the input is. To further investigate the behavior of different methods, Figure 7 shows the histogram of their predictions obtained from one fold of prediction. It is verified that SVM, ANN and RF always provide small predictions regardless of the input variables. This means that SVM, ANN and RF are not reliable; they ignore or extremely underestimate all large-scale fires.

Discussion
For comparison, the popular regression methods ANN, SVM and RF were used for forest fire prediction and their average MAEs and RMSEs of 10 times 10-fold cross-validation are computed. The ANN is typically a feedforward network with one hidden layer. The activation function of the hidden and output neurons is sigmoid. The number of hidden neurons, , is selected by an internal 10-fold grid search considering the possible set 4, 6, … , 20 , the median value of selected is 10. The SVM maps the input variables into a high dimensional space, and then finds the best linear hyperplane for regression with the support of a nonlinear kernel function. The regularization parameter, , is selected by an internal 10-fold grid search considering the possible set 0.01, 0.1, 1, 10, 100 . The median value of selected is 1. A Random Forest (RF) was constructed by averaging the outputs of multiple decision trees, each trained using training samples random selected by bootstrap technique. The number of trees, , was selected by an internal 10-fold grid search considering the possible set 50, 100, … , 500 , the median value of selected is 200. All experiments were conducted using a computer with Intel i7-8550U CPU, 8.00 GB RAM, Windows 10 OS. The DNN was built using the Keras 2.0 framework, and the other models for comparison were constructed using the scikit-learn 0.24.2 package. The obtained results are shown in Table 2 in terms of the mean and standard deviation of MAE and RMSE. The metrics were computed regarding 10 intervals of the burned area to investigate the prediction accuracy of the methods for different scales of forest fire.
In Table 2, the ANN, SVM and RF give better performance for small-scale forest fire whose burned area is less than 15.42 ha. The proposed method outperforms the other methods when the scale is larger indicating that the DNN successfully pays more attention to the large-scale forest fires with the support of data balancing procedure.
To better understand the performance of the methods, Figure 6 shows prediction results with respect to one fold of the cross-validation. Since the ANN, SVM and RF behave similarly, and to make the figure clear, only the SVM which has the best performance on the smallest interval is shown for comparison. Since in the dataset there are more than half of the samples with zero burned areas, the SVM actually learns always to provide a very small value no matter what the input is. To further investigate the behavior of different methods, Figure 7 shows the histogram of their predictions obtained from one fold of prediction. It is verified that SVM, ANN and RF always provide small predictions regardless of the input variables. This means that SVM, ANN and RF are not reliable; they ignore or extremely underestimate all large-scale fires.  Different from classical methods that don't learn useful information, the proposed method pays attention to all fire scales. For most small-scale forest fires, the proposed method provides close predictions, and the relatively larger prediction error is mainly due to some occasional large predictions ( Figure 6). For large-scale fires, the proposed method attempts closer predictions though these are still not very accurate. The prediction histogram of the proposed method (Figure 7) is closer to the original data distribution (Figure 1a), indicating the proposed method facilitates the extraction of correlation from input variables to the output. However, the performance of the proposed method is limited by the lack of enough information, since synthetic data, generated by adding Gaussian noise to original data, can help the training of DNN but do not provide any new information different from the original data. Therefore, the proposed method cannot fully detect the causes of large-scale fires because the information is insufficient, leading to wrong overestimation of some small-scale fires. As a result, it shows a trade-off between improving large-scale prediction accuracy and over-estimating small-scale fires ( Table 2). Different from classical methods that don't learn useful information, the proposed method pays attention to all fire scales. For most small-scale forest fires, the proposed method provides close predictions, and the relatively larger prediction error is mainly due to some occasional large predictions ( Figure 6). For large-scale fires, the proposed method attempts closer predictions though these are still not very accurate. The prediction histogram of the proposed method (Figure 7) is closer to the original data distribution (Figure  1a), indicating the proposed method facilitates the extraction of correlation from input variables to the output. However, the performance of the proposed method is limited by the lack of enough information, since synthetic data, generated by adding Gaussian noise to original data, can help the training of DNN but do not provide any new information different from the original data. Therefore, the proposed method cannot fully detect the causes of large-scale fires because the information is insufficient, leading to wrong overestimation of some small-scale fires. As a result, it shows a trade-off between improving large-scale prediction accuracy and over-estimating small-scale fires ( Table 2).
The over-estimation of small-scale fires may cause overreaction and extra costs. However, considering that a large-scale fire can cause serious consequences and huge losses, its accurate prediction usually is more important. We expect that the performance of the proposed method can be further improved if more data containing more information are collected.
In Table 2, notice that performance variations of the proposed method are larger than those of SVM, ANN and RF. The variations of MAE and RMSE are computed based on the results obtained from 10 cross-validations. Heavily affected by the imbalanced data, ANN, SVM and RF always provide very small and stable values in all the 10 cross-valida- The over-estimation of small-scale fires may cause overreaction and extra costs. However, considering that a large-scale fire can cause serious consequences and huge losses, its accurate prediction usually is more important. We expect that the performance of the proposed method can be further improved if more data containing more information are collected.
In Table 2, notice that performance variations of the proposed method are larger than those of SVM, ANN and RF. The variations of MAE and RMSE are computed based on the results obtained from 10 cross-validations. Heavily affected by the imbalanced data, ANN, SVM and RF always provide very small and stable values in all the 10 cross-validations, resulting in small variations. Performance variation of the proposed method among crossvalidations is mainly due to the change in the training set ( Figure 4). Since the dataset is relatively small (517 records), the random division of subsets makes the training set very different in cross-validations. When more data are collected, the effect of random data division can be weakened to reduce the variation of the proposed method. The large variations actually indicate that the proposed method tends to learn useful information in all the cross-validations.
To further investigate why it is difficult for SVM, ANN and RF to learn useful patterns for fire prediction, the trends of forest fires by input variables are shown in Figure 8. With respect to month, large-scale fires tend to occur in summer (July to September), which is consistent with common sense. The day of the week seems to not strongly influence fire occurrence, and large-scale fires happened in all the days except Friday. By definition, FFMC, DMC, DC and ISI suggest more severe burning conditions with larger values [5]. From the collected data, FFMC and DC behave in accordance with the definition, and DMC and ISI are more likely to indicate severe burning conditions in the middle of their ranges. The remaining four variables are the most intuitive. Large-scale fires are likely to be driven by high temperature, low RH (relative humidity), wind with speed of about 2-6 km/h and a small amount of rain. However, notice that the variable values more likely leading to severe burning can also suggest small-scale fires, and variables correlating nonlinearly with the burning area and can have various value combinations. Thus, it is difficult to explicitly extract rules for fire prediction from the data. A method able to automatically extract useful features from the data is required. The ANN, SVM and RF rely on high-quality features strongly correlated with the output as the input, whereas the proposed method based on DNN can further extract useful features for prediction from the input variables.
variations actually indicate that the proposed method tends to learn useful information in all the cross-validations.
To further investigate why it is difficult for SVM, ANN and RF to learn useful patterns for fire prediction, the trends of forest fires by input variables are shown in Figure  8. With respect to month, large-scale fires tend to occur in summer (July to September), which is consistent with common sense. The day of the week seems to not strongly influence fire occurrence, and large-scale fires happened in all the days except Friday. By definition, FFMC, DMC, DC and ISI suggest more severe burning conditions with larger values [5]. From the collected data, FFMC and DC behave in accordance with the definition, and DMC and ISI are more likely to indicate severe burning conditions in the middle of their ranges. The remaining four variables are the most intuitive. Large-scale fires are likely to be driven by high temperature, low RH (relative humidity), wind with speed of about 2-6 km/h and a small amount of rain. However, notice that the variable values more likely leading to severe burning can also suggest small-scale fires, and variables correlating nonlinearly with the burning area and can have various value combinations. Thus, it is difficult to explicitly extract rules for fire prediction from the data. A method able to automatically extract useful features from the data is required. The ANN, SVM and RF rely on high-quality features strongly correlated with the output as the input, whereas the proposed method based on DNN can further extract useful features for prediction from the input variables. In summary, SVM, ANN and RF cannot make meaningful predictions due to the complexity of the mapping between input variables and the burned area. The proposed method performs relatively well in the prediction of large-scale forest fires. However, it sometimes over-estimates small-scale fires and its prediction variation is relatively large. If more data are available, the superiority of the proposed method based on DNN is expected to be more obvious and overcome its drawbacks. Fortunately, more and more forest monitoring data can be collected in the big data era, and the proposed method is a promising tool to be considered for forest fire prediction.

Conclusions
This work proposed a new method based on a sparse autoencoder-based DNN and data balancing procedure for forest fire prediction using numerical environmental variables. Its main contributions are: (i) it employs a DNN model with sparse regularization, which can automatically extract features from a large amount of data without requiring In summary, SVM, ANN and RF cannot make meaningful predictions due to the complexity of the mapping between input variables and the burned area. The proposed method performs relatively well in the prediction of large-scale forest fires. However, it sometimes over-estimates small-scale fires and its prediction variation is relatively large. If more data are available, the superiority of the proposed method based on DNN is expected to be more obvious and overcome its drawbacks. Fortunately, more and more forest monitoring data can be collected in the big data era, and the proposed method is a promising tool to be considered for forest fire prediction.

Conclusions
This work proposed a new method based on a sparse autoencoder-based DNN and data balancing procedure for forest fire prediction using numerical environmental variables. Its main contributions are: (i) it employs a DNN model with sparse regularization, which can automatically extract features from a large amount of data without requiring expert intervention for feature selection, and (ii) it develops a data balancing procedure to tackle the problem of imbalanced dataset.
The forest fire data collected from the Montesinho Natural Park of Portugal is used to investigate the performance of the proposed method. The dataset is seriously imbalanced when more than half of the sample outputs are zeros. The results show that the proposed method outperforms ANN, SVM and RF for the prediction of large-scale forest fires. The prediction error on small-scale forest fires is mainly due to some occasionally large predictions. Prediction of forest fires is a challenging task. It is expected that the prediction performance could be further improved if additional information and more data are available, given the capability of DNN to extract useful information from a large amount of data.

Data Availability Statement:
Publicly available datasets were analyzed in this study. These data can be found here: http://archive.ics.uci.edu/ml/datasets/Forest+Fires (accessed on 29 February 2008).

Conflicts of Interest:
The authors declare no conflict of interest.