A Forest Fire Detection System Based on Ensemble Learning

: Due to the various shapes, textures, and colors of ﬁres, forest ﬁre detection is a challenging task. The traditional image processing method relies heavily on manmade features, which is not universally applicable to all forest scenarios. In order to solve this problem, the deep learning technology is applied to learn and extract features of forest ﬁres adaptively. However, the limited learning and perception ability of individual learners is not sufﬁcient to make them perform well in complex tasks. Furthermore, learners tend to focus too much on local information, namely ground truth, but ignore global information, which may lead to false positives. In this paper, a novel ensemble learning method is proposed to detect forest ﬁres in different scenarios. Firstly, two individual learners Yolov5 and EfﬁcientDet are integrated to accomplish ﬁre detection process. Secondly, another individual learner EfﬁcientNet is responsible for learning global information to avoid false positives. Finally, detection results are made based on the decisions of three learners. Experiments on our dataset show that the proposed method improves detection performance by 2.5% to 10.9%, and decreases false positives by 51.3%, without any extra latency.


Introduction
With the change of the earth's climate, forest fires occur frequently all over the world, which not only cause serious economic losses and destroy the ecological environment, but also pose a great threat to the safety of human life.
Forest fires usually spread quickly and are difficult to control in a short time. Therefore, it is imperative to detect the early forest fire before it spreads out, but traditional detection methods have obvious drawbacks in detecting it in open forest areas. Sensors-based [1][2][3] detection systems have good performance in indoor space, but it is difficult to install them outdoors, considering high coverage cost [4,5]. In addition, they cannot provide important visual information which can help firefighters promptly grasp the situation of the fire scene. Infrared or ultraviolet detectors [6,7] are easy to be interfered by the environment, and considering their short detection distance, they are not suitable for large open areas. Satellite remote sensing [8] is good at detecting large-scale forest fires, but it cannot detect early regional fire.
Impressed by the rising computer vision technology, researchers start to seek an efficient and effective fire detection model based on image processing. Chen et al. [9] proposed an RGB (red, green, blue) model based chromatic and disorder measurement for extracting fire-pixels in the video. The color information is responsible for extracting fire-pixels, and dynamic information is used to verify if it is a real fire. Töreyin et al. [10] used 1D temporal wavelet transform to detect flame flicker, and applied 2D spatial wavelet transform to identify fire moving regions. This method, which integrated color and temporal variation information, reduced false alarms in real-world scenes. Çelik et al. [11] studied diverse video sequences and images, and proposed a fuzzy color model using statistical analysis.
Combined with motion analysis, the model achieves a good discrimination between fire and fire-like objects. Teng et al. [12] analyzed fire characteristics and proposed a real-time fire detection method based on hidden Markov models (HMMs), which extracted candidate fire-pixels using moving pixel detection, fire-color inspection, and pixel clustering. Chino et al. [13] found that most algorithms were designed for video, which had obvious limitations. To solve this problem, a novel fire detection method named BowFire was proposed. The method combined color features with superpixel texture discrimination to detect fire in still images. In conclusion, most traditional fire detection methods based on image processing focused on creating artificial features like color, motion, and texture to detect fires.
However, powerful deep learners begin to replace human intelligence. They are better at learning features than humans, and the features they extract contain much deeper semantic information than manmade ones. Recently, deep learning has outperformed traditional manmade features in many fields, and have been widely used in fire detection. Zhang et al. [14] created a forest fire benchmark, and used Faster R-CNN (region-based convolutional neural network) [15], Yolo (you only look once) [16][17][18][19], and SSD (single shot multibox detector) [20] to detect fire. They found that SSD was better regarding efficiency, detection accuracy, and early fire detection ability. Moreover, they proposed an improved tiny-Yolo by adjusting the network architecture. Kim et al. [21] employed faster R-CNN to detect fire and non-fire regions based on their spatial features. In addition, long short-term memory (LSTM) is used to verify the reliability of fire alarm. Lee et al. [22] proposed a video-based fire detection model, which used faster R-CNN to generate a fire candidate region for each frame. Then, structural similarity (SSIM) and mean square error (MSE) were calculated to determine similarity between adjacent frames. Final fire regions were determined based on spatial and temporal features. Pan et al. [23] proposed a camera-based wildfire detection system via transfer learning, in which block-based analysis strategy was used to improve fire detection accuracy. Redundant filters, which had low energy impulse response, were removed to ensure the model's efficiency on edge devices. Wu et al. [24] applied principal component analysis (PCA) to process forest fire images, and then fed them into the training network. The combination of two models was proved to enhance location results. In conclusion, faced with fire detection task, most researchers tend to only assign individual learners to perform object detection tasks, which is considered unreliable, since it may lead to false negatives.
In this paper, a novel method based on ensemble learning for forest fire detection is proposed. First, forest fire detection is a complicated and difficult task, making it highly impractical for individual learners to detect fires in diverse scenarios. Every individual learner has its own expertise, and can extract different features from the image, so integrating different individual learners can significantly improve the robustness of the model and enhance detection performance. Therefore, two individual object detectors Yolov5 [25] and EfficientDet [26] are integrated to detect the fire in parallel. These two learners work synergistically in detecting different types of forest fires, thereby improving the detection accuracy. Second, the object detectors only care about what fire is like, so they do not take the whole image into consideration. In this case, fire-like objects will absolutely affect the detection results. To solve this problem, the EfficientNet image classifier [27] is incorporated into our model, whose role is to enable the model to take full advantage of the global information. Final detection results will be made through the decision strategy according to results of these three learners, which will efficiently increase detection accuracy and decrease the false positives.

Datasets
To ensure our learners can handle different kinds of forest fires (ground fire, trunk fire, and canopy fire), we collected images from multiple public fire datasets: BowFire [28], FD-dataset [29], ForestryImages [30], VisiFire [31], etc. After manual filtration, we created a single integrated forest fire dataset containing 10,581 images, with 2976 forest fire images and 7605 non-fire images. Representative samples of our dataset are shown in Figures 1-3. Forests 2021, 12, x FOR PEER REVIEW FD-dataset [29], ForestryImages [30], VisiFire [31], etc. After manual filtration, we a single integrated forest fire dataset containing 10,581 images, with 2976 forest fire and 7605 non-fire images. Representative samples of our dataset are shown in Fig  3. (a)  Forests 2021, 12, x FOR PEER REVIEW FD-dataset [29], ForestryImages [30], VisiFire [31], etc. After manual filtration, we a single integrated forest fire dataset containing 10,581 images, with 2976 forest fire and 7605 non-fire images. Representative samples of our dataset are shown in Fig  3. (a)

Yolov5
Yolo is a state-of-the-art, real-time object detector, and Yolov5 is based on Y Yolov4. Continuous improvements have made it achieve top performances on two object detection datasets: Pascal VOC (visual object classes) [32] and Microsoft (common objects in context) [33].
The network architecture of Yolov5 is shown in Figure 4. There are three reaso we choose Yolov5 as our first learner. Firstly, Yolov5 incorporated cross stage part work (CSPNet) [34] into Darknet, creating CSPDarknet as its backbone. CSPNet sol problems of repeated gradient information in large-scale backbones, and integra gradient changes into the feature map, thereby decreasing the parameters and (floating-point operations per second) of model, which not only ensures the in speed and accuracy, but also reduces the model size. In forest fire detection task, de speed and accuracy is imperative, and compact model size also determines its in efficiency on resource-poor edge devices. Secondly, the Yolov5 applied path aggr network (PANet) [35] as its neck to boost information flow. PANet adopts a new pyramid network (FPN) structure with enhanced bottom-up path, which impro propagation of low-level features. At the same time, adaptive feature pooling, whic feature grid and all feature levels, is used to make useful information in each featu propagate directly to following subnetwork. PANet improves the utilization of a localization signals in lower layers, which can obviously enhance the location accu the object. Thirdly, the head of Yolov5, namely the Yolo layer, generates 3 differe (18 18, 36 36, 72 72) of feature maps to achieve multi-scale [18] predictio bling the model to handle small, medium, and big objects. A forest fire usually de from small-scale fire (ground fire) to medium-scale fire (trunk fire), then to big-sc (canopy fire). Multi-scale detection ensures that the model can follow size change process of fire evolution.

Yolov5
Yolo is a state-of-the-art, real-time object detector, and Yolov5 is based on Yolov1-Yolov4. Continuous improvements have made it achieve top performances on two official object detection datasets: Pascal VOC (visual object classes) [32] and Microsoft COCO (common objects in context) [33].
The network architecture of Yolov5 is shown in Figure 4. There are three reasons why we choose Yolov5 as our first learner. Firstly, Yolov5 incorporated cross stage partial network (CSPNet) [34] into Darknet, creating CSPDarknet as its backbone. CSPNet solves the problems of repeated gradient information in large-scale backbones, and integrates the gradient changes into the feature map, thereby decreasing the parameters and FLOPS (floating-point operations per second) of model, which not only ensures the inference speed and accuracy, but also reduces the model size. In forest fire detection task, detection speed and accuracy is imperative, and compact model size also determines its inference efficiency on resource-poor edge devices. Secondly, the Yolov5 applied path aggregation network (PANet) [35] as its neck to boost information flow. PANet adopts a new feature pyramid network (FPN) structure with enhanced bottom-up path, which improves the propagation of low-level features. At the same time, adaptive feature pooling, which links feature grid and all feature levels, is used to make useful information in each feature level propagate directly to following subnetwork. PANet improves the utilization of accurate localization signals in lower layers, which can obviously enhance the location accuracy of the object. Thirdly, the head of Yolov5, namely the Yolo layer, generates 3 different sizes (18 × 18, 36 × 36, 72 × 72) of feature maps to achieve multi-scale [18] prediction, enabling the model to handle small, medium, and big objects. A forest fire usually develops from small-scale fire (ground fire) to medium-scale fire (trunk fire), then to big-scale fire (canopy fire). Multi-scale detection ensures that the model can follow size changes in the process of fire evolution.

EfficientDet
EfficientDet is a new family of object detectors developed by Google, and it consistently achieves better efficiency than prior art across a wide spectrum of resource constraints. Similar to Yolov5, EfficientDet has also achieved remarkable performances in Pascal VOC and Microsoft COCO tasks, and is widely used in real-world applications.
The network architecture of EfficientDet is shown in Figure 5. There are three reasons why we choose EfficientDet as our second learner. Firstly, EfficientDet employed state-ofthe-art network EfficientNet [27] as its backbone, making that the model has sufficient ability to learn the complex feature of diverse forest fires. Secondly, it applied an improved PANet, named bi-directional feature pyramid network (Bi-FPN) as its neck, to allow easy and fast multi-scale feature fusion. Bi-FPN introduces learnable weights, enabling network to learn the importance of different input features, and repeatedly applies top-down and bottom-up multi-scale feature fusion. Compared with Yolov5′s neck PANet, Bi-FPN has better performances with less parameters and FLOPS. Meanwhile, different feature fusion strategy brings different semantic information, thereby bringing different detection results. Thirdly, similar to EfficientNet, it integrates a compound scaling method that uniformly scales the resolution, depth, and width for all backbone, feature network, and box/class prediction networks at the same time, which ensures the maximum accuracy and efficiency under the limited computing resources. With more available resources, accuracy will be consistently improved. Our second learner, EfficientDet, with different backbone, neck, and head, can learn different information that Yolov5 cannot.

EfficientDet
EfficientDet is a new family of object detectors developed by Google, and it consistently achieves better efficiency than prior art across a wide spectrum of resource constraints. Similar to Yolov5, EfficientDet has also achieved remarkable performances in Pascal VOC and Microsoft COCO tasks, and is widely used in real-world applications.
The network architecture of EfficientDet is shown in Figure 5. There are three reasons why we choose EfficientDet as our second learner. Firstly, EfficientDet employed stateof-the-art network EfficientNet [27] as its backbone, making that the model has sufficient ability to learn the complex feature of diverse forest fires. Secondly, it applied an improved PANet, named bi-directional feature pyramid network (Bi-FPN) as its neck, to allow easy and fast multi-scale feature fusion. Bi-FPN introduces learnable weights, enabling network to learn the importance of different input features, and repeatedly applies top-down and bottom-up multi-scale feature fusion. Compared with Yolov5 s neck PANet, Bi-FPN has better performances with less parameters and FLOPS. Meanwhile, different feature fusion strategy brings different semantic information, thereby bringing different detection results. Thirdly, similar to EfficientNet, it integrates a compound scaling method that uniformly scales the resolution, depth, and width for all backbone, feature network, and box/class prediction networks at the same time, which ensures the maximum accuracy and efficiency under the limited computing resources. With more available resources, accuracy will be consistently improved. Our second learner, EfficientDet, with different backbone, neck, and head, can learn different information that Yolov5 cannot.

EfficientNet
EfficientNet is a new efficient network proposed by Google. It applied a novel model scaling strategy, namely compound scaling method, to balance network depth, network width, and image resolution for better accuracy at a fixed resource budget. With this, Ef-ficientNet outperformed other hot networks like ResNet [36], DenseNet [37], ResNeXt [38] with the highest Top-1 accuracy in ImageNet image classification task.
The network architecture of EfficientNet is shown in Figure 6. The reason why we choose EfficientNet as our third learner is that it achieves a superior trade-off between accuracy and efficiency. In our model, the third learner plays the most important role. It is responsible for learning the whole image to guide the detection, meaning that its decisions directly determine the final results. Meanwhile, it must be highly efficient, otherwise it will slow down the speed of the entire model.

Our Model
In real-world forest fire detection task, we need to handle different types of forest fires like ground fire, trunk fire, canopy fire. These fires, influenced by the environment, are diverse in shape, texture, or even color, bringing great difficulty for individual learner to extract effective features. By careful observations, we find that Yolov5 is better at learning long-area fires (Figure 7), but it sometimes misses objects ( Figure 8). Meanwhile, even though EfficientDet is not sensitive to long-area fires (Figure 7), it is more careful than

EfficientNet
EfficientNet is a new efficient network proposed by Google. It applied a novel model scaling strategy, namely compound scaling method, to balance network depth, network width, and image resolution for better accuracy at a fixed resource budget. With this, Effi-cientNet outperformed other hot networks like ResNet [36], DenseNet [37], ResNeXt [38] with the highest Top-1 accuracy in ImageNet image classification task.
The network architecture of EfficientNet is shown in Figure 6. The reason why we choose EfficientNet as our third learner is that it achieves a superior trade-off between accuracy and efficiency. In our model, the third learner plays the most important role. It is responsible for learning the whole image to guide the detection, meaning that its decisions directly determine the final results. Meanwhile, it must be highly efficient, otherwise it will slow down the speed of the entire model.
Head. Similar to Yolov5, the data are first input to EfficientNet for feature extraction, and then fed to Bi-FPN for feature fusion. Finally, head outputs detection results (class, score, location, size).

EfficientNet
EfficientNet is a new efficient network proposed by Google. It applied a novel model scaling strategy, namely compound scaling method, to balance network depth, network width, and image resolution for better accuracy at a fixed resource budget. With this, Ef-ficientNet outperformed other hot networks like ResNet [36], DenseNet [37], ResNeXt [38] with the highest Top-1 accuracy in ImageNet image classification task.
The network architecture of EfficientNet is shown in Figure 6. The reason why we choose EfficientNet as our third learner is that it achieves a superior trade-off between accuracy and efficiency. In our model, the third learner plays the most important role. It is responsible for learning the whole image to guide the detection, meaning that its decisions directly determine the final results. Meanwhile, it must be highly efficient, otherwise it will slow down the speed of the entire model.

Our Model
In real-world forest fire detection task, we need to handle different types of forest fires like ground fire, trunk fire, canopy fire. These fires, influenced by the environment, are diverse in shape, texture, or even color, bringing great difficulty for individual learner to extract effective features. By careful observations, we find that Yolov5 is better at learning long-area fires (Figure 7), but it sometimes misses objects ( Figure 8). Meanwhile, even though EfficientDet is not sensitive to long-area fires (Figure 7), it is more careful than

Our Model
In real-world forest fire detection task, we need to handle different types of forest fires like ground fire, trunk fire, canopy fire. These fires, influenced by the environment, are diverse in shape, texture, or even color, bringing great difficulty for individual learner to extract effective features. By careful observations, we find that Yolov5 is better at learning long-area fires (Figure 7), but it sometimes misses objects ( Figure 8). Meanwhile, even though EfficientDet is not sensitive to long-area fires (Figure 7), it is more careful than Yolov5, meaning that EfficientDet can make a complementary detection (Figure 8). There-fore, we consider that integrating these two efficient learners with different specialties to make detection together can improve detection accuracy.
Forests 2021, 12, x FOR PEER REVIEW Yolov5, meaning that EfficientDet can make a complementary detection (Figure 8). fore, we consider that integrating these two efficient learners with different specia make detection together can improve detection accuracy. Another issue is that the ability of the object detector is limited. It only learns region, which is just a local pattern of the whole image, but ignores the other infor Forests 2021, 12, x FOR PEER REVIEW Yolov5, meaning that EfficientDet can make a complementary detection (Figure 8). fore, we consider that integrating these two efficient learners with different specia make detection together can improve detection accuracy. Another issue is that the ability of the object detector is limited. It only learns region, which is just a local pattern of the whole image, but ignores the other infor Another issue is that the ability of the object detector is limited. It only learns the fire region, which is just a local pattern of the whole image, but ignores the other information like background. As a result, the object detector may treat fire-like objects (e.g., sun) as fires (Figure 9), thereby making false alarms. Therefore, a good leader EfficientNet that has a full understanding of the whole image is needed to guide the detection process.
Forests 2021, 12, x FOR PEER REVIEW like background. As a result, the object detector may treat fire-like objects (e.g., fires (Figure 9), thereby making false alarms. Therefore, a good leader EfficientN has a full understanding of the whole image is needed to guide the detection proc To address the above two issues and make sure our model is robust to dive narios, three deep learners are integrated to make decisions together ( Figure 10). T and second learners Yolov5 and EfficientDet act as object detectors, to detect fire lo in images by generating candidate boxes, respectively. Then, the non-maximum su sion algorithm [39] (Algorithm 1) is employed to eliminate redundant boxes, pre boxes with top confidence. The third learner EfficientNet acts as a binary classif sponsible for learning the whole image to determine whether the image contains jects. Finally, the object detection results, and image classification results are sen decision strategy module, in which if the image is considered to contain fire obje taining object detection results, otherwise ignoring them.
In addition, integrating multiple learners will not affect the overall efficie model, because the three learners are structurally independent, and the whole m executed by multi processes, meaning that each learner has a separate process resp for it. To address the above two issues and make sure our model is robust to diverse scenarios, three deep learners are integrated to make decisions together ( Figure 10). The first and second learners Yolov5 and EfficientDet act as object detectors, to detect fire locations in images by generating candidate boxes, respectively. Then, the non-maximum suppression algorithm [39] (Algorithm 1) is employed to eliminate redundant boxes, preserving boxes with top confidence. The third learner EfficientNet acts as a binary classifier, responsible for learning the whole image to determine whether the image contains fire objects. Finally, the object detection results, and image classification results are sent into a decision strategy module, in which if the image is considered to contain fire objects, retaining object detection results, otherwise ignoring them.
In addition, integrating multiple learners will not affect the overall efficiency of model, because the three learners are structurally independent, and the whole model is executed by multi processes, meaning that each learner has a separate process responsible for it.

Model Evaluation
We evaluate models using Microsoft COCO criteria (Table 1), which is widely used in object detection tasks. However, fire is a special object, which is diverse in shape, texture, and color. Bounding box generated by object detectors may slightly differ from ground truth (Figure 11), thereby influencing the calculation of average precision, but detectors do identify the fire areas successfully. Therefore, to evaluate models more comprehensively, we introduce two additional evaluation metrics, namely frame accuracy (FA) and false positive rate (FPR). For one image, if the detector misses any fire object, we call it is a frame false (FF), otherwise frame true (FT). If the detector treats any fire-like object as fire, we call it is a false positive (FP), otherwise true positive (TP). Note that FA is calculated on the test set containing 476 forest images, and FPR is calculated on our challenging non-fire dataset containing 641 images with fire-like objects (e.g., sun). The FA and FPR can be calculated as Equation (1) and Equation (2), respectively: FPR FP FP TP 100. (2)

Model Evaluation
We evaluate models using Microsoft COCO criteria (Table 1), which is widely used in object detection tasks. However, fire is a special object, which is diverse in shape, texture, and color. Bounding box generated by object detectors may slightly differ from ground truth (Figure 11), thereby influencing the calculation of average precision, but detectors do identify the fire areas successfully. Therefore, to evaluate models more comprehensively, we introduce two additional evaluation metrics, namely frame accuracy (FA) and false positive rate (FPR). For one image, if the detector misses any fire object, we call it is a frame false (FF), otherwise frame true (FT). If the detector treats any fire-like object as fire, we call it is a false positive (FP), otherwise true positive (TP). Note that FA is calculated on the test set containing 476 forest images, and FPR is calculated on our challenging non-fire dataset containing 641 images with fire-like objects (e.g., sun). The FA and FPR can be calculated as Equation (1) and Equation (2), respectively:

Training
We applied different strategies to train our three learners: Yolov5, EfficientDet, and EfficientNet. Object detectors, namely Yolov5 and EfficientDet, are trained with 2381 forest fire images, and tested with 476 forest fire images. The image classifier, namely EfficientNet, is trained with 2381 forest fire images and 5804 non-fire images, and tested with 476 forest fire images and 1160 non-fire images. Note that non-fire images contain normal images, and images with fire-like objects (e.g., sun). Each model is built up by Pytorch [40] and trained on NVIDIA GTX 2080TI. The details of our training strategy are shown in Table 2.

Comparison
We compare our model with typical one-stage object detectors. As is shown in Table 3, even though Yolov5 and EfficientDet are the most powerful detectors in this task, the high false positive rate and missing detections cannot be ignored. By integrating them (2 learners), all evaluation metrics are significantly improved, but the false positive rate is increased to 51.6%, since the false positives come from both Yolov5 and EfficientDet. Under the guide of our third learner EfficientNet, the false positive rate is reduced to 0.3%. What is also worth mentioning is that, after introducing the third learner, some metrics are slightly decreased. It is because that EfficientNet wrongly treats some fire images as non-fire ones, and then ignores the object detection results, but we consider it is worthwhile to sacrifice a tiny decrease in average precision and recall for substantial improvement in the false positive rate. To sum up, our model (3 learners) is superior in AP 0.5 , AP S , AP M , AP L , AR 0.5 , AR S , AR M , AR L , FPR, and FA compared with other typical object detectors. Comprehensive improvements make the model have better performance in detecting different types of forest fires: small-scale fires, medium-scale fires, big-scale fires, ground fires, trunk fires, canopy fires, and fires at night (Figures 12 and 13). Faced with fire-like objects (e.g., sun), our model will not be interfered. (Figure 14).

Discussion
Compared with other common objects that have fixed form, forest fire is a dynam object [44]. In the real-world scenario, a forest fire usually starts from small-scale fire, d velops to medium-scale fire, and then to big-scale fire [45]. In terms of types, it starts fro ground fire, then spreads to the trunk, and finally to the canopy [46]. The various shap

Discussion
Compared with other common objects that have fixed form, forest fire is a dynamic object [44]. In the real-world scenario, a forest fire usually starts from small-scale fire, develops to medium-scale fire, and then to big-scale fire [45]. In terms of types, it starts from ground fire, then spreads to the trunk, and finally to the canopy [46]. The various shapes, sizes, textures, and colors of forest fires make the fire evolution a complex process, and bring great difficulty in fire detection.
Therefore, it is highly imperative for detectors to be sensitive to different types of fires. Through careful experimental comparisons, we find that no single detector that can handle all kinds of fires. They have respective advantages and disadvantages. Yolov5 is excellent at detecting long-area fires (Figure 7), but it frequently misses objects (Figure 8). EfficientDet is a more careful detector, compared to Yolov5; even though it has a bad performance on long-area fires (Figure 7), it can detect fires that Yolov5 cannot (Figure 8), meaning that it is a good partner for Yolov5. Our model, which efficiently integrates decisions of these two powerful learners, boost detection performance by 2.5-10.9%, in terms of AP 0.5 , AP S , AP M , AP L , AR 0.5 , AR S , AR M , AR L . The significant improvements of average precision and average recall for small, medium, and big objects make the model more sensitive to the size changes of fires, thereby enhancing detection performance on different types of forest fires: ground fire, trunk fire, canopy fire, and fires at night in the fire evolution (Figures 12 and 13).
Another problem is that the false positive rate of the improved model (2 learners) becomes higher: 22.6% to 51.6% since the model also integrates wrong detection results from both learners. To address this issue, we use 8185 images containing 2381 forest fire images and 5804 non-fire images (containing fire-like images and normal forest images) to train our third learner EfficientNet. Sufficient training sets enabled EfficientNet to show a good discriminability between fire objects and fire-like objects, with 99.6% accuracy on 476 fire images, and 99.7% accuracy on 676 fire-like images. With the help of the leader learner EfficientNet, wrong detection results are eliminated, and the false positive rate is significantly decreased to 0.3% ( Figure 14). Noticeably, the join of EfficientNet reduces AP 0.5 , AP M , AP L , AR 0.5 , AR M , AR L by roughly 1%, which is because that EfficientNet wrongly ignores 2 fire images containing medium-scale and big-scale fire objects.
In terms of latency, the Yolo family is superior compared to EfficientDet and SSD. Excellent inference speed makes Yolo family widely used in real-world applications, but experimental results show that they are not able to have a satisfactory performance on forest fire detection tasks. The latency of EfficientDet is 65.6 ms, which is over twice that of Yolov5 (28.0 ms), but EfficientDet outperforms Yolov5 by over 5% regarding detection performance. We ensemble these three learners Yolov5 (28.0 ms), EfficientDet (65.6 ms), EfficientNet (31.3 ms) in parallel to make sure that our model can achieve the best performance without any extra latency. The final latency of our model (3 learners) is 66.8 ms, which shows that an excellent trade-off between detection performance and efficiency has been achieved, and the model is applicable for real-time detection task.
For further improvement, we plan to study the labeling strategy for forest fires, since the quality of training data directly determines the detection performance. Another interesting extension is to investigate the network architecture of backbones, and modify them to make sure that they are specially designed for forest fire detection task. Additionally, we will work on developing a forest fire tracking system, which can classify different types of forest fires: ground fire, trunk fire and canopy fire, to track the evolution and spread of forest fires.

Conclusions
The successful application of convolutional neural networks significantly improves the performance of object detection. However, forest fire is a dynamic object with no fixed form, which the individual object detector cannot handle. In addition, object detectors are easy to be deceived by fire-like objects and generate false positives due to their limited visual field. To address these problems, a novel ensemble learning method for real-time forest fire detection is proposed in this paper. Two powerful object detectors (Yolov5 and EfficientDet) with different expertise are integrated to make the whole model more robust to diverse forest fire scenarios. Then, a leader (EfficientNet) is introduced to guide the detection process to reduce false positives. Experimental results show that, compared with other popular object detectors, our model achieves a superior trade-off among average precision, average recall, false positive rate, frame accuracy, and latency. The significant improvements make it possible for the model to perform well in real-world forestry applications.
Author Contributions: R.X. devised the programs and drafted the initial manuscript. H.L. and K.L. helped with data collection, data analysis, and figures and tables. L.C. contributed to fund acquisition and writing embellishment. Y.L. designed the project and revised the manuscript. All authors have read and agreed to the published version of the manuscript.