Skip to Content
Remote SensingRemote Sensing
  • Article
  • Open Access

5 September 2021

Wildfire Segmentation Using Deep Vision Transformers

,
,
,
and
1
Perception, Robotics and Intelligent Machines Research Group (PRIME), Department of Computer Science, Université de Moncton, 18 Antonine-Maillet Ave, Moncton, NB E1A 3E9, Canada
2
SERCOM Laboratory, Ecole Polytechnique de Tunisie, Université de Carthage, La Marsa 77-1054, Tunisia
3
Telnet Innovation Labs, Telnet Holding, Parc Elghazela des Technologies de la Communication, Ariana 2088, Tunisia
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Data Mining in Multi-Platform Remote Sensing

Abstract

In this paper, we address the problem of forest fires’ early detection and segmentation in order to predict their spread and help with fire fighting. Techniques based on Convolutional Networks are the most used and have proven to be efficient at solving such a problem. However, they remain limited in modeling the long-range relationship between objects in the image, due to the intrinsic locality of convolution operators. In order to overcome this drawback, Transformers, designed for sequence-to-sequence prediction, have emerged as alternative architectures. They have recently been used to determine the global dependencies between input and output sequences using the self-attention mechanism. In this context, we present in this work the very first study, which explores the potential of vision Transformers in the context of forest fire segmentation. Two vision-based Transformers are used, TransUNet and MedT. Thus, we design two frameworks based on the former image Transformers adapted to our complex, non-structured environment, which we evaluate using varying backbones and we optimize for forest fires’ segmentation. Extensive evaluations of both frameworks revealed a performance superior to current methods. The proposed approaches achieved a state-of-the-art performance with an F1-score of 97.7% for TransUNet architecture and 96.0% for MedT architecture. The analysis of the results showed that these models reduce fire pixels mis-classifications thanks to the extraction of both global and local features, which provide finer detection of the fire’s shape.

1. Introduction

Statistically, forest fire accidents and arsons result in frightful damage. They are the cause of human and financial losses, the death of animals, and the destruction of woods and houses. Fires also affect 350 million to 450 million hectares every year [1]. Thus, several researchers focused on reducing this negative impact number by developing systems for fire detection at an early stage.
The first existing fire detection systems employed numerous fire sensing technologies such as gas, flame, heat, and smoke detectors. While these systems have managed to detect fire, they have faced some limitations related to coverage areas, false alarms, and slow time response [2]. Fortunately, the aforementioned problems were partially solved by using vision sensors that detect visual features of fire such as shape, color, and dynamic texture of flame.
In recent years, Deep Learning (DL) approaches have been proposed to replace hand-crafted techniques in computer vision applications. They showed impressive results in various tasks such as autonomous vehicles [3], pedestrian detection [4], and video surveillance [5,6]. DL approaches are used in forest fire segmentation tasks to extract the geometrical characteristics of the fire, such as height, width, angle, and so forth. These models, especially Convolutional Neural Networks (ConvNets), were also successfully employed to predict and detect the boundaries of fire as well as to identify and segment each fire pixel [7,8]. Their impressive results help to develop metrology tools, which can be used in modeling fire preparation as well as providing the necessary inputs to the mathematical propagation models. Nonetheless, due to the intrinsic locality of convolution operators, ConvNets remain limited when modelling the long-range relationship between elements in the image. In order to overcome this problem, vision Transformers, designed for sequence-to-sequence prediction were explored. Based on the self-attention mechanism, they determine the global dependencies between input and output sequences. Transformers witnessed a considerable success in the field of machine translation [9] and Natural Language Processing (NLP) [10,11,12]. This explains the multiple attempts of researchers to adapt them to several image recognition tasks such as object detection [13], text-image synthesis [14], image super-resolution [15], and medical imaging [12,16,17]. Indeed, these models proved their robustness and accuracy and outperformed the ConvNets in several applications [18].
In this paper, we present the very first study exploring the potential of Transformers in the context of forest fire segmentation using visible spectrum images. Two vision-based Transformers are considered—TransUNet [17] and MedT [16]. In order to exploit their strengths, these models were adapted to our problem. We used the Dice loss [19] as a loss function and the CorsicanFire dataset [20]. We show the feasibility of Transformers for fire segmentation under different settings, varying input size and backbone models that are either pure or hybrid Transformers, which combine a Convolutional Neural Network (CNN) and a Transformer. Then, we evaluate the two Transformers based approaches with state-of-the-art models U-Net [21], a fusion color space method [22], U 2 -Net [23], and EfficientSeg [24], which provided excellent results for object segmentation.
The remainder of this work is as follows: Related works are reviewed in Section 2. Then, employed models are presented in Section 3. In Section 4, experimental results are presented. In Section 5, results are discussed. Finally, Section 6 draws the conclusions of this study.

3. Methods

In this section, we describe the models employed for our wildfire segmentation task and we also present the training dataset, and the evaluation metrics used in this work.
Vision transformers, designed for sequence-to-sequence prediction, have emerged as alternative architectures to ConvNet models, which still show some limitations in explicitly modeling long-range dependency. In the previous section, we reviewed some of the most used visual transformers. Two of them, TransUNet [17] and MedT [16], used for medical imaging, have shown interesting performances in medical image segmentation. In this section, we will analyse these two vision Transformers since we will explore their performances when applied to our problem. We propose, here, the very first study using visual Transformers in forest fires segmentation task.
Furthermore, we present U 2 -Net [23] and EfficientSeg [24], two models used to carry a comparative study with the two aforementioned Transformers. It is worth mentioning that, nowadays, both U 2 -Net and EfficientSeg achieved excellent performances and outperformed the state-of-the-art object segmentation methods based on Convolutional Neural Networks [23,24]. These models are fed with RGB images, which are used to segment fire pixels and detect the exact fire areas’ shape. The result is a binary mask, which defines the segmented forest fire area in the input image.

3.1. TransUNet

TransUNet [17] is a hybrid CNN-Transformer model, which integrates both the Transformer and U-Net network. It adopted a high resolution of local features extracted by a CNN and the global information encoded by Transformers.
This model employs a Hybrid CNN-Transformer as an encoder. The CNN model is the first to extract the features. Then, patch embedding is employed to encode the positional information. The Transformer encoder contains twelve Transformer layers, which include a normalization layer, a Multi-Layer Perceptron (MLP), and a Multihead Self-Attention (MSA). The skip-connections from the encoder and the output of the Transformer are feeding the decoder, which consists of multiple 3*3 convolutional layers, upsamling operator, and ReLU activations. All feature extractors were pretrained on a very large dataset, namely, ImageNet [95], which provides numerous images [17]. Figure 1 illustrates the TransUNet architecture.
Figure 1. The proposed TransUNet architecture.

3.2. Medical Transformer (MedT)

The majority of vision Transformers require large-scale learning data for a better performance. MedT [16] is proposed to avoid this problem. This transformer adopts two concepts, Local-Global (LoGo) learning strategy and gated axial transformer layer. LoGo methodology is made up of two branches, global branch and local branch. The first branch uses the original resolution of the input image. The second employs the patches of the input image.
Figure 2 shows the MedT architecture. At first, the feature map is extracted from the input images using two convolutional blocks. Each block contains three convolutional layers, batch normalization layer, and ReLU activation. Then, the extracted feature map feeds local and global branches. The two branches include, respectively, five Encoder–Decoder blocks and two Encoder–Decoder blocks, connected by skip connection. The encoder consists of 1*1 convolutional layer, normalization layer, and two layers of multi-head attention, which operate on both width and height axis. The decoder also contains convolutional layer, batch normalization layer, and ReLU activation.
Figure 2. The proposed MedT architecture.

3.3. U 2 Net Architecture

U 2 -Net [23] is a deep network architecture. It is a two level nested U-structure, which employs RSU (Residual U-blocks) to extract multi scale information. The RSU block consists of a convolution layer, residual connection, U-Net structure, which contains a convolutional layer, a batch normalisation layer, and a ReLU activation function to extract the multi-scale features. U 2 -Net consists of six encoders, five decoders, and a saliency map fusion block. Each encoder and decoder are filled by the RSU. The saliency map block contains the Sigmoid function and 3*3 convolution layer [23]. Figure 3 presents the U 2 -Net architecture.
Figure 3. The proposed U 2 -Net architecture.

3.4. EfficientSeg

EfficientSeg [24] is a modified and scalable model based on U-Net structure and MobileNetV3 [96] blocks. It has an Encoder–Decoder structure. It proved its great performance and outperformed U-Net model [24].
As presented in Figure 4, the EfficientSeg architecture consists of different blocks, 3*3 and 1*1 convolution layers followed by a batch normalization layer and a ReLU activation function, Inverted Residual Blocks, four shortcut connections between encoder and decoder, and Downsampling and Upsampling operations.
Figure 4. The proposed EfficientSeg architecture.

3.5. Dataset

For the fire segmentation problem, there is still a limited number of datasets available. We use the CorsicanFire dataset [20] to train and evaluate the proposed models. The CorsicanFire dataset contains RGB and NIR (near-infrared) images. In this dataset, NIR images are captured with a longer integration/exposure time, which increases the brightness of the fire area and makes the segmentation easier with simple image processing techniques. Still, the obtained shape is less precise than with visible spectrum images, since it integrates the fire areas during the full exposure time (the fire shape covers a larger area in the image). In this work, we are interested in RGB images which provide a large number of images in the dataset and are widely used in the context of capturing fires. Our dataset consists of 1135 images and their corresponding masks. It describes the visual features of fire pixels such as color (red, orange, and white-yellow), the different weather conditions, the brightness, the distance to the fire, and the presence of smoke. Figure 5 depicts samples of the CorsicanFire dataset and their corresponding masks.
Figure 5. Examples from the CorsicanFire dataset. From (top) to (bottom): RGB images and their corresponding masks.

3.6. Evaluation Metrics

We evaluate the proposed approaches using F1-score and inference time.
  • F1-score combines recall and precision metrics to calculate the performance of the model (as given by Equation (1)).
    F 1 S c o r e = 2 P r e c i s i o n R e c a l l P r e c i s i o n + R e c a l l
    P r e c i s i o n = T P T P + F P
    R e c a l l = T P T P + F N ,
    where F N is the false negative rate, F P is the false positive rate, and T P is the true positive rate.
  • Inference time is defined as the average segmentation’s time using our test data.

4. Experiments and Results

In this section we present the experimental settings, illustrate the implementation details and present the experimental results.

4.1. Experimental Results

The proposed Transformers were developed using Pytorch [97]. Training and testing was performed on a machine with NVIDIA Geforce RTX 2080Ti GPU. The training data were split as follows: 815 images for training, 111 images for validation, and 209 images for testing. For all experiments, we adopted Dice loss [19], which maximizes the overlap between the predicted images set and the ground truth image set as a loss function, patch size p value of 16 and simple data augmentation techniques that are a rotation of 20 degrees and a horizontal flip.
Dice loss computes the similarity between predicted image and ground truth image (given by Equation (4)).
D i c e l o s s = 1 2 Y X Y + X ,
where X is the predicted image, Y is the ground truth input, and ⋂ is the intersection of the ground truth Y and the predicted mask X.
The TransUNet Transformer was tested using a learning rate value of 10 3 , input resolution of 224*224 and 512*512, and two backbones, which were pure transformer ViT and Hybrid CNN-Transformer ResNet50-ViT, pretrained on the large-scale dataset ImageNet [95].
The MedT Transformer was tested from scratch (no pretraining) using a learning rate value of 10 2 , Hybrid ConvNet-Transformer as a backbone, and input resolution images of 224*224 and 256*256.
To evaluate the performance of the proposed Transformers, we first compared the F1-score values between the two Transformers, TransUNet and MedT, by varying backbones, input resolution, and the size of the dataset. Then, we compared their results with various models: U 2 -Net, EfficientSeg, U-Net [21], and a color space fusion method [22].

4.1.1. Quantitative Results

The performances of both Transformers (TransUNet and MedT) are reported in Table 2. We can see that the Transformers, TransUNet and MedT, reach F1-score values of 97.7% and 96.0% respectively, and prove their accurate and robust performance to detect and segment fire pixels. These models segment wildfire pixels well, thanks to the use of both global and local features. They provide finer details of the fire.
Table 2. Quantitative results of TransUNet and MedT on CorsicanFire dataset.
Transformers, TransUNet-Res50-ViT and MedT, with a hybrid backbone, extract more details from the input image due to spatial and local information extracted by ConvNets. These models show a better performance than TransUNet-ViT with a pure Transformer as backbone.
Changing the input resolution from 224*224 to 256*256 or 512*512 shows some improvements in the F1-score value (between 0.2% and 0.7%), due to the larger number of input patches. However, larger computational capacity and time are required during training.
Using a LoGo methodology for learning, MedT extracts high level and finer features. This Transformer presents a great performance and proves its efficiency in segmenting fire pixels without pre-training.
TransUNet with a hybrid backbone obtains a better performance than MedT and proves its excellent ability to localize and segment forest fire pixels. However, it still depends on a pre-trained backbone on a very large dataset.
Table 3 presents a comparative analysis of TransUNet, MedT, U-Net, Color fusion method, U 2 -Net, and EfficientSeg, in terms of F1-score, using the CorsicanFire dataset. We can see that the results of both Transformers, TransUNet and MedT, are better than the performance of deep CNN models (U-Net, U 2 -Net, and EfficientSeg) and classical machine learning models (color fusion method).
Table 3. Comparative analysis of TransUNet, MedT, U-Net, Color fusion method, U 2 -Net and EfficientSeg using CorsicanFire dataset with image size of 224*224.
TransUNet-Res50-ViT and MedT obtain the best F1-score values of 97.5% and 95.5%, respectively. They outperform the color fusion method and convolutional networks, U-Net, EfficientSeg, and U 2 -Net. EfficientSeg shows a great result compared to the color fusion method, U-Net, and U 2 -Net. However, it is still more limited to modeling global information than TransUNet and MedT. It also obtains an inference time of 2 s, which is higher than the inference time of TransUNet. U-Net with 1135 images, which outperforms recently developed models, U-Net with 419 images, the color fusion method and U 2 -Net, thanks to its diverse feature maps. It also shows the best value of inference time, which is 0.02 s. U 2 -Net obtains an F1-score value of 82.93%, which is better than the color fusion method. However, it has a lower performance compared to ConvNets models (U-Net and EfficientSeg) and Transformers (TransUNet and MedT).

4.1.2. Qualitative Results

Similar to the quantitative results presented in Table 3, we can see in Figure 6, that TransUNet, with Res50-ViT as a backbone, segments fire pixels even better than manual annotation. This Transformer can correctly distinguish between fire and background under different conditions such as the presence of smoke, different weather conditions, and various brightnesses of the environment. It also proves its better ability to identify small fire areas and detect the precise shape of a fire.
Figure 6. Results of TransUNet-Res50-ViT. From (top) to (bottom): RGB images, their corresponding mask, and the predicted images of TransUNet-Res50-ViT.
Figure 7 depicts the results of TransUNet with pure Transformer ViT as a backbone. We can see that this Transformer correctly segments fire pixels under various conditions. However, it misclassifies the fire border pixels (highlighted in red box) and misdetects the exact shape of the fire area.
Figure 7. Results of TransUNet-ViT. From (top) to (bottom): RGB images, their corresponding mask, and the predicted images of TransUNet-ViT.
Figure 8 shows examples of the segmentation of MedT. Similar to TransUNet-Res50-ViT, we can see that MedT proves its efficiency in segmenting fire pixels and detects the precise shape of the fire. However, it still misclassifies some small areas of fire (highlighted in red box).
Figure 8. Results of MedT. From (top) to (bottom): RGB images, their corresponding mask, and the predicted images of MedT.
The results of the U-Net model are depicted in Figure 9. We can see that the model correctly segments fire pixels and identifies the shape of a flame. However, it still misclassifies some small areas.
Figure 9. Results of U-Net. From (top) to (bottom): RGB images, their corresponding mask, and the predicted images of U-Net.
Figure 10 presents the results of the U 2 -Net model. We can see that this model does not identify the precise shape of the fire and misdetects some small fire areas.
Figure 10. Results of U 2 -Net. From (top) to (bottom): RGB images, their corresponding mask, and the predicted images of U 2 -Net.
Figure 11 illustrates some EfficientSeg results. We can see that EfficientSeg shows an excellent performance in segmenting fire pixels under different conditions, similar to Transformers TransUNet-Res50-ViT and MedT. It also correctly identifies the small areas of fire and their precise shapes.
Figure 11. Results of EfficientSeg. From (top) to (bottom): RGB images, their corresponding mask, and the predicted images of EfficientSeg.
In addition, we evaluated TransUNet and MedT using images downloaded from the web. We can see in Figure 12 that TransUNet-Res50-ViT accurately segments fire pixels and detects the precise shape of the fire under various conditions such as in the presence of smoke. It shows a better visual result than TransUNet–ViT, which fails to detect small fire areas. MedT also shows a great performance in identifying fire pixels. However, it misclassifies some small fire areas.
Figure 12. Results of TransUNet and MedT using web images. From (top) to (bottom): real RGB images, TransUNet-Res50-ViT results, TransUNet-ViT results, and MedT results.

5. Discussion

Transformers techniques, TransUNet-Res50-ViT, TransUNet-ViT and MedT, achieve an excellent performance compared to deep CNNs U 2 -Net, U-Net, and EfficientSeg and compared to classical machine learning methods (the color fusion method) thanks to their rich extracted feature maps. However, they need higher inference times. For example, TransUNet R50-ViT and MedT obtain 1.41 and 2.72 s, respectively. TransUNet-Res50-ViT achieves the best F1-score with 97.7% thanks to the use of both global and local features. This Transformer proves an excellent ability to localize and segment forest fire pixels as well as small fire areas. It correctly distinguishes between forest fires and background under different conditions such as in the presence of smoke and in different weather conditions. However, it still depends on a pre-trained backbone, which requires a large computational capacity and more time during training.
MedT achieves an F1-score of 96%, which is better than TransUNet-ViT, thanks to spatial and local information extracted by ConvNets. This model proves its efficiency in segmenting fire pixels and detecting the precise fire’s shape without pre-training. However, it misdetects some small fire areas. It also requires larger computational capacity and more time during training using high input resolution images. TransUNet-ViT extracts fire features using pure Transformer ViT. This model shows an excellent performance compared to deep CNNs, U-Net, U 2 -Net, and EfficientSeg. It obtains a lower inference time, which is 0.13 s. It also proves its ability to segment forest fire pixels under various conditions. However, it misdetects the border pixels of the fire and the exact fire area’s shape. EfficientSeg, U-Net, and U 2 -Net show their ability to segment forest fire pixels and detect the precise fire’s shape. However, they still misclassify some small fire areas.
To conclude, vision Transformers, TransUNet and MedT, show their excellent ability to segment forest fires and detect the shape of the fire front. These models outperform current state-of-the-art architectures. TransUNet-Res50-ViT and MedT, which adopted a ConvNets and Transformer as a backbone, identify the fine details of forest fires. They prove their promising use for segmenting fire pixels under various conditions such as in the presence of smoke and with changes of brightness in the environment.

6. Conclusions

In this paper, we propose a new approach based on the use of vision Transformers for forest fire segmentation. We explore two Transformers, TransUNet and MedT, adapted to segment and identify fire pixels using the CorsicanFire dataset. We evaluate the performance of our proposed Transformers by varying backbones and input size. Then, we present a comparative analysis of the two Transformers with current state-of-the-art models: U-Net, a fusion color space method, U 2 -Net, and EfficientSeg. TransUNet and MedT with a hybrid CNN-Transformer as a backbone outperformed state-of-the-art methods and showed an excellent ability to segment fire pixels and identify the precise shape of a fire under different conditions such as different brightnesses and in the presence of smoke. For future work, we first aim to adapt our Transformers-based algorithms to detect and track wildfire in videos, which consists of a sequence of frames using the spatiotemporal features of fires. We will also work on evaluating our models for the detection and segmentation of both smoke and fire pixels in urban environments.

Author Contributions

Conceptualization, M.A.A. and R.G.; methodology, R.G. and M.A.A.; software, R.G.; validation, R.G. and M.A.A.; formal analysis, R.G., M.A.A., M.J. and W.S.M.; writing—original draft preparation, R.G.; writing—review and editing, M.A.A., M.J., W.S.M. and R.A.; funding acquisition, M.A.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research was enabled in part by support provided by the Natural Sciences and Engineering Research Council of Canada (NSERC), funding reference number RGPIN-2018-06233.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

This work uses a publicly available dataset CorsicanFire, see reference [20] for data availability. More details about the data are available under Section 3.5.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Dimitropoulos, S. Fighting fire with science. Nature 2019, 576, 328–329. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  2. Gaur, A.; Singh, A.; Kumar, A.; Kulkarni, K.S.; Lala, S.; Kapoor, K.; Srivastava, V.; Kumar, A.; Mukhopadhyay, S.C. Fire Sensing Technologies: A Review. IEEE Sens. J. 2019, 19, 3191–3202. [Google Scholar] [CrossRef]
  3. Kuutti, S.; Bowden, R.; Jin, Y.; Barber, P.; Fallah, S. A Survey of Deep Learning Applications to Autonomous Vehicle Control. IEEE Trans. Intell. Transp. Syst. 2021, 22, 712–733. [Google Scholar] [CrossRef]
  4. Tian, Y.; Luo, P.; Wang, X.; Tang, X. Deep Learning Strong Parts for Pedestrian Detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1904–1912. [Google Scholar]
  5. Pérez-Hernández, F.; Tabik, S.; Lamas, A.; Olmos, R.; Fujita, H.; Herrera, F. Object Detection Binary Classifiers methodology based on deep learning to identify small objects handled similarly: Application in video surveillance. Knowl.-Based Syst. 2020, 194, 105590. [Google Scholar] [CrossRef]
  6. Nawaratne, R.; Alahakoon, D.; De Silva, D.; Yu, X. Spatiotemporal Anomaly Detection Using Deep Learning for Real-Time Video Surveillance. IEEE Trans. Ind. Inform. 2020, 16, 393–402. [Google Scholar] [CrossRef]
  7. Gaur, A.; Singh, A.; Kumar, A.; Kumar, A.; Kapoor, K. Video flame and smoke based fire detection algorithms: A literature review. Fire Technol. 2020, 56, 1943–1980. [Google Scholar] [CrossRef]
  8. Ghali, R.; Jmal, M.; Souidene Mseddi, W.; Attia, R. Recent Advances in Fire Detection and Monitoring Systems: A Review. In Proceedings of the 18th International Conference on Sciences of Electronics, Technologies of Information and Telecommunications (SETIT’18), Genoa, Italy, 20–22 December 2018; Springer International Publishing: Berlin/Heidelberg, Germany, 2018; Volume 1, pp. 332–340. [Google Scholar]
  9. Ott, M.; Edunov, S.; Grangier, D.; Auli, M. Scaling Neural Machine Translation. arXiv 2018, arXiv:1806.00187. [Google Scholar]
  10. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
  11. Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
  12. Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A Survey on Visual Transformer. arXiv 2020, arXiv:2012.12556. [Google Scholar]
  13. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Computer Vision—ECCV; Springer International Publishing: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
  14. Ding, M.; Yang, Z.; Hong, W.; Zheng, W.; Zhou, C.; Yin, D.; Lin, J.; Zou, X.; Shao, Z.; Yang, H.; et al. CogView: Mastering Text-to-Image Generation via Transformers. arXiv 2021, arXiv:2105.13290. [Google Scholar]
  15. Yang, F.; Yang, H.; Fu, J.; Lu, H.; Guo, B. Learning Texture Transformer Network for Image Super-Resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 5791–5800. [Google Scholar]
  16. Valanarasu, J.M.J.; Oza, P.; Hacihaliloglu, I.; Patel, V.M. Medical Transformer: Gated Axial-Attention for Medical Image Segmentation. arXiv 2021, arXiv:2102.10662. [Google Scholar]
  17. Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]
  18. Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in Vision: A Survey. arXiv 2021, arXiv:2101.01169. [Google Scholar]
  19. Sudre, C.H.; Li, W.; Vercauteren, T.; Ourselin, S.; Jorge Cardoso, M. Generalised Dice Overlap as a Deep Learning Loss Function for Highly Unbalanced Segmentations. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support; Springer International Publishing: Cham, Switzerland, 2017; pp. 240–248. [Google Scholar]
  20. Toulouse, T.; Rossi, L.; Campana, A.; Celik, T.; Akhloufi, M.A. Computer vision for wildfire research: An evolving image dataset for processing and analysis. Fire Saf. J. 2017, 92, 188–194. [Google Scholar] [CrossRef] [Green Version]
  21. Akhloufi, M.A.; Tokime, R.B.; Elassady, H. Wildland fires detection and segmentation using deep learning. Pattern recognition and tracking xxix. Int. Soc. Opt. Photonics Proc. SPIE 2018, 10649, 106490B. [Google Scholar] [CrossRef]
  22. Dzigal, D.; Akagic, A.; Buza, E.; Brdjanin, A.; Dardagan, N. Forest Fire Detection based on Color Spaces Combination. In Proceedings of the 11th International Conference on Electrical and Electronics Engineering (ELECO), Bursa, Turkey, 28–30 November 2019; pp. 595–599. [Google Scholar] [CrossRef]
  23. Qin, X.; Zhang, Z.; Huang, C.; Dehghan, M.; Zaiane, O.R.; Jagersand, M. U2-Net: Going deeper with nested U-structure for salient object detection. Pattern Recognit. 2020, 106, 107404. [Google Scholar] [CrossRef]
  24. Yesilkaynak, V.B.; Sahin, Y.H.; Unal, G.B. EfficientSeg: An Efficient Semantic Segmentation Network. arXiv 2020, arXiv:2009.06469. [Google Scholar]
  25. Horng, W.B.; Peng, J.W.; Chen, C.Y. A new image-based real-time flame detection method using color analysis. In Proceedings of the IEEE Networking, Sensing and Control, Tucson, AZ, USA, 19–22 March 2005; pp. 100–105. [Google Scholar] [CrossRef]
  26. Çelik, T.; Demirel, H. Fire detection in video sequences using a generic color model. Fire Saf. J. 2009, 44, 147–158. [Google Scholar] [CrossRef]
  27. Chen, T.H.; Wu, P.H.; Chiou, Y.C. An early fire-detection method based on image processing. In Proceedings of the International Conference on Image Processing, Singapore, 24–27 October 2004; Volume 3, pp. 1707–1710. [Google Scholar] [CrossRef]
  28. Collumeau, J.F.; Laurent, H.; Hafiane, A.; Chetehouna, K. Fire scene segmentations for forest fire characterization: A comparative study. In Proceedings of the 18th IEEE International Conference on Image Processing, Brussels, Belgium, 11–14 September 2011; pp. 2973–2976. [Google Scholar] [CrossRef]
  29. Chino, D.Y.T.; Avalhais, L.P.S.; Rodrigues, J.F.; Traina, A.J.M. BoWFire: Detection of Fire in Still Images by Integrating Pixel Color and Texture Analysis. In Proceedings of the 28th SIBGRAPI Conference on Graphics, Patterns and Images, Salvador, Brazil, 6–29 August 2015; pp. 95–102. [Google Scholar] [CrossRef] [Green Version]
  30. Chen, J.; He, Y.; Wang, J. Multi-feature fusion based fast video flame detection. Build. Environ. 2010, 45, 1113–1122. [Google Scholar] [CrossRef]
  31. Jamali, M.; Karimi, N.; Samavi, S. Saliency Based Fire Detection Using Texture and Color Features. In Proceedings of the 28th Iranian Conference on Electrical Engineering (ICEE), Tabriz, Iran, 4–6 August 2020; pp. 1–5. [Google Scholar] [CrossRef]
  32. Ko, B.C.; Cheong, K.H.; Nam, J.Y. Fire detection based on vision sensor and support vector machines. Fire Saf. J. 2009, 44, 322–329. [Google Scholar] [CrossRef]
  33. Foggia, P.; Saggese, A.; Vento, M. Real-Time Fire Detection for Video-Surveillance Applications Using a Combination of Experts Based on Color, Shape, and Motion. IEEE Trans. Circuits Syst. Video Technol. 2015, 25, 1545–1556. [Google Scholar] [CrossRef]
  34. Khondaker, A.; Khandaker, A.; Uddin, J. Computer Vision-based Early Fire Detection Using Enhanced Chromatic Segmentation and Optical Flow Analysis Technique. Int. Arab. J. Inf. Technol. (IAJIT) 2020, 17, 947–953. [Google Scholar]
  35. Emmy Prema, C.; Vinsley, S.S.; Suresh, S. Efficient Flame Detection Based on Static and Dynamic Texture Analysis in Forest Fire Detection. Fire Technol. 2018, 54, 255–288. [Google Scholar] [CrossRef]
  36. Wang, T.; Shi, L.; Yuan, P.; Bu, L.; Hou, X. A new fire detection method based on flame color dispersion and similarity in consecutive frames. In Proceedings of the Chinese Automation Congress (CAC), Jinan, China, 20–22 October 2017; pp. 151–156. [Google Scholar] [CrossRef]
  37. Ajith, M.; Martínez-Ramón, M. Unsupervised Segmentation of Fire and Smoke From Infra-Red Videos. IEEE Access 2019, 7, 182381–182394. [Google Scholar] [CrossRef]
  38. Gonzalez, A.; Zuniga, M.D.; Nikulin, C.; Carvajal, G.; Cardenas, D.G.; Pedraza, M.A.; Fernandez, C.A.; Munoz, R.I.; Castro, N.A.; Rosales, B.F.; et al. Accurate fire detection through fully convolutional network. In Proceedings of the 7th Latin American Conference on Networked and Electronic Media (LACNEM), Valparaiso, Chile, 6–7 November 2017; pp. 1–6. [Google Scholar]
  39. Dang-Ngoc, H.; Nguyen-Trung, H. Evaluation of Forest Fire Detection Model using Video captured by UAVs. In Proceedings of the 19th International Symposium on Communications and Information Technologies (ISCIT), Ho Chi Minh City, Vietnam, 25–27 September 2019; pp. 513–518. [Google Scholar]
  40. Muhammad, K.; Ahmad, J.; Lv, Z.; Bellavista, P.; Yang, P.; Baik, S.W. Efficient Deep CNN-Based Fire Detection and Localization in Video Surveillance Applications. IEEE Trans. Syst. Man Cybern. Syst. 2019, 49, 1419–1434. [Google Scholar] [CrossRef]
  41. Mlích, J.; Koplík, K.; Hradiš, M.; Zemčík, P. Fire segmentation in Still images. In International Conference on Advanced Concepts for Intelligent Vision Systems; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 27–37. [Google Scholar]
  42. Harkat, H.; Nascimento, J.; Bernardino, A. Fire segmentation using a DeepLabv3+ architecture. Image and Signal Processing for Remote Sensing XXVI. Int. Soc. Opt. Photonics Proc. SPIE 2020, 11533, 134–145. [Google Scholar]
  43. Bochkov, V.S.; Kataeva, L.Y. wUUNet: Advanced Fully Convolutional Neural Network for Multiclass Fire Segmentation. Symmetry 2021, 13, 98. [Google Scholar] [CrossRef]
  44. Li, P.; Zhao, W. Image fire detection algorithms based on convolutional neural networks. Case Stud. Therm. Eng. 2020, 19, 100625. [Google Scholar] [CrossRef]
  45. Xu, R.; Lin, H.; Lu, K.; Cao, L.; Liu, Y. A Forest Fire Detection System Based on Ensemble Learning. Forests 2021, 12, 217. [Google Scholar] [CrossRef]
  46. Khan, R.A.; Uddin, J.; Corraya, S. Real-time fire detection using enhanced color segmentation and novel foreground extraction. In Proceedings of the 4th International Conference on Advances in Electrical Engineering (ICAEE), Dhaka, Bangladesh, 28–30 September 2017; pp. 488–493. [Google Scholar]
  47. Zhao, Z.Q.; Zheng, P.; Xu, S.T.; Wu, X. Object Detection With Deep Learning: A Review. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 3212–3232. [Google Scholar] [CrossRef] [Green Version]
  48. Pan, Z.; Xu, J.; Guo, Y.; Hu, Y.; Wang, G. Deep Learning Segmentation and Classification for Urban Village Using a Worldview Satellite Image Based on U-Net. Remote Sens. 2020, 12, 1574. [Google Scholar] [CrossRef]
  49. Bragilevsky, L.; Bajić, I.V. Deep learning for Amazon satellite image analysis. In Proceedings of the IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PACRIM), Victoria, BC, Canada, 21–23 August 2017; pp. 1–5. [Google Scholar] [CrossRef]
  50. Abdollahi, A.; Pradhan, B.; Shukla, N.; Chakraborty, S.; Alamri, A. Deep Learning Approaches Applied to Remote Sensing Datasets for Road Extraction: A State-Of-The-Art Review. Remote Sens. 2020, 12, 1444. [Google Scholar] [CrossRef]
  51. Bakator, M.; Radosav, D. Deep Learning and Medical Diagnosis: A Review of Literature. Multimodal Technol. Interact. 2018, 2, 47. [Google Scholar] [CrossRef] [Green Version]
  52. Minaee, S.; Boykov, Y.Y.; Porikli, F.; Plaza, A.J.; Kehtarnavaz, N.; Terzopoulos, D. Image Segmentation Using Deep Learning: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021. [Google Scholar] [CrossRef] [PubMed]
  53. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
  54. Iandola, F.N.; Moskewicz, M.W.; Ashraf, K.; Han, S.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50× fewer parameters and <1 MB model size. arXiv 2016, arXiv:1602.07360. [Google Scholar]
  55. Xing, Y.; Zhong, L.; Zhong, X. An Encoder-Decoder Network Based FCN Architecture for Semantic Segmentation. Wirel. Commun. Mob. Comput. 2020, 2020, 8861886. [Google Scholar] [CrossRef]
  56. Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI; Springer International Publishing: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar] [CrossRef] [Green Version]
  57. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar]
  58. Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef]
  59. Chen, L.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
  60. Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
  61. Xiao, J.; Hays, J.; Ehinger, K.A.; Oliva, A.; Torralba, A. SUN database: Large-scale scene recognition from abbey to zoo. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 3485–3492. [Google Scholar] [CrossRef]
  62. Dai, J.; Li, Y.; He, K.; Sun, J. R-FCN: Object Detection via Region-based Fully Convolutional Networks. arXiv 2016, arXiv:1605.06409. [Google Scholar]
  63. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European conference on computer vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
  64. Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
  65. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [Green Version]
  66. Jocher, G.; Stoken, A.; Chaurasia, A.; Borovec, J.; Chanvichet, V.; Kwon, Y.; TaoXie, S.; Changyu, L.; Abhiram, V.; Skalski, P.; et al. Yolov5. 2021. Available online: https://github.com/ultralytics/yolov5 (accessed on 20 August 2021).
  67. Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
  68. Tan, M.; Le, Q. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
  69. Girdhar, R.; Carreira, J.; Doersch, C.; Zisserman, A. Video Action Transformer Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 9–15 June 2019; pp. 244–253. [Google Scholar]
  70. Ye, L.; Rochan, M.; Liu, Z.; Wang, Y. Cross-Modal Self-Attention Network for Referring Image Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 9–15 June 2019; pp. 10502–10511. [Google Scholar]
  71. He, X.; Chen, Y.; Lin, Z. Spatial-Spectral Transformer for Hyperspectral Image Classification. Remote Sens. 2021, 13, 498. [Google Scholar] [CrossRef]
  72. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  73. Sun, C.; Shrivastava, A.; Singh, S.; Gupta, A. Revisiting Unreasonable Effectiveness of Data in Deep Learning Era. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 843–852. [Google Scholar]
  74. Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. arXiv 2020, arXiv:2012.12877. [Google Scholar]
  75. Chuvieco, E.; Mouillot, F.; van der Werf, G.R.; San Miguel, J.; Tanase, M.; Koutsias, N.; García, M.; Yebra, M.; Padilla, M.; Gitas, I.; et al. Historical background and current developments for mapping burned area from satellite Earth observation. Remote Sens. Environ. 2019, 225, 45–64. [Google Scholar] [CrossRef]
  76. Van der Werf, G.R.; Randerson, J.T.; Giglio, L.; van Leeuwen, T.T.; Chen, Y.; Rogers, B.M.; Mu, M.; van Marle, M.J.E.; Morton, D.C.; Collatz, G.J.; et al. Global fire emissions estimates during 1997–2016. Earth Syst. Sci. Data 2017, 9, 697–720. [Google Scholar] [CrossRef] [Green Version]
  77. Giglio, L.; Boschetti, L.; Roy, D.P.; Humber, M.L.; Justice, C.O. The Collection 6 MODIS burned area mapping algorithm and product. Remote Sens. Environ. 2018, 217, 72–85. [Google Scholar] [CrossRef]
  78. Key, C.H.; Benson, N.C. Landscape assessment (LA). In FIREMON: Fire Effects Monitoring and Inventory System. Gen. Tech. Rep. RMRS-GTR-164-CD; Lutes, D.C., Keane, R.E., Caratti, J.F., Key, C.H., Benson, N.C., Sutherland, S., Gangi, L.J., Eds.; U.S. Department of Agriculture, Forest Service, Rocky Mountain Research Station: Fort Collins, CO, USA, 2006; Volume 164, p. LA-1-55. [Google Scholar]
  79. Roy, D.; Boschetti, L.; Trigg, S. Remote sensing of fire severity: Assessing the performance of the normalized burn ratio. IEEE Geosci. Remote Sens. Lett. 2006, 3, 112–116. [Google Scholar] [CrossRef] [Green Version]
  80. Miller, J.D.; Thode, A.E. Quantifying burn severity in a heterogeneous landscape with a relative version of the delta Normalized Burn Ratio (dNBR). Remote Sens. Environ. 2007, 109, 66–80. [Google Scholar] [CrossRef]
  81. Frampton, W.J.; Dash, J.; Watmough, G.; Milton, E.J. Evaluating the capabilities of Sentinel-2 for quantitative estimation of biophysical variables in vegetation. ISPRS J. Photogramm. Remote Sens. 2013, 82, 83–92. [Google Scholar] [CrossRef] [Green Version]
  82. Zheng, Z.; Wang, J.; Shan, B.; He, Y.; Liao, C.; Gao, Y.; Yang, S. A New Model for Transfer Learning-Based Mapping of Burn Severity. Remote Sens. 2020, 12, 708. [Google Scholar] [CrossRef] [Green Version]
  83. Rebecca, G.; Tim, D.; Warwick, H.; Luke, C. A remote sensing approach to mapping fire severity in south-eastern Australia using sentinel 2 and random forest. Remote Sens. Environ. 2020, 240, 111702. [Google Scholar] [CrossRef]
  84. Zanetti, M.; Marinelli, D.; Bertoluzza, M.; Saha, S.; Bovolo, F.; Bruzzone, L.; Magliozzi, M.L.; Zavagli, M.; Costantini, M. A high resolution burned area detector for Sentinel-2 and Landsat-8. In Proceedings of the 10th International Workshop on the Analysis of Multitemporal Remote Sensing Images (MultiTemp), Shanghai, China, 5–7 August 2019; pp. 1–4. [Google Scholar]
  85. Kussul, N.; Lavreniuk, M.; Skakun, S.; Shelestov, A. Deep Learning Classification of Land Cover and Crop Types Using Remote Sensing Data. IEEE Geosci. Remote Sens. Lett. 2017, 14, 778–782. [Google Scholar] [CrossRef]
  86. Marcos, D.; Volpi, M.; Kellenberger, B.; Tuia, D. Land cover mapping at very high resolution with rotation equivariant CNNs: Towards small yet accurate models. ISPRS J. Photogramm. Remote Sens. 2018, 145, 96–107. [Google Scholar] [CrossRef] [Green Version]
  87. Zhang, Q.; Yuan, Q.; Zeng, C.; Li, X.; Wei, Y. Missing Data Reconstruction in Remote Sensing Image With a Unified Spatial–Temporal–Spectral Deep Convolutional Neural Network. IEEE Trans. Geosci. Remote Sens. 2018, 56, 4274–4288. [Google Scholar] [CrossRef] [Green Version]
  88. Kampffmeyer, M.; Salberg, A.B.; Jenssen, R. Semantic Segmentation of Small Objects and Modeling of Uncertainty in Urban Remote Sensing Images Using Deep Convolutional Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 680–688. [Google Scholar]
  89. Jeppesen, J.H.; Jacobsen, R.H.; Inceoglu, F.; Toftegaard, T.S. A cloud detection algorithm for satellite imagery based on deep learning. Remote Sens. Environ. 2019, 229, 247–259. [Google Scholar] [CrossRef]
  90. Pinto, M.M.; Libonati, R.; Trigo, R.M.; Trigo, I.F.; DaCamara, C.C. A deep learning approach for mapping and dating burned areas using temporal sequences of satellite images. ISPRS J. Photogramm. Remote Sens. 2020, 160, 260–274. [Google Scholar] [CrossRef]
  91. Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
  92. Farasin, A.; Colomba, L.; Garza, P. Double-Step U-Net: A Deep Learning-Based Approach for the Estimation of Wildfire Damage Severity through Sentinel-2 Satellite Data. Appl. Sci. 2020, 10, 4332. [Google Scholar] [CrossRef]
  93. Rahmatov, N.; Paul, A.; Saeed, F.; Seo, H. Realtime fire detection using CNN and search space navigation. J. Real-Time Image Process. 2021, 18, 1331–1340. [Google Scholar] [CrossRef]
  94. Khennou, F.; Ghaoui, J.; Akhloufi, M.A. Forest fire spread prediction using deep learning. Geospatial Informatics XI. Int. Soc. Opt. Photonics Proc. SPIE 2021, 11733, 106–117. [Google Scholar]
  95. Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar] [CrossRef] [Green Version]
  96. Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for MobileNetV3. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
  97. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. arXiv 2019, arXiv:1912.01703. [Google Scholar]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.