1. Introduction
Forest fire is one of the most serious threats to forest resources, ecosystems, and human society. In recent years, the rise in human activities, coupled with the deteriorating global environment, has greatly contributed to an increase in forest fire incidents [
1]. Due to its large-area coverage, all-weather operation, low cost, and especially its ability to monitor accidental fires caused by human behaviors, vision-based fire and smoke detection techniques have been increasingly predominant in forest surveillance systems [
2,
3]. In this work, our primary focus is on smoke detection rather than fire detection, because smoke is usually viewed as an earlier sign of a fire incident.
In the literature, existing smoke detection methods can be roughly divided into three types: (1) classification of video frames, image blocks, or pixels into a smoke class or non-smoke class, (2) estimating bounding boxes of smoke objects in an image for smoke localization and detection, and (3) image region segmentation of smoke objects [
4]. For these three types, supervised learning is a fundamental component, applied in smoke detection, recognition, localization, and segmentation. In the mode of machine learning, earlier smoke detection was supervised and can be traced back to Gubbi’s work [
5], where both classes of samples were manually featured as 60-dimensional vectors, obtained from three-level wavelet decomposition on 32
32 image blocks. Then, a Support Vector Machine (SVM) was trained to classify these samples into two classes: smoke and non-smoke. From then on, the focus of smoke detection was on both directions: feature construction and model selection. In general, a smoke detection task commonly consists of three steps [
6]. First, smoke features are constructed on smoke images or video frames to form vector-pattern samples. Second, a shallow learning model, such as SVM, AdaBoost, or random forest, is selected and trained on the featured samples. Finally, an unseen sample is identified by the trained model. For the shallow learning models, the main shortcomings lie in two aspects. One shortcoming concerns feature construction. The needed features are usually hand designed, heavily relying on the prior knowledge of the feature designer. Moreover, this over-reliance may exhibit poor scalability for smoke detectors. The other shortcoming refers to the limited representation capacity of shallow models. For example, shallow learning may be effective for a small dataset, owing to the handcrafted features, while on a large dataset, handcrafted features may fail to characterize the intricate characteristics of smoke due to the limited representation ability. To alleviate these issues, various deep models were later introduced into smoke detection. For example, Luo et al. [
7] proposed a convolutional neural network (CNN)-based smoke detector, where suspected smoke regions were first detected, and then the features of these regions could be extracted automatically by this CNN. To integrate feature extraction, candidate box extraction, and classification, Cheng [
8] introduced a fast R-CNN (Regional CNN) for smoke image recognition. Compared to the shallow learning model, both deep methods can improve the accuracy rate of smoke video detection. Additionally, considering the transparency characteristics of smoke, Zhan et al. [
9] designed a variant ResNet (Residual Network) named ARGNet, in which two identical ResNet50-vd models were used for feature extraction. Specifically, high-level features were extracted by the first ResNet50-vd and transmitted to the second ResNet50-vd. Then, together with the low-level features extracted by the second ResNet50-vd, ARGNet was designed to enhance the ability to represent smoke, e.g., to capture the transparency of smoke objects. Experimental results on the constructed dataset named UAV-IoT (Unmanned Aerial Vehicle–Internet of Things) demonstrated that ARGNet had achieved a higher accuracy rate in UAV aerial forest fire smoke detection, as well as in identifying smoke-like objects, long-distance smoke, and the performance on the smoke positioning in complex smoke scenarios. In contrast, a deep network model is superior to the shallow one. One explanation is that the deep model has more powerful capabilities in semantic feature extraction. Here, semantic information typically encompasses low-level features such as the color, texture, or position of a smoke object, as well as high-level features like semantic fragment, contour, or smoke-object region [
10]. Thanks to their powerful computation capacity in handling vast amounts of data, deep networks and their related variants, including lightweight versions [
11,
12,
13], have achieved great success in smoke detection. Here we only name a few.
On the other hand, due to the black-box nature of complex network architecture, deep networks usually lack interpretability [
14]. In this view, the attention mechanism provides a means to interpret the opaque behavior of neural architectures, which is valuable for vision-based tasks such as smoke detection [
15,
16]. Because the concept of saliency is more consistent with the visual attention mechanism of human beings, a salient smoke detection method in shallow learning was proposed in Ref. [
17]. Although this method can detect most of the instances of smoke, it also achieves a high error warning rate. In the direction of smoke attention, most methods adhered to the paradigm of deep learning in the literature. To clarify, here we give some examples. In Ba et al.’s work, a model named SmokeNet [
18] was proposed to enhance feature representation for smoke objects appearing in satellite imagery, by combining both spatial and channel information. However, it can be viewed as a failure, because SmokeNet was still prone to excessive error warnings. To reduce error warnings, He et al. [
19] introduced another attention-based CNN model and attempted to identify smoke from a fog environment, where a lightweight feature-level and decision-level fusion was used to improve the identification of smoke, fog, and other objects. On the other hand, self-attention is another common mechanism in smoke detection. There have been multiple versions proposed for different purposes. For example, Jiang et al. [
20] designed a Self-Attention Network for Smoke Detection named SAN-SD and applied it in industrial settings to detect smoke from low-resolution images where smoke was produced by straw burning. Considering transparency and variability, Wang et al. [
21] advised a self-attention-based YOLO model named SASC-YOLO, where the self-attention mechanism is intended to emphasize smoke features, but its performance was evaluated on synthetic smoke datasets. Additionally, Wang et al. proposed a Multi-level Feature Fusion Network (MFFNet) [
22], where the module named Attention Feature Enhancement was used to refine multi-scale features. However, this network was specifically utilized for smoke-screening satellite remote sensing images. In summary, these methods may be efficient in their individual scenarios; however, they cannot be used directly for our forest smoke detection concerns for the following reasons: (1) The used-attention mechanism is just a simple application of computer vision algorithms, with little or even no consideration of the characteristics of smoke. (2) They may be efficient for a simple scenario, e.g., one in which a single and large-scale smoke object appears. However, they will fail to detect complex smoke objects, e.g., smoke from a complex forest involving multiple points of ignition or varying scales.
In this work, our ambition is to propose a novel interpretable attention mechanism for multi-scale smoke detection. We highlight our contributions as follows.
- (1)
We propose a novel attention mechanism called Spatial and Efficient Channel Attention, abbreviated as SECA, for forest smoke detection.
- (2)
To improve the accuracy of multi-scale smoke detection, we construct feature maps via multi-kernel one-dimensional (1D) convolutions rather than single-kernel 2D or 3D convolutions in existing methods.
- (3)
For interpretability, the channel features, also generated from 1D convolutions, can be explained as the weights to emphasize the importance of spatial features along the channel direction.
- (4)
In terms of ease of use, our SECA can be used either as two isolated modules or as a unified module, and it can be easily plugged into a base network for smoke detection.
- (5)
We also provide an acceleration strategy for model training, e.g., by leveraging a DSConv-Haar Wavelet Downsampling technique.
The remaining structure of this paper is organized as follows. In
Section 2, we review some works that are the most related to ours. In
Section 3, we detail our SECA mechanism, the resulting network, and the aforementioned acceleration strategy.
Section 4 is dedicated to an extensive experimental evaluation, wherein a comparative analysis is conducted between our method and the state-of-the-art (SOTA) methods. Finally, we draw a conclusion in
Section 5.
4. Results
To evaluate the performance of our proposed attention mechanism SECA, we take YOLOv8 and YOLOv11 as base networks. All YOLO models are implemented in the PyTorch environment of Ultralytics. To show the comparison extensively, some state-of-the-art (SOTA) methods are used as baselines, which will be detailed in the following subsections. Our training epochs and batch size are set to 200 and 8, respectively, the recommended parameters for the YOLO-series networks in the literature. Other parameters are retained as the default values, e.g., those provided by the Ultralytics community. For the non-YOLO models, they are implemented in the Mmdetection3.0 environment, with the default values provided by Mmdetection3.0.
The composite indicators, such as Precision (P), Recall (R), mAP50, and mAP50-95 are also commonly used for performance assessment. For the convenience of readers, we list them as follows. They are defined as , , , and , where denotes the total number of classes. For the AP, defined as , it is the area under the Precision-Recall curve for a specific class. The indicator mAP50 calculates the mean average precision at a single IoU threshold of 0.5 (the percentage value of 50), assessing detection performance under relatively loose localization. Likewise, mAP50-95 provides a more comprehensive evaluation by averaging precision over 10 IoU thresholds, ranging from the percentage values of 50 to 95 (or 0.5 to 0.95 in real value) with a fixed step of 5 (0.05). We use it to assess the performance of multi-scale smoke detection, measured by IoU positioning accuracy, where is the value of the i-th class at the j-th IoU threshold. In practical situations, P is usually used as an indicator to measure the accuracy rate of smoke detection. For example, a high p-value means that the model may result in an accurate classification. Likewise, R indicates the model’s ability to prevent missing detections of smoke. A high R value means the network identifies most potential smoke regions, enabling its use in early fire alarms. Additionally, an F1-score, defined as , is also used as a performance metric because it is also popular in vision-based smoke detection.
To show the comparison more clearly, we divide the experiment into two subsections. In the first subsection, an ablation experiment is carried on our SECA, which encompasses evaluating the contribution and interpretability of the proposed modules. In the second subsection, we compare our SECA with SOTA baselines. This comparison consists of two parts. The first part is to show the comparison between our SECA and multiple newly proposed SOTA attention modules, some of which may not be used for smoke detection. For fairness, these modules are embedded in the same positions that our SECA has been embedded in, as illustrated in
Figure 3. In the second part, we compare our SECA-based YOLO networks, YOLO-SECA for short, with state-of-the-art smoke detectors. All experiments were conducted on a workstation equipped with an NVIDIA GeForce RTX 4060 GPU (8 GB VRAM) and an Intel
® Core™ i9-13900HX processor. The software environment consisted of Python 3.9, PyTorch 2.0.1, and CUDA 11.8, with code developed in Visual Studio Code (version 1.91.1).
4.1. Performance Assessment for Our SECA
In this subsection, we use an ablation experiment to evaluate the effectiveness of our proposed SECA, including the isolated modules of MSCA or ECA, the unified module of SECA, and the acceleration strategy DHWD. The results are listed in
Table 1 and
Table 2 and
Figure 6, respectively.
At first glance,
Table 1 shows that in terms of the five indicators, the unified module YOLO-SECA performs the best, followed by YOLO-SECA-DHWD. For example, YOLO-SECA achieved the highest
p-value of 0.818, the highest R of 0.779, and the highest mAP
50-95 value of 0.345. As for the other two indicators, it still achieves the second-best mAP
50 value of 0.815, which is just 0.008 points less than the best value of 0.823. In this case, because YOLO-SECA does not use the DHWD strategy, the magnitude of parameters in YOLO-SECA (indicator Parameter in
Table 1) is slightly higher than that of the pure YOLOv8. On the contrary, if DHWD is considered, the value of the indicator Parameter is greatly reduced, e.g., from 3.016 MB to 2.8 MB. On the other hand, if an isolated module, MSCA or ECA, is considered, the resulting YOLOv8-based models may outperform the naive one, as shown in
Table 1, where some of the indicator values in YOLO-MSCA and YOLO-ECA have surpassed those of the naive YOLOv8. Additionally,
Table 1 also shows the effectiveness of the proposed acceleration strategy DHWD, where the Parameter values of models YOLO-DHWD and YOLO-SECA-DHWD have been reduced to 2.8 MB, compared to the value of 3.011 MB in the naive model. This point will be further verified in
Table 2.
To show the accuracy of extracted features, for intuition, heatmaps of Grad-CAM (Gradient-weighted Class Activation Mapping [
32]) are used in this work. This is because in computer vision, the Grad-CAM tool is popularly used in feature visualization. To display the performance on different scales, especially for the performance on multi-scale smoke detection, we prefer to visualize FUAV data by heatmaps. To clarify, limited by space, both images from FUAV data have been selected to represent multi-scale smoke, representing their heatmaps in
Figure 6. As shown in
Figure 6a, in each smoke image, there are multiple complex smoke objects, rather than one simple object appearing in the literature. For example, the so-called complexity here for the selected smoke objects is reflected in a variety of sizes and the diffusion in different directions, which is even beyond human vision; for example, it is difficult to know the accurate number of smoke regions. Correspondingly, we show the heatmaps in
Figure 6b,d, provided by our attention modules MSCA, ECA, and SECA, respectively. For better visualization, we take the smoke objects in the second image as an example and highlight them in orange-colored boxes. If a heatmap is compared to the original image, it seems more intuitive to evaluate the accuracy of feature extraction. Following this intuition, it is evident that our SECA is superior to the other two attention modules. For example, a majority of smoke regions have been accentuated with brighter colors, while non-smoke regions appear in darker hues. On the one hand, it is observed that the heatmap from our SECA is not a straightforward superposition of the heatmaps from MSCA and ECA. This can be explained by the fact that, for example, the brighter smoke regions in MSCA or ECA heatmaps have not been displayed in the brighter regions of the SECA heatmap, and vice versa. On the other hand, the smoke location information captured by MSCA can be transmitted to SECA. For example, the blue smoke regions within the MSCA orange box exhibit a more vivid hue than the original blue and are more distinctly visualized in the SECA orange box. We explain it by using
defined in Equation (9), where
ECA(
X) =
AC MSCA(
X). For the position feature
, the channel output
has been used in SECA as weights, which is used to emphasize the importance of all of the obtained location features. As a result, both location information and channel information for smoke objects can be transferred to the final feature map. When Grad-CAM is applied for visualization, an important feature will be assigned a brighter color. In this view, this Figure can also be viewed as evidence for the interpretability of our SECA.
To provide a clear comparison for our acceleration strategy, DHWD, we take the standard convolutions (Conv) in YOLOv8 as a baseline and use training time (in seconds) for the performance assessment. For convenience, a total of 300 images were randomly selected from both datasets to establish a small training set. Then these training images were fed to the naive model and our YOLO-DHWD, where our DHWD was designed in accordance with the settings outlined in
Figure 5. The size of input tensors of both models was unified to (8,128,256,256), for a fair comparison. After 30 warm-up iterations, we recorded training time in seconds, FLOPs in Giga, Parameters in KB, and listed them in
Table 2. As a result,
Table 2 shows that our DHWD significantly surpasses standard convolutional techniques, exhibiting a nearly threefold increase in speed when compared to the convolutions employed in the base model.
4.2. Comparison Between Our SECA and SOTA Baselines
As aforementioned, in this subsection, we aim to show the result in twofold, namely, to represent the comparison between attention mechanisms, and to provide a comparison between the selected state-of-the-art (SOTA) smoke detection models.
4.2.1. Comparison of Attention Mechanisms
In computer vision and vision-based applications, attention mechanisms have gained immense popularity. To show the novelty of our proposal, we take newly proposed attention mechanisms as baseline methods, including Efficient Multi-Scale Attention (EMA, in 2023) [
33], Multi-scale Cross-axis Attention (MCA, in 2023) [
34], Channel Prior Convolutional Attention (CPCA, in 2024) [
25], and Global-to-Local Spatial Aggregation (GLSA, in 2024) [
35], though some of them may not have been applied for smoke detection. Here, we should point out that we do not take the classic Coordinate Attention (CA, 2018) [
27] and Convolutional Block Attention Module (CBAM, 2021) [
26] as baselines. The reason is that, for example, both EMA and CPCA can be viewed as new versions of CA or CBAM. Furthermore, the baseline attention mechanisms seem to have stronger performance in capturing multi-scale objects. For example, MCA is designed to utilize multi-scale features and long-range dependencies to capture the changes and morphology of interesting objects. While for GLSA, the authors claimed that it can utilize global and local spatial information for small object detection. Considering the similarity between these relevant objects and our concerned smoke, we are also interested in investigating the efficacy of these baselines in smoke detection. For fairness, the above-mentioned baseline attention mechanisms are all used as plug-and-play modules to replace our SECAs, similar to the way of the SECA network shown in
Figure 3. To evaluate multi-scale smoke detection performance, we calculate five metrics on both the FUAV and WSv2 datasets. The FUAV dataset is assessed across multiple scales, while the WSv2 dataset is assessed on small-area, long-distance, and combined smoke scales. The quantitative and visual results are presented in
Table 3 and
Figure 7, respectively.
Table 3 indicates that, at first glance, our SECA is still superior to other attention mechanisms, e.g., most indicators having been shown in bold. Especially on the data FUAV, three out of the four indicators for smoke detection have achieved the best results, highlighted in bold in the table. For example, our SECA has achieved the highest mAP
50 value of 0.815, the highest P (Precision) value of 0.818, and the highest F1-score of 0.798. Even on the indicator R (Recall), it still achieves the second best, i.e., the R value of 0.779, which is just 0.008 points less than that of EMA-attention. Similarly, on WSv2, two out of the four indicators for SECA have also reached the highest levels, i.e., achieving an mAP
50 of 0.704 and an F1-score of 0.708. On the other two indicators, our SECA achieves a P value of 0.778 and an R value of 0.650, which are merely 0.01 and 0.008 points less than the best values obtained by CPCA and MCA attention, respectively. The reason may be attributed to two factors. (1) Our SECA has a stronger interpretability, e.g., capturing the diffusivity of smoke objects in both spatial directions of height and width. (2) For our adopted multi-kernel 1D convolution (MK-DWConv1D), it may have more potential in capturing different scales than single-kernel 2D or 3D convolutions applied in baseline attentions. To clarify, some challenging smoke objects are used for testing. Also, we visualize the detected results in
Figure 7 for intuition.
In
Figure 7, three selected smoke images from WSv2 data were used for visualization. For each image, the smoke object highlighted in a red box in
Figure 7a was notably more complex, in terms of the spatial characteristics such as the size, orientation, and proximity to the observer in comparison to the smoke objects presented in FUAV data. The detection results are shown in
Figure 7b,f, captured by SECA-, MCA-, GLSA-, EMA-, and CPCA-attention. To improve visualization, if a smoke object was correctly detected, a red box was added to the object to highlight the detection. As for the performance, the columnar smoke object in the first image could be captured by all attention-based methods. However, for the remaining smoke objects in the other two images, perhaps due to the smaller area in the monitoring fields, they sometimes failed to be captured by the baseline methods. For example, the smoke in the second image was missed by MCA, GLSA, and CPCA. Similarly, the smoke in the third image was missed by MCA and EMA. With regard to our SECA,
Figure 7 shows that all of these challenging smoke objects could be correctly detected. The main reason for this phenomenon should still be attributed to our targeted design, i.e., using multi-kernel 1D convolutions to capture the diffusivity of smoke.
4.2.2. Comparison Among Different Networks
To provide a more extensive experiment, in this subsection, we aim to compare our method with SOTA mainstream models such as Faster R-CNN [
36], Mask R-CNN [
37], RetinaNet [
38], TOOD [
39] (Task-aligned One-stage Object Detection), DAB-DETR [
40] (Dynamic Anchor Boxes-DETR), and RTMDet-Tiny [
41] (Real-Time Models for Object Detection). Here, we have listed the original papers in our references to show our respect towards these authors and also to provide readers with a way to download the free codes if necessary. Among these models, most have been widely applied in fire or smoke detection [
42,
43]. To show the extensibility, both YOLO versions, e.g., YOLOv8 and the newly proposed YOLOv11, are used as backbones for our YOLO-SECA. We show the numerical value results in
Table 4 and visualizations for multi-scale smoke detection in
Figure 8.
In
Table 4, a comparative ablation experiment was conducted among ten models, including baseline methods and ours. A total of five indicators were used for the performance assessment. The indicator Latency (in milliseconds, ms) refers to the time delay between when a trained model receives an input and generates the corresponding output. Here, we use it to measure testing time and list the average time in milliseconds in the table. For easy reading, we divided the results into two groups: one for the baselines and the other for ours. The averaged results for each group show that our SECA-based models almost overwhelmingly outperform the baseline methods, especially in computational efficiency, e.g., achieving an average FLOPs value of 7.3, which is nearly 18 times less than that of the baseline. We should also point out that in this case, the computational efficiency may be attributed to the base model and our DHWD strategy, due to the differences between network architectures. If considering a fairer comparison, we should plug our SECA into an alternative base network and then conduct the comparison between the base model and the enhanced version, as shown in the comparison between YOLOv11 and YOLOv11 + SECA + DHWD in
Table 4. However, we will not explore this further in the present work due to space constraints. Similarly to
Section 4.2.1, for intuition, we also provide a visualized comparison in
Figure 8.
For further evaluation, more complex smoke images are used for testing. We select two of them for visualization. Here, the complexity, in human vision, refers to more smoke objects and different smoke areas. As shown in
Figure 8, a varying number of smoke objects, either five or seven, have been annotated by different-sized bounding boxes in Panels (a) and (i). The results of the first image show that among seven models, our YOLO-SECA model performs the best, followed by Faster-RCNN, Mask-RCNN, and RetinaNet. For example, for YOLO-SECA, nearly every smoke object can be accurately detected; for the models Faster-RCNN, Mask-RCNN, and RetinaNet, one out of five objects was missed, while for the remaining three, i.e., TOOD, DAB-DETR, and RTMDet-Tiny, the situation is even worse, where two or all of the smoke objects are missed. In the second image, it can be observed intuitively that all of the seven smoke objects can be detected by five models; three or five out of seven objects are missed by the rest two models, TOOD and RTMDet-Tiny, respectively. In this case, our YOLO-SECA still outperforms other models, if the indicator IoU is considered, where IoU refers to the overlap between the Ground-Truth bounding box (mask) and a model’s predicted mask. That is, YOLO-SECA demonstrates superior performance, with the highest IoU value of 0.618; following our YOLO-SECA, the models Faster-RCNN, Mask-RCNN, DAB-DETR, and RetinaNet achieve IoU values of 0.560, 0.558, 0.589, and 0.569, respectively. In contrast, the TOOD and RTMDet-Tiny models exhibit lower values of only 0.374 and 0.175, respectively.
5. Discussion
To improve the interpretability and the accuracy of multi-scale smoke detection, in this paper we have proposed a SECA mechanism. Different from the existing methods, our spatial and channel information are extracted by using multi-kernel 1D convolutions, rather than single-kernel 2D or 3D convolutions, aiming to capture complex smokes, e.g., multi-scale smoke objects generated from complex forest-monitoring scenes. As for the performance, although we have provided some experimental results in
Section 4, the indicators used for the performance assessment seem isolated. Furthermore, among the indicator values provided for our method, some of them are not significantly higher or lower than those of other methods. Therefore, we need to provide a discussion section for the extensive analysis, e.g., by using statistical significance.
In this section, two types of statistical hypothesis tests are utilized for discussion. One is about Paired sample
t-test, which is used to determine whether the mean difference between two sets of paired data is significantly different from zero. In this section we use it to test individual indicators; The other test is called Hotelling’s
T-squared statistical test (
T2-test, for short), which is a generalized version of
t-test and used for multivariate hypothesis testing. For example, all of indicator values collected on each turn will be viewed as a multivariate vector. Then
T2-test is used to identify changes in means between multivariable vectors generated from the paired two models. For both types of hypothesis tests, the null hypothesis (H
0) is that the means of the paired two groups are equal, or saying, the mean difference is set to zero. Correspondingly, alternate hypothesis (H
1) posits that there is a significant difference in the means of the two groups. The significance level is set to 0.05, a standard threshold in statistics tests, which means that there is a 5% probability of incorrectly rejecting the null hypothesis. Reflecting in this discussion, if a
p-value is less than 0.05, the result can be considered statistically significant, leading to the conclude that in statistical significance, our method is superior to or inferior to the paired method. In order to collect data for the hypothesis test, the procedures are run five turns, and on each turn, the results of the above indicators are recorded in a table. Then the averaged results (means) and
p-values are summarized in
Table 5, where only three indicators, namely, mAP
50, P (Precision) and R (Recall), are shown in
Table 5, limit to the space. Correspondingly, the multivariate vectors for
T2-test are also constructed on these three indicators, for fairness.
In
Table 5, the symbol “--” means that there has no enough space to show multivariate vectors. And it is also not necessary because a multivariate vector is actually composed of three univariate indicator values. For example, the first multivariate mean should be (0.811, 0.816, 0,772), where each component corresponds to a univariate mean of mAP
50, P, or P. The symbol “-” in the table means a meaningless value, when a
t-test was carried between YOLOv8 + SECA and itself. As a result, either in
t-testing or in
T2-testing, most
p-values have shown that there has significant difference between our method and the other paired baseline method. The reason may be threefold. (1) The setting may be unfair for the baseline methods, as we addressed above. For example, because our focus is on multi-scale smoke objects, the constructed sample set may be biased to complex objects, e.g., multiple ignition points or long-distance smoke objects appeared in a wild image or in a forest-monitoring video. (2) Five-turn experiment may be not enough for a statistical test. In this view, statistical significance can be viewed as a double-edged sword. On the one hand, it facilitates the analyze and evaluation of different methods from a theoretical standpoint; On the other hand, it also poses a significant challenge to training time, especially when training a deep network on a large-scale data in terms of the required five or more turns. (3) It may be attributed to an unidentified cause. Different from shallow learning, there have too many things that we currently do not know about attention mechanisms and deep learning. For example, in the classical machine learning, both mathematics and statistics are viewed as important cornerstones, which enable individuals to analyze and interpret the performance of models. While in attention mechanisms and attention-induced deep learning, non-interpretability appears to be a prevailing trend. Reflecting in our concerned smoke detection, we should point out that here, the interpretability just pertains to the alignment with human intuition, rather than strict adherence to mathematical and statistical rigor. For example, in human vision, the diffusion of smoke objects should be viewed as anisotropic, and in this work, it is expected to be captured by multi-kernel 1D convolutions. However, this point is hard to prove by using mathematics or statistics.
6. Conclusions
In this work, we focus on forest smoke detection and propose a novel interpretable attention mechanism named spatial and efficient channel attention, termed SECA. To capture the diffusivity of forest smoke, we aim to utilize both spatial information and channel information for smoke feature extraction. For ease of use, our SECA mechanism can be viewed as two isolated attention mechanisms, MSCA and ECA, which are used to construct spatial and channel attention maps, respectively. It can also be viewed as a unified attention, where the channel features can be interpreted as the weights, to emphasize the importance of the components of spatial MSCA. To further improve interpretability, we use multi-kernel 1D convolution to extract finer-grained features from multi-scale smoke objects instead of single-kernel 2D or 3D convolutions as the existing smoke detectors do. Additionally, to speed up the SECA-based YOLO networks, we provide a wavelet-based downsampling strategy called DHWD for model training. Extensive experiments on our collected data show that our proposal is superior to state-of-the-art attention mechanisms and attention-based deep models in terms of multi-scale, long-distance, and small-area smoke detection. Numerical results in
Table 4 demonstrate that our SECA-based models achieve a higher average accuracy of smoke detection, e.g., an increase of 4.2% in the indicator mAP
50, as well as an increase of 3.7% in mAP
50-95, compared to SOTA. On the other hand, we should also point out the shortcomings of this work. For example, the performance of SECA is only evaluated on YOLO models, without considering non-YOLO networks. When using bounding box as Ground-Truth annotations and predictions, the test results are still coarse-grained. No tests were carried out on finer-grained objects such as long-distance smoke presented by only a few pixels. Moreover, we have not yet delved deeper into 1D convolution.
Aside from general object detection, the difficulties in forest smoke detection are threefold: (1) As a non-rigid object, smoke exhibits characteristics of irregular shapes, blurred smoke region borders, and complex compositions. For example, until now, it has been unknown which features are effective for smoke detection. (2) The annotation for smoke objects is coarse-grained, e.g., using a bounding box for the smoke annotation may not be very accurate, especially for long-distance smoke objects that spread out. Even so, the number of annotated smoke images and videos is still very small. (3) The class imbalance problem is prevalent, especially at the early fire stage. In future research, on the one hand, we plan to integrate additional sources of information, such as infrared data [
45] and motion features [
17], and employ multimodal fusion techniques to enhance the model’s detection performance. Furthermore, we plan to compile more smoke datasets from real-world scenarios to improve the model’s ability to handle challenges across diverse environments. On the other hand, in the fields of pattern recognition and computer vision, both the 1D convolutions and multi-kernel operations, e.g., 1D-2D joint convolution [
46] and large selective kernel [
47], are commonly viewed as hot topics in the literature. Exploring how to integrate them into smoke detection tasks to enhance model interpretability is a potential direction for our future research.