TCSN-YOLO: A Small-Target Object Detection Method for Fire Smoke

Yang, Cao; Jun, Zhou; Hongyuan, Wen; Gang, Wang

doi:10.3390/fire8120466

Open AccessArticle

TCSN-YOLO: A Small-Target Object Detection Method for Fire Smoke

by

Cao Yang

^1,*,

Zhou Jun

²,

Wen Hongyuan

¹ and

Wang Gang

¹

School of Intelligent Manufacturing, Taizhou Institute of Science and Technology, Nanjing University of Science Technology, Taizhou 225300, China

²

School of Electronic and Electrical Engineering, Taizhou Institute of Science and Technology, Nanjing University of Science Technology, Taizhou 225300, China

^*

Author to whom correspondence should be addressed.

Fire 2025, 8(12), 466; https://doi.org/10.3390/fire8120466 (registering DOI)

Submission received: 9 August 2025 / Revised: 4 November 2025 / Accepted: 25 November 2025 / Published: 29 November 2025

(This article belongs to the Topic AI for Natural Disasters Detection, Prediction and Modeling)

Download

Browse Figures

Versions Notes

Abstract

Forest fires continue to pose a significant threat to public and personal safety. Detecting smoke in its early stages or when it is distant from the camera is challenging because it appears in only a small region of the captured images. This paper proposes a small-scale smoke detection algorithm called TCSN-YOLO to address these challenges. First, it introduces a novel feature fusion module called trident fusion (TF), which is innovatively designed and incorporated into the neck of the model. TF significantly enhances small target smoke recognition. Additionally, to obtain global contextual information with high computational efficiency, we propose a Cross Attention Mechanism (CAM). CAM captures diverse smoke features by assigning attention weights in both horizontal and vertical directions. Furthermore, we suggest using SoftPool to preserve more detailed information in the feature map. Normalized Wasserstein Distance (NWD) metric be embedded into the loss function of our detector to distinguish positive and negative samples under the same threshold. Finally, we evaluate the proposed model using AI For Humankind dataset and FlgLib dataset. The experimental results demonstrate that our method achieves 37.1% APs, 90.3% AP50, 40.4% AP50:95, 45.34 M Params and 170.5 G FLOPs.

Keywords:

small target; trident fusion; cross attention mechanism; SoftPool; NWD-loss

1. Introduction

Forest resources are vital strategic assets for humanity, serving essential economic, ecological, and social functions. Forest fires not only pollute water and air but also pose significant threats to human lives and property. Therefore, it is crucial to accurately detect forest fires and ensure the safety of human life and property [1]. During the early stages of forest fires, the smoke area is small and the spread rate is slow, making it the optimal time to extinguish the fire [2,3]. However, in the field of computer vision, it is a challenge to quickly and accurately identify targets at a long distance and small size, especially forest fires with a complex background. Take the definition of Common Objects in Context (COCO) object as an example, a small object refers to less than 32 × 32 pixels (medium object refers to 32 × 32–96 × 96, and large object refers to more than 96 × 96) [4]. How to extract and identify these small target features with less calculation is the focus of this study.

The traditional vision-based detection approach focuses on the artificial design of the smoke detection features [5]. Prema et al. [6] splits the smoke region based on the color of the local YUV color, and extract the static and dynamic texture features, such as uniformity, energy, correlation, and contrast. Filonenko et al. [7] utilized color and shape features for smoke detection. However, the color of the smoke often depends on the burning material, and its shape is easily influenced by wind and airflow. Alamgir et al. [8] used local binary patterns combined with co-occurrence texture features to characterize smoke. Li et al. [9] proposed a flame detection framework that automatically determines the Gaussian distribution of flame color using Dirichlet process estimation. Krstinić et al. [10] use HSI color space transformations to achieve pixel-level smoke segmentation. Whether it is based on flame color or smoke morphology and texture, these methods rely on the researchers’ choices and designs, and cannot fully represent the characteristics of smoke or the connections between these characteristics. Consequently, traditional visual inspection methods have poor generalization, and interference from the environment significantly impacts detection results.

As the second visual detection method, deep learning is utilized to extract smoke features [11]. Wang et al. [12] proposed a lightweight smoke detection model ESmokeNet, which includes an innovative EdgeDet head that can capture the edge information of smoke to improve the accuracy of the final decision. At the same time, a context embedding module is added to the backbone network of the model to enhance the extraction and fusion capabilities of multi-scale features. Ba et al. [13] enhanced the feature representation capability by a residual attention module (RA), which integrates the residual module with the spatial and channel attention mechanisms. Hashemzadeh et al. [14] propose a method that combines the visual feature extraction capabilities of convolutional neural networks (CNNs) with an efficient spatiotemporal feature calculation method to achieve accurate and real-time detection of smoke in video sequences. Majid et al. [15] propose a custom fire detection framework based on transfer learning, which uses Gradient Weighted Class Activation Mapping (Grad-CAM) to generate a coarse localization map that highlights important regions in the image for predicting concepts. Khan et al. [16] introduced a model termed DeepSmoke based on EfficientNet and DeepSmoke 3+. Hosseini et al. [17] propose a method for detecting smoke and flames in video frames, called the UFS network. The network employs a voting-based decision-making scheme, which can reduce false positives that occur with CNNs and improve the reliability of the system as much as possible. Zhu et al. [18] proposed the small object detection algorithm TPH-YOLOv5 for drone-captured images. This algorithm adds a small object detection head based on the YOLOv5 algorithm, which can detect small objects on feature layers with richer pixels. At the same time, transformer encoder blocks are also integrated in the backbone and neck sections, increasing the ability to capture global information. To address the computational overhead caused by the additional prediction heads in the TPH-YOLOv5 architecture, Zhao et al. [19] removed the tiny object prediction head and added a cross-layer asymmetric transformer (CA Trans) to reduce the amount of computation while retaining the performance of small object detection. Amjad et al. [20] propose a Dynamic Fire and Smoke Detection Model (DFDM), which is an optimized YOLOv7-tiny architecture that incorporates an asymptotic feature pyramid network (AFPN) to bridge the semantic gap and introduces a cross-layer dual attention (CDA) mechanism to improve the detection capability of key fire and smoke features. Bahhar et al. [21] proposed a two-stage YOLO model. The first stage uses YOLOv5s to detect smoke, and the second stage uses the standard YOLOv5 with a voting ensemble CNN architecture to give the final decision. To effectively reduce the false positive rate during smoke detection, Kwak et al. [22] use the dark channel prior algorithm to filter out interfering information in the image background, such as fog. We also use the optical flow method to extract the dynamic features of the smoke. Jeong et al. [23] A combined model based on the YOLO detector and the long short-term memory (LSTM) classifier with a teacher-student framework is proposed to improve detection performance by reflecting the spatiotemporal characteristics of smoke. The above-mentioned models have good performance in the field of target detection. We believe that after multiple convolutions, the features of small objects are likely to disappear. Furthermore, convolution operations cannot resolve the long-range dependency of feature information, making it difficult to improve detection accuracy.

2. Related Work

The attention mechanism can improve the performance of target detection models by preserving global contextual features significantly and extracting more differentiated information. Woo et al. [24] proposed a lightweight Convolutional Block Attention Module (CBAM) to adaptively refine feature maps by emphasizing informative features along both channel and spatial dimensions, thereby helping the model focus on key information. CBAM can be easily integrated into Convolutional Neural Network (CNN) architectures to highlight relevant features and improve their representation. DETR [25], as a novel Transformer architecture designed for object detection tasks, which uses a self-attention mechanism and positional embedding to capture global contextual information and understand the positional relationships of objects in images. Liu et al. [26] proposed a Swin transformer to captures longrange dependencies by self-attention. Safarov et al. [27] proposed a novel feature extraction structure, which replaces the CSPDarknet53 in the YOLOv5 backbone network with the vision transformer (ViT), and uses the self-attention mechanism of ViT to obtain the long-range dependency of the target. In particular, the detection results of the model are significantly improved in the presence of interference. However, algorithms based on the self-attention mechanism have problems such as long training time.

Feature fusion prevents information loss. Rohra et al. [28] proposed a multi-scale feature fusion network (MSFFNet), which extract sufficient semantic features to understand targets in both sparse and highly crowded scenes. This method employs adaptive fusion to integrate most encoded features, rather than extracting features separately. Therefore, it can expand the receptive field size and reduce computational cost. Singh [29] proposed a dual-stream feature fusion (BFF) network for target detection in hazy environments. The network’s multi-level feature fusion unit can fuse appropriate features between hazy images and mixed inputs, thus achieving excellent performance in hazy environments. Ghiasi et al. [30] proposed an feature fusion network based on neural architecture search (NAS-FPN), which automatically generates high-performance, scalable feature pyramid structures through neural architecture search, solving the suboptimal problem of manually designed FPNs and achieving synergistic optimization of accuracy and efficiency in multi-scale target detection tasks. Achinek et al. [31] introduce a dense attention mechanism in the Feature Network (FPN) layer, which can propagate semantic information and reduce the learning of redundant feature maps.

A good regression loss function will significantly improve the model detection performance. In the early target detection model based on anchor box, IoU is usually used to measure the overlap between the anchor box and the GT box. It has fast convergence speed and scale invariance, but it cannot accurately reflect the coincidence degree between them. DIoU adds the distance information of the center points of two boxes compared with IoU. The author indicates that a good BBR should consider three indicators, namely, overlap, center distance and aspect ratio, while DIoU only considers the first two, which solves the problems of divergence in training when IoU is used as a loss function. CIoU contains all three indexes, and combines the influence factors of aspect ratio into the penalty term, so it has a faster convergence speed. Focaler-IoU reformulates the IoU loss function through a linear interval mapping methodology, establishing a dynamic adaptation mechanism that differentiates treatment for regression samples based on their difficulty levels, thereby optimizing the model’s focus on critical learning objectives [32]. This method performs particularly well in scenes with small targets.

In order to improve the shortcomings of current research and improve the detection accuracy of smoke, this study designs an efficient feature extraction and feature fusion model for detecting small-target forest fire smoke. In the neck part of the network structure, the algorithm innovatively introduces shallow feature maps from the backbone to obtain richer small target information. Furthermore, this study investigates the method of focusing on the texture feature information of fire smoke that uses sparse attention to compute long-range dependencies of contextual information efficiently. In order to preserve more detailed feature information during feature extraction, we replace MaxPool pooling in the SPPF layer with SoftPool pooling. Finally, we use the normalized Gaussian Wasserstein distance as the basis for calculating the bounding box regression loss, which can solve the problem that IoU loss is very sensitive to the position deviation of tiny objects.

The main contributions of this work are as follows:

We propose a trident fusion (TF) module to integrate deep features, corresponding features, and shallow features to obtain more shallow small target information.
We propose a Cross Attention Mechanism (CAM) to capture smoke texture features in both horizontal and vertical dimensions.
We replace MaxPool with SoftPool in the SPPF layer of YOLOv8 to minimize information loss during the pooling process while preserving the functionality of the pooling layer.
Instead of the original CIoU metric, NWD is embedded into the bounding box regression loss function of the detector model.
The performance of this algorithm was analyzed and compared with other mainstream object detection algorithms.

The rest of this article is organized as follows: Section 2 discusses existing research on smoke recognition, and Section 3 introduces the TCSN-YOLO detection algorithm for small smoke targets. Section 4 presents test verification and analysis. Finally, Section 5 summarizes the findings and elaborates on future research directions.

3. Materials and Methods

The overall network structure of the proposed small smoke target detection model named TCSN-YOLO is shown in Figure 1. The proposed Trident Fusion (TF) module fuses shallow and deep feature maps and integrates Soft Pooling to enhance the efficiency of deep feature extraction and minimize information loss during downsampling. In addition, the Cross Attention Mechanism (CAM) is connected after the soft SPPF, which can improve the network’s attention to the region of interest from both horizontal and vertical directions. Finally, NWD-based Regression Loss applied to our detector avoids minor location deviation flip the anchor label.which enhance positive/negative sample differentiation, and reduces the convergence difficulty of the network.

3.1. Trident Fusion

We propose the Trident Fusion (TF) module, a novel three-dimensional feature map fusion mechanism that enhances spatial and semantic information for the detection head. As illustrated in Figure 2, the deep feature layer P5 and the middle feature layer P4 in the model backbone network serve as one input branch of TF module, the shallow feature layer P3 serves as another input branch. The feature layer P2 shallower than the feature layer P3 has richer pixel information, which serves as the third input. The TF module adjusts the above three input feature maps to the same size and concatenates them together. Compared with the original feature fusion module of the baseline network, the proposed module preserves rich low-level features, thereby enhancing small-target detection performance. The shallow feature layer P2 has a size of 160 × 160 × 128. After depthwise convolution using DSC (kernel size = 3, s = 2, padding = 1), the size is reduced to 80 × 80 × 128. Then, through point-by-point convolution, information from different channels is linearly combined to form the feature layer P2’. The upsample module in the Trident Fusion module uses nearest neighbor interpolation to upscale the small input feature map (40 × 40 × 512) by a factor of 2 with minimal computational effort. Notably, the feature map P2 retains substantial positional information, which is critical for localizing small objects due to its high spatial resolution. However, the rich background information contains complex textural characteristics that may interfere with target detection results. Therefore, in TF module, a depthwise separable convolution (DSC) [33] is deployed to filter out irrelevant information related to the detected object.

DSC comprises two stages: a depthwise convolution that operates on single-channel feature maps and a pointwise convolution that operates across feature map channels. When using DSC to perform channel convolution and point convolution on a 160 × 160 ×

C_{i n}

feature map, the number of parameters is 3 × 3 ×

C_{i n}

+ 3 ×

C_{o u t}

, while using a traditional convolution kernel, the number of parameters is 3 × 3 ×

C_{i n}

×

C_{o u t}

, Where

C_{i n}

refers to the number of input feature map channels, and

C_{o u t}

refers to the number of output feature map channels. DSC has relatively fewer parameters compared to traditional convolutional layers with kernels of equivalent size. This approach not only effectively extends the receptive field but also minimizes parameter growth.

3.2. Cross Attention Module

We propose a Cross Attention Module (CAM) to capture contextual information with a minor computation increment for smoke detecting. The network structure is illustrated in Figure 3. It consists of three conv, two Self-Attention, an addition operation, and a Residual Connection. Since Self-Attention shows remarkable performance in object detection, we adopt Self-Attention as the Main Operations to extract horizontal and vertical directions features. which is input of addition operation. The residual connection facilitates cross-layer feature integration through additive operations, where shallow-level features are directly combined with their corresponding deep representations. This architectural design not only promotes hierarchical feature reuse but also mitigates gradient vanishing issues, thereby significantly enhancing the model’s generalization capacity by preserving information propagation pathways throughout network depth.

The feature map X with spatial dimensions H × W × C is processed by the cross-attention module to yield the output feature map X′. Specifically, three convolutional layers with 1 × 1 filters are applied on feature map X to generate three feature maps Q, K and V, respectively, where Q, K

\in R^{c^{'} \times h \times w}

, V

\in R^{c \times h \times w}

. c′ is 1/8 of c, with c being the original channel count in X. Then we adjust the dimensions of matrices Q and K to obtain horizontal directions feature maps

Q_{h}

,

K_{h}

and

V_{h}

, respectively. We further obtain the correlation degree of all pixels in horizontal directions, which is named Ah via matrix multiplication. The matrix multiplication operation is then defined as follows:

A_{h} = s o f t m a x (Q_{h} K_{h})

(1)

where parameter matrices

Q_{h} \in R^{w \times h \times c^{'}}

,

K_{h} \in R^{w \times c^{'} \times h}

.

Vh based on V multiply with Ah to obtain horizontal directions feature matrix

X_{h}^{'}

. The calculation process of attention in the vertical direction is similar to that in the horizontal direction, and the final feature matrix named

X_{w}^{'} \in R^{c \times h \times w}

.

X_{h}^{'} = A_{h} \times V_{h}

(2)

X_{w}^{'} = A_{w} \times V_{w}

(3)

where parameter matrices

A_{h} \in R^{w \times h \times h}

,

V_{h} \in R^{w \times c \times h}

,

A_{w} \in R^{h \times w \times w}

,

V_{w} \in R^{h \times c \times w}

.

Finally, the feature matrix

X_{h}^{'}

and

X_{w}^{'}

are fused by using addition operation. It is follow by a shortcut connection to facilitate the transmission of gradient information. The operation is then defined as follows:

X^{'} = (X_{h}^{'} + X_{w}^{'}) + X

(4)

Despite CAM capturing contextual information in both horizontal and vertical directions, it remains insufficient for comprehensively capturing long-range dependencies across all pixels. To tackle this problem, We connect two CAM modules in series. The first CAM takes the feature map X as its input, and the output is a feature map X′ with the horizontal and vertical context information in the feature map X. The second CAM then takes X′ as its input and outputs a feature map X″, where X″ has the same shape as X and contains the global context information in the feature map X.

The complexity of classic self-attention in time and space is

O ((h \times w) \times (h \times w))

. By serially stacking two CAMs, it reduces the complexity in time to

O (((h \times w) \times (h + w)) \times 2)

, and CAM accelerates the speed of model training and inference greatly.

3.3. Soft-SPPF Module

Pooling reduces computational load by downsampling input feature maps while preserving salient features. Most architects use maximum or average pooling, but if maximum pooling is selected, most feature information will be lost. In the latter case, if average pooling is selected, the intensity of feature information in the whole area will be averaged, resulting in the weakening of local intensity. This paper adopts the SoftPool method, which balances global and local information by giving high weight to important features. SoftPool is differentiable, allowing the minimum gradient value to be updated during backpropagation.

SoftPool is an softmax weighted activation subsampling method. The feature weights are derived from the following formula:

w_{i} = \frac{e^{a_{i}}}{\sum_{j \in R} e^{a_{j}}}

(5)

In the formula, the weight

w_{i}

is equal to the natural exponent of the activation value

a_{i}

divided by the sum of the natural exponent of all the activation values. Activation values

a_{i}

play a prominent role in the calculation method of weights, higher activation values will become more and more important, SoftPool is a more balanced method, rather than simply choosing the maximum value. Sum all weighted activation values to obtain SoftPool’s output value:

\tilde{a} = \sum_{i \in R} w_{i} \times a_{i}

(6)

We replace the maximum pool in the original SPPF with SoftPool, and name this structure the Soft-SPPF module. The structure of the Soft-SPPF module is illustrated in Figure 4. Initially, input feature maps pass through a convolutional layer for feature compression. Subsequently, three sequential SoftPool operations with a 5 × 5 kernel size are applied to generate multi-scale feature maps. Finally, these feature maps with different scales are spliced in the channel dimension to form a feature vector containing multi-scale information by the “Concat” operation and a convolution layer. During this process, the Soft-SPPF module compresses the number of channels of the input feature map from 512 to 256. After multi-scale feature fusion, the number of channels in the output feature map is restored to 512, significantly reducing the amount of computation.

3.4. Normalized Gaussian Wasserstein Distance Loss

The Normalized Wasserstein Distance (NWD) is an evaluation metric specifically designed for small object detection scenarios [34]. By modeling bounding boxes as Gaussian distributions and computing the Wasserstein distance between them, NWD offers a more nuanced, differentiable, and robust alternative to IoU, as defined by Equation (7). Crucially, NWD effectively measures similarity regardless of bounding box overlap, overcoming a limitation of traditional intersection-based metrics. Furthermore, its inherent insensitivity to target scale renders NWD particularly suitable for assessing the similarity of small targets.

W_{2}^{2} (N_{a}, N_{b}) = {∥(Q_{a}, Q_{b})∥}_{2}^{2}

(7)

Q_{a} = {[c x_{a}, c y_{a}, \frac{w_{a}}{2}, \frac{h_{a}}{2}]}^{T}

(8)

Q_{b} = {[c x_{b}, c y_{b}, \frac{w_{b}}{2}, \frac{h_{b}}{2}]}^{T}

(9)

W_{2}^{2} (N_{a}, N_{b})

denotes a distance metric between two Gaussian distributions, Where

N_{a}, N_{b}

are modeled via bounding boxes

Q_{a}

= (

c x_{a}

,

c y_{a}

,

w_{a} / 2

,

h_{a} / 2

) and

Q_{b}

= (

c x_{b}

,

c y_{b}

,

w_{b} / 2

,

h_{b} / 2

).

N W D (N_{a}, N_{b}) = exp (- \frac{\sqrt{W_{2}^{2} (N_{a}, N_{b})}}{C})

(10)

NWD is derived by applying exponential normalization to the Wasserstein distance metric, enabling its direct application as a similarity measure, where the constant C serves as a scale normalization parameter dependent on the dataset. Based on the average pixel value occupied by small smoke targets in the AI For Humankind dataset used for training, we choose 18 as the value of the normalization constant C.

L_{N W D} = 1 - N W D (N_{a}, N_{b})

(11)

We integrate NWD into the regression loss function to replace CIoU loss, which allows the model to achieve scale invariance while the performance of measuring the similarity between bounding boxes could still be maintained.

4. Results

4.1. Dataset

This paper utilizes the dataset, which is an open dataset collected and organized by the non-governmental organization AI For Humankind, to investigate the deployment of object detection models for early wildfire smoke identification [35]. This dataset has 2192 annotated images captured using HPWREN cameras with 640 × 480 pixel resolution. To enhance model generalization during training, data augmentation is performed probabilistically. The image is mirrored with a probability of 20%. The image is rotated by 10° with a probability of 20%. The image is scaled by a factor of 0.9 with a probability of 20%. The image brightness is adjusted by a value of 0.4 with a probability of 20%. The image is translated by a value of 0.1 with a probability of 20%, especially increasing the number of images with white clouds, strong light, and fog as backgrounds to verify the effectiveness of the model under environmental interference. Consequently, the expanded dataset comprised 12,620 images. The AI For Humankind dataset is randomly divided by 8:1:1. Some representative images are shown in Figure 5.

4.2. Implementation Details

To evaluate and verify the performance of TCSN-YOLO, we conducted several experiments. The hardware and software environments were consistent across all tests. Detailed environmental information is provided in Table 1.

In the TCSN-YOLO, the activation function is an SiLU, which is optimized through Adaptive Motion Estimation (Adam) with a weight decay of 0.0005 and an initial learning rate of 0.001. Non-Maximum Suppression (NMS) was implemented with an Intersection-over-Union (IoU) threshold of 0.5. For sample allocation, we employed a Task Alignment Assigner that dynamically allocates positive and negative samples by calculating the alignment score between the prediction box and the ground truth box.

4.3. Experimental Model Evaluation Indicators

To visually evaluate the model’s performance on small targets, we used APs (IoU = 0.50 for small smoke) as the main performance metric, and added AP50 (IoU > 0.5) and AP50:95 (0.95 > IoU > 0.5). In addition, considering that our proposed TCSN-YOLO application scenario is an embedded platform, the scale of the algorithm is limited by computing resources. In order to more accurately evaluate its performance, we also measure two pivotal computational complexity metrics: trainable parameter counts (Params) and floating-point operations (FLOPs).

Precision quantifies the proportion of correctly identified positive instances among all samples predicted as positive. The calculation formula is as follows:

$P = \frac{T P}{T P + F P}$

(12)
Recall measures the proportion of true positive instances correctly predicted by the model among all actual positive cases. The computational formula is expressed as follows:

$R = \frac{T P}{T P + F N}$

(13)
Average Precision.

$A P = \int_{0}^{1} P (R) d R$

(14)

Average Precision (AP) quantifies the model’s precision performance averaged across all recall levels for a specific category, typically computed as the area under the Precision–Recall (PR) curve.
Params. It quantifies the total parameter count of the model, while it does not directly influence the inference speed, it fundamentally determine memory requirements during inference.
FLOPs. It quantifies the computational complexity of model, serve as a key metric for evaluating model efficiency.

4.4. Ablation Analysis of Network Structures

YOLOv8 is used as the baseline for ablation experiments, which performs well on standard datasets such as Microsoft COCO. The YOLOv8L model achieves a mean average precision (mAP) of 52.9% when measured on COCO. This high accuracy stems from the C2f module adopted in its backbone which reduces parameters by 28% while enhancing feature reuse by shortening gradient paths and introducing cross-stage connections. Its neck architecture restructures the Feature Pyramid Network (FPN) and Path Aggregation Network (PAN), significantly improving hierarchical feature fusion across scales. These factors contribute to improved detection accuracy and efficiency.

To validate the effectiveness of each of the improvement strategies, we conduct a series of ablation experiments on the validation set of AI For Humankind dataset with different settings. As illustrated in Table 2, where ✓ denotes strategy implementation, each enhancement strategy contributed to improved detection performance beyond the baseline model. Specifically, the integration of SoftPool, TF, CAM, and NWD loss yielded measurable gains, with all strategies collectively elevating key detection metrics.

The proposed Softpool enhances the accuracy of the network model in practical applications, improving AP50 by 0.2% and AP50:95 by 0.3%. The exponential weighting preserve the basic attributes of the input feature, while amplifying stronger feature activations.

The integration of the TF module into the feature fusion structure introduces shallow features rich in location information from the backbone network, thereby contributing to the clear details of the small targets. In addition, the TF module filters the background noise in the shallow features by using the DSC module. Baseline + TF module achieves 36.9% APs, which is 0.5% higher than that of Original YOLOv8 with similar params (43.76 M vs. 43.71 M) and slightly higher FLOPs (166.1 G vs. 165.2 G).

The CAM module extracts contextual features from the feature map along both the horizontal and vertical axes. It can obtain long-distance dependencies between pixels with a small increase in the amount of computation, which enable the model to balance computational efficiency and detection performance enhancement.

Integrating NWD into the regression loss function to replace CIoU loss ensures that the detection results of small targets will not be judged as negative samples due to slight deviations from the ground truth, thereby significantly improving small-target detection robustness. NWM-loss has significant improvements in AP50, AP50:95, and APs indicators, among which the APs indicator effect is better than other improvement strategies, proving that scale invariance plays a critical role in boosting small-target detection performance.

The proposed TCSN-YOLO model uses multiple improvement strategies. For small objects with complex environments, the proposed model shows excellent detection performance, which achieves higher confidence and reduces false positives and false negatives. Compared with the baseline model, the TCSN-YOLO model exhibits increases of 1.9% in AP50, 2.6% in AP50:95, and 0.7% in APs.

4.5. Comparison with Mainstream Methods

To evaluate the detection performance of TCSN-YOLO, we conducted comparative experiments with several mainstream object detection models, including TPH-YOLOv5, TPH-YOLOv5++, DETR and YOLOv8. Experiments were performed on the AI For Humankind dataset, and the results are summarized in Table 3. Comparative analysis reveals that the proposed TSCD-YOLO model achieved the highest detection precision among all mainstream models. Although the improved model had a slightly higher FLOPs compared to the TPH-YOLOv5++ and YOLOv8 model.

The experimental results can be explained as follows: Despite significantly improving small object detection performance, the additional prediction head in TPH-YOLOv5 also leads to large computational and memory overhead. Moreover, it is not scale-invariant and prone to false detection and missed detection. TPH-YOLOv5++ is an improvement on the TPH-YOLOv5 model. It eliminates the dedicated tiny object detection head. Instead, it employs the Cross-layer Asymmetric Transformer (CA-Trans) module to enhance small object feature expression, and models sparse cross-layer attention through the Sparse Local Attention (SLA) module, while reducing the amount of calculation, it maintains a high accuracy of small target detection. However, SLA only calculates the attention value between neighbors and lacks the process of capturing global contextual relationships. DETR adopts the Transformer codec architecture and uses its global attention mechanism to analyze the contextual information of the entire image. It shows significant advantages on large targets, but its performance on small targets still needs to be improved. Compared to the previously mentioned models, YOLOv8 has fewer parameters, lower FLOPs, and enhanced object detection performance. However, in a complex backgrounds, deep feature maps frequently contain insufficient semantic information to maintain robust detection performance for small objects. Through the efficient integration of complementary shallow feature maps, the detection model gains more comprehensive and effective representations, significantly enhancing small object detection under these challenging conditions. Consequently, TCSN-YOLO maintains high detection performance for small objects in complex backgrounds with lower computational complexity and memory overhead. In contrast, competing methods exhibit degraded performance under similar conditions.

To further evaluate the detection performance of the models, we selected some images with complex backgrounds of fog, strong light and white clouds to compare the mainstream models and the TCSN-YOLO model. As shown in Table 4, the a column is the result without obvious interference information, the b column is the detection result under strong light illumination, the column c presents the detection result under the scene with white clouds in the picture background, and the column d presents the detection result under the scene with fog in the picture background. It show that the TCSN-YOLO model is little affected by background interference and has good robustness to environmental noise. Compared with other models, it can detect the category and position of smoke targets more accurately. It has learned more complete and detailed small target features, which greatly alleviates the problem of missed detection.

Figure 6 demonstrates the performance comparison with other mainstream methods during training, where Figure 6a–c exhibit the performance curves of AP50, AP50:90, and box regression loss metrics, respectively. To visually observe the performance of the comparison model, we present the training process during the initial 100 epochs. TSCD-YOLO achieves superior performance across all key metrics, demonstrating its enhanced detection capability. Furthermore, its rapid convergence significantly enhances training efficiency while conserving computational resources. The results verify the practicality of the model in actual deployment.

4.6. Generalization Experiment

We verifies the generalization ability of the TSCN-YOLO model on the AI For Humankind and FlgLib datasets [36]. The experimental setup is identical to that employed in the previous experiment. The results are presented in Table 5.

The AI For Humankind dataset included 2194 bounding box annotated images HPWREN Cameras captured on remote mountain tops in Southern California, including images taken under different backgrounds (clouds, strong light, fog, etc.). The smoke images included in this dataset are annotated in the Pascal VOC annotation format. FIgLib was developed and is maintained by Hans-Werner Braun at The University California San Diego (UCSD). The objective is to support neural network researchers in advancing early fire detection systems and fundamental research on fire initiation mechanisms. The dataset contains 24,800 high-resolution images of 1536 × 2048 or 2048 × 3072 pixels, covering a variety of terrains such as mountains, deserts, and coastal regions, and includes temporal sequences spanning an entire year to capture seasonal changes.

On the two distinct datasets, the experimental results confirm that TCSN-YOLO maintains high accuracy across both datasets, demonstrating exceptional resilience to interference and robust feature extraction capabilities. The model consistently achieves high AP and AP50:95 values under diverse conditions, indicating strong performance stability for small targets in complex scenes. This validates its suitability for challenging small-target smoke detection applications.

The visualization results on the FlgLib datasets are shown in Figure 7. We selected representative images from the FlgLib dataset to validate the accuracy of our TCSN-YOLO architecture across different fire scenarios. These images contain scenes with a variety of interfering backgrounds. For example, in Image 3, strong light blurs the outline of the smoke, while in Image 4, the color and shape of the smoke closely resemble the background mountains. The visualization results confirm TCSN-YOLO’s robust detection competence for small objects across diverse complex scenarios. Its consistent performance on unseen datasets further affirms strong generalization capabilities.

The visualization shows that our proposed model can accurately identify small smoke targets in the image and is robust in scenes with various interference backgrounds (as illustrated in the 3rd and 4th images of Figure 7). Its consistent performance on unseen datasets further affirms its strong generalization capabilities.

Consequently, ablation, comparative, and generalization experiments demonstrate that TCSN-YOLO surpasses mainstream small object detectors while exhibiting robust generalization capabilities. The model sustains high detection precision and stability in challenging scenarios, including complex backgrounds and small targets.

5. Discussion

Aiming at the challenge of detecting small smoke in complex background, we propose TCSN-YOLO, a model based on YOLOv8. To begin, we propose the trident fusion (TF), which receives more shallow features to identify small-sized smoke. Then, to address the difficulty of capturing global dependencies of small smoke, CAM is introduced to optimize the feature extraction capability in different directions in the feature map. Moreover, we substitute the max pooling in the SPPF module with Softpool, which minimizes information loss during the pooling process while preserving the functionality of the pooling layer. Finally, NWD metric be embedded into the loss function of our detector to distinguish positive and negative samples under the same threshold. We attribute performance gains to each architectural innovation through a series of ablation experiments and conduct a comprehensive comparison with established classical algorithms. The TCSN-YOLO achieves 37.1% APs, 90.3% AP50, and 40.4% AP50:95, surpassing existing methods and demonstrating its effectiveness. The TCSN-YOLO demonstrates excellent anti-interference capabilities and accuracy, making it is very effective for small target fire smoke detection scenarios that require high-precision recognition and robustness.

Although TCSN-YOLO performs effectively in detecting small smoke targets, interference factors such as sunlight exposure at specific angles or low smoke concentration can lead to misjudgments. Future work will focus on studying the characteristic differences between smoke and these interfering factors to enhance the algorithm’s robustness. Additionally, while the complex structure of TCSN-YOLO provides powerful capabilities, it introduces challenges in optimization. Increasing the number of parameters imposes greater computational demands on the hardware, potentially limiting the model’s applicability for real-time deployment. Furthermore, although we have expanded the dataset, the number is still relatively small. Smaller datasets have fewer smoke features and insufficient scene coverage, which brings the risk of overfitting. Next, we will collect smoke data from more scenarios to improve the generalization ability of the model. Future research should prioritize using cross-modal distillation technology, a multimodal teacher model composed of text and images guides a unimodal student model to learn cross-modal correlation features. This reduces the number of model parameters, enabling the student model to achieve near-multimodal perception capabilities.

Author Contributions

Conceptualization, C.Y.; software, C.Y.; supervision, Z.J.; visualization, W.H. and W.G.; writing—original draft, C.Y.; writing—review and editing, Z.J. and W.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Science Foundation of China grant number 62375129.

Data Availability Statement

The dataset referenced is publicly available (see [35,36]). The experimental data supporting the findings of this study are available from the first author (C.Y.) upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Sathishkumar, V.E.; Cho, J.; Subramanian, M.; Naren, O.S. Forest fire and smoke detection using deep learning-based learning without forgetting. Fire Ecol. 2023, 19, 9. [Google Scholar] [CrossRef]
Kim, S.Y.; Muminov, A. Forest fire smoke detection based on deep learning approaches and unmanned aerial vehicle images. Sensors 2023, 23, 5702. [Google Scholar] [CrossRef] [PubMed]
Çınarer, G. Hybrid Backbone-Based Deep Learning Model for Early Detection of Forest Fire Smoke. Appl. Sci. 2025, 15, 7178. [Google Scholar] [CrossRef]
Kisantal, M.; Wojna, Z.; Murawski, J.; Naruniec, J.; Cho, K. Augmentation for small object detection. arXiv 2019, arXiv:1902.07296. [Google Scholar]
Geetha, S.; Abhishek, C.S.; Akshayanat, C.S. Machine vision based fire detection techniques: A survey. Fire Technol. 2021, 57, 591–623. [Google Scholar] [CrossRef]
Prema, C.E.; Suresh, S.; Krishnan, M.N.; Leema, N. A novel efficient video smoke detection algorithm using co-occurrence of local binary pattern variants. Fire Technol. 2022, 58, 3139–3165. [Google Scholar] [CrossRef]
Filonenko, A.; Hernández, D.C.; Jo, K.H. Fast smoke detection for video surveillance using CUDA. IEEE Trans. Ind. Inform. 2017, 14, 725–733. [Google Scholar] [CrossRef]
Alamgir, N.; Nguyen, K.; Chandran, V.; Boles, W. Combining multi-channel color space with local binary co-occurrence feature descriptors for accurate smoke detection from surveillance videos. Fire Saf. J. 2018, 102, 1–10. [Google Scholar] [CrossRef]
Li, Z.; Mihaylova, L.S.; Isupova, O.; Rossi, L. Autonomous flame detection in videos with a Dirichlet process Gaussian mixture color model. IEEE Trans. Ind. Inform. 2017, 14, 1146–1154. [Google Scholar] [CrossRef]
Krstinić, D.; Stipaničev, D.; Jakovčević, T. Histogram-based smoke segmentation in forest fire detection system. Inf. Technol. Control 2009, 38, 237–244. [Google Scholar]
Sultan, T.; Chowdhury, M.S.; Safran, M.; Mridha, M.F.; Dey, N. Deep learning-based multistage fire detection system and emerging direction. Fire 2024, 7, 451. [Google Scholar] [CrossRef]
Wang, J.; Zhang, X.; Zhang, C. A lightweight smoke detection network incorporated with the edge cue. Expert Syst. Appl. 2024, 241, 122583. [Google Scholar] [CrossRef]
Ba, R.; Chen, C.; Yuan, J.; Song, W.; Lo, S. SmokeNet: Satellite smoke scene detection using convolutional neural network with spatial and channel-wise attention. Remote Sens. 2019, 11, 1702. [Google Scholar] [CrossRef]
Hashemzadeh, M.; Farajzadeh, N.; Heydari, M. Smoke detection in video using convolutional neural networks and efficient spatio-temporal features. Appl. Soft Comput. 2022, 128, 109496. [Google Scholar] [CrossRef]
Majid, S.; Alenezi, F.; Masood, S.; Ahmad, M.; Gündüz, E.S.; Polat, K. Attention based CNN model for fire detection and localization in real-world images. Expert Syst. Appl. 2022, 189, 116114. [Google Scholar] [CrossRef]
Khan, S.; Muhammad, K.; Hussain, T.; Del Ser, J.; Cuzzolin, F.; Bhattacharyya, S.; Akhtar, Z.; de Albuquerque, V.H.C. Deepsmoke: Deep learning model for smoke detection and segmentation in outdoor environments. Expert Syst. Appl. 2021, 182, 115125. [Google Scholar] [CrossRef]
Hosseini, A.; Hashemzadeh, M.; Farajzadeh, N. UFS-Net: A unified flame and smoke detection method for early detection of fire in video surveillance applications using CNNs. J. Comput. Sci. 2022, 61, 101638. [Google Scholar] [CrossRef]
Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11 October 2021; pp. 2778–2788. [Google Scholar]
Zhao, Q.; Liu, B.; Lyu, S.; Wang, C.; Zhang, H. Tph-yolov5++: Boosting object detection on drone-captured scenarios with cross-layer asymmetric transformer. Remote Sens. 2023, 15, 1687. [Google Scholar] [CrossRef]
Amjad, A.; Huroon, A.M.; Chang, H.T.; Tai, L.C. Dynamic fire and smoke detection module with enhanced feature integration and attention mechanisms. Pattern Anal. Appl. 2025, 28, 81. [Google Scholar] [CrossRef]
Bahhar, C.; Ksibi, A.; Ayadi, M.; Jamjoom, M.M.; Ullah, Z.; Soufiene, B.O.; Sakli, H. Wildfire and smoke detection using staged YOLO model and ensemble CNN. Electronics 2023, 12, 228. [Google Scholar] [CrossRef]
Kwak, D.K.; Ryu, J.K. A study on the dynamic image-based dark channel prior and smoke detection using deep learning. J. Electr. Eng. Technol. 2022, 17, 581–589. [Google Scholar] [CrossRef]
Jeong, M.; Park, M.; Nam, J.; Ko, B.C. Light-Weight Student LSTM for Real-Time Wildfire Smoke Detection. Sensors 2020, 20, 5508. [Google Scholar] [CrossRef] [PubMed]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23 August 2020; pp. 213–229. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 17 March 2021; pp. 10012–10022. [Google Scholar]
Safarov, F.; Muksimova, S.; Kamoliddin, M.; Cho, Y.I. Fire and Smoke Detection in Complex Environments. Fire 2024, 7, 389. [Google Scholar] [CrossRef]
Rohra, A.; Yin, B.; Bilal, H.; Kumar, A.; Ali, M.; Li, Y. MSFFNet: Multi-scale feature fusion network with semantic optimization for crowd counting. Pattern Anal. Appl. 2025, 28, 21. [Google Scholar] [CrossRef]
Singh, K.; Parihar, A.S. Bff: Bi-stream feature fusion for object detection in hazy environment. Signal Image Video Process. 2024, 18, 3097–3107. [Google Scholar] [CrossRef]
Ghiasi, G.; Lin, T.Y.; Le, Q.V. Nas-fpn: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16 June 2019; pp. 7036–7045. [Google Scholar]
Achinek, D.N.; Shehu, I.S.; Athuman, A.M.; Fu, X. DAF-Net: Dense attention feature pyramid network for multiscale object detection. Int. J. Multimed. Inf. Retr. 2024, 13, 18. [Google Scholar] [CrossRef]
Zhang, H.; Zhang, S. Focaler-iou: More focused intersection over union loss. arXiv 2024, arXiv:2401.10525. [Google Scholar]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Wang, J.; Xu, C.; Yang, W.; Yu, L. A normalized Gaussian Wasserstein distance for tiny object detection. arXiv 2021, arXiv:2110.13389. [Google Scholar]
HPWREN/AI for Mankind. Available online: https://github.com/aiformankind/wildfire-smoke-dataset (accessed on 20 November 2024).
Dewangan, A.; Pande, Y.; Braun, H.W.; Vernon, F.; Perez, I.; Altintas, I.; Cottrell, G.W.; Nguyen, M.H. FIgLib & SmokeyNet: Dataset and deep learning model for real-time wildland fire smoke detection. Remote Sens. 2022, 14, 1007. [Google Scholar]

Figure 1. TCSN-YOLO network architecture.

Figure 2. Trident Fusion module structure.

Figure 3. Cross attention module structure.

Figure 4. Soft-SPPF module structure.

Figure 5. Some representative images from AI For Humankind dataset. (a) Complex background, gray rocks. (b) Complex background, fog. (c) Complex background, strong light. (d) Simple background, small object. (e) Simple background, medium object. (f) Complex background, white clouds.

Figure 6. Comparison with other mainstream methods. (a) AP50 curves. (b) AP50:90 curves. (c) box regression loss curves.

Figure 7. The visualization results on FlgLib datasets. (a) wildfire scenes, stones. (b) wildfire scenes, roads, mountains. (c) wildfire scenes, strong light. (d) wildfire scenes, trees, mountains.

Table 1. Hardware and software environment.

Hardware environment	CPU	Intel Xeon Gold 6330
	RAM	64 G
	GPU	NVIDIA GeForce RTX 3090
	Video memory	16 G
Software environment	OS	Ubuntu 18.04
	language	Python 3.8
	framework	Pytorch 1.9.0

Table 2. Results of ablation experiment.

Baseline	TF	CAM	SoftPool	NWD-Loss	${AP}_{50}$ (%)	${AP}_{50 : 95}$ (%)	APs (%)	${AP}_{M}$ (%)	Params (M)	FLOPs (G)
✓					88.4	37.8	36.4	56.8	43.71	165.2
✓	✓				89.5	38.7	36.9	57.4	43.76	166.1
✓		✓			88.9	38.2	36.6	58.2	44.18	167.4
✓			✓		88.6	38.1	36.5	57.1	43.82	165.9
✓				✓	88.9	38.5	37.0	56.9	43.71	165.4
✓	✓	✓	✓	✓	90.3	40.4	37.1	61.6	45.34	170.5

Table 3. Comparative performances of the proposed and mainstream models on AI For Mankind dataset.

Model	${AP}_{50}$ (%)	${AP}_{50 : 95}$ (%)	${AP}_{S}$ (%)	${AP}_{M}$ (%)	Params (M)	FLOPs (G)
TPH-YOLOv5	87.5	37.5	35.1	58.0	45.36	260.1
TPH-YOLOv5++	87.7	38.8	36.1	59.5	41.49	160.0
DETR	78.8	30.0	29.6	55.4	60.28	253.6
YOLOv8	88.4	37.8	36.4	56.8	43.71	165.2
TCSN-YOLO (ours)	90.3	40.4	37.1	61.6	45.34	170.5

Table 4. The Visual comparison of different models on AI For Humankind dataset.the a column is the detection result without obvious interference information, the b column is the detection result under strong light illumination, the column c presents detection result under the scene with white clouds in the picture background, and the column d presents detection result under the scene with fog in the picture background.

Model	Detection Results
TPH-YOLOv5
TPH-YOLOv5++
DETR
YOLOv8
TCSN-YOLO
	(a)	(b)	(c)	(d)

Table 5. Generalization performance of the TCSN-YOLO model evaluated on the FIgLib dataset.

Model	${AP}_{50}$ (%)	${AP}_{75}$ (%)	${AP}_{S}$ (%)	${AP}_{M}$ (%)
AI For Humankind	90.3	40.4	37.1	61.6
FlgLib	88.0	38.2	35.8	61.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, C.; Jun, Z.; Hongyuan, W.; Gang, W. TCSN-YOLO: A Small-Target Object Detection Method for Fire Smoke. Fire 2025, 8, 466. https://doi.org/10.3390/fire8120466

AMA Style

Yang C, Jun Z, Hongyuan W, Gang W. TCSN-YOLO: A Small-Target Object Detection Method for Fire Smoke. Fire. 2025; 8(12):466. https://doi.org/10.3390/fire8120466

Chicago/Turabian Style

Yang, Cao, Zhou Jun, Wen Hongyuan, and Wang Gang. 2025. "TCSN-YOLO: A Small-Target Object Detection Method for Fire Smoke" Fire 8, no. 12: 466. https://doi.org/10.3390/fire8120466

APA Style

Yang, C., Jun, Z., Hongyuan, W., & Gang, W. (2025). TCSN-YOLO: A Small-Target Object Detection Method for Fire Smoke. Fire, 8(12), 466. https://doi.org/10.3390/fire8120466

Article Menu

TCSN-YOLO: A Small-Target Object Detection Method for Fire Smoke

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Trident Fusion

3.2. Cross Attention Module

3.3. Soft-SPPF Module

3.4. Normalized Gaussian Wasserstein Distance Loss

4. Results

4.1. Dataset

4.2. Implementation Details

4.3. Experimental Model Evaluation Indicators

4.4. Ablation Analysis of Network Structures

4.5. Comparison with Mainstream Methods

4.6. Generalization Experiment

5. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI