Light-YOLOv5: A Lightweight Algorithm for Improved YOLOv5 in Complex Fire Scenarios

Fire-detection technology is of great importance for successful fire-prevention measures. Image-based fire detection is one effective method. At present, object-detection algorithms are deficient in performing detection speed and accuracy tasks when they are applied in complex fire scenarios. In this study, a lightweight fire-detection algorithm, Light-YOLOv5 (You Only Look Once version five), is presented. First, a separable vision transformer (SepViT) block is used to replace several C3 modules in the final layer of a backbone network to enhance both the contact of the backbone network to global in-formation and the extraction of flame and smoke features; second, a light bidirectional feature pyramid network (Light-BiFPN) is designed to lighten the model while improving the feature extraction and balancing speed and accuracy features during a fire-detection procedure; third, a global attention mechanism (GAM) is fused into the network to cause the model to focus more on the global dimensional features and further improve the detection accuracy of the model; and finally, the Mish activation function and SIoU loss are utilized to simultaneously increase the convergence speed and enhance the accuracy. The experimental results show that compared to the original algorithm, the mean average accuracy (mAP) of Light-YOLOv5 increases by 3.3%, the number of parameters decreases by 27.1%, and the floating point operations (FLOPs) decrease by 19.1%. The detection speed reaches 91.1 FPS, which can detect targets in complex fire scenarios in real time.

Due to the irregular shape of smoke, uneven spatial distribution and short existence time, it is very difficult for the accurate detection of smoke.Traditional methods usually detect smoke by features such as color, texture, and shape of smoke.Favorskaya et al. [1] used dynamic texture features to detect smoke using two-dimensional and threedimensional LBP histograms, which can exclude the interference of wind in static scenes.Dimitropoulos et al. [2]used HSV model and adaptive median algorithm for preprocessing, and then used high-order linear dynamic system for dynamic texture analysis of smoke, which significantly improves the detection accuracy.Wang et al. [3]proposed a flame detection method that combines the dynamic and static features of the flame in the video and reduces the influence of the environment by combining flame color features and local features.
In recent years, with the rapid development of machine learning, the latest target detection algorithms in deep learning have been applied to the field of fire detection.Wang et al. [4]used a improved YOLOv4 network for real-time smoke and fire detection, dramatically reducing the number of parameters to improve detection speed and successfully deploying to UAVs, but with lower accuracy than the original algorithm.Zhang et al. [5]proposed a T-YOLOX fire detection algorithm using VIT technique to improve the accuracy of detecting smoke, fire, and people, but did not discuss the number of parameters and computational effort.Zhao et al. [6] proposed an improved Fire-YOLO algorithm for forest fires to enhance the detection of small targets in fires and reduce the model size, but again, the number of parameters and computational effort were not discussed.Li et al. [7] improved the algorithm of YOLOv3-tiny to improve the fire detection accuracy by multi-scale fusion and k-means clustering, but the detection speed is not ideal and the scenario of application is single.Yue et al. [8]reduced the false detection rate by increasing the resolution of the feature map and expanding the perceptual field, but the detection speed is not satisfactory.Wu et al. [9] added dilated convolution to the SPP module of YOLOv5 and used GELU activation function and DIoU-NMS to improve the speed and accuracy to improve the robustness of fire detection and meet the requirements of video fire detection.Zheng et al. [10] built an improved DCNN model to identify forest fires and used migration learning and PCA techniques to improve the accuracy, but did not discuss the analysis of real-time performance.Xue et al. [11]uses the SPPFP module to replace the SPPF module in YOLOv5, adds the CBAM attention module, and uses migration learning and other methods to improve the detection accuracy of small and medium-sized targets in forest fires, finally achieving good results, but sacrificing speed.Wu et al. [12] improved YOLOv4-tiny by adding SE attention mechanism and using multi-scale detection to enhance the detection of small targets and occluded objects to meet the requirements of ship fires in the sea.Shahid et al. [13]used the vision transformer for fire detection and demonstrated the feasibility of VIT, but the number of parameters and computational effort could be improved, and real-time performance was not discussed.However, some of these articles mentioned above are unsatisfactory in terms of speed, some are unsatisfactory in terms of accuracy, and some are too homogeneous in terms of environment to achieve a balance between these three.Therefore, this paper proposes a Light-YOLOv5s method for fire detection to achieve a balance of speed and accuracy in complex environments.

Baseline
YOLOv5 is the object detection network of the YOLO series, which is famous for being fast, lightweight and accurate.The structure of YOLOv5 consists of 4 modules are input, backbone, neck, and prediction.Compared with YOLOv4, YOLOv5 adds mosaic data enhancement and adaptive anchor frame calculation, using Leaky ReLU and Sigmoid activation functions, etc. YOLOv5 has n, s, m, l and x versions.We chose YOLOv5n, which has both speed and accuracy, as the baseline for improvement after experimental comparison, and we call the improved model Light-

Separable Vision Transformer
In recent years, Vision Transformer [14][15] has achieved great success in a range of computer vision tasks, boasting performance that exceeds that of CNNs in major domains.However, these performances usually come at the cost of increased computational complexity, the number of parameters.These algorithms often require expensive GPU computing power to use and are also difficult to deploy on mobile devices.Separable Vision Transformer [16] solves this challenge by maintaining accuracy while balancing computational cost.In this paper, the last layer of the backbone network is replaced with SepViT Block, which enhances the feature extraction capability of the model and optimizes the relationship of the global information of the network.In SepViT Block, the depthwise self-attention and pointwise self-attention to reduce computation and enable local information communication and global information interaction in windows.First, each window of the divided feature map is considered as one of its input channels, and each window contains its own information, and then a depthwise selfattention(DWA) is performed on each window token and its pixel tokens.The operation of DWA is as follows: DWA( ) Attention( , , ) where f is the feature tokens, composed of window tokens and pixel tokens.Q W , K W , and V W represent three Linear layers for query, key and value computation in a routine self-attention.Attention represent a standard selfattention operation.After the DWA operation is completed, pointwise self-attention(PWA) is used to establish connections among windows and generate the attention map by LayerNormalization(LN) and Gelu activation function.
The operation of PWA is as follows: where wt means the window token.Then, SepViT Block can be expressed as: ^PWA( , ) where n f represent as the SepViT Block.
In fire detection, where speed and accuracy are equally important, and timely and accurate detection of fires can greatly reduce damage.We found in our experiments that the depth-wise separable convolution(DSC) and Ghost convolution with little difference in accuracy.DSC is able to reduce the number of parameters and calculations to a greater extent, but DSC also has the drawback that the channel information of the input image is separated during the calculation.To solve this problem, we improve the DSC block in [24] by channel shuffle the features of the DSC output., and we call the improved module DSSConv, whose structure is shown in Figure 3(b).where the depth-separable convolution consists of depth convolution and point convolution.Input a × × H W C feature map P ,depth-wise convolution with one filter per input channel can be described as: where K is the depth-wise convolutional kernel of size kk H W C  where the th m filter in K is applied to the th m channel in P to produce the th m channel of the filtered output feature map G .The new features are then generated by 1×1 point convolution, The calculation process is shown in Figure 3(a).YOLOv5 uses PANET [25] on the neck for feature extraction and fusion.It uses bottom-up and top-down bidirectional fusion methods and achieves good results, but the environment of fire detection is usually too complex and more features need to be fused to get better results.BiFPN is a weighted bidirectional feature pyramid network that connects input and output nodes of the same layer across layers to achieve higher-level fusion and shorten the path of information transfer between higher and lower layers.Since weighting brings a certain computational rise, this paper removes the weighted feature fusion to make the neck network further lightweight.

Global Attention Mechanism
The complex environment of fire detection is prone to false and missed detections.GAM [26] is used to strengthen the connection between space and channels, reduce the information reduction of flame and smoke in fire and amplify the features of global dimension.Given an input feature map Where c M is the channel map and S M is the spatial maps;  means element wise multiplication.We add GAM to the bottleneck module, whose structure is shown in Figure 5.

The IoU Loss and Activation
IoU [27] loss can predict the localization of the bounding box regression more accurately, and the most commonly used in the YOLO series is CIoU [28].as the research progresses, there are more and more variants of IoU, such as DIoU [29], GIoU [30], EIoU [31] and the latest SIoU [32].they are defined as follows: () v is used to measure the similarity of aspect ratio.
The CIoU used by YOLOv5 relies on the aggregation of bounding box regression metrics and does not consider the direction of the mismatch between the desired ground box and the predicted "experimental" box.This leads to inferiority to SIoU in terms of training speed and prediction accuracy.
In lightweight networks, HSwish, Mish and LeakyReLu are faster than ReLu in terms of training speed.They can be defined as: We found experimentally that using the Mish activation function is more accurate than the others, and detailed comparison experiments are given in Section 4.

Datasets
Since there is a lack of authoritative datasets for fire detection, the dataset used in this paper is derived from public datasets and web images and contains 21136 images.The dataset we collected contains various scenarios, such as forest fires, indoor fires, urban fires, traffic fires, etc. Figure 6 shows part of the dataset.

Training Environment and Details
This paper uses Ubuntu 18.04 operating system, NVIDIA GeForce RTX3060 GPU, CUDA11.1,Python3.8.8, optimizer for training, with 16 Batchsize, initial learning rate of 0.01, and 100 training epochs, the size of the input image is 448×448.

Model Evaluation
In this paper, precision(P), recall(R), average precision(AP), mean average precision(mAP), parameters, computation, inference time, and FPS are used as evaluation metrics for model performance.where AP is the area under the PR curve and mAP denotes the average of AP for each category.The specific formula is as follows: TP R= TP+FN

Result Analysis and Ablation Experiments
To further verify the validity of the model, in this section we perform a series of ablation experiments.As shown in HSwish, but the accuracy is better, while SIoU and CIoU are about the same in detection speed and better in accuracy than CIoU, so we choose the combination of Mish activation function and SIoU loss.
To further validate the effectiveness of Light-BiFPN, we use the latest method of replacing the backbone with a lightweight network to conduct comparative experiments.We found that ShuffleNetv2 has the fastest detection speed but is hardly satisfying in terms of accuracy, while Light-BiFPN has the second fastest detection speed after ShuffleNetv2 and much higher accuracy than other lightweight networks.The results are shown in Table 3.
Finally, we compared all the improved methods for ablation experiments, and the results are shown in Table 4.
Compared with the original algorithm, the mAP of Light-YOLOv5 has been improved by 3.3%, and the number of parameters and computation have been reduced to a certain extent, although the detection speed is lower than the original algorithm, but it can also meet the industrial detection needs.We also compared with the most advanced detectors at this stage to further verify the effectiveness of the methods, and the comparison results are shown in Table 5.It can be seen that although Light-YOLOv5 is inferior to YOLOv7-tiny and YOLOv3-tiny in terms of detection speed, it is much higher than these detectors in other parameters, with mAP 6.8% higher than the latest YOLOV7-tiny, further proving the effectiveness of the method in this paper.We have placed the detection effect graph in Figure 7(b).self-developed fire dataset show that Light-YOLOv5 has a 3.3% higher mAP than the baseline model and 6.8% higher mAP than the latest YOLOv7-tiny compared to the state-of-the-art detector, with detection speeds that meet industrial requirements.

Figure 1
Figure 1 Architecture of Light-YOLOv5 model

.
are feature maps and the learned window tokens.Concat denote the concatenation operation.Slice denote the slice operation.Figure2shows the structure of SepViT Block.

Figure 2
Figure 2 Overall structure of SepViT Block

Figure 3
Figure 3 (a) The calculation process of the DSC.(b)The structure of the DSSConv module.We designed DSSbottleneck as well as DSSC3 based on the bottleneck and C3 modules of YOLOv5, and their structures are shown in (a) and (b) of Figure 4.

Figure 4
Figure 4 (a)The structure of the DSSbottleneck module.(b)The structures of the DSSC3 module.

Figure 5
Figure 5The structure of the GAMbottleneck.
parameters A and B represent the area of the ground truth bounding box and the area of the prediction bounding box, respectively; C denotes the minimum enclosing box of the ground truth bounding box and the prediction bounding box; gt b,b represent the centroids of the prediction bounding box and the ground truth bounding box, respectively, and  represents the Euclidean distance between the two centroids, d is the diagonal distance of the smallest enclosing region that can contain both the prediction bounding box and the ground truth bounding box;  is the weight function, and Positives) means that the sample is divided into positive samples, and is divided correctly.FP (False Positives) means that the sample is divided into positive samples, and it is divided into the incorrectly samples.FN (False Negatives) means that the sample is divided into negative samples, and it is divided into the incorrectly samples.

Figure 6
Figure 6 Example images of the dataset

Figure 7 5 Discussion
Figure 7 Comparison of Light-YOLOv5 and the original algorithm detection results.(a)YOLOv5n (b)Light-YOLOv5 5 Discussion In summary, this paper proposes a Light-YOLOv5 algorithm for fire detection in complex scenarios, achieving a balance of efficiency and performance.Light-YOLOv5 uses YOLOv5n as the baseline, SepViT Block to strengthen the connection between the backbone network and the global information, Light-BiFPN to strengthen the network feature extraction and lighten the network, fusion GAM module to reduce the loss of information, and finally Mish activation function to improve the accuracy and SIoU to improve the convergence speed of training.Experimental results on a

Table 1 ,
we compared the effects of different versions of YOLOv5.The mAP of YOLOv5n is 1.7% lower than YOLOv5s, 2.8% lower than YOLOv5m, 73.6% less computation than YOLOv5s, 91.3% less than YOLOv5m, 74.8% less number of parameters than YOLOv5s, 91.5% less than YOLOv5m, and faster detection speed than YOLOv5s and YOLOv5m.So we choose YOLOv5n as the baseline.Further, we conducted comparison experiments on LeakyReLu, Mish, HSwish activation functions and CIoU, SIoU with YOLOv5n as the baseline.As shown in Table2, The detection speed of Mish is not as fast as LeakyReLu and

Table 1
Performance comparison of different models of YOLOv5

Table 2
The comparison results of different activation functions and IoU loss under the same model

Table 3
The comparative experiments of Light-BiFPN and different state-of-the-art lightweight models

Table 4
Results of ablation experiments with different modified methods

Table 5
Comparison of the results of the most advanced detectors at this stage