A Flame Detection Algorithm Based on Improved greenYOLOv7

: Flame recognition is of great signiﬁcance in ﬁre prevention. However, current algorithms for ﬂame detection have some problems, such as missing detection and false detection


Introduction
Fire disaster is one of the most common and widespread hazards to public safety, posing a threat to human life and property security. As a matter of fact, fire disasters can be easily extinguished in the early stages. Therefore, early detection of fire disasters plays an important role in fire prevention and may reduce economic losses and casualties. However, nowadays buildings are huge and complicated structures, which dramatically increases the difficulty of fire detection. Accordingly, a more advanced fire detection method is called for.
The development of technologies upon flame detection passes through stages such as sensor detection, traditional image processing methods, and deep learning methods. Given the importance of fire monitoring, extensive research has been carried out on flame detection methods. Traditional fire alarm systems generally detect changes in physical quantities such as smoke concentration and temperature in the environment [1]. When these physical quantities reach a certain value, an alarm is triggered, but there always exists a certain time lag. The current flame detection methods can be classified into two main categories: those based on manually defined flame features and those based on convolutional neural networks. Note that in the early stages of fire development, most flames are small and scattered, and their color is similar to natural light and sunlight. Therefore, the detection accuracy of the methods based on manually defined flame features is quite low, since it is difficult to detect small target flames. Researchers have tried to improve the detection accuracy of those methods by combining the flame's YUV color model, shape features, and motion features [2] and by combining the RGB color model with the ViBe background extraction algorithm [3]. Yet, the detection effect is still not satisfactory. Convolutional neural networks have strong learning ability, fault tolerance, and fast speed; thus, they are commonly used in image recognition and classification. Currently, the convolutional neural networks (CNNs) used for object detection mainly include region-convolutional neural networks (R-CNN) [4] and YOLO series [5][6][7][8][9]. Compared with other convolutional neural networks, the YOLO series can better extract global information from images and can be trained end-to-end, which assures them as a more suitable option for flame detection. In recent years, the YOLO series has been developed over many generations [10][11][12][13][14]. However, current methods have some issues, such as low detection accuracy and high false-negative rate for small target flames, which cannot meet the requirements of flame prevention. YOLOv7 [15] is the latest version of the YOLO series, and in [16], it is proved that the performance of YOLOv7 is obviously better than YOLOv5 and YOLOv6 in the aspect of target detection. In earlier works, YOLOv7 has been applied to safety helmet detection, urban vehicle detection, road damage detection, tree species identification, and vehicle-related distance estimation [16][17][18][19][20]. In this work, we propose a new flame detection algorithm based on YOLOv7, which is, to our knowledge, the first time that YOLOv7 has been applied in this field. In our algorithm, we replace a convolution of the MP-1 module with the SimAM structure, which is a parameter-free attention mechanism [21]. In this way, the missing detection problem can be improved. Furthermore, we replace a convolution of the ELAN-W module with a ConvNeXt-based CNeB module [22] to improve detection accuracy and the false detection problem in complex environments. Additionally, in order to further the robustness of the algorithm, we construct a self-built data set by combining several publicly available data sets for various application scenarios, and thus it contains sufficient data volume and various complex detection environments. The experimental results demonstrate that our algorithm is distinctly stronger than the original YOLOv7 and YOLOv5 in all performance metrics. This paper is structured as follows. In Section 2, we present the background and some related works, and in Section 3, we introduce the improvements of our proposed model based on YOLOv7. Then, we show our experiment design, including data set and performance metrics, in Section 4 and analyze the experiment results in Section 5. Finally, this paper is closed with a conclusion.

Background and Related Work
The goal of this section is to introduce the background of the field of target detection and some related works on parameter-free attention modules and ConvNeXt modules, which are applied in our proposed model.

Background on Target Detection
Target detection is a hot research topic in computer vision, and it can be divided into traditional target detection [23] and deep-learning-based target detection [24][25][26][27][28]. Traditional target detection methods mainly rely on feature extractors, which use sliding windows to extract image features and generate a large number of target candidate regions [29]. However, the above-mentioned methods are cumbersome and have some problems, such as serious window redundancy, slow detection speed, and low monitoring accuracy. Recently, convolutional neural networks have become widespread tools for image feature extraction and classification, which is the most popular architecture of deep learning [30][31][32][33][34][35][36]. The target detection technologies based on deep learning can adaptively learn high-level semantic information of images by using multi-structure network models along with their training algorithms.
In 2014, Girshick et al. successfully applied convolutional neural networks to target monitoring and proposed the R-CNN algorithm [4], which combined AlexNet [37] with selective search algorithms [38]. The detection accuracy of the R-CNN algorithm reached 58.5% on the PASCAL VOC2007 data set, which is significant progress compared to traditional target detection algorithms. The YOLO series is a deep-learning-based approach to object detection in real time. Meanwhile, it is often used in flame detection. In 2015, Redmon et al. proposed the YOLOv1 algorithm, which integrates classification, positioning, and detection functions in one network [10]. Following that, the YOLO series has been developed over many upgraded versions. There are many flame detection algorithms based on YOLOv3, YOLOv4, and YOLOv5 [5][6][7][8][9]. YOLOv7 [15] is the latest version of the YOLO series, and in [16], it is proved that the performance of YOLOv7 is obviously better than YOLOv5 and YOLOv6 in the aspect of target detection. Unfortunately, as far as we know, YOLOv7 has not yet been applied to flame detection. In this work, we propose an improved detection algorithm based on YOLOv7, which is, to our knowledge, the first time that YOLOv7 has been applied in this field.

Parameter-Free Attention Module
The attention mechanism is derived from the study of human vision, which means that the machine selectively focuses on a specific part of the visual area and ignores other irrelevant information. In recent years, the attention mechanism has been widely used in many areas, including image processing, speech recognition, and natural language processing [39][40][41][42][43][44][45]. In 2021, Yang et al. proposed the parameter-free attention module (SimAM) [21]. Compared to ECA single-dimensional attention [46] and CBAM twodimensional attention [47], the SimAm attention module does not need to add parameters to derive three-dimensional attention weights, and it is simple and efficient. The structure comparison is shown in Figure 1. In order to improve the performance of the attention mechanism, the SimAM algorithm evaluates the importance of each neuron. Rich information neurons inhibit surrounding neurons, known as spatial inhibition [48]. Informative neurons are found by measuring the linear separability between data elements, and an energy function is defined for each neuron in Equation (1). where t is the target neuron, and x i are other neurons in a single channel of input features, respectively. By minimizing Equation (1), the linear separability between neurons can be found [49]. For ease of understanding, y t and y 0 are processed with binary labels, and a regularizer is added. Thus, we obtain the minimum energy from Equation (2) It is not difficult to see from Equation (2) that the smaller the energy of t is, the greater the linear separability between the target neuron and surrounding neurons is, and the higher the information richness is. Therefore, 1 e * t is generally used to indicate the importance of neurons.

ConvNeXt Model
ConvNeXt is a pure convolutional neural network proposed by Liu et al. [50]. It has a simple structure, and its accuracy and inference speed far exceed the Swin Transformer [51].
The structure is shown in Figure 2, where H, w, and dim represent the height, width, and number of layers of the feature map, respectively. The ConvNeXt network adjusts the stacking times in ResNet50 from (3,4,6,3) to (3,3,9,3), in order to obtain more complex features. First, the input feature map goes through a convolution layer with convolution kernel size 4 and step size 4 to downsample the input feature map. Then, it changes the number of channels to 96 and goes through 4 ConvNeXt blocks. Each ConvNeXt block adopts the inverted bottleneck design, which can effectively avoid the loss of information during the downsampling process, and then obtain a feature map with a size of 7 × 7 × 768, and finally pass through the global average pool. The normalization and LN layers are used to reduce the model's dependence on the initialization parameters, such that the accuracy can be further improved, and finally, a feature map is output through a linear classifier.

Proposed Model
YOLOv7 is the most powerful target detection model of the YOLO series. Compared to earlier versions of the YOLO series, YOLOv7 is more efficient and more accurate, and it can reach a higher monitoring speed with the same computing resources.
For YOLOv7, the image is first resized to 640 × 640 and input to the backbone network. Then, three layers of feature maps of different sizes are output through the head layer network. Finally, prediction results are obtained through reparameterization and convolution. In general, YOLOv7 optimizes the model with the model structure reparameterization and dynamic label assignment. However, for non-rigid objects (e.g., flames), there are still deficiencies in the detection, such as missing detection, false detection, and low detection accuracy and efficiency. In order to further these problems, here, we propose an improved YOLOv7 network. In this work, we replace convolutions in MP-1 and ELAN-W modules of the YOLOv7 network with a three-dimensional attention mechanism SimAM and a pure ConvNeXt-based CNeB module, respectively. The newly added modules can be seen in the MP-1 and ELAN-W module structure diagrams in Figure 3 in detail.  Figure 3. Improvement of YOLOv7 overall structure.

Improvement on YOLOv7 Backbone Structure
In the MP-1 module of the YOLOv7 network, the above branch goes through maximum pooling and 1 × 1 convolution, the image information is extracted through the local maximum, and during the below branch, the image information is extracted through two convolutions. When the information passes through the convolution with kernel 3 × 2, it will cause some fine-grained loss, thereby reducing the feature learning ability of the network, and thus small targets cannot be perceived. As shown in Figure 4, here, the 3 × 2 convolution in the below branch is replaced by the SimAM attention mechanism, which can suppress the interference of complex backgrounds on the target, enhance the extraction of target features, and significantly improve the problem of missing detection on small targets.

Improvement on YOLOv7 Head Structure
In order to enable a deeper network to effectively learn and converge, the ELAN-W structure has been proposed in the version of YOLOv7. It is proved through experiments that by controlling the shortest and longest gradient path. The learning ability of the network can be improved without destroying the original gradient path.
As shown in Figure 5, here, we propose an improved ELAN-W module by replacing a 1 × 1 convolution with a CNeB module build based on ConvNeXt. The above branch first goes through a 1 × 1 convolution to change the number of channels. Then, the image features can be extracted through four 3 × 3 convolutions and finally passed to the CNeB module, which further improves the extraction of image features by the network model. There is a 1 × 1 convolution operation during the below branch, which is used to control the change in the number of channels. The improved ELAN-W module can further detection accuracy and the false detection problem.

Experiments Design
Our experiments were performed on the Ubuntu 16.04 operating system. The GPU and graphics card are NVIDIA TITAN Xp, and the display driver is version 470.74. Python version 3.7.13, CUDA version 12.1, and the PyTorch framework were used in programming for the experiments. In addition, the development environment is PyCharm.

Dataset
The data set plays a very crucial role in deep learning training, and it can make the model robust and generalizable. In order to satisfy the requirements of our experiments, we built a data set of flame under various scenes. The images used in the experiments were mainly from multiple public data sets of flames, such as Kaggle and ImageNet [52,53], and obtained by searching Google. Finally, we chose 8778 different effective images for our experiments, where the training set takes 70%, the testing set takes 20%, and validation set takes 10%, and some examples are shown in Figure 6.

Evaluation Metrics
The detection accuracy of this experiment is measured by the value of the mean average precision (mAP). This value represents the average precision (AP) of all categories, which is calculated from the precision-recall curve (PR) curve. The calculation formulas are as follows: where P indicates precision, R indicates recall, TP (true positives) is the number of correctly predicted positive samples, FP (false positives) is the number of wrongly predicted positive samples, FN (false negatives) is the number of wrongly predicted negative samples, and n represents the total number of target categories. F1 score is another important performance metric that combines recall and precision together and takes both false positives and false negatives into account. The calculation formula of F1 is as follows:

Results and Analysis
In our experiments, we measured the performances of our improved model as well as the original YOLOv5 and YOLOv7, and the results are shown in Figure 7. In order to effectively compare these models, we used the same data set and the same training parameters and methods in the experiments for each model. In Figure 7, the so-called YOLOv7-Improved-1 indicates the network only with the SimAM attention mechanism shown in Figure 4, the so-called YOLOv7-Improved-2 indicates the network only with ConvNeXt shown in Figure 5, and the so-called YOLOv7-Improved-3 indicates the network with both of SimAM attention mechanism and ConvNeXt. It can be easily seen from Figure 7 that there is no underfitting and overfitting during all experiments.
In order to avoid the randomness, the experiments for each model are repeated three times, and we choose the average value as the final result. An overview of our experimental results is given in Table 1. It can be easily seen that the overall performances of the original YOLOv7 has been significantly improved compared to YOLOv5. Further, the performances of YOLOv7-Improved-1 and YOLOv7-Improved-2 are already higher than the original YOLOv7 in each metric, and YOLOv7-Improved-3 is still much stronger than YOLOv7-Improved-1 and YOLOv7-Improved-2. The mAP_0.5 of the YOLOv7-Improved-3 has been increased by 7% compared to the original YOLOv7, the detection accuracy (precision) has been increased by 5.3%, and the F1 score has been increased by 4.1%.  For comparative analysis of various models, we randomly selected three images from the test set and used the models of YOLOv5, YOLOv7, and YOLOv7-Improved-3 to detect the flame in these images. The detection effects are shown in Figure 8. From Figure 8, we can see that for Image a, the confidence of YOLOv5 is 58%, the confidence of YOLOv7 is 72%, and the confidence of YOLOv7-Improved-3 has reached 82%. Obviously, the confidence of YOLOv7-Improved-3 has been significantly improved. For Image b, YOLOv5 cannot detect the target in the lower right corner, while YOLOv7 and YOLOv7-Improved-3 detected it successfully, and the confidence of YOLOv7-Improved-3 is higher than YOLOv7. For Image c, both YOLOv5 and YOLOv7 have serious false detections due to the complex background and environment, while YOLOv7-Improved-3 detected the flame with a high degree of confidence.

Conclusions
In this work, we introduce a flame detection algorithm based on the improved YOLOv7 network. First, we added the SimAM non-parameter attention to the MP-1 module of YOLOv7 in order to solve the problem of missing detection of YOLOv7. Next, the ConvNeXt-based CNeB module replaces the ordinary convolution in the ELAN-W module of YOLOv7, which solves the false detection of YOLOv5 and YOLOv7 and also improves the performance of the model. Additionally, in order to enhance the robustness and generalization of our proposed algorithm, a sufficient amount of self-built data covering various application scenes is created based on different public data sets. The experimental results show that our improved YOLOv7 model has higher accuracy in flame detection without adding too many parameters, and the situations of missing detection and false detection have been distinctly improved. Although its detection speed is slower than the original YOLOv7, it can basically satisfy the requirement of real-time detection. In future work, the network structure will be further studied to improve the detection speed and the effect of real-time detection.

Conflicts of Interest:
The authors declare no conflict of interest.