1. Introduction
As an important part of global resources, the forest can provide a habitable environment for human beings, and have a variety of ecological functions such as conserving water and soil, maintaining ecological species diversity, and maintaining the balance of carbon and oxygen in the atmosphere [
1,
2]. However, due to climate change and global warming, increasingly extreme hot weather has emerged in recent years, causing forest fires to increase dramatically in frequency and scale. Due to the stochastic characteristics of forest fire, it is difficult to put out, which can bring huge losses to forest resources, people’s lives, and properties, and seriously damages the balance of the forest ecosystems [
3]. The earlier a forest fire is detected, the sooner it can be extinguished, preventing more extensive damage. To reduce the danger of forest fire, daily monitoring and timely detection of forest fire is of great importance [
4,
5,
6]. The means of forest fire monitoring commonly include ground manual patrol, fixed-point lookout monitoring, satellite remote sensing monitoring [
7], and aerial monitoring [
8]. The Unmanned Aerial Vehicle (UAV) is a new type of aviation platform. With the constant maturity in its technology, it has been applied in many fields, such as weather monitoring, disaster monitoring, power line inspection, and disaster rescue. In particular, it is easy to operate, highly mobile, low cost, easy to maintain, lightweight, small volume, and has real-time inspection capabilities, etc., in forest fire prevention and forest patrol, and other operations can also show its advantages [
9,
10,
11,
12]. Using the images taken by UAVs as data, combined with computer technology such as image processing to identify forest fires, has been one of the most commonly used tools in forest fire monitoring. This is based on an unmanned aircraft forest fire detection (UAV-FFD) platform, where fire images can be captured and transmitted factually to the ground station. Using a large-scale YOLOv3 network, Jiao et al. [
13] developed an algorithm for detecting fires that demonstrates a recognition rate of around 91% and can process 30 frames per second. In their study, Francesco et al. [
14] suggested a noise-resistant algorithm for detecting forest fires through edge detection. This one-dimensional algorithm uses infrared images captured by unmanned aerial vehicles (UAVs) as input and leverages the established physical attributes of the target of interest to amplify the discontinuity of feature edges. A long short-term memory convolutional neural network model, in combination with the gray wolf optimization algorithm, was introduced by Wang et al. [
15] to forecast the spread of fires using infrared images captured by UAVs. This approach yields relatively precise predictions in a timely manner, enabling real-time decision-making to control fire suppression.
Most existing forest fire detection methods employ either visual images or infrared images. Visual images have the advantages of rich color, high resolution, and clear environmental texture information. However, visual images are more susceptible to environmental factors. For example, when the UAV is located at a high altitude, the smoke in the air tends to obscure the information about the ground flames, thus, preventing firefighters from determining the exact location and scale of the fire. Infrared images can show the temperature of an object and are more suitable for flame identification and detection tasks. However, infrared images have a low spatial resolution, low contrast ratio, poor environmental texture information, and blurred visual effects. It is difficult for firefighters to make judgments about the location of the fire, so infrared images play little role in the subsequent forest fire-fighting work. Therefore, if the visual and infrared images can be fused into one image by the method of image fusion, combining the advantages of both, it can not only ensure the accurate detection of forest fire information, but also determine the location of the forest fire more clearly through the environmental information of the image.
Visual and infrared image fusion is an important application of image fusion technology, which can integrate the information from visual and infrared images to obtain a comprehensive image with both the visual and infrared information. Visual images can provide information such as the shape and color of targets, but they are severely limited at night or in low-light conditions. Infrared images, on the other hand, can provide information on the thermal distribution of objects and can accurately detect targets even in dark environments. Visual and infrared images have different features at the object edges, contours, and other aspects, and fusing them can reduce the noise in the image while preserving the important features of both [
16]. By inputting both the infrared and visual images into a deep learning network, Li Hui [
17] carried out a series of feature extraction and fusion procedures. The unique feature of this network is its adaptability to diverse input resolutions and its ability to generate fusion images with any desired output resolution. Duan et al. [
18] proposed a dual-scale fusion method based on parallel salient features. This method first performs saliency detection on the infrared and visible light images to obtain saliency maps of the two images. Then, using dual-scale analysis, the images are divided into two scales. The saliency maps of the two scales are fused separately to obtain two fusion images. Finally, a feature selection method is used to combine the fusion images of the two scales to generate the final fusion image. By breaking down the visual and infrared images into low-frequency and high-frequency components, Yin et al. [
19] extracted the feature information pertaining to significant objects in the image from the low-frequency component, and combined the low-frequency components of both the images using a weighted average method. In the high-frequency component, they fused the high-frequency components of both the images using a detail-preserving technique.
Current methods for forest fire recognition and detection mainly focus on single-spectrum images. Visible light images offer rich colors, high resolution, and clear environmental texture information, making observation more intuitive. Flame information can be identified through feature extraction and processing, such as color and brightness. However, visible light images are susceptible to environmental factors, such as smoke, which can obscure the ground fire information, making it difficult to judge the location and size of the fire accurately. Moreover, visible light images may not produce satisfactory results under low-light conditions. Infrared images are grayscale images that display object temperatures, with lower grayscale values indicating higher temperatures, and vice versa. This characteristic makes infrared images more suitable for forest fire recognition and detection tasks, as flame information can be accurately determined by processing the grayscale feature of each pixel in the image. However, infrared images have low spatial resolution, low contrast, poor environmental texture information, and blurry visual effects, making it difficult to judge the fire’s location based on the environmental information in the image. Moreover, the feedback provided by infrared images to firefighters is not always clear. By fusing the visible light and infrared images using image fusion methods, the advantages of both can be combined, ensuring the accurate detection of forest fire information and providing clear information about the fire’s location through the environmental information in the image. This can facilitate the subsequent firefighting work.
To address the limited information expression capacity of single images and inadequate forest fire monitoring under a single spectrum, this study proposed a method of target monitoring based on fused pictures for the early detection of forest fires and achieving the goal of early fire warning. By fusing the visible and infrared photos and utilizing the full information and rich characteristics of the fused images, the method minimized the incidence of false alarms and missed alarms associated with identifying forest fires. The study began by creating a dataset of visible and infrared photos captured by a UAV that included fire information and pre-processing the dataset. A deep learning model of the image fusion network, Fire Fusion-Net (FF-Net), was then proposed based on the VIF-net architecture. The FF-Net was enhanced by adding an attention mechanism based on the unique quality of local brightness in photographs of forest fires, leading to superior image fusion. Target detection on the fused images was performed using the YOLOv5 network. Finally, the experimental results were compared and analyzed. The main contributions of this study are as follows:
Using an unmanned aerial vehicle, this study built a multispectral image dataset of forest fires through the integration of image fusion and target detection techniques. Compared to publicly available image fusion datasets, this dataset boasted significant advantages in terms of the image quantity, resolution, and content richness. As such, the dataset could be more effectively employed for deep learning-based multispectral image fusion.
In this study, an image fusion network named Fire Fusion-Net (FF-Net) was proposed, which is based on the dense block architecture. An attention mechanism was incorporated to enhance the fusion effect in regions with prominent features, such as the flames in images. Additionally, the impact of brightness on high-resolution images was taken into account, and the loss function was improved. Fusion experiments were conducted using different algorithms and compared with traditional image fusion datasets and the image fusion datasets constructed in this study. The results indicated that the proposed method outperformed other methods in terms of fusion performance, with lower distortion, noise, and better fusion evaluation metrics. This image fusion approach is not limited to forest fire image fusion tasks and could be applied to other image fusion tasks as well.
The present study suggested a method of forest fire identification using fused images for target detection, which exhibited a higher accuracy rate and reduced the false alarm and missed alarm rate compared to single spectrum recognition. This approach could significantly enhance the reliability of forest fire identification.
3. Methods
In this paper, the FE-Net Network was proposed to fuse the visual and infrared images. First, the network architecture of FF-Net was described in detail. Then, the attention mechanism module was introduced in the FE-Network. Finally, the loss function of the network was improved.
3.1. FF-Net Network Architecture
In order to simplify the network structure, the fusion strategy for the RGB visual images was the same as grayscale infrared images in this paper. Feature extraction, feature fusion, and image reconstruction were the three primary parts of the FF-Net network architecture, as seen in
Figure 5. I
A and I
B, respectively, represent the visual and infrared images and were supplied into the dual channels. The feature extraction contained DenseBlock [
21] modules. The first layer, C11/C12, contained 3 × 3 filters to extract the rough features and the dense block contained three convolutional layers, which also included 3 × 3 filters, and each layer’s output was cascaded as the input of the next layer.
For each convolutional layer in the feature extraction part, the input channel number of the feature maps was 16. The architecture of encoder had two advantages. First, the filter size and stride of the convolutional operation were 3 × 3 and 1, respectively. With this strategy, the input image could be any size. Second, the dense block architecture could preserve the deep features as much as possible in the encoding network and this operation could make sure all the salient features were used in the fusion strategy.
The feature fusion section included an attention module and an additive fusion strategy module. The features obtained from the dense blocks were weighted using the attention module, and then the features were fused using the additive fusion strategy. The attention mechanism module will be described in detail in
Section 3.2. The additive fusion strategy was the same as the one used in DeepFuse [
22], and its operation process is shown in
Figure 6.
In our network,
m ∈ {1, 2, · · ·, M}, M = 64 represented the number of feature maps.
k indicated the index of feature maps which were obtained from input images. Where
(
i = 1, · · ·,
k) indicated the feature maps obtained by the encoder from the input images,
denoted the fused feature maps. The addition strategy was formulated by Equation (1).
where (
x,
y) denoted the corresponding position in the feature maps and fused feature maps. Then,
would be the input to the decoder and the final fused image would be reconstructed by image reconstruction.
Finally, the results of the fusion layer were reconstructed from the fused features by four other convolutional layers, C2, C3, C4, and C5. The more detailed network architecture is shown in
Table 4.
3.2. Attention Mechanism
In the initial network structure, there was no attention module between the feature extraction and feature fusion sections. This was because the initial network was trained using a 64 × 64 randomly cropped dataset extracted from the public dataset TNO. This approach was feasible since the original image fusion dataset had low resolution. However, the existing drone-based forest fire multispectral image dataset has a higher resolution, and some information has obvious local features, such as the flame area in the image. Therefore, this study improved the feature fusion section of the network.
In Sanghyun et al.’s study [
23], the attention mechanism not only makes the network focus on the region of interest, but also improves the expression of the region of interest. The goal was to improve the representation by using the attention mechanism: focusing on the important features and suppressing the unnecessary ones. In the feature fusion part of the FF-Net network, the features extracted by the deep feature block were connected to the attention mechanism to refine the global features extracted by the deep feature block before fusion, while enhancing the local features that were of more interest. The features enhanced by the attention mechanism were directly connected to the feature fusion layer.
The convolutional bottleneck attention module (CBAM) [
24], an approach that improves the expressiveness of the network, was employed in this paper. CBAM uses two different modules (channel and space) to induce the feature refinement of attention, achieving significant performance improvements while keeping the overhead small. The CBAM was given an intermediate feature map
F ∈ R
(C × H × W) as the input, and its operation process was generally divided into two parts. First, the input was globally max-pooled and mean-pooled by channel, and the two one-dimensional vectors after pooling were fed to the fully connected layer and summed to generate a one-dimensional channel attention
MC ∈ R
(C × 1 × 1), and then the channel attention was multiplied with the input elements to obtain the channel attention-adjusted feature map
F′. Secondly,
F′ was pooled by space for the global maximum pooling and mean pooling, and the two two-dimensional vectors generated by the pooling were spliced and then convolved to finally generate the two-dimensional spatial attention
MC ∈ R
(1 × H × W), and then the spatial attention was multiplied with
F′ by element; the specific process is shown in
Figure 7, and the CBAM generation attention process can be described as follows.
3.3. Loss Function
In this section, we set the M-SSIM and TV as loss functions with the aim of implementing unsupervised learning and determining the appropriate parameters to fully utilize the network.
SSIM is the structural similarity index between two different images, as seen in Equation (3). It combines three factors—luminance, structure, and contrast—to comprehensively assess picture quality. In the original network, due to the limitation of the dataset, the luminance at lower spatial resolution could not measure the consistency of the global luminance, so the luminance component was neglected. Let
X be the reference image and
Y be the test image, which is described as follows.
In our study, however, the multispectral forest fire image dataset was of higher resolution, and there were regions in the image where the local brightness of the flame was more obvious, so we rewrote Equation (3) as Equation (4).
where μ and σ denote the mean and standard deviation respectively, and
is the cross-correlation between
X and
Y.
C1 and
C2 are stability coefficients to deal with cases where the mean and variance are close to zero. The standard deviation of the Gaussian window was set to 1.5 in the calculation.
Then,
and
were calculated according to Equation (4), where
IA,
IB, and
IF denoted visual, infrared and fused images, respectively.
W represented the sliding window with the size of m × n, which moved pixel by pixel from the top-left to the bottom-right. This study set C
1and C
2 as 9 × 10
−4, and the size of the window as 11 × 11. Generally, the local grayscale value increases with the richness of the thermal radiation information, so the temperature of a thermal target can be measured by the intensity of its pixels. Therefore, we leveraged
to calculate the average intensity of pixels in the local window to measure the score of
SSIM, where
Pi was the value of pixel
i.
A function was created to adaptively learn deep features when
is larger than or equal to
, indicating that the local window of
IB included more thermal radiation. The formulas are provided in Equations (5) and (6).
where
N represents the total number of sliding windows in a single image.
This paper introduced the total variation function to design the mixed loss function in order to achieve gradient transformation and remove some noise, which is described as follows.
where
R is the difference between the visual and fused images,
is the l
2 distance, and
LTV denotes the total variation loss function. Since the two types of loss functions were not an order of magnitude, when the weight of
LSSIM in the loss function was relatively low, this led to low contrast and low quality in the fused image. In contrast, when the weight of
LSSIM in the loss function was relatively high, the details in the visual images were lost to a certain degree. To achieve an approximate tradeoff between the infrared and visual features, we set a hyper-parameter λ, which was set as different values to weigh the impact between them. The loss function is described as follows.
3.4. Experimental Parameters Setting
The experiments were implemented on Tensor Flow and trained on a PC equipped with an AMD Ryzen 7 4800 H with Radeon Graphics 2.90 GHz CPU, 8 GB RAM and a NVIDIA GeForce RTX2060 GPU. Some of the comparative experiments were made on a MATLAB R2020a.
To fully evaluate the algorithm, we conducted the experiments on both the TNO dataset and the self-built dataset and compared it with several more advanced image fusion methods, including three traditional methods such as the Dual Tree Complex Wavelet Transform (DTCWT), Adaptive Sparse Representation (ASR), Cross Bilateral Filter (CBF), and three deep learning methods such as Fusion Gan [
25], U2Fuison [
26], and DenseFuse [
27]. All six method implementations are publicly available, and we set the parameters reported in the original paper.
Subjective visual evaluation systems are susceptible to human factors such as visual acuity, subjective preferences, and personal emotions. In addition, the differences between image fusion results based on subjective evaluation are not significant in most cases. Therefore, it is essential to analyze the fusion performance based on quantitative evaluation. Eight image fusion metrics were selected for quantitative evaluation, including entropy (
EN), mutual information (
MI),
QAB/F, standard deviation (
SD), spatial frequency (
SF), average gradient (
AG), mean squared error (
MSE), and peak signal-to-noise ratio (
PSNR) [
28].
Entropy (
EN) measures the amount of information contained in the fused image according to information theory. It is defined as follows:
where
L represents the number of gray levels and
represents the normalized histogram of the corresponding gray level in the fused image. The larger the
EN, the more information that is contained in the fused image and the better the fusion algorithm performs.
The mutual information (
MI) meter is a quality measurement that calculates the amount of information transferred from the source images to fused images. It is defined as follows:
where
and
denote the marginal histograms of the source image
X and the fused image
F, respectively.
denotes the joint histogram of the source image
X and the fused image
F. A high MI measure shows that significant information is transferred from the source images to the fused image, indicating good fusion performance.
QAB/F calculates the amount of edge information transmitted from source images to fused images, assuming that the edge information in the source images is retained in the fused image. It is defined as follows:
where
denotes the edge strength and orientation values at location
, and
denotes the weight that expresses the importance of each source image to the fused image. A high
QAB/F ratio indicates that a significant amount of edge information is conveyed to the fused image.
The standard deviation (
SD) metric is based on a statistical concept that indicates the distribution and contrast of the fused image. It is defined as follows:
where μ denotes the mean value of the fused image. A fused image with high contrast often results in a large
SD, which means that the fused image achieves a good visual effect.
Spatial frequency (
SF) is an image quality index based on gradients, i.e., horizontal and vertical gradients, which are also called spatial row frequency (
RF) and column frequency (
CF), respectively. It is defined as follows:
A fused image with a large SF is sensitive to human perception according to the human visual system and has rich edges and textures.
The average gradient (
AG) metric quantifies the gradient information of the fused image and represents its detail and texture. It is defined as follows:
The larger the AG metric, the more gradient information the fused image contains and the better the performance of the fusion algorithm.
The mean squared error (
MSE) computes the error of the fused image in comparison with those in the source images and, hence, measures the dissimilarity between the fused image and source images. It is defined as follows:
A small MSE metric indicates a good fusion performance, which means that the fused image approximates to the source image and minimal error occurs in the fusion process.
The peak signal-to-noise ratio (
PSNR) metric is the ratio of the peak value power and the noise power in the fused image and, thus, reflects the distortion during the fusion process. It is defined as follows:
where
r denotes the peak value of the fused image. The larger the
PSNR, the closer the fused image is to the source image and the less distortion the fusion method produces.
5. Conclusions
In order to solve the problem of early warning and monitoring of forest fire, this study proposed a method based on visual and infrared image fusion to detect forest fire, and analyzed the performance of the image fusion network and the detection effect of the fused images. Firstly, we constructed a simulated forest fire dataset containing daytime and nighttime images. Next, an improved FF-Net network combining the attention mechanism with the image fusion network was proposed and experimentally validated in comparison with other image fusion methods. Then, the fused images, visual images, and infrared images were compared for target detection by the YOLOv5 network to analyze the detection effect of the fused images. The results showed that compared with some commonly used image fusion methods, the improved FF-Net network had stronger image fusion capability, the fused images were clearer and more comprehensive in terms of information, and the image fusion indexes such as EN, MI, and SF had obvious improvement. In terms of target detection, compared with the visual and infrared images, the fused images had a higher accuracy rate, and the false alarm and missed alarm rates were reduced, which could effectively improve the reliability of forest fire identification and have great significance for early forest fire warning. In addition, compared with the visual images, the fused images could more accurately determine the specific degree of flame burning; compared with the infrared images, the fused images had more obvious environmental texture information, and it was easier to clearly determine the fire location, which is more meaningful for the accurate judgment of the fire situation and the subsequent suppression work. However, there are still some problems that need to be further solved in subsequent research. This study failed to conduct simulated fire experiments in a real forest environment and did not obtain more realistic forest fire images as dataset support. Validating the effectiveness of the method in a more realistic environment is the main direction of the follow-up work.