Few-Shot Fine-Grained Forest Fire Smoke Recognition Based on Metric Learning

To date, most existing forest fire smoke detection methods rely on coarse-grained identification, which only distinguishes between smoke and non-smoke. Thus, non-fire smoke and fire smoke are treated the same in these methods, resulting in false alarms within the smoke classes. The fine-grained identification of smoke which can identify differences between non-fire and fire smoke is of great significance for accurate forest fire monitoring; however, it requires a large database. In this paper, for the first time, we combine fine-grained smoke recognition with the few-shot technique using metric learning to identify fire smoke with the limited available database. The experimental comparison and analysis show that the new method developed has good performance in the structure of the feature extraction network and the training method, with an accuracy of 93.75% for fire smoke identification.


Introduction
Forest fires often cause significant economic losses and ecological damage, which are classified as highly destructive disasters. In recent decades, as global warming increases and the climate changes, forest fires have become very frequent in certain regions such as Canada and the western U.S. Once a forest fire spreads, it burns a large number of trees, and in many circumstances, it is difficult for the forest to return to its original appearance. Forest fires also may induce soil erosion and environmental pollution, lead to deaths or habitat loss for wildlife, threaten the biodiversity of the planet's ecology [1], and bring immeasurable loss of property and life to people and society [2].
Thus, to prevent the spread of forest fires and mitigate their impacts, it is of extreme importance to detect fires at an early stage. Currently, the forest fire detection methods of a forest fire can be divided into three categories: smoke detection [3], flame detection [4], or a combination of both [5][6][7]. However, due to the large amount of vegetation in the forest environment, when a forest fire occurs, the flames are easily blocked in the early stages of the fire, limiting the application of early forest fire detection using the flame detection method.
One the other hand, the smoke produced spreads over time and is not affected by vegetation. Therefore, it can be a good candidate for the detection and warning for the early stages of forest fire. However, in addition to the smoke produced by fires, non-hazardous smoke is produced by other activities such as home cooking, factory manufacturing, etc., as shown in Figure 1. If non-hazardous smoke in the monitoring regions is detected and triggers a forest fire alarm, it is a false alarm known as an In-Smoke False Positive problem.
Currently, to detect smokes in a forest, there are two available technologies: smoke sensors [8] and visual cameras [9]. The smoke sensors detect the physicochemical properties of smoke, which is challenging for applications in open forest monitoring scenarios and cannot distinguish between the different smoke classes produced by forest fire or a nonhazardous causes. Visual cameras record images [10] or videos [11][12][13][14][15], and the smoke Currently, to detect smokes in a forest, there are two available technologies: smoke sensors [8] and visual cameras [9]. The smoke sensors detect the physicochemical properties of smoke, which is challenging for applications in open forest monitoring scenarios and cannot distinguish between the different smoke classes produced by forest fire or a non-hazardous causes. Visual cameras record images [10] or videos [11][12][13][14][15], and the smoke detection is performed by feature extraction to identify, locate, or segment smoke in the recorded images or videos.
The traditional visual detection method directly uses artificially designed features. For example, the commonly used feature descriptions in the mainstream traditional image-based smoke detection methods [16,17] include color, texture, edge and other features. These cannot be used to distinguish different types of smoke, so the traditional image-based smoke detection method cannot avoid the false positive issues when different classes of smoke are present in the detection scenes. Compared with image-based smoke detection, the traditional video-based smoke detection can contain dynamic features. However, there is no significant difference between the dynamic features of various classes of smokes. Thus, the traditional video-based smoke detection can only be used to distinguish smoke and non-smoke, and cannot avoid the false positives in the smoke class.
Recently, based on the recorded images/videos from visual detection tools, researchers tend to use deep learning methods to detect smoke through learning features from the training data followed by detecting smoke using these extracted features. For visual smoke detection based on deep learning methods, depending on the nature of the detection tasks, the accuracy varies with the composition of the data set, the design of the network structure, and the learned features. Commonly, the different types of detection tasks are as follows: dichotomy tasks with smoke and non-smoke [18,19]; three-category tasks with smoke, non-smoke, and flame [20]; four-category tasks with smoke, smoke with fog, non-smoke, and non-smoke with fog [21]; five-category tasks with smoke, clouds, fog, trees, and cliffs [22]; target detection tasks for locating smoke [23][24][25]; and image segmentation tasks that demarcate the smoke area [26,27]. It can be seen that the common tasks do not divide smokes into different classes and the network cannot extract the characteristics that distinguish different classes of smokes. Thus, the current deep learning-based smoke detection methods cannot avoid false positives in the smoke classes.
However, non-hazardous smoke is very common in forest monitoring scenarios. The frequent false alarms bring concerns to the relevant personnel with unnecessary panic and waste of human resources, reduce the trust in the fire detection system, and may result in the situation that the responsible firefighters ignore the fire alarms and miss actual forest fires. Targeting to reduce the false alarm rate of forest fire smokes, for the first time, this paper proposes a few-shot fine-grained smoke identification method using metric learning to screen out the non-hazardous forest fire smokes, as shown in Figure 2. To address the constraint that limited smoke data can be obtained for different smoke classes, this paper first establishes a few-shot dataset with a small number of images from multiple smoke classes including cooking, factory, and forest fire smokes. Then, suitable models are designed to identify the three smoke classes using a feature extraction network and The traditional visual detection method directly uses artificially designed features. For example, the commonly used feature descriptions in the mainstream traditional imagebased smoke detection methods [16,17] include color, texture, edge and other features. These cannot be used to distinguish different types of smoke, so the traditional imagebased smoke detection method cannot avoid the false positive issues when different classes of smoke are present in the detection scenes. Compared with image-based smoke detection, the traditional video-based smoke detection can contain dynamic features. However, there is no significant difference between the dynamic features of various classes of smokes. Thus, the traditional video-based smoke detection can only be used to distinguish smoke and non-smoke, and cannot avoid the false positives in the smoke class.
Recently, based on the recorded images/videos from visual detection tools, researchers tend to use deep learning methods to detect smoke through learning features from the training data followed by detecting smoke using these extracted features. For visual smoke detection based on deep learning methods, depending on the nature of the detection tasks, the accuracy varies with the composition of the data set, the design of the network structure, and the learned features. Commonly, the different types of detection tasks are as follows: dichotomy tasks with smoke and non-smoke [18,19]; three-category tasks with smoke, non-smoke, and flame [20]; four-category tasks with smoke, smoke with fog, nonsmoke, and non-smoke with fog [21]; five-category tasks with smoke, clouds, fog, trees, and cliffs [22]; target detection tasks for locating smoke [23][24][25]; and image segmentation tasks that demarcate the smoke area [26,27]. It can be seen that the common tasks do not divide smokes into different classes and the network cannot extract the characteristics that distinguish different classes of smokes. Thus, the current deep learning-based smoke detection methods cannot avoid false positives in the smoke classes.
However, non-hazardous smoke is very common in forest monitoring scenarios. The frequent false alarms bring concerns to the relevant personnel with unnecessary panic and waste of human resources, reduce the trust in the fire detection system, and may result in the situation that the responsible firefighters ignore the fire alarms and miss actual forest fires. Targeting to reduce the false alarm rate of forest fire smokes, for the first time, this paper proposes a few-shot fine-grained smoke identification method using metric learning to screen out the non-hazardous forest fire smokes, as shown in Figure 2. To address the constraint that limited smoke data can be obtained for different smoke classes, this paper first establishes a few-shot dataset with a small number of images from multiple smoke classes including cooking, factory, and forest fire smokes. Then, suitable models are designed to identify the three smoke classes using a feature extraction network and loss function with its corresponding training method. Specifically, for the feature extraction network, this paper compares a variety of classical networks, and analyzes the relationship between the performance of the network and the ratio of its parameter quantities to the number of layers. For the loss function, this paper compares the contrast loss functions, including contrastive loss, triplet loss, circle loss and instance loss, that are commonly used in metric learning [28] and the cross-entropy loss function that is commonly used in probability learning. Two support strategies for using training samples and using new samples are compared when contrast loss is used. tion network, this paper compares a variety of classical networks, and analyzes the relationship between the performance of the network and the ratio of its parameter quantities to the number of layers. For the loss function, this paper compares the contrast loss functions, including contrastive loss, triplet loss, circle loss and instance loss, that are commonly used in metric learning [28] and the cross-entropy loss function that is commonly used in probability learning. Two support strategies for using training samples and using new samples are compared when contrast loss is used.
The rest of this paper is arranged as follows: the methods of this article are introduced in detail in Section 2; Section 3 presents the experimental results and an analysis of the methods proposed in this article, and finally, the findings are summarized in Section 4.

Task Definition
In this paper, considering that forest fire smoke detection scenes are mostly disturbed by cooking smoke generated by residences and factory smoke, the fine-grained classification includes three categories: cooking smoke, factory smoke and fire smoke. The feasibility of this task is based on considerations of human experience. Different types of smoke can be distinguished by the judgment of smoke sources, which is similar to fine-grained recognition tasks in other fields. Therefore, the metric learning applicable to other fields is also applicable to the smoke field. However, unlike fine-grained recognition tasks such as face recognition, bird species judgment, and automobile brand judgment, the types of smoke are more limited and deterministic. Thus, networks are usually not tested by new kinds, and the few-shot situation in smoke fine-grained identification tasks is not generated by new species, but rather due to the low probability nature of fire smoke itself. Therefore, in this task, a small sample fire smoke types is also used in the training process of the network, which is different from the conventional few-shot learning process based on metric learning, and is an adaptive change made after fully considering the characteristics of the task.

Dataset
The dataset used in this paper is a self-built dataset, part of which is derived from network pictures, and the other part of the data is derived from real forest fire smoke detection scenarios. Considering the identification task needs, the dataset includes three types of smoke images, namely, cooking smoke, factory smoke and fire smoke. The cooking smoke and factory smoke are categorized as non-hazardous smokes and do not cause fire alarms. As in the case of sparse real smoke data, the dataset was constructed with only a small number of samples, in which the training dataset had 9 images for each category making a total of 27; in the test set there are 16 pictures in each category making a total of 48 pictures. Figure 3 shows the example data from the dataset. It is desirable to obtain a fine-grained smoke recognition model from a small number of samples. The rest of this paper is arranged as follows: the methods of this article are introduced in detail in Section 2; Section 3 presents the experimental results and an analysis of the methods proposed in this article, and finally, the findings are summarized in Section 4.

Task Definition
In this paper, considering that forest fire smoke detection scenes are mostly disturbed by cooking smoke generated by residences and factory smoke, the fine-grained classification includes three categories: cooking smoke, factory smoke and fire smoke. The feasibility of this task is based on considerations of human experience. Different types of smoke can be distinguished by the judgment of smoke sources, which is similar to fine-grained recognition tasks in other fields. Therefore, the metric learning applicable to other fields is also applicable to the smoke field. However, unlike fine-grained recognition tasks such as face recognition, bird species judgment, and automobile brand judgment, the types of smoke are more limited and deterministic. Thus, networks are usually not tested by new kinds, and the few-shot situation in smoke fine-grained identification tasks is not generated by new species, but rather due to the low probability nature of fire smoke itself. Therefore, in this task, a small sample fire smoke types is also used in the training process of the network, which is different from the conventional few-shot learning process based on metric learning, and is an adaptive change made after fully considering the characteristics of the task.

Dataset
The dataset used in this paper is a self-built dataset, part of which is derived from network pictures, and the other part of the data is derived from real forest fire smoke detection scenarios. Considering the identification task needs, the dataset includes three types of smoke images, namely, cooking smoke, factory smoke and fire smoke. The cooking smoke and factory smoke are categorized as non-hazardous smokes and do not cause fire alarms. As in the case of sparse real smoke data, the dataset was constructed with only a small number of samples, in which the training dataset had 9 images for each category making a total of 27; in the test set there are 16 pictures in each category making a total of 48 pictures. Figure 3 shows the example data from the dataset. It is desirable to obtain a fine-grained smoke recognition model from a small number of samples. Sensors 2022, 22, x FOR PEER REVIEW 4 of 12 Figure 3. Partial data from the dataset used by the fine-grained smoke identification task.

Network Structure
Due to the small number of samples, it is very important to select the appropriate feature extraction networks for training. An optimal network structure requires a suitable network depth for the proposed task, which ranges from 6 to 30, and an appropriate network size within that depth range. The ResNet18 [29] network structure, as shown in Figure 4, is selected in this paper as it falls within the depth range and is the best performing moderately sized network structure. The network consists of an 18-layer network structure with a short circuit connection between each residual module to prevent gradients from disappearing or exploding.

The Loss Function and Its Corresponding Training Process
Deep metric learning has achieved promising results in face recognition tasks. The feature extraction network can determine whether two faces are the same person through learning local key features, such as a person's facial features, that distinguish a person's face. This is similar to the smoke fine-grained recognition task, but the key feature that distinguishes smoke is the source of smoke, which is learned by training with smoke samples.
In metric learning, the results of classification are based on the Euclidean distance between the eigenvectors. If the eigenvectors distance of two samples is less than the threshold, they are considered to be of the same category; otherwise, they are considered to belong to different categories. Thus, the goal of metric learning is to make the eigenvectors of the same sample closer and the eigenvectors of different classes of samples farther apart. To achieve this, it is necessary to determine the loss function in training. In deep

Network Structure
Due to the small number of samples, it is very important to select the appropriate feature extraction networks for training. An optimal network structure requires a suitable network depth for the proposed task, which ranges from 6 to 30, and an appropriate network size within that depth range. The ResNet18 [29] network structure, as shown in Figure 4, is selected in this paper as it falls within the depth range and is the best performing moderately sized network structure. The network consists of an 18-layer network structure with a short circuit connection between each residual module to prevent gradients from disappearing or exploding.

Network Structure
Due to the small number of samples, it is very important to select the appropriate feature extraction networks for training. An optimal network structure requires a suitable network depth for the proposed task, which ranges from 6 to 30, and an appropriate network size within that depth range. The ResNet18 [29] network structure, as shown in Figure 4, is selected in this paper as it falls within the depth range and is the best performing moderately sized network structure. The network consists of an 18-layer network structure with a short circuit connection between each residual module to prevent gradients from disappearing or exploding.

The Loss Function and Its Corresponding Training Process
Deep metric learning has achieved promising results in face recognition tasks. The feature extraction network can determine whether two faces are the same person through learning local key features, such as a person's facial features, that distinguish a person's face. This is similar to the smoke fine-grained recognition task, but the key feature that distinguishes smoke is the source of smoke, which is learned by training with smoke samples.
In metric learning, the results of classification are based on the Euclidean distance between the eigenvectors. If the eigenvectors distance of two samples is less than the threshold, they are considered to be of the same category; otherwise, they are considered to belong to different categories. Thus, the goal of metric learning is to make the eigenvectors of the same sample closer and the eigenvectors of different classes of samples farther apart. To achieve this, it is necessary to determine the loss function in training. In deep

The Loss Function and Its Corresponding Training Process
Deep metric learning has achieved promising results in face recognition tasks. The feature extraction network can determine whether two faces are the same person through learning local key features, such as a person's facial features, that distinguish a person's face. This is similar to the smoke fine-grained recognition task, but the key feature that distinguishes smoke is the source of smoke, which is learned by training with smoke samples.
In metric learning, the results of classification are based on the Euclidean distance between the eigenvectors. If the eigenvectors distance of two samples is less than the threshold, they are considered to be of the same category; otherwise, they are considered to belong to different categories. Thus, the goal of metric learning is to make the eigenvectors of the same sample closer and the eigenvectors of different classes of samples farther apart. To achieve this, it is necessary to determine the loss function in training. In deep metric learning, contrast loss has a concise form and is easy to converge during training, so this paper uses contrast loss to train the network. For contrast loss, L contrastive , the training sample consists of two training images. As shown in Equation (1), when the categories of the two training images are the same, the true label of the training data is 0; otherwise, the true label of the training data is 1. The two eigenvectors are obtained for the two images in the sample through the feature extraction network: where Y is the true label of the sample, D W is the Euclidean distance between two eigenvectors in the sample, and m is a hyper-parameter, as shown in Figure 5, which controls the minimum distance between samples of different classes. The loss value is greater than zero when the distance between samples of the same class is greater than zero, and when the distance between different classes is less than the margin, loss is incurred.
Sensors 2022, 22, x FOR PEER REVIEW 5 of 12 metric learning, contrast loss has a concise form and is easy to converge during training, so this paper uses contrast loss to train the network. For contrast loss, Lcontrastive, the training sample consists of two training images. As shown in Equation (1), when the categories of the two training images are the same, the true label of the training data is 0; otherwise, the true label of the training data is 1. The two eigenvectors are obtained for the two images in the sample through the feature extraction network: where Y is the true label of the sample, DW is the Euclidean distance between two eigenvectors in the sample, and m is a hyper-parameter, as shown in Figure 5, which controls the minimum distance between samples of different classes. The loss value is greater than zero when the distance between samples of the same class is greater than zero, and when the distance between different classes is less than the margin, loss is incurred.  Figure 6 shows the training process of the network, where a pair of images Ia (i = ia, j = ja) and Ib (i = ib, j = jb) are randomly selected in the training set Tij to constitute a training sample S, where ja ≠ jb. When ia = ib, the two images belong to different categories, and the training sample S is true labeled 0; otherwise, it is 1. Next, Ia and Ib are entered into the feature extraction network D, and the corresponding feature vectors Fa = D(Ia) and Fb = D(Ib) are obtained. Then, the Euclidean distance between Fa and Fb is calculated, and the Dw and corresponding Lcontrastive are obtained. The parameter m of the network is obtained by optimizing the argmin(m)Lcontrastive.

Model Framework for Fine-Grained Smoke Identification Tasks
After determining the network structure and training, the model framework of the task is shown in Figure 7. The metric model proposed in this paper uses the ResNet network as feature extractor D, and uses the feature center Ci provided by the training set to guide the classification during the testing phase. In this paper, the lighter ResNet18 in the  Figure 6 shows the training process of the network, where a pair of images I a (i = i a , j = j a ) and I b (i = i b , j = j b ) are randomly selected in the training set T ij to constitute a training sample S, where j a = j b . When i a = i b , the two images belong to different categories, and the training sample S is true labeled 0; otherwise, it is 1. Next, I a and I b are entered into the feature extraction network D, and the corresponding feature vectors F a = D(I a ) and F b = D(I b ) are obtained. Then, the Euclidean distance between F a and F b is calculated, and the Dw and corresponding L contrastive are obtained. The parameter m of the network is obtained by optimizing the argmin(m) L contrastive . metric learning, contrast loss has a concise form and is easy to converge during training, so this paper uses contrast loss to train the network. For contrast loss, Lcontrastive, the training sample consists of two training images. As shown in Equation (1), when the categories of the two training images are the same, the true label of the training data is 0; otherwise, the true label of the training data is 1. The two eigenvectors are obtained for the two images in the sample through the feature extraction network: where Y is the true label of the sample, DW is the Euclidean distance between two eigenvectors in the sample, and m is a hyper-parameter, as shown in Figure 5, which controls the minimum distance between samples of different classes. The loss value is greater than zero when the distance between samples of the same class is greater than zero, and when the distance between different classes is less than the margin, loss is incurred.  Figure 6 shows the training process of the network, where a pair of images Ia (i = ia, j = ja) and Ib (i = ib, j = jb) are randomly selected in the training set Tij to constitute a training sample S, where ja ≠ jb. When ia = ib, the two images belong to different categories, and the training sample S is true labeled 0; otherwise, it is 1. Next, Ia and Ib are entered into the feature extraction network D, and the corresponding feature vectors Fa = D(Ia) and Fb = D(Ib) are obtained. Then, the Euclidean distance between Fa and Fb is calculated, and the Dw and corresponding Lcontrastive are obtained. The parameter m of the network is obtained by optimizing the argmin(m)Lcontrastive.

Model Framework for Fine-Grained Smoke Identification Tasks
After determining the network structure and training, the model framework of the task is shown in Figure 7. The metric model proposed in this paper uses the ResNet network as feature extractor D, and uses the feature center Ci provided by the training set to guide the classification during the testing phase. In this paper, the lighter ResNet18 in the

Model Framework for Fine-Grained Smoke Identification Tasks
After determining the network structure and training, the model framework of the task is shown in Figure 7. The metric model proposed in this paper uses the ResNet network as feature extractor D, and uses the feature center C i provided by the training set to guide the classification during the testing phase. In this paper, the lighter ResNet18 in the ResNet family was adopted to minimize overfitting of the depth model in the case of small sample sizes. sification, and after each test sample obtains a feature vector, it is necessary to calculate the distance from feature center Ci to determine its category.
Specifically, the feature center Ci is generated by the training set Tij, and i = 1,2,3 is the category number, and j = 1, …, Ni is the sample number. The eigenvector of the training set is first obtained by D, which is Fij = D(Tij), and Ciis the average of Fij in the j dimension. I is the input test image, and its feature vector F is generated by D: F = D(I). The distance di of F to Ci is then estimated and finally the category of I C(I) = argmin (i)(di) is determined.

Case Studies and Discussion
To validate the proposed new method in this paper, case studies on the experimental database described in Section 2.2 were performed, analyzed and compared using the three main elements in the method. Figure 8a shows the training curve obtained by training the network ResNet18 using the training dataset and the contrast loss. The training set is taken as the support set, and the support set and the test set are extracted by feature extraction network to obtain the feature vector. Its distribution in the feature space is shown in Figure 8b, where the star data points represent the feature center Ci and the circular data points represent the feature vectors of the test data. It can be seen that the distribution of the feature vectors of each category has a clear tendency of intra-class aggregation and inter-class separation. Figure 9 shows the results of the test in the form of a confusion matrix. It can be seen from the matrix that the recognition effect of factory smoke is better, which may be because the locations where factory smoke appears tend to have a simpler background, and the characteristics of the smoke source site are more obvious. There is still a certain error rate in the network for the distinction between cooking smoke and fire smoke. It may be caused by areas of overlap between human dwellings and vegetation in a forest environment, which results in a similar background for the two types of smoke. In addition, cooking smoke has similar characteristics to those of fire smoke. It is not highly significant in terms of color, texture, etc. Feature Extractor D learns how to extract features F of smoke to distinguish between different categories of smoke. For example, its extracted features may be related to the source of smoke: cooking smoke is often above a roof, factory smoke rise from high chimneys, and forest fire smoke usually rises in grassy groves. The trained feature extractor results in the feature vectors being distributed in the feature space according to the category; that is, the sample feature vectors of different classes are far away, and the sample feature vectors of the same class are close. Feature center C i is an important basis for classification, and after each test sample obtains a feature vector, it is necessary to calculate the distance from feature center C i to determine its category.

Results
Specifically, the feature center C i is generated by the training set T ij , and i = 1,2,3 is the category number, and j = 1, . . . , N i is the sample number. The eigenvector of the training set is first obtained by D, which is F ij = D(T ij ), and C i is the average of F ij in the j dimension. I is the input test image, and its feature vector F is generated by D: F = D(I). The distance d i of F to C i is then estimated and finally the category of I C(I) = argmin (i)(d i ) is determined.

Case Studies and Discussion
To validate the proposed new method in this paper, case studies on the experimental database described in Section 2.2 were performed, analyzed and compared using the three main elements in the method. Figure 8a shows the training curve obtained by training the network ResNet18 using the training dataset and the contrast loss. The training set is taken as the support set, and the support set and the test set are extracted by feature extraction network to obtain the feature vector. Its distribution in the feature space is shown in Figure 8b, where the star data points represent the feature center C i and the circular data points represent the feature vectors of the test data. It can be seen that the distribution of the feature vectors of each category has a clear tendency of intra-class aggregation and inter-class separation. Figure 9 shows the results of the test in the form of a confusion matrix. It can be seen from the matrix that the recognition effect of factory smoke is better, which may be because the locations where factory smoke appears tend to have a simpler background, and the characteristics of the smoke source site are more obvious. There is still a certain error rate in the network for the distinction between cooking smoke and fire smoke. It may be caused by areas of overlap between human dwellings and vegetation in a forest environment, which results in a similar background for the two types of smoke. In addition, cooking smoke has similar characteristics to those of fire smoke. It is not highly significant in terms of color, texture, etc. To further examine the reliability of the network, we visualized the salient feature map of the network, and Figure 10 shows the salient feature map of example data in the network. As can be seen in Figure 10, the network is largely based on the source of the smoke for type judgment; for example, for cooking smoke, the network is based on the house, for factory smoke, the network is based on the chimney, and for fire smoke, the network is based on grass. To further examine the reliability of the network, we visualized the salient feature map of the network, and Figure 10 shows the salient feature map of example data in the network. As can be seen in Figure 10, the network is largely based on the source of the smoke for type judgment; for example, for cooking smoke, the network is based on the house, for factory smoke, the network is based on the chimney, and for fire smoke, the network is based on grass. To further examine the reliability of the network, we visualized the salient feature map of the network, and Figure 10 shows the salient feature map of example data in the network. As can be seen in Figure 10, the network is largely based on the source of the smoke for type judgment; for example, for cooking smoke, the network is based on the house, for factory smoke, the network is based on the chimney, and for fire smoke, the network is based on grass.

Network Optimization and Discussions
The primary and important element in the model architecture is the backbo work, and in this paper, two issues need to be considered to determine the backbo work to be used. First of all, the number of samples is very small, requiring the n to have fewer parameters. Otherwise, the network would overfit. Secondly, for th grained smoke recognition task, it is necessary to extract the semantic features smoke source from the image, which has certain requirements for the depth of t work. When the number of network layers is small, the semantic features may no tracted. Under these constraints, a network is expected to have better performanc it is deeper and the number of parameters is small. However, when the network sparse, it also faces the issue of semantic feature extraction difficulties. Therefore, mize the network, we first limited the number of layers of the network to between 30 according to the nature of the task, and then selected a variety of networks in thi for testing, as shown in Table 1, including GoogleNet, MobileNet, ResNet18, S CNN, AlexNet, and VGGNet. Among these networks, to ensure the coherence of th a shallow CNN network is established based on the ResNet18 network and the A network. Figure 11 shows the specific structure of the shallow CNN network, wh input data has a size of 100 × 100 and the number of channels is three. The convol layer has a convolutional kernel size of 3 × 3 and an output channel count of fo fully connected layer has an input length of 80,000 and an output length of 500 networks have different ratios of parameter quantities to the number of layers, w representing the number of layers of the network and NP representing the par quantity of the network.

Network Optimization and Discussions
The primary and important element in the model architecture is the backbone network, and in this paper, two issues need to be considered to determine the backbone network to be used. First of all, the number of samples is very small, requiring the network to have fewer parameters. Otherwise, the network would overfit. Secondly, for the fine-grained smoke recognition task, it is necessary to extract the semantic features of the smoke source from the image, which has certain requirements for the depth of the network. When the number of network layers is small, the semantic features may not be extracted. Under these constraints, a network is expected to have better performance when it is deeper and the number of parameters is small. However, when the network is too sparse, it also faces the issue of semantic feature extraction difficulties. Therefore, to optimize the network, we first limited the number of layers of the network to between 6 and 30 according to the nature of the task, and then selected a variety of networks in this range for testing, as shown in Table 1, including GoogleNet, MobileNet, ResNet18, Shallow CNN, AlexNet, and VGGNet. Among these networks, to ensure the coherence of the data, a shallow CNN network is established based on the ResNet18 network and the AlexNet network. Figure 11 shows the specific structure of the shallow CNN network, where the input data has a size of 100 × 100 and the number of channels is three. The convolutional layer has a convolutional kernel size of 3 × 3 and an output channel count of four. The fully connected layer has an input length of 80,000 and an output length of 500. These networks have different ratios of parameter quantities to the number of layers, with NL representing the number of layers of the network and NP representing the parameter quantity of the network.  Figure 11. Structure of a CNN network used to ensure data coherence. Figure 12 shows the relationship between the NP/NL ratio and the accuracy of the tested networks. It can be seen that when NP/NL is too large or too small, the network accuracy declines. ResNet18 performed best in terms of detection accuracy, validating the selection of ResNet18 as a network in this paper. In addition to the backbone model, the loss function is another important element in the model architecture. To verify the effectiveness of metric learning in fine-grained identification tasks, this paper compares the loss functions including contrastive loss, triplet loss, circle loss and instance loss that are commonly used in metric learning and the crossentropy loss function that is commonly used in probability learning. Among them, it is found through experiments that the design of triplet loss, circle loss and instance loss increases the convergence difficulty of the network, especially in the case of few training samples, which results in a low test accuracy of the network. In the case of triplet loss, circle loss and instance loss, the test accuracy of the network is 64.2%, 54.8% and 64.2%, respectively. Therefore, hereafter we mainly discuss contrastive loss and cross-entropy loss. When using the cross-entropy loss function, the output dimension of the last fully connected layer of the network is the number of categories, which is three in this task. The vector is normalized using the softmax activation function, on the basis of which the principle of probability maximum is used for classification. In metric learning, to determine the category of test samples, support samples are necessary. In the usual metric learning task, due to the limitations of the task itself, the type of test sample is a new type for the network, so the support samples are also new samples. However, in the fine-grained smoke recognition task, there is no need to keep the fire smoke as a new type; therefore, in order to give full play to the learning ability of the network, we directly use the training samples as support samples. For example, the embedding vectors of all fire smoke training samples are averaged to obtain the support vector of the fire smoke type as the support vector of factory smoke and cooking smoke.
For a test sample, after obtaining its embedded vector, its type is determined by calculating the distance from each type support vector. Table 2 and Figure 13 show the results of the comparison. The sub-plots shown in Figure 13 correspond to the items in Table  2, which are the distributions of the category-centered feature vectors and the test data feature vectors in the feature space. It can be seen by comparing Figure 13a,b that the longer spacing is more conducive to the differentiation of categories. This shows that the use of the network-learned training set as a support set is conducive to the network obtaining a more accurate category-centered feature vector. From Figure 13c, it can be seen Figure 11. Structure of a CNN network used to ensure data coherence. Figure 12 shows the relationship between the NP/NL ratio and the accuracy of the tested networks. It can be seen that when NP/NL is too large or too small, the network accuracy declines. ResNet18 performed best in terms of detection accuracy, validating the selection of ResNet18 as a network in this paper.
Sensors 2022, 22, x FOR PEER REVIEW 9 of 12 Figure 11. Structure of a CNN network used to ensure data coherence. Figure 12 shows the relationship between the NP/NL ratio and the accuracy of the tested networks. It can be seen that when NP/NL is too large or too small, the network accuracy declines. ResNet18 performed best in terms of detection accuracy, validating the selection of ResNet18 as a network in this paper. In addition to the backbone model, the loss function is another important element in the model architecture. To verify the effectiveness of metric learning in fine-grained identification tasks, this paper compares the loss functions including contrastive loss, triplet loss, circle loss and instance loss that are commonly used in metric learning and the crossentropy loss function that is commonly used in probability learning. Among them, it is found through experiments that the design of triplet loss, circle loss and instance loss increases the convergence difficulty of the network, especially in the case of few training samples, which results in a low test accuracy of the network. In the case of triplet loss, circle loss and instance loss, the test accuracy of the network is 64.2%, 54.8% and 64.2%, respectively. Therefore, hereafter we mainly discuss contrastive loss and cross-entropy loss. When using the cross-entropy loss function, the output dimension of the last fully connected layer of the network is the number of categories, which is three in this task. The vector is normalized using the softmax activation function, on the basis of which the principle of probability maximum is used for classification. In metric learning, to determine the category of test samples, support samples are necessary. In the usual metric learning task, due to the limitations of the task itself, the type of test sample is a new type for the network, so the support samples are also new samples. However, in the fine-grained smoke recognition task, there is no need to keep the fire smoke as a new type; therefore, in order to give full play to the learning ability of the network, we directly use the training samples as support samples. For example, the embedding vectors of all fire smoke training samples are averaged to obtain the support vector of the fire smoke type as the support vector of factory smoke and cooking smoke.
For a test sample, after obtaining its embedded vector, its type is determined by calculating the distance from each type support vector. Table 2 and Figure 13 show the results of the comparison. The sub-plots shown in Figure 13 correspond to the items in Table  2, which are the distributions of the category-centered feature vectors and the test data feature vectors in the feature space. It can be seen by comparing Figure 13a,b that the longer spacing is more conducive to the differentiation of categories. This shows that the use of the network-learned training set as a support set is conducive to the network obtaining a more accurate category-centered feature vector. From Figure 13c, it can be seen In addition to the backbone model, the loss function is another important element in the model architecture. To verify the effectiveness of metric learning in fine-grained identification tasks, this paper compares the loss functions including contrastive loss, triplet loss, circle loss and instance loss that are commonly used in metric learning and the crossentropy loss function that is commonly used in probability learning. Among them, it is found through experiments that the design of triplet loss, circle loss and instance loss increases the convergence difficulty of the network, especially in the case of few training samples, which results in a low test accuracy of the network. In the case of triplet loss, circle loss and instance loss, the test accuracy of the network is 64.2%, 54.8% and 64.2%, respectively. Therefore, hereafter we mainly discuss contrastive loss and cross-entropy loss. When using the cross-entropy loss function, the output dimension of the last fully connected layer of the network is the number of categories, which is three in this task. The vector is normalized using the softmax activation function, on the basis of which the principle of probability maximum is used for classification. In metric learning, to determine the category of test samples, support samples are necessary. In the usual metric learning task, due to the limitations of the task itself, the type of test sample is a new type for the network, so the support samples are also new samples. However, in the fine-grained smoke recognition task, there is no need to keep the fire smoke as a new type; therefore, in order to give full play to the learning ability of the network, we directly use the training samples as support samples. For example, the embedding vectors of all fire smoke training samples are averaged to obtain the support vector of the fire smoke type as the support vector of factory smoke and cooking smoke.
For a test sample, after obtaining its embedded vector, its type is determined by calculating the distance from each type support vector. Table 2 and Figure 13 show the results of the comparison. The sub-plots shown in Figure 13 correspond to the items in Table 2, which are the distributions of the category-centered feature vectors and the test data feature vectors in the feature space. It can be seen by comparing Figure 13a,b that the longer spacing is more conducive to the differentiation of categories. This shows that the use of the network-learned training set as a support set is conducive to the network obtaining a more accurate category-centered feature vector. From Figure 13c, it can be seen that due to the use of shallow CNN as feature extraction networks, the feature vectors of the test data cannot be well clustered with the centers of each class, indicating that the shallow CNN learning limited the selection of features. In Figure 13d, although the test data show good convergence, since the loss function does not have an interval constraint and does not have a support set, the samples near the boundaries of each category are prone to misclassification. that due to the use of shallow CNN as feature extraction networks, the feature vectors of the test data cannot be well clustered with the centers of each class, indicating that the shallow CNN learning limited the selection of features. In Figure 13d, although the test data show good convergence, since the loss function does not have an interval constraint and does not have a support set, the samples near the boundaries of each category are prone to misclassification.
(a) (b) (c) (d) Figure 13. The comparison results are presented in the form of feature spatial distribution map.

Conclusions and Future Work
In this paper, fine-grained smoke recognition under the few-shot condition was achieved using the metric learning method. Specifically, ResNet18 was selected as the feature extractor and the contrast loss was applied for training. The feature centers of the

Conclusions and Future Work
In this paper, fine-grained smoke recognition under the few-shot condition was achieved using the metric learning method. Specifically, ResNet18 was selected as the feature extractor and the contrast loss was applied for training. The feature centers of the training set were introduced during the testing phase to guide classification. The developed method obtained a detection accuracy of 93.75% on the test set of the self-built data set. By applying the method of this paper to the coarse-grained smoke detection method, the non-fire smoke in the detection results can be further screened out, thereby reducing the risk of false positives within the smoke class in the forest fire monitoring scenario.
In the future, to improve the overall accuracy of forest fire smoke detection tasks, it is necessary to further improve the interclass accuracy of smoke detection. Additionally, the attention of network still needs to be optimized for various types of smoke, especially fire smoke.