2.1. Data Collection
Existing public datasets of forest fires tend to lack diversity [
16]. In this paper, blender [
27] was used to generate smoke with various appearances and resolutions by setting different lighting, wind, airflow, gravity, etc., and rendering the virtual smoke under any background image to produce synthetic images, as shown in
Figure 1.
For the synthetic images suitable for subsequent pixel-level domain adaptation, the style of each image’s foreground and background should be uniform to ensure visual harmony. Therefore, virtual background images were collected in a video game, Red Dead Redemption 2, to match the virtual smoke images. Considering the diversity of the environment, we gathered 1500 images containing the scenes from different periods, different forest types, different environments, and different seasons. The samples of background image data are presented in
Figure 2.
We built a synthetic smoke dataset containing 2000 images of 256 × 256 size by rendering each virtual background image with a virtual smoke image.
Figure 3 shows the sample of the synthetic image dataset.
For real smoke data, wildfire smoke images were mainly collected from the internet, the public dataset from State Key Laboratory of Fire Science, USTC [
28], and the public dataset from Yuan Feiniu [
29]. After data cleaning, we obtained a total of 2500 images of wildfire smoke. In addition, a large number of real wildland background images were collected from the internet and selected, resulting in 4000 wildland background images. Of the images above, 1040 images were selected to build the test set—520 smoke and 520 non-smoke images—and
Figure 4 shows some samples of the test set.
2.2. Pixel-Level Based Domain Adaptation
The pixel-level domain adaptation was made based on CycleGAN, as shown in
Figure 5.
is the source domain, and
is the target domain; the data
in the source domain are synthetic smoke images, and the data
in the target domain are real smoke images. The generator
would learn a mapping function from the synthetic smoke images (the source domain) to the real smoke images (the target domain), the generator
, and vice versa.
and
are used to generate
and
, respectively. Our goal is to convert the source domain image into the style of the target domain image to obtain the data in
, i.e., photorealistic smoke images. For mode collapse [
6] and loss of structural information in the source domain images [
30], the training process introduced cycle consistency loss to regularize. Specifically, for
and
, one of our goals is
The other goal is an inverse process for
. Equation (1) shows the whole cycle consistency loss:
In addition, two discriminators
and
are trained to distinguish real and fake images. The adversarial loss [
6] is formulated as Equations (2) and (3):
The overall loss of CycleGAN is defined as Equation (4):
where
is the weight of cycle consistency loss. To improve the training efficiency, the CycleGAN model was compressed using the general-purpose compression framework in this paper [
25], reducing the computational cost of the generator and the model size.
2.3. Feature-Level Based Domain Adaptation
To learn high-level semantic information about the smoke images and further reduce the feature distribution difference between the photorealistic smoke images and the real smoke images, the feature-level domain adaptation method, which combined ADDA [
24] with DeepCORAL [
26], was proposed in this paper. In this section, the smoke images in the source domain are the photorealistic smoke images obtained by pixel-level domain adaptation, but all the non-smoke images in the source domain are real non-smoke images. This differs from the setting in traditional domain adaptation because the two categories in the source domain have different sources of data. However, our goal is to achieve the binary classification of smoke and non-smoke images, so we only need to focus on the generalized features of smoke and not too much on the features of non-smoke images. Experimentally, this setup did not affect the performance of the model. The images in the target domain are made up of real smoke images and real non-smoke images.
The source domain images
are used with the label
throughout the feature-level domain adaptation process, while the target domain images
are used without the label. The aim of feature-level domain adaptation is to train a target representation
and classifier
that can accurately classify the target images into two classes, including smoke and non-smoke, even in the absence of domain annotations. Since it is not possible to perform supervised training directly on the target domain, a source representation mapping
and a source classifier
were trained using the source domain images in the pre-training phase, as shown in
Figure 6, where the source classifier is trained using the standard cross-entropy loss:
In the adversarial adaptation phase, the main objective is to regularize the source mapping and the target mapping training, thus minimizing the feature distributions extracted by the source and target mappings: and . Under such conditions, the source classifier can be used directly in the target representation, i.e., .
Based on the idea of adversarial training, the training losses of the domain classifier
and the target mapping
are optimized in an alternating minimization process, as shown in
Figure 7. First, for domain classifiers, the training goal is to accurately distinguish whether the data come from the source or target domain. The domain classifier D is, therefore, optimized using standard supervised loss, defined as follows:
When training the target mapping
, the standard cross-entropy loss is also used, as defined in Equation (7):
In such a setting, the optimization of the target mapping
and the domain classifier
is performed in adversarial training. In addition, to further align the correlation of features in the source and target domains, the calculation of the CORAL loss [
26] was added to ADDA [
24]. This is to align the second-order statistics of the source and target domain feature distributions, which helps to confuse the feature distributions. The definition of a CORAL loss is as follows:
where
and
are the covariance matrices of d-dimensional features in the source domain and the target domain, respectively. The CORAL loss is also computed in the adversarial adaptation phase, as shown in
Figure 7, where the learning rate of the classifier C was adjusted to 0 when performing backpropagation so that the CORAL loss only trains the target mapping
.
Therefore, the overall loss function of the target mapping
is shown below:
where
denotes the weight of CORAL loss during training, which varies from 0 to 1 with training epochs.
In summary, the overall process of feature-level domain adaptation is as follows: first, a Source CNN and classifier are trained using the source domain images. After the training, the parameters of these two parts are no longer updated. In the adversarial training phase, the initial weights of the Target CNN are the same as those of the Source CNN, and the source and target domain images are used as inputs to the Source CNN and Target CNN, respectively. The features obtained after mapping are used to calculate and . At the same time, the source images and the target images are jointly used as input to the Target CNN, and the mapped features are then used as input to the pre-trained classifier, and the final output is used to calculate .