1. Introduction
A thermal infrared (TIR) camera captures infrared radiation emitted by objects in scenes as the visible spectrum images do not have ideal color and texture, such as night or low-light working environments where TIR images have a strong advantage; thus, in recent years, TIR cameras have been widely used in industrial surveillance and drones. However, such images do not have color information, which makes it difficult for humans to distinguish objects in the scene and affects their use in some important contexts, such as emergency rescue environments. In order to improve human eye recognition and computer intelligent processing, many researchers are studying how to translate TIR images into color visible (CV) images for computer vision tasks such as object tracking, crowd counting, panoramic segmentation, and image fusion.
Furthermore, TIR relies on converting temperature into an image, so there is only one active channel measuring mainly the temperature information; however, CV relies on the conversion of colors into images, so there are three channels, which we usually call RGB, and the information carried by TIR and CV is not in the same domain, such that TIR to CV translation is a one-to-three value mapping, including the translation of texture and color. Moreover, due to the limitations of imaging mechanisms and camera manufacturing processes, TIR images have limited resolution and a less prominent texture; these huge differences between image modalities develop challenges to the design of image translation models [
1]. Most deep learning (DL)-based image translation algorithms use an end-to-end neural network that directly translates single-channel TIR images to three-channel CV images. As a result, when the CV content is simple and the TIR texture is relatively rich (such as face TIR images), the translation effect is better [
2]; however, when the CV scene is complex (such as the street view or the natural landscape), the translated images tend to have large areas of color confusion and texture anomalies.
The translation from TIR to CV has a one-to-many mapping relationship between the different domain values, the most important process of which is to translate the information of the temperature domain into the color domain. In order to reduce the ambiguity of the translation process, the translation of different domain values is only considered a one-to-one mapping relationship learning, i.e., first translating the TIR images to grayscale visible (GV) images. Existing translation algorithms, such as QS-Attn [
3], decompose one temperature value into three RGB values, causing the translated image to show blurred edges and color confusion, as shown in
Figure 1e, which represents the ambiguity in the mapping from the temperature value to RGB due to a lack of sufficient constraints. This ambiguity can be reduced when the temperature is translated only to grayscale values, as shown in
Figure 1c.
Although further translation from GV images to CV images is required, there is no blurred edge and color confusion because both effects are translated in the same domain. Some existing research results ([
4,
5]) show that the only disadvantage of the translation from GV to CV is that it is not easy to restore the real scene color; however, the resulting image can fully meet the visual needs of the human eye, as shown in
Figure 1f.
Since this study focuses on TIR to GV translation, an improved CycleGAN, called GMA-CycleGAN (Gray Mask Attention-CycleGAN), is proposed, i.e., a gray image cycle-consistent GAN with mask attention. The mask attention mechanism helps in improving the texture of salient objects (pedestrians and vehicles) to meet the needs of object detection and semantic segmentation in practical applications. Thus, a mask attention module has been proposed. Moreover, this latter does not increase network parameters based on TIR temperature masks and CV semantic masks that separate salient objects from the background. Therefore, in GMA-CycleGAN, the mask attention module is added to feature encoding and the feature decoding convolutional layers of the generator. In addition, a perceptual loss term is added to the original CycleGAN loss function to make the translated image closer to the real image in the feature space.
The subsequent parts of this paper are arranged as follows: first, the relevant research status is explained in
Section 2, the improved algorithm that was proposed is explained in
Section 3, the experiments and datasets are explained in
Section 4, the obtained results are analyzed and discussed in
Section 5, and a conclusion concludes this work in
Section 6.
2. Related Work
Due to the different imaging principles of thermal and visible light sensors, the temperature domain, where TIR is located, is very different from the color domain, where CV is located, and it is difficult for traditional methods to directly identify the mapping relationship from TIR to CV. Early studies ([
6,
7]) used the fusion of near-infrared images and TIR to supplement texture information, to obtain a fusion image that approximates grayscale visible light, and to color the fusion image according to the color distribution of a reference color image, so that the visual realism of the obtained image is poor.
In recent years, DL has been widely used in various computer vision tasks, and the powerful fitting ability of neural networks has engendered certain progress in TIR to CV image translation tasks. Moreover, scholars have proposed many network models based on DL, and these models were divided into two categories: convolutional neural networks (CNNs) models and generative adversarial networks (GANs) models according to whether adversarial training is used. In addition, models based on CNNs, such as the TIR2lab model proposed by Berg et al. [
8], are the first end-to-end TIR translation models. They hypothesized that the CNN model, based on the autoencoder structure, could identify the luminance-to-chromaticity mapping relationship of paired TIR and CV, and, for the first time, they used the neural network to directly translate TIR to CV. In order to make the small objects of the translated images have more realistic and richer texture information, Wang et al. [
9] proposed an attention-based hierarchical thermal infrared image colorization network (AHTIC-Net), which uses multi-scale network structures to extract the features of objects of different sizes in order to enhance the model’s attention to small objects during the training process. In general, the translation model, based on CNNs, has an intuitive structure and a simple network training mode; however, due to the insufficient constraint of the loss function of CNNs on the image translation, the translated CV images have the disadvantages of local detail distortion, low image contrast, and blurred visual effect.
Thus, due to the adversarial training of the generative model and the discriminative model, GANs have better behavior than CNNs when applied to image generation, and they can use the unpaired TIR-CV dataset for training. For instance, Isola et al. [
10] propose pix2pix to identify the mapping of the source image to a target image using paired datasets. In order to break through the dataset limitations, Zhu et al. [
11] proposed the CycleGAN, which uses two symmetric GANs to form a closed-loop network where one GAN translates images from the source domain to the target domain and the other GAN translates the target domain image back to the source domain and uses cycle-consistency loss to boost the image after two translations to be identical to the original image. Moreover, Pix2pix and CycleGAN have greatly improved the translation effect between CV images, so that image translation, based on the GAN model, has attracted the attention of many scholars. Subsequently, several have improved GAN, introduced methods such as contrastive learning and attention mechanisms, and proposed a variety of unpaired image translation models [
3,
12,
13,
14]. These GAN-based image translation models achieve good results in tasks such as semantic map to CV translation, super resolution, image inpainting, and style transfer; however, the visual effect is not ideal when they are directly applied to TIR-CV translation. Furthermore, Kuang et al. [
15] improved the pix2pix method and proposed a new algorithm, the TIC-CGAN, which used GAN for TIR image translation in traffic scenes for the first time. To make the translated image richer in texture, TIC-CGAN applied a coarse-to-fine generator instead of the pix2pix generator, which led to finer texture features in the target images.
The methods mentioned above, based on CNNs and GANs, use paired TIR-CV datasets for neural network training. This latter translates daytime TIR into daytime CV and translates nighttime TIR into nighttime CV, respectively. Since the high beams of oncoming vehicles in nighttime traffic scenes interfere with RGB imaging, resulting in the visual effect of nighttime CV being inferior to daytime CV, Luo et al. improved the CycleGAN algorithm and proposed PearlGAN [
16], which used an unpaired training mode to translate nighttime TIR to daytime CV. Although these improved GAN models improve the realism of the translated images, the generated CV images still generate defects related to unclear texture and color distortion. This is due mainly to the one-to-many correspondence in the TIR to CV translation, which is itself an ill-posed solution [
17]. Thus, including the previous CNN models, which directly translate single-channel TIR to three-channel CV, these end-to-end translations cannot handle the one-to-many mapping between the temperature domain and the color domain well, resulting in different degrees of color confusion and edge blurring in the translated image.
In order to reduce the instability of the ill-posed problem-solving process, we propose the decomposition of the TIR to CV translation into a two-phase translation process: the first one consists of translating from TIR to GV, whereas the second one achieves the translation from GV to CV. In the proposed experiment, we use the original CycleGAN as the base model. First, we change the TIR to CV one-to-three channel translation to TIR to GV one-to-one channel translation. Although this process does not intrinsically solve the problem of temperature and color matching, it reduces the uncertainty of temperature-to-color translation, helping to improve the sharpness of image edges and reduce unwanted color noise. Inspired by the spatial attention module of AttentionGAN [
18], and in order to better distinguish between salient objects (movable pedestrians and vehicles) in the generated image, we separate the object and background, then the semantic mask and the temperature mask are extracted in CV and TIR, respectively, making the salient objects in the image clearer without increasing the network parameters. Moreover, the use of adversarial loss in image translation tasks tends to produce distorted textures [
17]. In order to mitigate this problem, a perceptual loss term is added to the original CycleGAN loss function to encourage the translated image to be more similar to the real image in the feature space, which makes the texture information of the translated image closer to the true GV image. Thus, considering that the grayscale image coloring task is only performed in the temperature domain and does not produce edge blurring and noise [
4], the original CycleGAN is directly used for the GV to CV translation.
3. Methods
3.1. The Framework of GMA-CycleGAN
The flowchart of TIR-GV image translation, based on our improved CycleGAN (GMA-CycleGAN), is shown in
Figure 2, where
A and
B present two data domains, namely the TIR and the GV,
and
are two mask attention-based CycleGAN generators, and
and
are two CycleGAN discriminators. The first row represents the TIR to GV translation, whereas the second row represents the GV to TIR translation. The unpaired training scheme is used, where the real images
A and
B are randomly selected from the TIR and GV datasets, respectively. Taking the first row of
Figure 2 as an example (the same would have been performed on the second row), the input real TIR image
A is translated by
to GV, and then the discriminator
determines whether the generated GV and the real GV image
B are real or fake and calculates the adversarial loss. Since the adversarial loss, calculated by CycleGAN discriminators, will cause some distorted textures in the generated images, we introduce the perceptual loss (pl), based on the VGG-16 feature extractor [
19], to calculate the difference between the global features of the generated and the real images. This makes the overall visual effect of the generated images more realistic. In addition, CycleGAN will also input the generated image and the real image
A into generator
, and calculate the difference between the two translated images and the real image
A, namely, determining the cycle consistency loss and identity mapping loss.
Thus, the remainder of this section consists of introducing the temperature mask and the semantic mask in
Section 3.2, the improved mask attention-based generator in
Section 3.3, and the improved loss function of CycleGAN in
Section 3.4.
3.2. Temperature Mask and Semantic Mask
In order to better identify the salient objects (movable pedestrians and vehicles) in the translated images and perform downstream tasks, such as object recognition and object tracking, after the completion of the image translation task, the object will be separated from the background in CV and TIR, the semantic mask and the temperature mask will be extracted, and they will be added to the generator as prior knowledge.
Furthermore, we extracted the semantic images from the real CV using Mask2Former [
20], a semantic segmentation model pre-trained on the CitySpace dataset, and then assigned a value of zero to the background (e.g., sky, vegetation, and road) and a value of one to the objects to obtain binary semantic masks, as shown in
Figure 3a–c. Among them, the semantic masks of the daytime scene have better effect than the semantic masks of the nighttime scene, as the pre-trained semantic segmentation model will make an incorrect judgment on the object category in the black border of the nighttime CV and it is difficult to distinguish pedestrians in the distance; thus, a more accurate temperature mask is used instead of the semantic mask in the nighttime scene.
Moreover, the raw TIR contains temperature information for the imaging area. Due to the different characteristics of various types of objects absorbing, emitting, and reflecting heat, the specific objects have different manifestations on TIR. Considering this particular property of TIR, the object is separated from the background by setting a pixel threshold. Among them, the human body temperature is relatively stable, and it is the easiest to be separated from the background; moreover, the temperature of the car is higher when the engine is working, and the metal and glass materials on the outer surface of the car are more reflective when the engine is turned off; thus, it is easier to distinguish it. Since the thermal infrared sensor receives both the object’s own radiation and the environmental radiation, the influence of solar radiation at night on the thermal imaging is small, and the camera is almost only sensitive to the heat emitted by the object itself, so the segmentation of the human body and the vehicle in the nighttime scene is more accurate than during the day. In addition, as daytime lighting conditions vary, the threshold chosen for segmentation should also vary [
21]. Therefore, we used the method proposed in [
22] to divide the FLIR dataset into three scenarios: sunny day, cloudy day, and night, and then extract the corresponding salient object temperature threshold windows for the different scenarios.
We set the pixel value in the TIR image threshold window to one and the pixel value beyond the threshold window to zero in order to obtain binary temperature masks, as shown in
Figure 3d,e. From this figure, pedestrians can be better divided in sunny days, but the car division noise in the distant parking lot is large, and the road surface also has a certain noise. As for the cloudy days, vehicles and pedestrians can also be separated, but there is a certain amount of noise. Finally, at night, people in the distance can be accurately divided with less noise. Since semantic masks are less noisy than temperature masks during the day and vice versa in nighttime scenes with less ambient radiation, we use semantic masks during the day and temperature masks at night. In general, both temperature masking and semantic masking are a little noisy, but our network model does not rely entirely on masks as it also relies on other feature information of the original CV-TIR image pairs; thus, the mask noise has a small effect.
3.3. Generator Based on Mask Attention
Spatial attention focuses on local information within the spatial domain, that is, the identification of the areas on the feature map that deserve our attention, yielding better network outputs. General spatial attention is calculated using neural networks, which is posterior knowledge, while the temperature mask and the semantic mask, inferred by the threshold window and pre-training model in this study, can be regarded as a type of spatial attention (i.e., mask attention) based on prior knowledge without increasing the amount of network parameters, and can focus on the mask region. The proposed mask attention multiplies the input feature map with the mask on a channel-wise pixel-wise basis, as expressed in Equation (1):
where
and
are the feature maps of the mask attention input and output, respectively,
represents the binary masks whose length and width are equal to the feature maps, and
,
and
represent the number of channels, the height, and the width of the feature maps, respectively. Added to that,
is a parameter for adjusting the attention strength of the mask. Afterwards, we translate the binary mask to weight mask by applying (
). In order to emphasize the salient object and suppress the background, we set
as this parameter yields a weaker attention when it is close to one and a strong attention when it is close to zero. For feature maps of different spatial sizes, the original mask passes through a pooling layer of size
, stride 2, and padding 1 to keep the length and the width consistent.
The GMA-CycleGAN generator adds mask attention between the convolutional layers, the instance normalization (IN), and the ReLU activation layers of the original CycleGAN encoder and decoder. As shown in
Figure 4, among them, the encoder uses three convolutional layers to extract the feature maps of
from the source image of
, and then the translator translates the
dimension feature maps extracted in the previous step into the
dimensional feature maps of the target image through nine residual blocks. Finally, the decoder uses three deconvolution layers to restore the low-level features from the feature maps so as to output a
image. To maintain the symmetry of the CycleGAN generator, in the encoder, we insert the mask attention module after the initialization convolutional layer (kernel size equal to
and stride equal to one), and after two down-sample convolutional layers (kernel size is equal to
and stride is equal to two), other structures are the same as the original CycleGAN, and then the corresponding improvements are added symmetrically to the decoder. Since the CycleGAN model is mirror-symmetric, the mask attention module is placed after the convolution operation when encoding and before the deconvolution operation when the feature is decoded.
3.4. Loss Function of GMA-CycleGAN
Let A and B represent the source image domain (TIR domain) and the target image domain (GV domain), respectively, and a and b represent the source image and target image, respectively. There are three loss functions for the original CycleGAN: GAN loss, identity mapping loss, and cycle consistency loss, where GAN loss includes two GAN loss functions, i.e., and . Moreover, GAN loss guarantees that the generated sample is distributed the same way as the real sample. The cycle-consistency loss encourages the sample to remain unchanged after passing through two generators, i.e., and . As for the identity mapping loss , it guarantees hue associativity between the generated image and the original image.
Since adversarial loss can cause texture distortion in the generated image, we added the perceptual loss item to make the generated image texture more realistic. Thus, perceptual loss calculates the
distance between the feature maps obtained by the convolution of the generated image and the feature maps obtained by the convolution of the real image, so that their high-level semantic information is closer. Referring to [
23], we use the pre-trained VGG-16 network on the ImageNet dataset [
24] as a feature extractor, and the perceptual loss is expressed in Equation (2):
Since the VGG-16 network is pre-trained and its input must be a three-channel image, while TIR and GV are single-channel images, we copy the single channel into a three-channel image and use it as the input of VGG-16.
The final loss function of the model is a weighted combination of each loss, and the formula is expressed as follows:
where
,
,
,
are the weight coefficients used to adjust the ratio of the cycle-consistency loss, identity mapping loss, and perceptual loss, respectively.
5. Results and Discussion
First, when the mask attention is used in different positions of the generator encoder, the FID of the corresponding six models of different mask attention parameters w is calculated as shown in
Table 1. Referring to this table, when the mask attention is added to the first two convolutional layers, the FID is the smallest, i.e., the image translation effect of the GMA-CycleGAN_4 model is the best. We believe that the first two convolutional layers extract low-level features of the TIR image, such as edge information, and then the mask attention can guide the network to distinguish between the boundaries of an object and those of a background. As for the third convolutional layer, it extracts more complicated image global features, and using mask attention may destroy the extracted global features. As can be seen from the table, it is better to set the parameter w to 0.6, a smaller w will overly suppress the characteristics of the background area, and a larger w does not distinguish between the background and the object enough.
Since few studies have used FLIR datasets for TIR translation, we selected five typical and popular unpaired image translation models to compare with our model, including CycleGAN [
11], U-GAT-IT [
12], NICE-GAN [
13], CUT [
14], and QS-attn [
3], and their open-source code was implemented for model training and testing.
Table 2 displays the quantitative evaluation metrics of each model on the testing set, where our model outperforms the other models across the board. In the typical model, QS-attn has the highest realism indicator and NICE-GAN has the highest faithfulness indicator. Compared to these two state-of-the-art (SOTA) models, our model’s FID is reduced by 2.42, the PSNR is increased by 1.43, and the SSIM is basically unchanged. Moreover, our model is twice as fast as CycleGAN in terms of training time but faster than the other models, where QS-attn takes the longest training time of about 9 days.
To sum up, we provide three nighttime TIR images, four daytime TIR images, and their corresponding translated CV images to subjectively evaluate the different models, as shown in
Figure 5. Since all models in
Figure 5 were trained with the unpaired dataset, in which daytime CV images accounted for the majority (about 80%), these models can easily translate the night images into the day images; this does mean these models are overfitted for the night image translation; however, the loss curves did not show overfitting during the model training. In fact, in order to obtain better RGB visual effects, we expect to translate all these TIR images into day images regardless they were captured at day or night. In the future work, we consider to supplement extra information to augment the translation effect from night TIR images to day CV images.
Whether it is a daytime scene or a nighttime scene, the translated image of the typical models has different degrees of color chaos, and the proposed model does not have this feature, which shows the superiority that TIR is translated to GV first and then translated from GV to CV. Moreover, while computing the translated images of the typical models, the edges of the sky and the roads are clear, but the edges of the ground scene are more blurred (such as pedestrians and vegetation), while the translated images of our model have less blurred edges. Overall, the proposed model’s translated images have a better overall effect, with more realistic textures and better distinction between salient objects (people and vehicles) and backgrounds.
We also explore the impact of each component in ablation studies. Comparison results are shown in
Figure 6. As can be seen, removing the perceptual loss leads to distorted details, such as the white circles with blue edges (marked in green rectangles) in the two rows of
Figure 6b and the abnormal red color around the two pedestrians in the second row of
Figure 6b. Additionally, removing the mask attention leads to semantic confusion between salient objects and the background, such as the two pedestrians and the pick-up truck in the second row of
Figure 6c. Therefore, each component is indispensable for generating high-quality CV images.
Table 3 displays the quantitative evaluation metrics of different ablation studies for GMA-CycleGAN_4. The translated TIR images produced by GMA-CycleGAN_4 with full structures achieve the best quantitative metrics. The lack of perceptual loss and mask attention causes performance degradation, in which omitting perceptual loss reduces the FID, PSNR, and SSIM by 2.66, 1.07, and 0.0086 and omitting mask attention reduces the FID, PSNR, and SSIM by 2.02, 0.92, and 0.0062.
6. Conclusions
To solve the problem of color distortion and edge blurring in the images generated by the existing end-to-end TIR to CV translation model, we propose an improved CycleGAN (GMA-CycleGAN) consisting of the translation from TIR to GV first, then using the original CycleGAN to translate from GV to CV. Thus, for temperature domain to color domain translation, the one-to-one mapping relationship is only considered, that is, the TIR to GV translation, which reduces the color ambiguity caused by the different domain translation. We also take the temperature mask of TIR and the semantic mask of CV as prior knowledge to add the edge information of salient objects. In addition, to mitigate texture distortion caused by adversarial loss, perceptual loss is added to the CycleGAN loss function. In terms of objective evaluation, compared to the existing SOTA methods, our model training time is shorter, the FID is reduced by 2.42, and the PSNR is increased by 1.43. In terms of subjective evaluation, experimental results show that the texture and color of the translated image, obtained by our method, are more realistic, and the salient object edge information is richer. The results also validate the effectiveness of the proposed method and indicate its importance for many fields, such as autonomous vehicles, emergency rescue, robot navigation, and nighttime video surveillance.
Further work is threefold. First, in order to apply our method to versatile datasets, extraction of the temperature mask from JPG or PNG format image files is needed. Second, semantic segmentation of TIR and GV images, which can be used to maintain semantic consistency between real and generated images, is needed to further improve the translated image quality. Third, when using test datasets that are quite different from the aligned FLIR dataset, the generalization ability of our model is not ideal, and we consider solving this problem in further research works.