VQ-InfraTrans: A Unified Framework for RGB-IR Translation with Hybrid Transformer

Sun, Qiyang; Wang, Xia; Yan, Changda; Zhang, Xin

doi:10.3390/rs15245661

Open AccessArticle

VQ-InfraTrans: A Unified Framework for RGB-IR Translation with Hybrid Transformer

School of Optics and Photonics, Beijing Institute of Technology, No. 5 South Street, Beijing 100081, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(24), 5661; https://doi.org/10.3390/rs15245661

Submission received: 27 September 2023 / Revised: 22 November 2023 / Accepted: 4 December 2023 / Published: 7 December 2023

(This article belongs to the Special Issue Computer Vision and Image Processing in Remote Sensing)

Download

Browse Figures

Versions Notes

Abstract

:

Infrared (IR) images containing rich spectral information are essential in many fields. Most RGB-IR transfer work currently relies on conditional generative models to learn and train IR images for specific devices and scenes. However, these models only establish an empirical mapping relationship between RGB and IR images in a single dataset, which cannot achieve the multi-scene and multi-band (0.7–3

μ

m and 8–15

μ

m) transfer task. To address this challenge, we propose VQ-InfraTrans, a comprehensive framework for transferring images from the visible spectrum to the infrared spectrum. Our framework incorporates a multi-mode approach to RGB-IR image transferring, encompassing both unconditional and conditional transfers, achieving diverse and flexible image transformations. Instead of training individual models for each specific condition or dataset, we propose a two-stage transfer framework that integrates diverse requirements into a unified model that utilizes a composite encoder–decoder based on VQ-GAN, and a multi-path transformer to translate multi-modal images from RGB to infrared. To address the issue of significant errors in transferring specific targets due to their radiance, we have developed a hybrid editing module to precisely map spectral transfer information for specific local targets. The qualitative and quantitative comparisons conducted in this work reveal substantial enhancements compared to prior algorithms, as the objective evaluation metric SSIM (structural similarity index) was improved by 2.24% and the PSNR (peak signal-to-noise ratio) was improved by 2.71%.

Keywords:

infrared image; image-to-image translation; multi-modal controls; vector quantization; transformer

Graphical Abstract

1. Introduction

Complex illumination scenarios have adversely affected the accuracy of visible light data in recent years. Unfortunately, these factors are beyond our control and significantly reduce the usefulness of images. This situation poses a significant challenge in processing and training, limiting the range of applications for these data. Infrared (IR) images (both 0.7–3

μ

m and 8–15

μ

m) offer radiation intensity texture information that visible images lack, making them particularly helpful in daytime, nighttime, and complex scenes. In low-light conditions, infrared images captured through thermal radiation (8–15

μ

m) provide enriched semantic information. Objects exhibiting high thermal temperatures can reveal discernible features within intricate scenes. Therefore, based on deep learning, cross-modal image translation has become a hot topic in remote sensing research in recent years. Many researchers are studying how to translate RGB images into infrared images for deep learning-based visual tasks such as object tracking, crowd counting, panoramic segmentation, and image fusion in urban scenarios. The utilization of the RGB-IR dataset in the aforementioned tasks holds the potential to provide comprehensive multi-band fusion data for urban scenes, thereby facilitating precise modeling across different scenarios.

A large-scale neural network algorithm based on RGB features can be trained using large monomodal public datasets, such as ImageNet [1], PASCAL VOC [2], and MS COCO [3]. However, compared to RGB datasets, infrared image public datasets often suffer from limitations such as limited scene diversity, a lack of diverse target categories, low data volume, and low resolution. Therefore, researchers have developed a large number of deep-learning-based style transfer algorithms to achieve the conversion from RGB images to infrared images through end-to-end translation, such as CNN [4,5,6], GAN [7,8,9], attention networks [10,11,12], etc., to learn and fit the mapping relationship between RGB and IR images. These RGB-IR algorithms approach the task by solving a pixel-level conditional generation problem. IR images convert the radiation intensity field into grayscale images, leading to a mapping relationship between IR and RGB images that is not based on spectral physical characteristics. As a result, there is no strict pixel-level correspondence [13]. The research conducted in [14,15] indicated that, while a conditioned generative model can successfully generate customized IR images, these models primarily focus on studying the texture or content transformation from RGB to IR, without considering the diverse types of migration mapping relationship between different visual fields. The mono-modality transformation predominantly relies on simplistic semantic matching and transferring strategies, leading to unrealistic expression of radiation information. Due to the global feature extraction and generation mechanisms of the transfer model, vehicles and pedestrians exhibit significant disparities between the generated infrared textures and the ground-truth, and they may even be overlooked in some results. Consequently, this limits their flexibility and versatility in various scenarios and tasks. In practical applications, it is crucial for the model to accurately translate complex and diverse scenes, data, and task requirements. Therefore, designing a unified visibility-infrared migration framework suitable for multi-scene and multi-task purposes holds significant practical value.

The RGB-infrared transfer framework encounters two specific challenges: the limitation of scenario diversity, and the general usage of flexibility. The generated translations should possess semantic integrity and visual rationality to ensure a high-quality output. Therefore, we propose a novel multi-modal two-stage training framework called VQ-InfraTrans. Our framework can translate an RGB image from scratch or based on conditions in a single type or multi-modal, such as independently implementing RGB image migration tasks to short-wave infrared (SWIR, 0.7–3

μ

m) and long-wave infrared (LWIR, 8–15

μ

m) images. We mainly consider modalities, including exemplar and particular target editions, which are the common interactive migration task methods. This allows for scenario-specific translation from RGB to IR based on corresponding reference images, including the SWIR remote sensing scene dataset, autonomous driving scenes, and target recognition scenes with LWIR datasets. In this RGB-IR study, we propose a novel multi-modal translation approach. This method not only enhances the overall naturalness of human–computer interaction but also consolidates the information from multiple data sources, to generate more comprehensive results. First, we employ a VQ-GAN [16] structure in the initial training step to ensure the quality of the transfer results. Two codices with identical systems are used to reconstruct RGB and IR images by learning discretization codebooks. The vectorized content representation accurately reconstructs input image details and converts image features into a more straightforward and faster-to-calculate codebook, enabling fast transfer and decoding of inter-domain features. In the second step, we adopt a hybrid dual-channel transformer to remix the IR and RGB codebooks for predicting RGB-IR pixel relationships. Finally, a CNN decoder is employed to reconstruct the mixed encoding into an IR image. In addressing the challenge of accurately generating high-radiance textures for vehicles and pedestrians in the infrared spectrum, our approach aims to ensure alignment between the generated infrared images and the radiation feature of ground-truth IR images. This process entails employing object detection algorithms to extract pedestrians and vehicles as IR texture masks, followed by local target texture reinforcement. Extensive experiments on multiple datasets demonstrated that our method can effectively generate high-quality results in various scenarios, as shown in Figure 1. In summary, we make the following contributions:

We propose a unified transfer learning framework that can generate multi-modal and multi-scene IR images from an RGB image. We introduce multi-modal transfer learning into IR image generation, achieving the best generalization and performance currently available through vectorized reconstruction and hybrid transformer models;
We propose a reference-based image translation model containing vectorized encoders and instance segmentation. The IR radiation distribution and instance mask from the model can enhance local target features with similar semantics in the IR Synthetic image. When dealing with targets exhibiting high radiation features, our framework demonstrates the capability to produce high-quality infrared images.

Figure 1. Given an input RGB image, our framework VQ-InfraTrans is able to (A) produce diverse IR results unconditionally; (B) scenario-specific translation from RGB to IR based on corresponding reference images with a partial of controls, and the radiation texture of the target in the red box has been locally reinforced.

The following is the arrangement of the subsequent parts of this article: First, in Section 2, the relevant research status is explained. Section 3 discusses the proposed algorithm theory. Section 4 presents the experimental results and analysis, and a discussion of these results. Section 5 concludes this article.

2. Related Work

The translation from image to image was first discussed in [17], to learn the mapping function between source and target domains. This style of transfer work mainly deals with two significant challenges: First, the imaging principles of IR and RGB sensors differ, and the radiation field where IR is located varies significantly from the color space. Consequently, traditional methods find it difficult to determine the mapping relationship between RGB and IR. Second, the mainstream infrared migration methods are based on end-to-end generative adversarial networks. Among them, cycle consistency is used to handle unpaired data [18,19], while the enhanced attribute space is proposed to provide diversity [20]. Most algorithms for translating infrared images introduced architectures based on Cycle-GAN, such as Drit++ et al. [21,22,23,24]. In addition, some other algorithms also provide appropriate structural solutions for this task.For example, FastCUT [25] adopts one-sided translation without using cycle consistency to improve diversity [26,27], and U-GAT-T [28] focuses explicitly on geometric transformations of content in translation. Kuang et al. [29] improved the pix2pix method and proposed TIC-CGAN, the first GAN application to translate thermal IR (8–15

μ

m) images in traffic scenes. The generator in ThermalGAN [30] utilized a U-Net-based architecture, and the authors used a unique dataset named ThermalWorld to enhance training. In DRIT [22], the authors introduced the use of multiple generators. Each generator focused on learning attributes of different scenes, and a classifier based on ResNet [31] was used to determine which generator’s output was most suitable for a given input image.

Wang et al. [32] proposed an attention-based hierarchical infrared image coloring network (AHTIC-Net) to enhance the realistic and rich texture information of small objects in translated images. It employed a multi-scale structure to extract features of objects with different sizes, thereby improving the model’s focus on small objects during training. In recent years, many migration models have leaned towards using universal style transfer (UST) methods. Representative UST methods include AdaIN [33], WCT [34], and Avatar-Net [35]. These methods have been continuously expanded upon [36,37,38]. However, they are limited in terms of disentanglement and reconstruction of image content during the stylization process. In addition, the research on extracting image content structure and texture style features has become increasingly mature. Gatys et al. [39] found that the layers in CNN can extract content structure and style texture, they proposed an optimization-based iterative generation method for stylized images. Li and Justin [40,41] used an end-to-end model to achieve real-time style transfer with a specific style. To enable more efficient applications, Stylebank et al. [42,43,44] combined multiple types into one model and achieved excellent stylization results. Chen et al. [45] proposed an internal–external style transfer algorithm (IEST) that includes two contrastive losses, which can generate more natural stylized effects. However, the existing encoder–transfer–decoder style transfer methods cannot handle random dependencies, which may result in the loss of detailed information.

Recently, the effectiveness of vector quantization (VQ) technology as an intermediate representation for generative models has been demonstrated [16,46]. Therefore, in this work, we explore the suitability of using vectorization as an encoder in RGB-IR tasks, where the latent representation obtained through vector quantization serves as the intermediate vector for RGB-IR tasks. Vector quantization (VQ) technology addresses the scaling problem of pixel representations by using quantized latent vectors [47,48].

The VQ-InfraTrans method is an intermediate representation framework for vector encoding (codebook), enabling unconditional and end-to-end migration. This framework applies a transformer codex regression calculation to the input image, facilitating seamless style transfer. This approach mitigates the risk of losing fine-grained details during feature extraction and preserves generated structures with fidelity.

3. Method

3.1. A Unified Framework

The VQ-InfraTrans framework aims to perform unconditional or conditioned translation of RGB images into IR images. This framework translates RGB images based on corresponding IR bands or scenes, blending the features of RGB images considering the radiation characteristics of local targets in IR images.

As a problem of mapping one-to-many relationships, the framework should generate IR results that are both aesthetically pleasing and semantically valid. To achieve this goal, as illustrated in Figure 2, our framework comprises a content/style encoder (i.e., an RGB image and an IR image encoder), a vector quantized content codebook Z, and a hybrid-transformer codex. The training steps of this framework are as follows: (1) In the first step, we employ the VQ-GAN [16] codex as the trained model. Its robust and efficient image reconstruction capability enhances image resolution during editing and reduces the likelihood of content leakage. Given an input image, the content encoder and style encoder compute domain-specific features for domains C and S. The generated content aligns with the quantized code Z of image feature vectors; (2) By utilizing a two-way mixed transformer with code book Z from both domains, we establish connections between different domain tokens to learn global consistency characteristics along with specific style and content attributes [49,50]. After refactoring codes through the transformer decoder, these migrated features in Z are ultimately reconstructed into images using a CNN decoder similar to that in VQ-GAN [16] in the first step.

3.1.1. Vector Quantized Content Representation

Vector Quantization Encoder. In the initial stage, illustrated in Figure 2, a vector quantization strategy is employed to independently encode two distinct types of image domain feature, namely RGB image information and IR image information. Specifically, given visual domains

I_{c} \in R^{H \times W \times 3}

and

I_{s} \in R^{H \times W \times 3}

, we construct a codebook

Z = z_{k}

based on the vector quantized algorithm, consisting of learned content encodings

Z_{k} \in R^{\times n_{c}}

, where

n_{c}

represents the encoding dimension. Given two encoders (

E_{c}

and

E_{s}

) and the allocated space for each item

{\hat{c}}_{i j} \in R^{\times n_{c}}

extracted by the content encoder

E_{c}

and style encoder

E_{s}

, we find the closest encoding in the codebook Z to obtain a vector representation c through the vector quantization.

c = vq (\hat{c}) : = (arg min_{z_{k} \in Z} ∥{\hat{c}}_{i j} - z_{k}∥) \in R^{h \times w \times n_{c}}

(1)

Loss Function for Vector Encoder. As the quantization operation

vq

is not differentiable for gradient back-propagation, we employ the straight-through trick [48] to copy the gradient from c to

\hat{c}

and optimize the codebook Z using the self-reconstruction path and loss function

L_{vq}

.

L_{vq} = ∥ sg [\hat{c}] {- c ∥}_{2}^{2} + {∥ sg [c] - \hat{c} ∥}_{2}^{2}

(2)

where

sg [-]

is the stop-gradient operation.

3.1.2. Hybrid-Transformer Translation

In the second step of model training, we extract content-invariant RGB image features obtained through vector quantization encoding and style features encoded from infrared images. By leveraging a hybrid transformer to merge these features, we achieve unconditional translation of RGB-IR images, see Figure 2. The specific network consists of two different transformer encoders: one for content, and the other for style [49]. In addition, a multi-layer transformer decoder [50] encodes the content sequence based on the style sequence after the transformer encoders.

Transformer Encoder. The specific process of feature extraction encoding in this transformer encoder is as follows: The input image undergoes codebook discretization to obtain

z_{q} \in R^{h \times w \times n_{z}}

. (Note that this part uses a pretrained model from the previous step, only for forward propagation.) To train the transformer encoder [16], we flatten

z_{q}

to space

R^{h \times w \times n_{z}}

, resulting in

h \times w \times n_{z}

codes that form the input token sequence. The input sequence is encoded in

Q, K, V

:

Q = Z_{q} \times W_{q}, K = Z_{q} \times W_{k}, V = Z_{q} \times W_{v}

(3)

where

W_{q}, W_{k}, W_{v} \in R^{C \times d_{h} e a d}

,

z_{q}

is the embedding of the input sequence via the vector encoder, and

d_{h e a d} = C / N

. The multi-head attention

F_{M S A} (Q, K, V)

is then calculated using

F_{M S A} (Q, K, V) = Concat (Attention_{1} (Q, K, V), \dots, {Attention}_{n} (Q, K, V)) W_{o}

(4)

where

W_{o} \in R^{C \times C}

is a learnable parameter. N represents the number of attention heads.

Y_{c}^{'} = F_{M S A} (Q, K, V) + Q

(5)

Y_{c} = F_{F F N} (Y_{c}^{'}) + Y_{c}^{'}

(6)

The residual connection structure

F_{F F N} (Y_{c}^{'}) = max (0, Y_{c}^{'} W_{1} + b_{1}) + W_{2} + b_{2}

is used, and normalization (LN) [51] is performed after each module. Similarly, the encoded style sequence

Y_{s}

is also encoded following the same calculation process as

Y_{c}

. The transformer encoder structures for the vectorized content style encoding

Z q

-content and texture information

Z q

-style are consistent. Here, it is assumed that there is no need to impose semantic, edge, or structural constraints on the content encoder.

Transformer Decoder. Our transformer decoder is used to translate based on the encoded content sequence

Y_{c}

and the encoded style sequence

Y_{s}

. As shown in Figure 2, we highlight the architecture of the transformer decoder. Each transformer decoder layer consists of two MSA layers and one FFN layer [49]. Our input to the transformer decoder includes the encoded content sequence

Y_{c}

and style sequence

Y_{s}

. We use the content sequence to generate Q sequences and the style sequence to generate the key-value K and V values.

After each block in the decoder, the normalization (LN) method is used for horizontal standardization. The final results X are as shown above.

Q = {\hat{Y}}_{C} W_{q}, K = Y_{s} W_{k}, V = Y_{s} W_{v}

(7)

\begin{matrix} X^{″} = F_{M S A} (Q, K, V) + Q \\ X^{'} = F_{M S A} (X^{″}, k, v) + X^{″} \\ X = F_{F F N} (X^{'}) + X^{'} \end{matrix}

(8)

Decoder. Upon applying the transformer for encoding and decoding, a novel codebook,

Z_{q o u t}

, is generated. However, since

Z_{q}

is already the style-translated result, the decoder of VQ-GAN [16] trained in Section 3.1.1 is incompatible with this codebook. To address this issue, we successfully introduce a new CNN decoder with the same layout as the vector decoder, which requires individual training in the second phase. We can confidently say that the result is a flawlessly translated image. The final output is the translated image

I_{o u t}

,

I_{o u t} \in R^{H \times W \times 3}

.

The Loss Function For Content and Style Representations. To define the content loss

L_{c}

, the migrated image is used by the pretrained VGG network to extract the features of the input images, and the Euclidean distance between them is calculated for each layer:

L_{c} = \frac{1}{N_{l}} \sum_{i = 0}^{N_{l}} {∥ϕ_{i} (I_{o}) - ϕ_{i} (I_{c})∥}_{2}

(9)

The loss function for the style’s texture is similarly obtained by pretraining VGG to extract features from each layer of the input and output images, finding the mean and variance of the difference to obtain

L_{s}

. The loss function for texture styles is obtained by pretraining VGG to extract features from each layer of the input and output images and finding the mean and variance of the difference to obtain

L_{s}

:

L_{s} = \frac{1}{N_{1}} \sum_{i = 0}^{N_{1}} {∥μ (ϕ_{i} (I_{o})) - μ (ϕ_{i} (I_{s}))∥}_{2} + {∥σ (ϕ_{i} (I_{o})) - σ (ϕ_{i} (I_{s}))∥}_{2}

(10)

Like the reconstruction of GAN networks, identity loss is a vital loss function composition approach to maintain the stylistic consistency of translation results [13]. The task works as follows: identity loss [51] is used to learn richer and more accurate content and style representations. Two identical content (style) models are fed into the network, while the generated output

I_{c c} (I_{s s})

should be the same as the input

I_{c} (I_{s})

. Therefore, we compute two identity loss terms to measure the difference between

I_{c} (I_{s})

and

I_{c c} (I_{s s})

:

L_{i d 1} = {∥I_{c c} - I_{c}∥}_{2} + {∥I_{s s} - I_{s}∥}_{2}

(11)

L_{i d 2} = \frac{1}{N_{l}} \sum_{i = 0}^{N_{l}} {∥ϕ_{i} (I_{c c}) - ϕ_{i} (I_{c})∥}_{2} + {∥ϕ_{i} (I_{s s}) - ϕ_{i} (I_{s})∥}_{2}

(12)

Finally, we obtain the final loss function for this framework:

L = λ_{c} L_{c} + λ_{s} L_{s} + λ_{i d 1} L_{i d 1} + λ_{i d 2} L_{i d 2}

(13)

We balance the difference in magnitude of each loss function by setting

λ_{c} = 10

,

λ_{s} = 10

,

λ_{i d 1} = 20

,

λ_{i d 2} = 1

, respectively.

3.2. Local Target Reinforcement

In remote sensing tasks, we are often limited to a small number of infrared images of the scene and lack of precise calibration data for these images. In order to ensure that the high-radiance targets of vehicles and pedestrian categories in the generated images closely resemble the ground-truth infrared textures, we propose a semantic-matching-based local target grayscale enhancement method. This method leverages examples with similar semantic targets and utilizes the detection algorithm(YOLOv7 [52]) to extract pedestrians and vehicles as infrared texture masks. Subsequently, the infrared mask features of local targets (

f_{M a s k}

) are incorporated into the RGB feature (

f_{R G B}

). At the same time, the system will mix all the IR references generated from the selected conditions for migration.

f_{c o n t e n t} = f_{R G B} + f_{M a s k}

(14)

Specifically, as shown in Figure 3, we use instance segmentation to obtain the target mask of the infrared image and preserve the texture of the infrared target within the mask. The target radiance texture mask is input into the style encoder and then concatenated with the encoding result of the source RGB image. After that, we create a hybrid embedding content code by mapping the targets’ mask codebook through an embedding layer for generating embedding radiation features (

f_{c o n t e n t}

). The transformer encoder continues the style migration training of this combined result [53]. For the whole network, the mask map of the instantiated target is generated by the trained vector encoder in the first step and embedded into the prime of the content image encoding result

Y_{c}^{'}

. In the second training step, the transformer codebook learns that the IR-style image has local target greyscale information, so it adjusts the greyscale information of these local targets in the RGB image to make its local IR features more consistent with the actual situation. The detailed steps are shown in Figure 3:

4. Experiments

In this section, we state the results and analysis of the experiments conducted within this framework. Meanwhile, we also introduce an example translation and its running results. All our experiments were conducted on 4 NVIDIA RTX3090 GPUs (ASUS, Taiwan, China).

4.1. Compared Baselines and Datasets

We utilized three datasets containing infrared and RGB image pairs: VEDAI (0.7–3

μ

m) [54], FLIR (7.5–13.5

μ

m), and M3FD (7.5–13.5

μ

m) [55]. Unlike the control group’s three models, we did not specifically train on the three datasets during the training process. Instead, we sampled IR images from the three datasets and included them in the training set, to enhance the generalization and adaptability of our unified model.

Specifically, VEDAI [54] is a dataset for vehicle detection in aerial imagery. It consists of 1268 image pairs (RGB and IR images). We used 1068 image pairs for training and 200 for subsequent testing. We used 8863 image pairs for training for the the FLIR dataset and 1200 image pairs for testing. For the M3FD [55] dataset, we selected 4200 RGB-IR paired images under different conditions, such as nighttime and sunlight.

The settings for the network parameters in the experiment were as follows: For the first step of VQ-GAN [16], we followed the original code [46]. The channel number for the encoder was set to 512, and the codebook size was set to 1024. The structure of both the content encoder (RGB image input) and style encoder (IR image input) was consistent, except that the channel number for the IR image input changed from 3 to 1. For the second step transformer encoder, we set the dimension of the vector quantized codebook as

n_{c} = 1024

, and both the input and output channels were d = 512, to match with the VQ-GAN encoder [16]. During training, the input image size was scaled to 512 × 512. The parameters were

β_{1} = 0.01

and

β_{2} = 0.999

, and the initial learning rate was

2.5 \times 10^{- 5}

. The first step training epoch was set to 500, while transformer training consisted of 320,000 iterations.

4.2. Qualitative Evaluation

4.2.1. Unconditional Translation

As shown in Figure 4, our designed model was trained using unpaired datasets. Most of the images in the dataset are daytime RGB images (about 80%), and the models could easily convert night style images into daytime style ones, indicating that the models were overfitting to nighttime translation. In fact, in Figure 4, the different scenes, such as the sky or background, are not uniformly black but display different grayscale values depending on day–night or temperature changes. Our algorithm accurately matched sky grayscale when using multimodal translation, a functionality currently not achievable with other asymmetric matching algorithms.

Our transformer decoder based on the MLP projector effectively maintains the low and medium-level features of the RGB input image [56], thereby enhancing the transfer performance in unsupervised learning. Our outputs exhibit high-quality infrared texture and precise geometric features. Notably, our two-path hybrid transformer enables us to produce results for various modalities, as illustrated in the Figure 4. This novel RGB-IR approach, which facilitates cross-modal interaction, distinguishes our work from previous studies.

4.2.2. Exemplar Translation

Pixel-to-pixel mapping-based contrastive algorithms are influenced by the overall style of the reference image when transferring such objects. For example, when the black background of the sky occupies a large portion of the reference image, the grayscale values of the generated results of contrastive algorithms tend to converge with the grayscale values of the sky, resulting in darker overall infrared (IR) generated results.

Our algorithm possesses two critical advantages for such tasks: (1) Using a dedicated MLP and a mixed transformer mechanism enables the network to flexibly and accurately transfer detailed texture from the infrared image, without being distracted by structural details [56]; (2) Feature reconstruction through vector quantization effectively preserves geometric structural information, leading to highly accurate results. As illustrated in the figure, our method performed well in extracting infrared features of the various target objects from the reference image and in accurately translating these texture features in the generated outputs, whether in the texture-rich FLIR or the M3FD with more prominent infrared sources.

For the targets and background, as shown in block 7 in Figure 5, the infrared texture features of the road in the background vary significantly under different periods and lighting conditions. The contrastive algorithms failed to extract the infrared texture features of the road in the reference image effectively, resulting in translated results that still retain some RGB image features, such as road reflections.

As shown in red block 2 in Figure 5, the other GAN-based algorithms and attention-based algorithms failed to localize and transfer the infrared texture of high-intensity targets, such as buildings and pedestrians, while preserving the overall style transfer. The IR resulted in pedestrians and vehicle algorithmic outputs appearing more like grayscale RGB images rather than objects with infrared characteristics. Furthermore, even in the fourth box of the second row in the figure, it can be observed that the results obtained from the CUT [25] and QSA [57] algorithms lack the desired high-intensity infrared features and instead exhibit blurriness and geometric distortions due to the influence of the RGB image.

The results are depicted in Figure 5 and Figure 6, where we merge the content and style extracted from two images within the RGB domain to the IR domain. Our method can generate diverse and vivid greyscale images, with reference sampling from the hybrid-transformer (e.g., the people in 2, 4, 6 blocks and the buildings in 1, 8 block).

Figure 6 presents a comprehensive analysis of the migration outcomes leveraging the short-wave infrared remote sensing influence of VEDAI. The results affirmed the remarkable proficiency of our algorithm in upholding the integrity of building edge details and textures within box 1/3/4, thereby eliminating the undesirable edge blurring phenomenon encountered with the CUT and QSA algorithms. Furthermore, block 2 showcases our algorithm’s exceptional aptitude in extracting intricate vehicle textures and detailed information. However, it is important to acknowledge that the road generation effect was subject to the influence of the reference picture’s road gray level, leading to a road gray level that closely aligns with the reference example.

4.3. Quantitative Evaluation

We evaluate the quality of the translated images from two perspectives: the naturalness in following real-world scenarios, and the similarity to the original image. To ensure translation fidelity, we calculated two commonly used image evaluation metrics, the average structural similarity index (SSIM [58]) and peak signal-to-noise ratio (PSNR [59]), between the source image and translated image. Higher SSIM and PSNR values indicate a closer resemblance between the transformed and source images. As for the naturalness in following real-world scenarios, we quantitatively measured this using FID (Frechet inception distance) [60], which measures the distance between two distributions: source images and generated images. The expression for FID [60] is as follows:

F I D (s, c) = {∥μ_{c} - μ_{s}∥}^{2} + Tr (\sum I_{s} + \sum I_{c} - 2 \sqrt{\sum I_{s} \sum I_{c}})

(15)

μ_{c}

and

μ_{s}

represent the average values of feature vectors obtained by inputting real images and translated images into the InceptionV3 model [61]. Additionally,

I_{s}

and

I_{c}

represent the covariance matrices of real images and translated images, respectively. Tr denotes the trace of a matrix computation. A smaller FID indicates that the feature vectors are closer, resulting in more realistic translated images.

We compared the transfer results of our model with three typical image translation models, InfraGAN [62], CUT [25], and QSA [57]. Table 1 shows the quantitative evaluation metrics for each model on the test set, where our model outperformed the other models comprehensively. In the control group, InfraGAN [62] achieved the highest structural similarity (SSIM [58]) and FID [60] scores on the VEDAI [54] dataset but performed poorly on M3FD [55] and FLIR. This was likely because the pix2pix-based algorithm was more likely to obtain higher scores due to its symmetric matching mechanism and the smaller amount of data with a single scene in VEDAI [54]. On the other hand, M3FD [55] and FLIR have larger quantities of data and include multiple scenes and day-night shooting environments. The proximity of the near-infrared band to the visible light band resulted in similar geometric characteristics of the targets and backgrounds in both types of image. In contrast, the geometric features of the targets in the long-wave infrared images were influenced by the propagation of infrared radiation, and the imaging principles of long-wave infrared significantly differ from those of visible light, leading to discrepancies in their geometric attributes. Therefore, the asymmetric matching algorithms performed better on these datasets. For example, both the SSIM [58] and FID [60] metrics of QSA [57] were closer to our proposed model’s metrics.

Therefore, the RGB-IR model that supports complex scenarios should have better adaptability and generalization ability. As shown in Table 1, our model performed well on multiple datasets. It achieved the functionality of a unified model, which means it does not require specialized training for specific datasets and can also effectively translate RGB-IR.

Table 2 shows that even if the target reference greyscales were not entirely accurate in training, comparing the migration results of unpaired matching shows that they were accurate. The greyscale similarity of the targets (vehicle and people), as well as the FID [60] values were somewhat improved, which means that the local editing function improved the accuracy of the RGB-IR migration work.

4.4. Ablation Experiments

In infrared (IR) images, specific targets such as people and vehicles exhibit more pronounced radiation measurements compared to the overall image. Conventional transfer methods often fail to accurately translate the gray values of these targets, particularly for distant vehicles and people, who tend to blend into the surrounding environment. Our translation method based on local enhancement incorporates a secondary gray correction by precisely locating the target. Therefore, we adopted the synchronized input image’s irradiance for the exemplar as an IR-style translation for establishing a standard value reference. As depicted in Figure 7, the greyscale of the special targets in the RGB image are adjusted to a certain extent based on the value of the input reference images, thereby enhancing the conformity with the actual IR characteristics. When extracting and assigning the target irradiance, which is commonly referred to as a greyscale value, it should be noted that this dataset lacks calibration data and detailed radiometric temperature reference information.

In certain scenarios, specific infrared (IR) targets with high brightness, such as people and vehicles, may be occluded or appear in the background. Due to the unique characteristics of long-wave infrared, these targets should exhibit prominent and high-brightness infrared textures in the IR results. As described in Section 3.2, we employed instance segmentation methods to enhance the texture and brightness of these particular targets by searching for reference images in the dataset that had semantic similarity as exemplars. Figure 7 shows a person in the red blocks 1 and 2, located within a forest. Due to the fusion function of feature mapping during the transfer process, the radiance characteristics of this pedestrian needed to be more accurately represented. However, by applying the enhancement mentioned above, the output results tended to be closer to the ground truth.

Artflow [63] elucidated that content information is embedded in the style of iterative transformation training, leading to leakage in migration results. Aiming to address the aforementioned issue of content leakage during the migration process, we adopted a proficient decomposition technique for texture representation, which should solely control the texture style, while preserving the underlying structure. We also proposed utilizing a transformer codex book with vector quantization as an image coding method for feature extraction and migration, enabling precise preservation of RGB image content structure and IR image texture without compromising on leakage issues, as illustrated in Figure 8. In Figure 8, we can observe that the shadows cast by moving vehicles on the ground should exhibit distinct radiation characteristics compared to those cast by stationary vehicles. Our framework struggled to distinguish this contrast from the existing training data, leading to uncertainties in the grayscale values of vehicle shadows in the generated results. However, these uncertainties had a relatively modest impact on the overall image quality.

We trained a Condition-VQGAN to verify whether our algorithm could better translate and reconstruct infrared images. The former has the same VQ-GAN structure as our algorithm, but it transforms the transferred style image into MASK and inserts it after encoding the content image for training in the transformer, while the latter abandons the training based on VQ-GAN in our method and directly uses a dual-path transformer encoder combined with GAN to achieve image translation. We selected 8000 images from FLIR and M3FD for training and calculated FID, PSNR, and SSIM for testing results to compare the performance. As illustrated in Table 3, compared with Condition-VQGAN and TransGAN, our method had fewer distortions in the translated color images and superior evaluation metrics.

5. Conclusions

This paper presents VQ-InfraTrans, a novel framework for translating RGB images into infrared images. Our approach trains a vector quantized codebook to capture the invariant content information of the input domains. This input domain information is transformed into a codebook, and the domain-specific distance information in the codebook can be captured by a transformer-based content/style migration encoder, which finally translates the RGB content sequences with reference into IR image sequences. We proposed a unified framework for multimodal transfer learning, with a specific focus on the intricate task of converting RGB images into long-wave and short-wave infrared images. Furthermore, our framework empowers the synthesis of cross-scene conversion between daytime and nighttime scenarios, presenting a significant advancement in the field. We proposed an innovative approach that leverages deep learning techniques to address the inherent differences in spectral characteristics between RGB and IR modalities. VQ-InfraTrans offers a quantitative and qualitative performance comparable to several baselines for the translation of RGB into IR images. Our work highlights the potential of our approach for various computer vision applications.

Author Contributions

Q.S.: Conceptualization, Methodology, Investigation, Writing—original draft, Software, Data curation. X.W.: Supervision, Formal analysis. X.Z.: Visualization, Investigation, Investigation. C.Y.: Writing—review and editing, Resources. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original FLIR dataset is openly available, access through the link: https://www.flir.com/oem/adas/adas-dataset-form (accessed on 27 May 2023). The original M3FD dataset is openly available, access through the link: https://drive.google.com/drive/folders/1H-oO7bgRuVFYDcMGvxstT1nmy0WF_Y_6?usp=sharing (accessed on 27 July 2023).

Conflicts of Interest

The authors declare no conflict of interest.

References

Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
Everingham, M.; Eslami, S.A.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes challenge: A retrospective. Int. J. Comput. Vis. 2015, 111, 98–136. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings, Part V 13, Proceedings of the Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
Chang, Y.; Luo, B. Bidirectional convolutional LSTM neural network for remote sensing image super-resolution. Remote Sens. 2019, 11, 2333. [Google Scholar] [CrossRef]
Gu, J.; Sun, X.; Zhang, Y.; Fu, K.; Wang, L. Deep residual squeeze and excitation network for remote sensing image super-resolution. Remote Sens. 2019, 11, 1817. [Google Scholar] [CrossRef]
Lu, T.; Wang, J.; Zhang, Y.; Wang, Z.; Jiang, J. Satellite image super-resolution via multi-scale residual deep neural network. Remote Sens. 2019, 11, 1588. [Google Scholar] [CrossRef]
Haut, J.M.; Fernandez-Beltran, R.; Paoletti, M.E.; Plaza, J.; Plaza, A.; Pla, F. A new deep generative network for unsupervised remote sensing single-image super-resolution. IEEE Trans. Geosci. Remote Sens. 2018, 56, 6792–6810. [Google Scholar] [CrossRef]
Lei, S.; Shi, Z.; Zou, Z. Coupled adversarial training for remote sensing image super-resolution. IEEE Trans. Geosci. Remote Sens. 2019, 58, 3633–3643. [Google Scholar] [CrossRef]
Xiong, Y.; Guo, S.; Chen, J.; Deng, X.; Sun, L.; Zheng, X.; Xu, W. Improved SRGAN for remote sensing image super-resolution across locations and sensors. Remote Sens. 2020, 12, 1263. [Google Scholar] [CrossRef]
Zhang, D.; Shao, J.; Li, X.; Shen, H.T. Remote sensing image super-resolution via mixed high-order attention network. IEEE Trans. Geosci. Remote Sens. 2020, 59, 5183–5196. [Google Scholar] [CrossRef]
Salvetti, F.; Mazzia, V.; Khaliq, A.; Chiaberge, M. Multi-image super resolution of remotely sensed images using residual attention deep neural networks. Remote Sens. 2020, 12, 2207. [Google Scholar] [CrossRef]
Zhang, S.; Yuan, Q.; Li, J.; Sun, J.; Zhang, X. Scene-adaptive remote sensing image super-resolution using a multiscale attention network. IEEE Trans. Geosci. Remote Sens. 2020, 58, 4764–4779. [Google Scholar] [CrossRef]
Yang, S.; Sun, M.; Lou, X.; Yang, H.; Zhou, H. An unpaired thermal infrared image translation method using GMA-CycleGAN. Remote Sens. 2023, 15, 663. [Google Scholar] [CrossRef]
Huang, S.; Jin, X.; Jiang, Q.; Liu, L. Deep learning for image colorization: Current and future prospects. Eng. Appl. Artif. Intell. 2022, 114, 105006. [Google Scholar] [CrossRef]
Liang, W.; Ding, D.; Wei, G. An improved DualGAN for near-infrared image colorization. Infrared Phys. Technol. 2021, 116, 103764. [Google Scholar] [CrossRef]
Esser, P.; Rombach, R.; Ommer, B. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12873–12883. [Google Scholar]
Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1125–1134. [Google Scholar]
Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
Liu, M.Y.; Breuel, T.; Kautz, J. Unsupervised image-to-image translation networks. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Zhu, J.Y.; Zhang, R.; Pathak, D.; Darrell, T.; Efros, A.A.; Wang, O.; Shechtman, E. Toward multimodal image-to-image translation. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Zhang, L.; Gonzalez-Garcia, A.; Van De Weijer, J.; Danelljan, M.; Khan, F.S. Synthetic data generation for end-to-end thermal infrared tracking. IEEE Trans. Image Process. 2018, 28, 1837–1850. [Google Scholar] [CrossRef] [PubMed]
Cui, Z.; Pan, J.; Zhang, S.; Xiao, L.; Yang, J. Intelligence Science and Big Data Engineering. Visual Data Engineering. In Proceedings, Part I, Proceedings of the 9th International Conference, IScIDE 2019, Nanjing, China, 17–20 October 2019; Springer Nature: Berlin/Heidelberg, Germany, 2019; Volume 11935. [Google Scholar]
Lee, H.Y.; Tseng, H.Y.; Huang, J.B.; Singh, M.; Yang, M.H. Diverse image-to-image translation via disentangled representations. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 35–51. [Google Scholar]
Lee, H.Y.; Tseng, H.Y.; Mao, Q.; Huang, J.B.; Lu, Y.D.; Singh, M.; Yang, M.H. Drit++: Diverse image-to-image translation via disentangled representations. Int. J. Comput. Vis. 2020, 128, 2402–2417. [Google Scholar] [CrossRef]
Park, T.; Efros, A.A.; Zhang, R.; Zhu, J.Y. Contrastive learning for unpaired image-to-image translation. In Proceedings, Part IX 16, Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 319–345. [Google Scholar]
Mao, Q.; Lee, H.Y.; Tseng, H.Y.; Ma, S.; Yang, M.H. Mode seeking generative adversarial networks for diverse image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1429–1437. [Google Scholar]
Mao, Q.; Tseng, H.Y.; Lee, H.Y.; Huang, J.B.; Ma, S.; Yang, M.H. Continuous and diverse image-to-image translation via signed attribute vectors. Int. J. Comput. Vis. 2022, 130, 517–549. [Google Scholar] [CrossRef]
Lee, H.Y.; Li, Y.H.; Lee, T.H.; Aslam, M.S. Progressively Unsupervised Generative Attentional Networks with Adaptive Layer-Instance Normalization for Image-to-Image Translation. Sensors 2023, 23, 6858. [Google Scholar] [CrossRef] [PubMed]
Kuang, X.; Zhu, J.; Sui, X.; Liu, Y.; Liu, C.; Chen, Q.; Gu, G. Thermal infrared colorization via conditional generative adversarial network. Infrared Phys. Technol. 2020, 107, 103338. [Google Scholar] [CrossRef]
Kniaz, V.V.; Knyaz, V.A.; Hladuvka, J.; Kropatsch, W.G.; Mizginov, V. Thermalgan: Multimodal color-to-thermal image translation for person re-identification in multispectral dataset. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Munich, Germany, 8–14 September 2018. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Wang, H.; Cheng, C.; Zhang, X.; Sun, H. Towards high-quality thermal infrared image colorization via attention-based hierarchical network. Neurocomputing 2022, 501, 318–327. [Google Scholar] [CrossRef]
Huang, X.; Belongie, S. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1501–1510. [Google Scholar]
Li, Y.; Fang, C.; Yang, J.; Wang, Z.; Lu, X.; Yang, M.H. Universal style transfer via feature transforms. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Sheng, L.; Lin, Z.; Shao, J.; Wang, X. Avatar-net: Multi-scale zero-shot style transfer by feature decoration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8242–8250. [Google Scholar]
Gu, S.; Chen, C.; Liao, J.; Yuan, L. Arbitrary style transfer with deep feature reshuffle. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8222–8231. [Google Scholar]
Jing, Y.; Liu, X.; Ding, Y.; Wang, X.; Ding, E.; Song, M.; Wen, S. Dynamic instance normalization for arbitrary style transfer. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 4369–4376. [Google Scholar]
An, J.; Li, T.; Huang, H.; Shen, L.; Wang, X.; Tang, Y.; Ma, J.; Liu, W.; Luo, J. Real-time universal style transfer on high-resolution images via zero-channel pruning. arXiv 2020, arXiv:2006.09029. [Google Scholar]
Gatys, L.A.; Ecker, A.S.; Bethge, M. Image style transfer using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2414–2423. [Google Scholar]
Johnson, J.; Alahi, A.; Fei-Fei, L. Perceptual losses for real-time style transfer and super-resolution. In Proceedings, Part II 14, Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 694–711. [Google Scholar]
Li, C.; Wand, M. Precomputed real-time texture synthesis with markovian generative adversarial networks. In Proceedings, Part II 14, Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 702–716. [Google Scholar]
Chen, D.; Yuan, L.; Liao, J.; Yu, N.; Hua, G. Stylebank: An explicit representation for neural image style transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1897–1906. [Google Scholar]
Dumoulin, V.; Shlens, J.; Kudlur, M. A learned representation for artistic style. arXiv 2016, arXiv:1610.07629. [Google Scholar]
Lin, M.; Tang, F.; Dong, W.; Li, X.; Xu, C.; Ma, C. Distribution aligned multimodal and multi-domain image stylization. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 2021, 17, 1–17. [Google Scholar] [CrossRef]
Chen, H.; Wang, Z.; Zhang, H.; Zuo, Z.; Li, A.; Xing, W.; Lu, D. Artistic style transfer with internal-external learning and contrastive learning. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–14 December 2021; Volume 34, pp. 26561–26573. [Google Scholar]
Razavi, A.; Van den Oord, A.; Vinyals, O. Generating diverse high-fidelity images with vq-vae-2. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
Han, L.; Ren, J.; Lee, H.Y.; Barbieri, F.; Olszewski, K.; Minaee, S.; Metaxas, D.; Tulyakov, S. Show me what and tell me how: Video synthesis via multimodal conditioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 3615–3625. [Google Scholar]
Van Den Oord, A.; Vinyals, O. Neural discrete representation learning. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Deng, Y.; Tang, F.; Dong, W.; Ma, C.; Pan, X.; Wang, L.; Xu, C. Stytr2: Image style transfer with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11326–11336. [Google Scholar]
Jiang, Y.; Chang, S.; Wang, Z. Transgan: Two pure transformers can make one strong gan, and that can scale up. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–14 December 2021; Volume 34, pp. 14745–14758. [Google Scholar]
Bruckner, S.; Gröller, M.E. Style transfer functions for illustrative volume rendering. In Proceedings of the Computer Graphics Forum; Wiley Online Library: Hoboken, NJ, USA, 2007; Volume 26, pp. 715–724. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar]
Huang, Z.; Zhao, N.; Liao, J. UniColor: A Unified Framework for Multi-Modal Colorization with Transformer. ACM Trans. Graph. 2022, 41, 1–16. [Google Scholar] [CrossRef]
Razakarivony, S.; Jurie, F. Vehicle detection in aerial imagery: A small target detection benchmark. J. Vis. Commun. Image Represent. 2016, 34, 187–203. [Google Scholar] [CrossRef]
Liu, J.; Fan, X.; Huang, Z.; Wu, G.; Liu, R.; Zhong, W.; Luo, Z. Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5802–5811. [Google Scholar]
Wang, Y.; Tang, S.; Zhu, F.; Bai, L.; Zhao, R.; Qi, D.; Ouyang, W. Revisiting the Transferability of Supervised Pretraining: An MLP Perspective. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 9173–9183. [Google Scholar]
Hu, X.; Zhou, X.; Huang, Q.; Shi, Z.; Sun, L.; Li, Q. Qs-attn: Query-selected attention for contrastive learning in i2i translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18291–18300. [Google Scholar]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
Huynh-Thu, Q.; Ghanbari, M. Scope of validity of PSNR in image/video quality assessment. Electron. Lett. 2008, 44, 800–801. [Google Scholar] [CrossRef]
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: New York, NY, USA, 2017; Volume 30. [Google Scholar]
Fardoulis, J.; Depreytere, X.; Gallien, P.; Djouhri, K.; Abdourhmane, B.; Sauvage, E. Proof: How Small Drones Can Find Buried Landmines in the Desert Using Airborne IR Thermography. J. Conv. Weapons Destr. 2020, 24, 15. [Google Scholar]
Özkanoğlu, M.A.; Ozer, S. InfraGAN: A GAN architecture to transfer visible images to infrared domain. Pattern Recognit. Lett. 2022, 155, 69–76. [Google Scholar] [CrossRef]
An, J.; Huang, S.; Song, Y.; Dou, D.; Liu, W.; Luo, J. Artflow: Unbiased image style transfer via reversible neural flows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 862–871. [Google Scholar]

Figure 2. VQ-InfraTrans pipeline. This framework translates an RGB image based on corresponding IR bands or scenes, blending the features of RGB images considering the radiation characteristics of local targets in IR images. We employ a vector quantization codebook as an intermediate representation for both RGB and IR images and integrate the content and texture features of the vectorized codebooks using a hybrid-transformer codex. Finally, we decode the fused representation to generate the IR image. Our objective is to use the RGB images for generating diverse IR results by leveraging a vector quantized codebook, an intermediate representation for (1) image-to-image translation between RGB and IR domains; (2) unconditional generation in each domain. To achieve this, we propose two sub-networks for disentangling and quantizing the VQ-GAN [48] representation from the continuous grayscale input, and a hybrid-transformer for learning unified condition-based translation; (3) It adjusts the greyscale information of these local targets in the RGB image using the target’s texture mask to make its local IR features more consistent with the actual situation.

Figure 3. Local target editing module: instance the input exemplar IR image; instance the content RGB image; obtain and concatenate the exemplar IR features and RGB content feature from the encoder; input the transformer encoder and the decoder.

Figure 4. Qualitative evaluation of our proposed image-to-image translation methods demonstrates the diverse results on two unpaired datasets. We can observe variations in grayscale values of sky, vehicles, and pedestrians across different wavelengths or day-night transitions due to radiation changes. Our model can generate diverse results for each of the input IR images from FLIR and M3FD. The second column shows an example of generating daytime IR images under LWIR, and the third column shows an example of the night IR results. The fourth column shows an example of generating daytime IR images under SWIR, and the fifth column shows an example of night IR results.

Figure 5. Comparative results of multiple algorithms on sample image pairs from two LWIR datasets (FLIR and M3FD [55]). The temperature threshold windows for salient objects in different scenarios were derived by utilizing an exemplar reference IR image. Compared to exemplar-based methods, our approach selectively inherited the greyscales from example images with high confidence. As shown in the fifth column of this figure, our method accurately extracted the irradiance information for people, roads, and buildings from the reference IR image, and consistently produced vivid and accurate results. For the FLIR dataset, our results demonstrated the accurate extraction of radiometric images for buildings and roads in blocks 1, 3, 5, 7, and 8. Additionally, successful texture extraction of reference targets was observed through the grayscale representation of people in blocks 2, 4, and 6 within this framework. The 3rd through 5th column images are the results from CUT [25], QSA [57], and our proposal. FLIR and M3FD [55] have long-wave infrared scenes.

Figure 6. Comparative results of multiple algorithms on sample image pairs from the VEDAI [54] datasets (short-wave infrared). The first two columns are the RGB image and IR reference. The 3rd through 7th column images are the results from Infra-GAN [6], CUT [25], QSA [57], and our proposal.

Figure 7. The presence of hidden or obscured targets with high radiation in most infrared scenes posed a significant challenge for RGB-IR mapping. We proposed a local reinforce method based on target matching to address this issue. The results obtained from modifying the input IR images demonstrated that our VQ-InfraGAN effectively generated high-quality IR images, particularly in terms of accurately depicting people and vehicle targets. Due to the local target editing module, our transformer-based framework generated consistent greyscale across the target pixel of vehicles and pedestrians sharing the same semantics (e.g., the vehicles in block 1 in the 1st row, and the pedestrians in block 2/3 in the 2nd and 3rd rows within the red box).

Figure 8. Comparison results of image reconstruction, showing the effect of the TransGAN model. The TransGAN model often failed to accurately translate the gray values of these pronounced radiation targets, the object in the red box demonstrates the changes in radiation texture during the training process.

Table 1. Comparison of the metrics of different algorithms on three other datasets. The best results for each dataset are underlined. The upward arrow indicates superior performance with larger values of the indicator, while the downward arrow signifies better performance with smaller values of the indicator.

FLIR	SSIM ↑	PSNR ↑	FID ↓
CUT	0.7707	33.360	71.962
QSA	0.7768	42.869	70.648
VQ-InfraTrans	0.7876	39.873	69.445
M3FD	SSIM ↑	PSNR ↑	FID ↓
CUT	0.7326	17.281	81.30
QSA	0.8077	37.193	74.55
VQ-InfraTrans	0.8258	40.539	71.23
VEDAI	SSIM ↑	PSNR ↑	FID ↓
InfraGAN	0.898	27.64	50.485
CUT	0.792	19.83	121.557
QSA	0.7815	17.10	116.659
VQ-InfraTrans	0.881	28.31	73.79

Table 2. Comparison of four different algorithms on FLIR for each class. Each entry shows five different metrics in the order of SSIM and FID. The best result is shown as underlined for each metric. The upward arrow indicates superior performance with larger values of the indicator, while the downward arrow signifies better performance with smaller values of the indicator.

FLIR	Vehicle	People
Metric	SSIM ↑/FID ↓	PSNR ↑/FID ↓
InfraGAN	0.66/222.575	0.65/123.033
CUT	0.72/87.44	0.76/84.78
QSA	0.74/80.09	0.75/83.17
VQ-InfraTrans	0.72/77.82	0.80/80.15

Table 3. Ablation study on VQ-InfraTrans, Condition-VQGAN used a mixed transformer encoder, and TransGAN did not adopt a vector encoder. Each entry shows three different metrics, in the order SSIM, PSNR, and FID. The upward arrow indicates superior performance with larger values of the indicator, while the downward arrow signifies better performance with smaller values of the indicator. The best result is shown underlined for each metric.

Method	SSIM ↑	PSNR ↑	FID ↓
Condition-VQGAN	0.771	34.8	70.6
TransGAN	0.735	37.2	80.09
VQ-InfraTrans	0.787	39.87	69.45

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sun, Q.; Wang, X.; Yan, C.; Zhang, X. VQ-InfraTrans: A Unified Framework for RGB-IR Translation with Hybrid Transformer. Remote Sens. 2023, 15, 5661. https://doi.org/10.3390/rs15245661

AMA Style

Sun Q, Wang X, Yan C, Zhang X. VQ-InfraTrans: A Unified Framework for RGB-IR Translation with Hybrid Transformer. Remote Sensing. 2023; 15(24):5661. https://doi.org/10.3390/rs15245661

Chicago/Turabian Style

Sun, Qiyang, Xia Wang, Changda Yan, and Xin Zhang. 2023. "VQ-InfraTrans: A Unified Framework for RGB-IR Translation with Hybrid Transformer" Remote Sensing 15, no. 24: 5661. https://doi.org/10.3390/rs15245661

APA Style

Sun, Q., Wang, X., Yan, C., & Zhang, X. (2023). VQ-InfraTrans: A Unified Framework for RGB-IR Translation with Hybrid Transformer. Remote Sensing, 15(24), 5661. https://doi.org/10.3390/rs15245661

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

VQ-InfraTrans: A Unified Framework for RGB-IR Translation with Hybrid Transformer

Abstract

1. Introduction

2. Related Work

3. Method

3.1. A Unified Framework

3.1.1. Vector Quantized Content Representation

3.1.2. Hybrid-Transformer Translation

3.2. Local Target Reinforcement

4. Experiments

4.1. Compared Baselines and Datasets

4.2. Qualitative Evaluation

4.2.1. Unconditional Translation

4.2.2. Exemplar Translation

4.3. Quantitative Evaluation

4.4. Ablation Experiments

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI