V2T-GAN: Three-Level Refined Light-Weight GAN with Cascaded Guidance for Visible-to-Thermal Translation

Infrared image simulation is challenging because it is complex to model. To estimate the corresponding infrared image directly from the visible light image, we propose a three-level refined light-weight generative adversarial network with cascaded guidance (V2T-GAN), which can improve the accuracy of the infrared simulation image. V2T-GAN is guided by cascading auxiliary tasks and auxiliary information: the first-level adversarial network uses semantic segmentation as an auxiliary task, focusing on the structural information of the infrared image; the second-level adversarial network uses the grayscale inverted visible image as the auxiliary task to supplement the texture details of the infrared image; the third-level network obtains a sharp and accurate edge by adding auxiliary information of the edge image and a displacement network. Experiments on the public dataset Multispectral Pedestrian Dataset demonstrate that the structure and texture features of the infrared simulation image obtained by V2T-GAN are correct, and outperform the state-of-the-art methods in objective metrics and subjective visualization effects.


Introduction
Infrared images are widely used in military, medical, industrial and agricultural fields, and are generally obtained by shooting target scenes with an infrared thermal imager. However, in some special environments, the amount of image data that can be obtained by an infrared thermal imager is relatively insufficient, and the equipment is expensive. These problems limit the acquisition of infrared image data. Therefore, related research on infrared image simulation has been progressing.
The traditional infrared simulation approach can be divided into two types: infrared image simulation based on three-dimensional modeling and infrared image simulation based on visible light image. The first method uses three-dimensional modeling of the scene, and then simulates according to infrared radiation characteristics [1][2][3], without the need for real visible light images. The disadvantage is that the overall process is complicated, and the texture of the simulation result is unnatural. Furthermore, because it only targets a single scene, the generalization performance of the model is poor. The second method requires real visible light images, which is a simpler and more convenient method than the previous one, but it also has the disadvantages of low simulation accuracy and poor generalization ability. In view of the above problems, we aim to provide a more convenient, accurate and robust infrared simulation approach.
Infrared image simulation based on visible light image is a pixel-level image conversion task, which can predict and simulate corresponding infrared images through visible light images. Recently, the pixel-level image conversion task based on the deep learning method has achieved great success, and the algorithm is relatively simple and convenient. Common pixel-level image conversion tasks include monocular depth estimation [4,5], semantic segmentation [6,7], optical flow estimation [8], image style conversion [9], etc.
In order to achieve the conversion from a visible light image to infrared image, we propose a three-level refined light-weight GAN with cascaded guidance. As shown in Figure 1, V2T-GAN is a three-level cascaded network. The first-level network uses semantic segmentation images as an auxiliary task to guide G 1 to learn infrared images with more accurate structural information; the second-level network uses GIV images as an auxiliary task to guide G 2 to learn more accurate infrared images with detailed textures; and the thirdlevel network uses visible light edge images as auxiliary information to further optimize the predicted infrared images. G d predicts the displacement offset map of the second-level network's output image T 2 in the x and y directions, and then resamples T 2 according to the displacement offset information to obtain the final infrared image T 3 .

First-Level Network
The first-level network uses semantic segmentation images as auxiliary tasks to guide the first-level target task generative network to predict infrared images with more correct structure information. As shown in Figure 1, the blue part is the first-level network, including a target task generator G1, a discriminator D1 and an auxiliary task network Gs. G1 estimates the corresponding infrared image T1 from the visible light image, and D1 is responsible for identifying the authenticity of the predicted infrared image T1 and the target infrared image Ttrue. Then, Gs estimates the semantic segmentation image from T1, and guides G1 to pay more attention to the structure information by predicting the semantic segmentation image, thereby predicting the infrared image with more correct structure information.
The network structure of G1 and Gs is the generator U-net [31] in pix2pix [9], and the network structure of D1 is the discriminator in pix2pix. To reduce the overall parameter amount of the network, we adjust the initial output channel number of Gs to 4; that is, the number of all the feature map channels in the network is 1/16 of the original U-net. In order to improve the calculation efficiency of the overall algorithm, this paper generally uses lightweight convolutions in the network, such as GConv, DSConv, BSConv, Ghost module, etc. In V2T-GAN, G1 has the largest overall network parameters, and its lightweight operation has the greatest impact on the overall network. Therefore, we analyzed and compared the different lightweight methods of G1, and finally adopted GConv, and the standard convolution of G1, Gs and D1 are all replaced by GConv with a group number of 4.
There have been many research studies on lightweight convolutions, and the methods applied in this paper will be introduced below.
The specific implementation of GConv is divided into three steps: • GConv divides the input channels into even and non-overlapping groups according to the grouping number g; • Perform standard convolution independently on each group that has been divided; • Concat the results of the standard convolution in the dimension of the channel.
Depthwise Convolution (DConv) [32] is a special type of GConv. The number of groups and output channels are the same as the number of input channels. The DSConv consists of two steps:

•
The first step is to perform DConv; • Use standard convolution with a 1 × 1 convolution kernel to adjust the number of output channels.
The BSConv is also divided into two steps: • First perform standard convolution with a 1 × 1 convolution kernel to adjust the number of output channels; • Use DConv.
The GhostModule is divided into three steps:

First-Level Network
The first-level network uses semantic segmentation images as auxiliary tasks to guide the first-level target task generative network to predict infrared images with more correct structure information. As shown in Figure 1, the blue part is the first-level network, including a target task generator G 1 , a discriminator D 1 and an auxiliary task network G s . G 1 estimates the corresponding infrared image T 1 from the visible light image, and D 1 is responsible for identifying the authenticity of the predicted infrared image T 1 and the target infrared image T true . Then, G s estimates the semantic segmentation image from T 1 , and guides G 1 to pay more attention to the structure information by predicting the semantic segmentation image, thereby predicting the infrared image with more correct structure information.
The network structure of G 1 and G s is the generator U-net [31] in pix2pix [9], and the network structure of D 1 is the discriminator in pix2pix. To reduce the overall parameter amount of the network, we adjust the initial output channel number of G s to 4; that is, the number of all the feature map channels in the network is 1/16 of the original U-net. In order to improve the calculation efficiency of the overall algorithm, this paper generally uses lightweight convolutions in the network, such as GConv, DSConv, BSConv, Ghost module, etc. In V2T-GAN, G 1 has the largest overall network parameters, and its lightweight operation has the greatest impact on the overall network. Therefore, we analyzed and compared the different lightweight methods of G 1 , and finally adopted GConv, and the standard convolution of G 1 , G s and D 1 are all replaced by GConv with a group number of 4.
There have been many research studies on lightweight convolutions, and the methods applied in this paper will be introduced below.
The specific implementation of GConv is divided into three steps: • GConv divides the input channels into even and non-overlapping groups according to the grouping number g; • Perform standard convolution independently on each group that has been divided; • Concat the results of the standard convolution in the dimension of the channel.
Depthwise Convolution (DConv) [32] is a special type of GConv. The number of groups and output channels are the same as the number of input channels. The DSConv consists of two steps:

•
The first step is to perform DConv; • Use standard convolution with a 1 × 1 convolution kernel to adjust the number of output channels.
The BSConv is also divided into two steps: • First perform standard convolution with a 1 × 1 convolution kernel to adjust the number of output channels; • Use DConv. The GhostModule is divided into three steps: • Perform standard convolution with a 1 × 1 convolution kernel. The number of output channels in this step is: C 1 = [C in /r], where C 1 is the number of output channels in the first step, C in is the number of input channels, and r represents the manually set rate; • Use GConv on the output result of the first step, the number of groups is the number of output channels in the first step (that is, equal to C 1 ), the number of output channels in this step is: Concatenate the output result of the first step and the second step to get the final result.

Second-Level Network
The second-level network further optimizes the infrared image output by the first-level network, and uses the GIV images as an auxiliary task to guide the second-level target task generative network to predict infrared images with more accurate details and textures. The network structure is shown in the purple part of Figure 1, including a target task generator G 2 , a discriminator D 2 and an auxiliary task network G g . The input of G 2 is concatenated with the predicted infrared image T 1 of the first-level network, the output feature map of the penultimate layer of G 1 and the GIV image I g , and finally the predicted infrared image T 2 is output. D 2 has the same structure and function as D 1 , which used to discriminate the predicted infrared image T 2 and the target infrared image T true . G g estimates the GIV image from T 2 , and guides G 2 to further optimize the predicted infrared image.
In the case of good lighting, visible light images have more detailed texture information than infrared images. GIV images also have rich detailed texture information, and compared with visible light images, it has a high similarity with infrared images in terms of human visual effects and some objective metrics, such as FID [33], LIPIS [34], etc. Therefore, we adopt the GIV image as the auxiliary task of the second-level network to guide G 2 to further optimize the detailed texture information of the predicted infrared image.

Target Task Generator G 2
An illustration of the second-level target task generator G 2 is depicted in Figure 2, consisting of an MFM module [28], four L-FMR modules that add skip connection and a Ghost module. The MFM module is shown in Figure 3. In order to obtain the information of multiple receptive fields, the input is respectively passed through four dilated convolutions with a convolution kernel size of 3 × 3 and a dilation rate of 1, 2, 3 and 4. The output of the dilated convolution with different dilation rates are added and fused, and then the added results are concatenated. The input and output of G 2 are similar. To increase the direct mapping between input and output, the input is added to the final concatenated result after a pointwise convolution. The L-FMR module is improved from the FMRB [35], which is a network module for image deblurring tasks. It has been verified that FRMB can learn and restore the detailed texture information of the image. In order to reduce the amount of overall network parameters, we replace all the standard convolutions in FMRB with a Ghost module with a rate of 4, which is the L-FMR module in Figure 2. • Perform standard convolution with a 1 × 1 convolution kernel. The number of output channels in this step is: C1 = [Cin/r], where C1 is the number of output channels in the first step, Cin is the number of input channels, and r represents the manually set rate; • Use GConv on the output result of the first step, the number of groups is the number of output channels in the first step (that is, equal to C1), the number of output channels in this step is: C1 × (r − 1); • Concatenate the output result of the first step and the second step to get the final result.

Second-Level Network
The second-level network further optimizes the infrared image output by the first-level network, and uses the GIV images as an auxiliary task to guide the second-level target task generative network to predict infrared images with more accurate details and textures. The network structure is shown in the purple part of Figure 1, including a target task generator G2, a discriminator D2 and an auxiliary task network Gg. The input of G2 is concatenated with the predicted infrared image T1 of the first-level network, the output feature map of the penultimate layer of G1 and the GIV image Ig, and finally the predicted infrared image T2 is output. D2 has the same structure and function as D1, which used to discriminate the predicted infrared image T2 and the target infrared image Ttrue. Gg estimates the GIV image from T2, and guides G2 to further optimize the predicted infrared image.
In the case of good lighting, visible light images have more detailed texture information than infrared images. GIV images also have rich detailed texture information, and compared with visible light images, it has a high similarity with infrared images in terms of human visual effects and some objective metrics, such as FID [33], LIPIS [34], etc. Therefore, we adopt the GIV image as the auxiliary task of the second-level network to guide G2 to further optimize the detailed texture information of the predicted infrared image.

Target Task Generator G2
An illustration of the second-level target task generator G2 is depicted in Figure 2, consisting of an MFM module [28], four L-FMR modules that add skip connection and a Ghost module. The MFM module is shown in Figure 3. In order to obtain the information of multiple receptive fields, the input is respectively passed through four dilated convolutions with a convolution kernel size of 3 × 3 and a dilation rate of 1, 2, 3 and 4. The output of the dilated convolution with different dilation rates are added and fused, and then the added results are concatenated. The input and output of G2 are similar. To increase the direct mapping between input and output, the input is added to the final concatenated result after a pointwise convolution. The L-FMR module is improved from the FMRB [35], which is a network module for image deblurring tasks. It has been verified that FRMB can learn and restore the detailed texture information of the image. In order to reduce the amount of overall network parameters, we replace all the standard convolutions in FMRB with a Ghost module with a rate of 4, which is the L-FMR module in Figure 2. Figure 2. The proposed network of G2.

Second-Level Auxiliary Task Network Gg
In order to better guide G2 to learn the detailed texture information and obtain the predicted infrared image T2 with rich detailed texture, Gg only needs to pay attention to the details of T2. Therefore, the receptive field of Gg should be smaller. The network structure of Gg is shown in Figure 4. The upper part is the overall network structure of Gg, which contains four blocks, and the number in the middle represents the number of channels. The lower part represents the specific network structure of each block: it contains three cascaded Ghost modules, the number is the size of the convolution kernel and the convolution step length is 1. This kind of network structure makes the overall network receptive field size of Gg only 5 × 5, and the parameter quantity is extremely small.

Thrid-Level Network
To further optimize the predicted infrared image and obtain a clear edge, the thirdlevel network adds the edge image of visible light as auxiliary information. At the same time, inspired by [36], we learn the position offset information to further obtain infrared images with sharper and more accurate edges. As shown in the green part of Figure 1, the third level has just one displacement network, Gd. The input of Gd is concatenated with the predicted infrared image T2 of the second level network and the edge image of the visible light image. The output is the positional offset map of the input image in the row direction and the column direction. Then, the input image T2 is resampled from the two position offset maps to obtain the final predicted infrared image T3.

Second-Level Auxiliary Task Network G g
In order to better guide G 2 to learn the detailed texture information and obtain the predicted infrared image T 2 with rich detailed texture, G g only needs to pay attention to the details of T 2 . Therefore, the receptive field of G g should be smaller. The network structure of G g is shown in Figure 4. The upper part is the overall network structure of G g , which contains four blocks, and the number in the middle represents the number of channels. The lower part represents the specific network structure of each block: it contains three cascaded Ghost modules, the number is the size of the convolution kernel and the convolution step length is 1. This kind of network structure makes the overall network receptive field size of G g only 5 × 5, and the parameter quantity is extremely small.

Second-Level Auxiliary Task Network Gg
In order to better guide G2 to learn the detailed texture information and obtain the predicted infrared image T2 with rich detailed texture, Gg only needs to pay attention to the details of T2. Therefore, the receptive field of Gg should be smaller. The network structure of Gg is shown in Figure 4. The upper part is the overall network structure of Gg, which contains four blocks, and the number in the middle represents the number of channels. The lower part represents the specific network structure of each block: it contains three cascaded Ghost modules, the number is the size of the convolution kernel and the convolution step length is 1. This kind of network structure makes the overall network receptive field size of Gg only 5 × 5, and the parameter quantity is extremely small.

Thrid-Level Network
To further optimize the predicted infrared image and obtain a clear edge, the thirdlevel network adds the edge image of visible light as auxiliary information. At the same time, inspired by [36], we learn the position offset information to further obtain infrared images with sharper and more accurate edges. As shown in the green part of Figure 1, the third level has just one displacement network, Gd. The input of Gd is concatenated with the predicted infrared image T2 of the second level network and the edge image of the visible light image. The output is the positional offset map of the input image in the row direction and the column direction. Then, the input image T2 is resampled from the two position offset maps to obtain the final predicted infrared image T3.

Thrid-Level Network
To further optimize the predicted infrared image and obtain a clear edge, the thirdlevel network adds the edge image of visible light as auxiliary information. At the same time, inspired by [36], we learn the position offset information to further obtain infrared images with sharper and more accurate edges. As shown in the green part of Figure 1, the third level has just one displacement network, G d . The input of G d is concatenated with the predicted infrared image T 2 of the second level network and the edge image of the visible light image. The output is the positional offset map of the input image in the row direction and the column direction. Then, the input image T 2 is resampled from the two position offset maps to obtain the final predicted infrared image T 3 . The overall network structure of G d is shown in Figure 5a, using a codec network structure, and the encoding end includes four down-sampling residual blocks (D-Res). The specific network structure of D-Res is shown in Figure 5b. The input goes through two standard convolutions with 3 × 3 convolution kernels, and then through a bilinear interpolation down-sampling to compress the resolution of the feature map twice. Finally, the skip connection of the convolution with a convolution kernel of 4 × 4 and step size of 2 is added to the down-sampling result. The network structure of the decoding end is symmetrical to the encoding end, including four up-sampling residual blocks (U-Res). The specific network structure of U-Res is shown in Figure 5c. The input goes through two standard convolutions with 3 × 3 convolution kernels, and then through a bilinear interpolation up-sampling to double the resolution of the feature map. Finally, the deconvolution skip connection with a convolution kernel of 4 × 4 and step size of 2 is added to the up-sampling result. The overall network structure of Gd is shown in Figure 5a, using a codec network structure, and the encoding end includes four down-sampling residual blocks (D-Res). The specific network structure of D-Res is shown in Figure 5b. The input goes through two standard convolutions with 3 × 3 convolution kernels, and then through a bilinear interpolation down-sampling to compress the resolution of the feature map twice. Finally, the skip connection of the convolution with a convolution kernel of 4 × 4 and step size of 2 is added to the down-sampling result. The network structure of the decoding end is symmetrical to the encoding end, including four up-sampling residual blocks (U-Res). The specific network structure of U-Res is shown in Figure 5c. The input goes through two standard convolutions with 3 × 3 convolution kernels, and then through a bilinear interpolation up-sampling to double the resolution of the feature map. Finally, the deconvolution skip connection with a convolution kernel of 4 × 4 and step size of 2 is added to the up-sampling result.  In this paper, according to the row direction position offset map, IRow, and the column direction position offset map, ICol, predicted by Gd, the second-level output result T2 is resampled to obtain the final predicted infrared image, T3. The resampling process is defined as Equation (1). T3 (x, y) represents the gray value of the third-level network output image at the position (x, y); and T2 (x, y) represents the gray value of the second-level network output image at position (x, y). Row (x, y) and Col (x, y) denote the position offset in the row and column direction.

Loss Function
The three-level network in this paper is jointly trained in an end-to-end manner. The gradient descent of the discriminator and the generator is performed alternately; we first fix the parameters of D1 and D2, train G1, Gs, G2, Gg and Gd, and then fix G1, Gs, G2, Gg and Gd, and train D1 and D2. The overall loss function Lfinal uses a minimum-maximum training strategy, and the expression is as follows: LGAN is the sum of adversarial loss functions, and Lpixel is the sum of pixel-level loss functions. LGAN includes the first-level adversarial loss, LGAN1, and the second-level adversarial loss, LGAN2. The expression is as follows: In this paper, according to the row direction position offset map, I Row , and the column direction position offset map, I Col , predicted by G d , the second-level output result T 2 is resampled to obtain the final predicted infrared image, T 3 . The resampling process is defined as Equation (1). T 3 (x, y) represents the gray value of the third-level network output image at the position (x, y); and T 2 (x, y) represents the gray value of the second-level network output image at position (x, y). Row (x, y) and Col (x, y) denote the position offset in the row and column direction.

Loss Function
The three-level network in this paper is jointly trained in an end-to-end manner. The gradient descent of the discriminator and the generator is performed alternately; we first fix the parameters of D 1 and D 2 , train G 1 , G s , G 2, G g and G d , and then fix G 1 , G s , G 2 , G g and G d , and train D 1 and D 2 . The overall loss function L final uses a minimum-maximum training strategy, and the expression is as follows: L GAN is the sum of adversarial loss functions, and L pixel is the sum of pixel-level loss functions. L GAN includes the first-level adversarial loss, L GAN1 , and the second-level adversarial loss, L GAN2 . The expression is as follows: The first-level discriminator D 1 is used to distinguish the synthetic image pair [I rgb , T 1 ] and the real image pair [I rgb , T true ]. The loss function adopts the combination of cross entropy, which is expressed as The second-level discriminator D 2 is used to distinguish the synthetic image pair [I rgb , T 2 ] and the real image pair [I rgb , T true ], expressed as The total pixel-level loss function L pixel includes the L 1 loss function L G1 and L Gs of the first-level generative network G 1 and Gs; the L 1 loss function, L G2 and LG g , of the second-level generative network, G 2 and G g ; the gradient loss function L g_G2 , which is more sensitive to texture; and the L 1 loss function L Gd after resampling. The expression is defined as follows: λ is a hyperparameter, which represents the weight of each loss function. G 1 , G 2 and G d are the target task networks with the highest weights; the networks G s and G g are responsible for auxiliary tasks and have lower weights; the gradient loss function is used to increase the network's ability to perceive edges, with the smallest weights. After experiments, we finally set λ from 1 to 6 as 100, 5, 200, 10, 0.5 and 100, respectively. The L 1 loss function represents the average absolute error, expressed as where i is the pixel index, N is the total number of all pixels in an image, and y i and y * i , respectively, represent the real and predicted gray value at pixel i. The expression of the gradient loss function L g_G2 is as follows: ∇ hŷ i and ∇ h y i represent the gradient value in the horizontal direction at pixel i of the target infrared image T true and the infrared simulation image respectively.

Dataset
We performed the experiments on MPD [11], which consists of image pairs for visible light images and corresponding infrared images with a resolution of 640 × 512. The training set and test set contain 50,187 and 45,141 image pairs, respectively. Both the training set and the test set involve three scenes-campus, street and suburbs-and each scene contains images taken during the day and night. We select image pairs in the daytime as the training set of the network, and the training set size consisted of 33,399 image pairs. Correspondingly, we randomly select 565 image pairs from the daytime image pairs in the MPD test set as the test set of the network. We resize the image resolution to 256 × 256 through bilinear interpolation down-sampling.
Predicting the semantic segmentation image and gray-scale inversion image of visible light is the auxiliary task of this network. The gray-scale inversion image is obtained by converting the visible light image from a color image to a gray-scale image and then performing the gray-scale value inversion operation. Semantic segmentation images can be predicted by feeding visible light images into a model trained by Refinenet [37] on Cityscapes. Cityscapes is a large dataset mainly used for semantic segmentation. The main scene is outdoor streets, similar to MPD. The edge image of the visible light is the auxiliary information in the third-level network, which is extracted by the Canny operator with the upper and lower thresholds set to 60 and 120, respectively.

Evaluation Metrics
In the previous work of the image domain conversion task, there are some recognized evaluation metrics to evaluate the similarity between the network predicted image and the real target image. We used the mean absolute relative error (Rel), mean log10 error (Log10), root mean squared error (Rms) and accuracy index (δ < 1.25 i , i = 1, 2, 3). The calculation expressions of each metrics are as follows: where i is the pixel index, and N is the total number of pixels in an infrared image. y i and y * i respectively, represent the gray value of the target image and the gray value of the predicted image at pixel i. We also employed pixel-level similarity metrics to evaluate our method, i.e., Structural-Similarity (SSIM) and Peak Signal-to-Noise Ratio (PSNR). PSNR and SSIM, as evaluation metrics for image deblurring and super-resolution, can better reflect the similarity of the two images.

Training Setup
Our method was implemented with Pytorch using one NVIDIA GeForce RTX 2080 Ti GPU with 16 GB memory. We used a Gaussian distribution with a mean of 0 and a standard deviation of 0.2 for weight initialization. We minimized the loss function using the Adam optimizer with a momentum of 0.5 and initial learning rate of 0.0001. We set the batch size to 4.

Results
This section compares our method with other state-of-the-art image domain conversion methods based on generative adversarial networks. The comparison results are shown in Table 1. Pix2pix [9] is a popular cGAN that can realize image-to-image conversion and is suitable for all image domain conversion tasks. The input of the network is conditional information. In this paper, the input of pix2pix is set as a visible light image, and the output is a corresponding infrared image, without other auxiliary tasks or auxiliary information. X-Fork [38] is a GAN that realizes cross-view image translation and requires auxiliary tasks of semantic segmentation. Selection-GAN [24] is also a GAN that realizes cross-view image translation, and its network structure is a two-level GAN, where each level of the network is guided by an auxiliary task of semantic segmentation. SEAN [39] can achieve image fusion and conversion. The style image needs to be added as auxiliary information in the process of converting the input image to the target image. In this paper, the semantic segmentation image is used as input, the GIV image is used as the style image, and the output is the predicted infrared image. LG-GAN [40] explores the generation of scenes in the local environment, and considers the global and local context at the same time, which can effectively deal with the generation of small objects and scene details. As can be seen from Table 1, compared with other advanced generative adversarial networks for image domain transformation, our algorithm achieves the best results on various objective evaluation metrics.
The visual comparison between our method and other advanced algorithms is shown in Figure 6. Our proposed V2T-GAN has the smallest network parameters, only 15.24 M, and the lowest error RMS. The overall parameters of the network are about 73.35%, 73.68%, 73.83%, 94.29% and 14.03% lower than the Pix2pix, X-Fork, Selection-GAN, SEAN and LG-GAN algorithms, respectively. REVIEW 10 environment, and considers the global and local context at the same time, which can e tively deal with the generation of small objects and scene details. As can be seen from Table 1, compared with other advanced generative advers networks for image domain transformation, our algorithm achieves the best resul various objective evaluation metrics. Table 1. Comparison of the algorithms in objective metrics.

Methods
The The visual comparison between our method and other advanced algorithms is sh in Figure 6. Our proposed V2T-GAN has the smallest network parameters, only 15.2 and the lowest error RMS. The overall parameters of the network are about 73 73.68%, 73.83%, 94.29% and 14.03% lower than the Pix2pix, X-Fork, Selection-GAN, S and LG-GAN algorithms, respectively.

Ablation Study
To further analyze the details of the proposed approach, ablation experiments conducted by investigating different configurations of the components of V2T-Net.

Ablation Study
To further analyze the details of the proposed approach, ablation experiments were conducted by investigating different configurations of the components of V2T-Net.

Three-Level Network Structure
To verify the effectiveness of the three-level network structure, this section compares the experimental results of the first-level network, the two-level network and the third-level network. The comparison results are show in Table 2. We can observe the improvement in the three-level network structure in this table, which outperforms other structures in all the metrics. The predicted infrared images of the one-level and two-level networks are shown in Figure 7. It can be seen that the results of the one-level network are relatively rough, while the results of the two-level network are more accurate in detail and more similar to the target image. For example, for the road signs selected in the first image, part of the structure is missing in the result of the first-level network, and the outline of the two-level network is more complete. The framed parts in the second image include road signs and branches. Comparing the two results, we can observe that the detailed texture information of the two-level network results is relatively more accurate.  The predicted infrared images of the one-level and two-level networks are shown in Figure 7. It can be seen that the results of the one-level network are relatively rough, while the results of the two-level network are more accurate in detail and more similar to the target image. For example, for the road signs selected in the first image, part of the structure is missing in the result of the first-level network, and the outline of the two-level network is more complete. The framed parts in the second image include road signs and branches. Comparing the two results, we can observe that the detailed texture information of the two-level network results is relatively more accurate.

One level
Two levels Ground truth RGB image Figure 7. Comparison of the one-level and two-level networks' visualization results. Figure 8 shows the infrared simulation images output by the two-level network and three-level network. We can see that the visualization results of the selected area in the yellow box after the position offset optimization are poor, mainly reflected in the blurred  Figure 8 shows the infrared simulation images output by the two-level network and three-level network. We can see that the visualization results of the selected area in the yellow box after the position offset optimization are poor, mainly reflected in the blurred image edges, unclear textures and many errors at the edges. This is because the position offset network adopts the CNN training method; that is, it learns to convert the image directly through the pixel-level loss function. Although the converted result performs better on the pixel-level objective indicators, the subjective perception of the human eye is poor. From the perspective of the local image, it can be found from the cars selected in the second and fourth columns that the contour of the two-level network conversion result is easier to identify and more similar to the target infrared image.

Auxiliary Task
Auxiliary tasks are added to the method in this paper to improve network performance. In this section, we compare the effects of auxiliary tasks. The auxiliary task of the first-level network is semantic segmentation of images, and the auxiliary task of the second-level network is the GIV images. The results of specific ablation experiments are shown in Table 3: Structure 1, removing the two auxiliary tasks of semantic segmentation and GIV images at the same time; that is, removing the Gs and Gg networks. Structure 2, only remove the GIV image, which means there is no Gg network. Structure 3, only remove the semantic segmentation image; that is, no Gs network. Row 4 represents the complete V2T-GAN.
It can be seen from Table 3 that the our complete V2T-GAN, including semantic segmentation and GIV image auxiliary tasks, obtains the best experimental results. The accu-

Auxiliary Task
Auxiliary tasks are added to the method in this paper to improve network performance. In this section, we compare the effects of auxiliary tasks. The auxiliary task of the first-level network is semantic segmentation of images, and the auxiliary task of the second-level network is the GIV images. The results of specific ablation experiments are shown in Table 3: Structure 1, removing the two auxiliary tasks of semantic segmentation and GIV images at the same time; that is, removing the G s and G g networks. Structure 2, only remove the GIV image, which means there is no G g network. Structure 3, only remove the semantic segmentation image; that is, no G s network. Row 4 represents the complete V2T-GAN. It can be seen from Table 3 that the our complete V2T-GAN, including semantic segmentation and GIV image auxiliary tasks, obtains the best experimental results. The accuracy rate δ < 1.25 is 2.30%, 1.96% and 1.30% higher than Structures 1-3, respectively. The structure one has no auxiliary tasks, and the performance is the worst. Structure 3 is better than the Structure 2 network in various objective metrics, indicating the GIV auxiliary task has a greater effect than semantic segmentation.
Although the auxiliary task of semantic segmentation in Structure 2 enables the network to learn more correct structure information, the calculation process of objective metrics cannot add weight to the structure information. The auxiliary task of GIV image in Structure 3 enables the network to obtain more detailed image information. Even if there are some differences in structure, it can still ensure better metrics. This is also a limitation of objective metrics.

Edge Auxiliary Information
In order to guide the third-level network to learn more clear and accurate edge information, the input of the third-level network adds the edge image of visible light as auxiliary information. We conducted an experimental analysis on the effectiveness of the edge image auxiliary information, and the results are shown in Table 4. We found that adding visible light edge images as auxiliary information can improve the objective metrics of predicting infrared images, which means that V2T-GAN has indeed learned a sharp edge from this auxiliary task.

Lightweight Convolution
To reduce the amount of overall network parameters, we generally use lightweight convolution in the proposed network. The two sub-networks with the largest amounts of parameters in V2T-GAN are G 1 and G s . Therefore, this section compares the different lighweight methods of G 1 and G s . The experimental results are shown in Table 5. BSConv, DSConv, GhostModule and GConv, respectively, represent the replacement of the standard convolutions in G 1 and G s with blueprint separable convolution, depthwise separable convolution, GhostModule and group convolution. In the experiment, the grouping number of group convolution and GhostModule were both set to 4. It can be found from Table 5 that the overall network parameters using BSConv are the smallest, but the overall network using GConv performs best in various objective metrics. The goal of a lightweight network for V2T-GAN is to reduce the amount of overall network parameters, improve calculation efficiency, alleviate the problem of network overfitting and improve conversion accuracy. In order to trade off accuracy and efficiency, we finally use GConv in our network.

Conclusions
We propose a three-level refined lightweight GAN with cascaded guidance (V2T-GAN) to address image domain conversion task on a visible light image to the corresponding infrared simulation image. In the three-level network, semantic segmentation images, GIV images and visible light edge images were used as input information for auxiliary tasks. The experimental results on the MPD show that our method obtains much better results than the state-of-the-art on the task of feature conversion from visible light to infrared images.
In the future, we would like to be able to convert from visible light to infrared images without having to create a one-to-one mapping between the training data, as well as apply the idea of the algorithm in this paper to other fields, such as the fusion of visible light and infrared images and the detailed enhancement of infrared images.