Next Article in Journal
Development of Vertical Radar Reflectivity Profiles Based on Lightning Density Using the Geostationary Lightning Mapper Dataset in the Subtropical Region of Brazil
Next Article in Special Issue
Fast Hyperspectral Image Classification with Strong Noise Robustness Based on Minimum Noise Fraction
Previous Article in Journal
Tree Completion Net: A Novel Vegetation Point Clouds Completion Model Based on Deep Learning
Previous Article in Special Issue
Lightweight Design for Infrared Dim and Small Target Detection in Complex Environments
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

DBSF-Net: Infrared Image Colorization Based on the Generative Adversarial Model with Dual-Branch Feature Extraction and Spatial-Frequency-Domain Discrimination

1
PLA Rocket Force University of Engineering, Xi’an 710025, China
2
Department of Automation, Tsinghua University, Beijing 100084, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2024, 16(20), 3766; https://doi.org/10.3390/rs16203766
Submission received: 5 September 2024 / Revised: 30 September 2024 / Accepted: 1 October 2024 / Published: 10 October 2024

Abstract

:
Thermal infrared cameras can image stably in complex scenes such as night, rain, snow, and dense fog. Still, humans are more sensitive to visual colors, so there is an urgent need to convert infrared images into color images in areas such as assisted driving. This paper studies a colorization method for infrared images based on a generative adversarial model. The proposed dual-branch feature extraction network ensures the stability of the content and structure of the generated visible light image; the proposed discrimination strategy combining spatial and frequency domain hybrid constraints effectively improves the problem of undersaturated coloring and the loss of texture details in the edge area of the generated visible light image. The comparative experiment of the public infrared visible light paired data set shows that the algorithm proposed in this paper has achieved the best performance in maintaining the consistency of the content structure of the generated image, restoring the image color distribution, and restoring the image texture details.

1. Introduction

Visual cameras cannot produce clear images in low-visibility environments such as night, heavy rain, heavy snow, and heavy fog. However, infrared sensors have stable imaging capabilities in such complex scenes [1,2]. As shown in Figure 1, infrared cameras can work in harsh environments such as night and smoke and can achieve all-weather and all-weather perception and monitoring of the environment. They can be used as environmental perception devices for night driving assistance systems. In applications, long-wavelength (7.5∼13 µm) infrared cameras are more robust to the interferences produced by headlights and traffic signals and a human body radiates in the long-wavelength infrared light wavelength of 9.3 µm [3], which supports the suitability of thermal cameras for capturing humans.
For the assisted driving system, infrared imaging is unaffected by low visibility environments such as nighttime and exhibits more stable perception performance. Unfortunately, infrared images have low contrast, low resolution, and no color information. According to the mechanism of human visual cognition, the human visual system can distinguish thousands of color tones and intensities but can only distinguish about twenty grayscale levels [4]. In addition, visual psychophysics experiments have suggested that, compared with pseudo-color images and grayscale images, color images can improve the speed and accuracy of the observer’s recognition of targets [5]. By using the infrared image colorization method to impart color information to the infrared image, human drivers can more clearly identify roads, obstacles, and other vehicles at night, enabling them to judge road conditions with higher accuracy while reducing the risk of accidents during nighttime driving, thereby improving the safety and visibility of nighttime driving. Therefore, the colorization of infrared images is of great significance for nighttime assisted driving.
Image colorization is an important research topic in the field of low-level computer vision, and its methods can be divided into two categories: traditional image colorization methods and deep learning-based image colorization methods. Traditional image colorization methods are semi-automatic colorization methods that require human interaction, mainly including methods based on local color extension and reference image color migration. Color extension requires the manual annotation of coloring lines in advance, while color migration requires the selection of reference color images in advance [6,7,8,9,10,11]. The reference image of the color migration method can be directly specified by the user or obtained from the Internet or from a large dataset [12,13,14,15]. Regions with similar colors or textures often have similar structures or lines, so the colorization of the image can be guided by identifying the similarities between the target image and the reference image [16,17,18,19]. The method based on reference image color migration does not require manual annotation, but relies on one or more color reference images to represent the color information. The color migration of the target infrared image is achieved by statistically analyzing the high-order statistics in the color reference image. The higher the texture and semantic information match between the reference image and the original infrared image, the better the color transfer effect. These traditional methods are not only complex in terms of their calculation process but also have a high calculation cost, making it difficult to achieve real-time colorization.
Deep neural networks have realized the automation of image colorization, effectively reducing the time required for the colorization process, and are increasingly favored by researchers [20,21,22,23]. The colorization method based on deep learning automatically obtains the mapping relationship between data, continuously optimizes the model weights, and significantly improves the generalization ability, colorization effect, and real-time performance of the algorithm. At the same time, there are also certain problems with the automated colorization method. When performing colorization tasks based on convolutional neural networks (CNNs), the information extracted by the CNN structure is translation invariant, and the weights of the extracted image background features and detail features are equal [24,25,26,27], but the understanding of image features is insufficient, resulting in a decrease in network performance and problems such as blurred and disconnected coloring. In order to make the network focus on more prominent features during deep learning, attention structures represented by Transformer are increasingly being applied by researchers to colorization tasks. Transformer uses color upsampling for colorization [28]. However, during the training process, color errors continue to accumulate and increase, resulting in a decrease in colorization accuracy and saturation.
The colorization structure based on the generative adversarial mechanism is still the most widely used approach with the best colorization effect and the most stable network structure. The image colorization method based on a generative adversarial structure [29,30,31] may cause changes in the image content and object structure due to insufficient constraints during the game between the generator and discriminator, and it may also result in edge blurring, line distortion, and other problems in the generated image. Therefore, in order to achieve a better infrared image colorization effect, this study improves and optimizes it based on the generative adversarial framework. The key contributions of this study can be summarized as follows: Firstly, this study proposes a dual-branch feature extraction network for the infrared image colorization method. The proposed method incorporates a dual-branch feature extraction structure in generative adversarial networks, extracting both basic and detailed features separately, effectively capturing long-range dependencies while preserving the original texture details as much as possible, thereby reducing edge blur and distortion in generated images. By comparing the loss through global and local saliency self-supervision, better consistency between the generated image and the original content and structure can be maintained. Secondly, this study proposes a colorization method that combines spatial-domain and frequency-domain discrimination strategies. The proposed method utilizes spatial discriminator constraints to generate the image color distribution, solving the problems of color saturation and inaccuracy. Meanwhile, the frequency-domain discriminator extracts detailed features of different frequency bands, making the generated color images more realistic and natural. The experimental results suggest that the method outperforms the existing methods in restoring image color distribution and texture detail information.
This paper proposes a dual-branch feature extraction and spatial-frequency domain discrimination network, named DBSF-Net. By combining dual-branch feature extraction with the discrimination strategy of spatial-domain and frequency-domain information, the DBSF-Net is proposed for achieving high-quality infrared image colorization. The DBSF-Net exhibits excellent performance on different infrared visible light datasets, achieving better quantitative and qualitative comparison results than the existing state-of-the-art (SOTA) colorization algorithms. Meanwhile, by deploying algorithms on embedded computing platforms, real-time infrared image colorization can be achieved in real scenarios, providing support for the real-time application of algorithms in tasks such as autonomous piloting.

2. Related Work

Deep learning-based image coloring methods can be mainly categorized from a network perspective into CNN-based methods, generative adversarial networks (GAN)-based methods, and Transformer-based structures, etc.

2.1. CNN-Based Methods

CNN was first applied to the task of infrared image colorization. Cheng et al. first proposed a fully automatic colorization method based on CNN [24]. Although this approach is effective, it still requires a large amount of training data and its performance is prone to be limited by the training dataset. Iizuka et al. used a data-driven CNN method that combines long-range global and short-range local features to realize fully automatic colorization [25]. Larsson et al. reported a self-supervised automatic coloring method which realizes colorization by predicting pixel color histograms and extracts multi-scale features using convolutional deep networks [26]. Zhang et al. regarded colorization as a classification task to increase color diversity [27]. Zhang et al. proposed a real-time user-guided colorization architecture which is subdivided into local and global prompt networks [28]. Through adopting similarity sub-networks and coloring sub-networks, He et al. implemented different colorization styles based on reference images, calculated similarity, and refined colors [32]. Subsequently, Zhang et al. applied the above method to colorize grayscale videos, introducing temporal consistency loss to reduce color fluctuations [33]. Dabas et al. used a VGG network for fully automatic image coloring [34]. Dong et al. acquired three-dimensional weight volumes based on convolutional networks, achieved high-precision colorization, and updated distorted information maps using Markov random field methods [35]. Pang et al. proposed the SFAC algorithm, which uses structural feature alignment to colorize old photos, eliminating big data dependencies and ensuring that semantically related objects have similar colors [36].

2.2. GAN-Based Methods

The most commonly used method for image colorization is GAN. Goodfellow et al. first proposed GAN, which generates realistic color images through the adversarial training of generators and discriminators [29]. Later, Isola et al. proposed a conditional adversarial network for image transformation tasks [30]. Zhu et al. first proposed CycleGAN, which enables the self-supervised image cross-domain transformation of unpaired data [31]. Suárez et al. reported DCGAN for near-infrared image coloring [37]. By maintaining the routing distance among samples to achieve good mapping performance, Benaim et al. proposed DistanceGAN [38]. Subsequently, Bansal et al. proposed RecycleGAN for unsupervised video colorization, overcoming the challenge of infrared video colorization [39]. Kniaz et al. proposed ThermalGAN for cross-modal color thermal infrared facial re-recognition [40]. Mehri et al. proposed a near-infrared image colorization method based on unpaired datasets [41]. Abbott et al. introduced adaptive CycleGAN for the conversion of long-wave infrared and visible light datasets [42]. Emami et al. proposed SPA-GAN, which guides the generator to focus on key image regions and generate more realistic images through discriminative salient feature scores [43]. Chen et al. proposed a GAN without independent encoders, which simplifies the structure and directly trains the encoder using the adversarial loss of multi-scale discriminators to improve training efficiency [44]. Park et al. utilized an image transformation framework based on contrastive learning to increase mutual information [45]. Han et al. proposed an unsupervised image transformation method based on the contrastive learning of dual-branch encoders, which exhibits excellent performance [46]. Huang et al. combined U-Net and skip connections to improve the cyclic consensus GAN for the automatic colorization of images and used a composite loss function to enhance the authenticity of color images [47]. Li et al. proposed an unsupervised infrared video conversion method that constrains the video conversion process through perceptual period loss and region similarity loss [48]. Yadav et al. proposed an efficient attention recursive GAN based on MobileNet for the conversion of resource-limited infrared to visible light [49]. Luo et al. designed PearlGAN, which includes an attention-guided module and a structured gradient alignment loss to promote edge consistency [50]. Yu et al. proposed the ROMA framework to convert unpaired nighttime infrared videos into fine-grained daytime visible light videos and designed a multi-scale region discriminator [51]. Guo et al. proposed structural consistency constraints to improve image structural consistency [52]. Lin et al. designed an image structure preservation method based on cyclic consistency to alleviate the problem of insufficient supervised labels in source-domain images [53]. Bharti et al. proposed a multi-objective recurrent adversarial network with quantized evolutionary gradient perception, which uses evolutionary computation and multi-objective optimization methods to improve coloring efficiency [54].

2.3. Transformer-Based Structures

To learn key image features, researchers have incorporated attention mechanisms into neural networks to improve network performance. Transformer-based attention mechanisms have been widely adopted for image colorization tasks. Zhao et al. proposed a new cyclic switching GAN combining Swin Transformer and convolutional layers to improve the colorization effect of infrared images [55]. Feng et al. proposed a multi-scale training structure and a progressive growth generator method to generate fine-grained style images through cross-attention mechanisms [56]. Guo et al. proposed multi-feature contrastive learning to improve discriminator performance and solve model collapse problems [57]. Liu et al. proposed the TCVC network to enhance the temporal consistency of video coloring using a self-normalization learning scheme that does not require training with ground truth color videos [58]. Liang et al. introduced color control and multi-mode colorization methods, utilizing pre-trained stable diffusion models to achieve accurate local color operations [59]. Wei et al. proposed a new infrared colorization algorithm that achieves cross-modal zero-shot learning through frequency-domain feature decoupling and reconstruction without requiring infrared datasets for training [60]. Kumar et al. proposed the ColTran grayscale coloring network architecture based on Transformer blocks, which provides a variety of color grayscale images through the construction of autoregressive colorimeters, color up-samplers, and spatial up-samplers [61]. Kim et al. proposed the InstaFormer model, which integrates global and instance-level information [62]. The ColorFormer proposed by Ji et al. automatically colors through a mixed attention mechanism assisted by color memory [63]. Zheng et al. proposed an efficient image transformation structure which includes a hybrid perception block and a dual-pruning self-attention module [64]. Torbunov et al. utilized Visual Transformer (ViT) to achieve a high correlation between the original image and the translated image [65]. Ma et al. proposed CFFT-GAN, which improves image conversion performance by decoupling and fusing features through cascaded CFFT modules [66]. Jiang et al. proposed the MATEBIT structure to learn cross-domain correspondence relationships and enhance feature acquisition for high-quality images [67]. Lee et al. proposed an efficient ViT architecture for real-time interactive coloring which realizes acceleration by pruning redundant image blocks and layers [28]. Chen et al. proposed a sample-based video colorization framework with long-term spatiotemporal dependencies which generates more diverse, realistic, and stable results [68].

3. Materials and Methods

3.1. Overall Framework of the Proposed Algorithm

The study proposes a colorization method for infrared images based on a dual-branch feature extraction network. Specifically, the dual-branch feature extraction structure is integrated into the GAN to extract both basic and detailed features, and a global and local saliency self-supervised contrastive loss is established by adding a saliency feature query modules to ensure the invariance of the generated image content and structure. This method improves the existing GAN-based colorization method, which has the problem of adding or deleting the original image content and changing the original object structure in the generated color image, and maintains the invariance of the generated content and structure. In addition, due to the limitations of the imaging mechanism of infrared cameras, the grayscale values of neighboring pixels have similarities. Using pixel-level loss discrimination during colorization tasks can result in problems such as undersaturation or inaccurate coloring in the target edge region. Moreover, multiple convolutions and sampling processes in deep networks can cause the loss of the original infrared image details and texture information in the generated image. In response to the above problems, this study proposes a dual-discrimination strategy that combines the spatial domain and frequency domain to achieve higher-quality colorization. The use of spatial discriminators primarily aims at making more precise judgments on the color distribution of the generated results and real samples to solve the problems of undersaturation and inaccurate coloring. The frequency-domain discriminator mainly addresses the problem of generating color images that lose the texture details of the original infrared images. The original infrared images are subjected to multi-scale geometric analysis and transformation to obtain frequency-domain images. By effectively extracting detailed features of different frequency bands, the texture details of the generated images are enhanced.
The schematic diagram of the overall structure of the colorization algorithm network proposed in this subsection is shown in Figure 2. The proposed network is based on a generative adversarial structure in which the generator introduces a long-short distance feature extraction module based on the Transformer and CNN, according to the U-Net encoding and decoding structure, and introduces a saliency feature query module to construct global and local self-supervised contrastive losses, thereby effectively constraining the content and structure of the generated images by the generator. Firstly, the original source-domain infrared image I x is input into the U-Net encoding and decoding structure to generate a three-channel color image G ( I x ) through continuous downsampling and deconvolution upsampling while calculating the adversarial loss Lg during this process. The basic module of the U-Net is chosen to be Darknet to maintain the efficiency of image generation, and single-channel I x is duplicated and expanded to 3 channels before it enters the network. By encoding I x and G ( I x ) separately, the features F x and F y of the source-domain X and target-domain Y are obtained. Then, the source-domain and target-domain features are input into the Transformer feature extraction network, and the multi-head attention mechanism in the Transformer is used to extract the long-distance low-frequency global features of images. Meanwhile, the source-domain and target-domain features are input into the CNN feature extraction network, and the local features are extracted, preserving the texture detail information of the original source-domain image as much as possible. Subsequently, the global saliency feature query module calculates its global self-supervised comparison loss with the input, while the local saliency feature query module calculates its local self-supervised comparison loss with the input. The saliency feature query module selects the anchor point of the saliency feature position from the target-domain features and then samples a corresponding positive and negative feature from the source-domain features to calculate the comparison loss of the anchor point, through which the model can use the mutual information among the corresponding features as much as possible. Finally, a dual-discrimination hybrid strategy using spatial-domain and frequency-domain information is used to discriminate the generated color images. The discrimination process is guided by a composite loss constraint of adversarial loss and saliency contrast loss until a consistent colorization effect of the generated content structure is achieved.

3.2. The Structure of Multi-Branch Feature Extraction Network

3.2.1. Lightweight-Transformer-Based Global Feature Extractor

For the features F x and F y encoded from the input original infrared image and the generated color image, a Transformer extraction network with spatial self-attention mechanism is used to extract their global features F x g and F y g , respectively. Specifically, the feature map can be segmented into fixed-size feature blocks, and the linear projections of these feature blocks and their corresponding image positions can be input into the Transformer encoder. In the encoder, a multi-head attention mechanism is used to calculate the attention score between each feature block and other feature blocks, thereby capturing global features. However, conventional Transformer modules require the extraction of global and local contexts, with a large model parameter scale and high computational complexity, which makes it difficult to deploy on mobile platforms. Therefore, the feature extraction network introduces the Lite Transformer (LT) [69] block as the basic unit of the global feature extraction network. Compared with the traditional Transformer structure, the overall computational complexity of the LT structure is reduced by half. While ensuring the same performance, the model is lighter and more in line with the computing power requirements of resource-constrained devices, such as vehicular platforms.

3.2.2. Inverse Neural Network (INN)-Based Detail Feature Extractor

This module introduces a local feature extractor based on the CNN architecture which extracts the local detail features F x l and F y l from the features F x and F y , respectively, to better preserve the texture detail information of the original image and perform feature extraction as losslessly as possible. The more lightweight INN block is used as the basic unit of the detail feature extractor. INN is a coupled layer in which the activation results of each layer can be reversely inferred from the results of the next layer. Therefore, in the process of the reverse inference of the results, the INN network only saves the results of the last layer of the network and the parameters of each intermediate layer, and the results of each intermediate layer can be derived through the backpropagation algorithm. In this process, there is no need to save the activation results of each layer directly. This method can significantly reduce memory usage.
The forward propagation process of INN is shown in Figure 3a. Each layer’s input is divided into two parts, u 1 and u 2 , and then transformed and alternately coupled by learning the functions F and P to obtain the output ( v 1 , v 2 ). The calculation process is shown in Equation (1).
v 1 = u 1 + F ( u 2 ) v 2 = u 2 + P ( v 1 )
The backpropagation process of INN is shown in Figure 3b, in which ( u 1 , u 2 ) can be directly recovered by reverse calculation. The reverse calculation and forward calculation use the same parameters, and the reverse calculation process is described by
u 2 = v 2 P ( v 1 ) u 1 = v 1 F ( u 2 )
where F and P are similar residual functions without requiring reversibility.
Therefore, the INN blocks must be successively connected, amongst which any other network modules must not be inserted to prevent information loss and reversible computing processes from being interrupted.

3.2.3. Saliency Query Module

The saliency query module is introduced to calculate the contrast loss. The learning process is constrained by global and local contrast losses to solve the problem that the original content and structure are changed during image colorization.
(1) Global self-supervised saliency query module
The global self-supervised saliency query module combines contrastive learning between the source-domain global feature F x g and the target-domain global feature F y g extracted by the feature extraction network. The overall structure of the global query module is shown in Figure 4. Firstly, the feature map F x g R h × w × c (h, w, c represent the height, width, and channel count of the feature, respectively) is reshaped into the query matrix Q R h w × c , while the key matrix K R c × h w is obtained. Then, conducting Softmax activation with respect to the query Q will output the global saliency attention matrix B g R h w × h w . After that, the saliency of the corresponding features is measured by calculating the entropy H g of each row in B g , which is described as
H g ( m ) = n = 1 h w B g ( m , n ) log B g ( m , n ) ,
where B g [ 0 , 1 ] , which is the output of the softmax layer. m and n are the indices of Q and K, respectively, corresponding to the rows and columns in the global saliency matrix.
Entropy is used to describe the degree of disorder in a vector. When one element in H g ( m ) is close to 1 and the rest are close to 0, the entropy approaches 0. Thus, a lower H g ( m ) indicates a stronger saliency of the feature. For our proposed algorithm in this study, it is necessary to select all salient queries. Sort the rows of the global saliency attention matrix Bg in ascending order of their entropy H g , and select the N rows with the lowest entropy to construct a new matrix B g N R n × h w .
Furthermore, based on the saliency attention matrix B g N , the saliency feature positions are selected from F y g as anchor points q, and F x g is routed to obtain a positive feature value v + and (N-1) negative feature values v . Based on the obtained routing values and temperature hyper-parameter τ , a global self-supervised contrastive loss L c g is constructed to ensure the consistency of source-domain feature relationships in the generated results. The calculation of L c g is expressed as
L c g = log 1 i = 1 N 1 exp ( q i · v i / τ ) exp ( q i · v i + / τ ) + i = 1 N 1 exp ( q i · v i / τ )
(2) Local self-supervised saliency query module
Similarly, the local self-supervised saliency query module combines contrastive learning between the source-domain local feature F x l and the target-domain local feature F y l extracted by the feature extraction network. The local query module mainly measures the similarity between queries and adjacent keys within a fixed-size (s × s) local area, which is used to capture spatial features of the local area. Using the same global attention module, reconstruct F x l as Q l R h w × c with a key matrix of K l R h w × s 2 × c . Multiply these two matrices and perform softmax regression to obtain the local attention matrix B l R h w × s 2 , and calculate the local entropy of each row according to the following Equtation (5).
H l ( m ) = n = 1 s 2 B l ( m , n ) log B l ( m , n )
Sort H l in ascending order, select the N rows with the smallest H l as the matrix B l N , and further route to obtain v l + , v l and q l . Construct the local saliency comparison loss L c l :
L c l = log 1 i = 1 N 1 exp ( q l i · v l i / τ ) exp ( q l i · v l i + / τ ) + i = 1 N 1 exp ( q l i · v l i / τ )

3.2.4. Discriminator Based on Spatial-Domain and Frequency-Domain Information

As shown in Figure 5, the source-domain infrared image I x and the target-domain real-color image I y are simultaneously input into a colorization generator to produce the colorized image G ( I x ) . Next, G ( I x ) and the real-color image I y are fed into a Markov (PatchGAN) spatial discriminator, which ensures that the generated image shares the same background and color distribution as the real image. The PatchGAN discriminator effectively addresses issues like inaccurate coloring and undersaturation and can effectively capture local spatial features. Furthermore, a frequency-domain discriminator based on Haar is used to extract frequency-domain features. G ( I x ) and the original infrared image I x are input into a frequency-domain discriminator, based on Haar wavelet transforms, to ensure the generated image preserves the texture details of the infrared image, thereby solving problems such as line distortion and blurred details.
(1) Spatial-domain discriminator
The structure of the spatial discriminator D1 is shown in Figure 6, which introduces learning process constraints based on Markov discriminators. Specifically, this discriminator uses six stride convolutions with a kernel size of 5 and a stride size of 2. These convolutional layers can effectively capture the statistical feature information of Markov image patches. Firstly, the PatchGAN maps the input image to N × N patch matrices and then characterizes the local features of the image by using patch block discrimination. Finally, the final classification results are averaged to obtain an overall discrimination result. In this way, the discriminator can finely distinguish the local features of the image, thereby guiding the generator to generate more accurate and uniform-color images.
(2) Frequency-domain discriminator
The frequency-domain discrimination strategy based on the Haar wavelet transform is adopted to retain sufficient texture and detail information of the original infrared image I x . Firstly, the original infrared image I x and the generated image G ( I x ) are decomposed into a base sub-image and three high-frequency sub-images using the Haar transform. These sub-images are then concatenated together and sent to the frequency-domain discriminator D 2 for discrimination. The structure of D 2 is shown in Figure 7. Grouped convolution is used to extract features from the four sub-band images in D 2 . Considering the different importance of each frequency component in the feature representation of infrared images, Squeeze-and-Excitation (SE) is used to weigh the features of different frequency bands. Therefore, in the identification process of D 2 , the texture details in the generated image will be enhanced.
(a) The Haar wavelet transform
The equation of the wavelet transform can be expressed as
W T ( a , τ ) = 1 a + f ( t ) ψ ( t τ a ) d t
where a denotes the scaling scale, τ denotes the translation of the primary function, t denotes the time, f ( · ) denotes the time-domain signal, and ψ ( · ) denotes the wavelet’s primary function. The mother wavelet of the Haar wavelet transform is expressed by
ψ ( t ) = 1 , 0 t < 1 / 2 1 , 1 / 2 t < 1 0 , else
the corresponding father wavelet to which is expressed by
φ ( t ) = 10 t < 1 0 , else
In order to meet the requirement of orthogonality between the father wavelet and the mother wavelet, a filter is required, which can be described as
h [ n ] = 1 / 2 n = 0 , 1 0 , else
Based upon the above Equation (10), we obtain
ψ ( x ) = φ ( x ) φ ( 2 x 1 )
The Haar primary function obtains signal scale information, while the Haar wavelet function represents signal detail information. The Haar wavelet has symmetry and only takes the values of 1 and −1, which are easy to calculate.
(b) Squeeze-and-Excitation module
The Squeeze-and-Excitation (SE) block [70] is an information feature construction module used in CNNs which calibrates channel feature maps by constructing dependency relationships between channels. The overall structure of the SE block is shown in Figure 8. Firstly, the input information X is mapped to the feature U through a specific F-transform. The F-transform can be a convolution operation, pooling operation, or other linear transformation operations, depending on the requirements and design of the network. Then, feature U passes through a squeezing layer, which performs channel-level feature aggregation mapping in the spatial dimension (H × W), compressing each channel in feature U into a one-dimensional vector, which is called a channel descriptor. The squeezing operation can effectively reduce the number of channels in feature maps, thereby minimizing computational complexity and model complexity. Then, the channel descriptor passes through the excitation layer, which adopts a self-selected channel mechanism to directly combine high-dimensional features as input and generate a set of channel weights. The high-dimensional feature combination is a fixed-length vector, which can be randomly initialized or pre-trained. The function of the incentive layer is to generate channel weights for each dimension based on the combination of high-dimensional features and channel descriptors and to perform weighted summation on the feature map U to obtain the output tensor of the SE. The output tensor of the SE block can be directly input into other networks or combined with other building blocks to construct more sophisticated network structures.

3.3. Loss Function

In order to effectively constrain the learning process of the colorization network, a composite loss function is designed to guide training, so that the generated colorized image maintains consistency with the original infrared image in terms of its content structure. The specific composite loss includes the adversarial loss of the generator, the global self-supervised contrastive loss, the local saliency contrastive loss, and the discriminator mean square error (MSE). The overall composite loss function is expressed as
L T o t a l = L g + λ 1 L c g + λ 2 L c l + λ 3 L M S E
where λ 1 , λ 2 and λ 3 represent the hyper-parameter weights of the global saliency contrastive loss, local saliency contrastive loss, and the MSE loss function, respectively.
The adversarial loss generated by the generator can ensure the overall coloring effect of the colorization network and obtain a color image with true overall color. The contrast loss of global and local saliency is used to compare the global basis and local details, respectively, ensuring the consistency of the content and structure through the contrast loss while colorizing the background and detail information of the results. The generated adversarial loss is expressed as Equation (13):
L g = E I x X log ( 1 D ( G ( I x ) ) ) + E I y Y log D ( I y )
where D ( · ) denotes the discriminator operation, E stands for the expectation operator, G ( · ) denotes the generator operation, I x denotes the raw input infrared data, and I y denotes the raw input visible light data. The global self-supervised contrastive loss L c g and the local saliency contrastive loss composite L c l are described in Section 3.2.3. Finally, in order to constrain the generator to generate samples approximating the real samples, the MSE is specifically expressed in Equation (14):
L M S E = 1 N i = 1 N ( G ( I x i ) I y i ) 2
where N denotes the number of samples, G(Ix) denotes the generated image, and Iy denotes the real-color image.

4. Experiment and Results

Comparative experiments were conducted on the proposed colorization algorithm and the selected comparison methods. Firstly, the datasets used in these experiments and the details of the experimental implementation are introduced. Then, the proposed method is qualitatively and quantitatively compared with other comparative methods in different experiments. In addition, ablation experiments are conducted to demonstrate the effectiveness of important components such as the saliency query module and the spatial-frequency domain information discriminator. Finally, considering the limited computing resources of the autopilot platform, the efficiency of different comparison methods is evaluated on the commonly used edge computing platform of NVIDIA DRIVE AGX Orin (ARM+GPU architecture) in assisted driving applications.

4.1. Implementation Details

Experimental environment:
This subsection conducts comparative experiments on the colorization algorithm and selected comparison methods studied and implemented in this section. The training and testing of the experiments are based on the supercomputer server equipment in the laboratory, which integrates a total of three 2080Ti 11GB graphics cards, (The manufacturer of the 2080Ti 11GB graphics cards is NVIDIA, with the company headquarters located in Santa Clara, CA, USA.) including a system disk and two storage disks. The operating system is Ubuntu 16.0.4, and the algorithm running environment requirements in this study includes deep learning libraries such as Python 3, Torch 1.6, Torchvision 0.7.0, Dominate 2.4, etc.
Experiment data: The datasets used for colorization network training include publicly available datasets (KAIST) [71]. The KAIST is a pedestrian dataset in which the infrared images are captured by FLIR-A35 at wavelengths of 7.5–13 µm. The dataset consists of 95,328 sets of images, each containing a pair of color and thermal infrared images, which use beam splitter-based hardware to physically align the two image domains, thus covering the same spatial locations. The intrinsic spatial resolutions of the visible RGB image and the infrared image are the same. This dataset provides various conventional traffic scenarios such as campuses, streets, pedestrians, etc., during day and night. The image size in the dataset is 640 × 480, and the dataset includes 12 sets of folder data. Figure 9 shows partially paired images from the KAIST dataset.
Experimental setup: Two different datasets are used, one from the KAIST and the other from the self-built dataset. In total, 4000 pairs of infrared and visible light data are randomly selected as the training set for each dataset, and 1000 pairs are randomly selected for constructing the test set. All image sizes are set to 256 × 256 to ensure input consistency. During the training stage of the proposed algorithm, set the traversal epochs to 200 and the batch size to 1. In order to effectively learn the model parameters, this study sets the initial learning rate to 0.0002 and saves the weight model every 5 epochs for the purpose of saving and backtracking the model state during the training process. The entire training and testing process of the model is based on the PyTorch learning framework, which uses a single computer and a single graphics card (1 GPU) for computation to ensure the efficiency and accuracy of the training process. In Equation (12), the hyperparameters λ 1 , λ 2 , and λ 3 are set to 0.5, 0.5, and 1, respectively. These were determined through experimentation to be the optimal weights.
Evaluation criterion: The quantitative indicators used in this study include peak signal-to-noise ratio (PSNR), which evaluates the quality of the generated samples, and the structural similarity (SSIM), which evaluates the similarity between the generated samples and real samples, as well as the learning perceptual image patch similarity (LPIPS) and vision information fidelity (VIF), which are evaluation indicators based on human vision, thereby evaluating the quality of the output images of the generated model.
(a) PSNR: The PSNR is an indicator for measuring the quality of the generated images, and the higher the PSNR value, the better the quality of the generated image. The specific process of calculating the PSNR first requires obtaining the MSE of the image (as shown in Equation (15)) and performing logarithmic operations based on MSE to obtain the PSNR index. The PSNR unit is dB. Given the generated image Y and the reference sample X, the larger the PSNR calculation result, the better the algorithm’s coloring effect.
MSE = 1 H W i = 0 H 1 j = 0 W 1 I y ( i , j ) I x ( i , j ) 2 2 ,
The PSNR calculation is expressed by
PSNR = 10 × log 10 ( 2 c 1 MSE ) ,
where c denotes the number of image bits, H denotes the height of the image, and W denotes the width of the image.
(b) SSIM: The SSIM is an indicator used to evaluate the structural similarity between the generated samples and real samples. The SSIM mainly calculates similarity differences based on three parameters: the pixel values, contrast, and overall image structure between two compared images. The SSIM value is between [0, 1]; the larger the SSIM value, the more similar the generated sample is to the reference sample. The SSIM calculation can be expressed as Equation (17):
SSIM ( I x , I y ) = ( 2 μ I x μ I y + c 1 ) ( 2 σ I x I y + c 2 ) ( μ I x 2 + μ I y 2 + c 1 ) ( σ I x 2 + σ I y 2 + c 2 ) ,
where μ denotes the sample mean, σ denotes the sample standard deviation, σ x y denotes the sample covariance of Ix and Iy, and c1 and c2 are two constants.
(c) LPIPS [72]: Also known as “perception loss,” the LPIPS is more in line with human visual perception than the commonly used PSNR and SSIM indicators for image evaluation. The lower the LPIPS value, the more similar the two sample data are. The LPIPS can be calculated by
D I y , I x = l 1 H l W l h , w | | w l y ^ h w l x h w l | | 2 2 ,
where D denotes the distance between the generated sample and the reference sample. Separate the feature stack from the l-th layer and perform unit normalization. Meanwhile, adjust the number of activated channels in use and calculate the L2 distance. Then, calculate the average in space and the sum in the channels.
(d) VIF [73]: The VIF is an indicator used to evaluate the visual quality and details of the original image in digital image processing which is based on natural scene statistical estimation and human visual distortion modeling. The value of the VIF index is obtained according to the calculation of mutual information, and the value of the VIF index is between [0, 1]. The larger the index value, the better the image quality.

4.2. Experimental Results

According to the experimental data, the colorization conversion effect of the proposed algorithm and the comparison method is verified and tested. The comparative experiments mainly select CycleGAN [31], MocycleGAN [74], ToDayGAN [75], CUT [45], and DECENT [46]. Five coloring algorithms are compared and validated under the same experimental conditions, trained on the same dataset, and tested for coloring effects in different scenarios.
For the selected KAIST experimental dataset, this subsection randomly selects data from various scenarios for comparative experimental training and testing. The colorization effect of various algorithms on infrared images is evaluated through evaluation indicators. The experimental comparison results of this section are shown in Figure 10 and Figure 11, covering two scenarios: road and traffic conditions. Meanwhile, subjective and objective evaluations are conducted on both the comparative experiment and the experimental results, the evaluation results of which are listed in Table 1, Table 2 and Table 3. It can be observed from these tables that the bold values represent the optimal results. These experimental results are of great significance for evaluating the performance of various algorithms in colorization conversion and help to better understand and compare the effects and characteristics of each algorithm.
(1) Experimental results and analysis based on KAIST “Traffic” scenario
Qualitative Evaluation: The colorization results for the “traffic” scenario in the KAIST dataset are depicted in Figure 10. From the perspective of overall image color, each method exhibits a similar color distribution to the original visible light image, with the exception of the DECENT method, which does not show any significant coloration errors. Then, upon observing the road marking lines, in the colorization result of the DECENT method, the position of the double yellow lines on the ground is noticeably incorrect, with lines distorted and disconnected from the distant double yellow lines. Additionally, the colorization position of the white dividing lines on the ground is noticeably incorrect. Near the position of the double yellow lines, the MocycleGAN and CUT methods also produce some extraneous content. Observing the ground shadow areas, the colorization results of CycleGAN, ToDayGAN, MocycleGAN, CUT, and other methods have issues with the absence of color and shape imbalance in the ground shadow areas. In the DECENT method’s colorization result, the color of the ground shadow area overflows into the original area, and the shape distribution is completely unbalanced. However, the colorization shape of the ground shadow area proposed in this section is relatively consistent with the original image, with clear colors and less noise. From the perspective of the building area, the reconstruction results of this method have clear edges and richer texture details, capable of reconstructing the overall scene of pedestrians and the supermarket entrance in the roadside area on the first floor. The building area of the roadside area on the first floor reconstructed by CycleGAN, ToDayGAN, MocycleGAN, and other methods is blurred, making it difficult to distinguish people and scenes. The scene reconstructed by the DECENT method in this area is very blurred, with colors overlapping with other areas, making it impossible to distinguish the scene. Observing the roadside cars, the reconstruction results of the CycleGAN, ToDayGAN, MocycleGAN, and DECENT methods show distorted car shape content and blurred car images, making them difficult to recognize. The shape and color of the cars reconstructed by this method are roughly consistent with the original visible light image. Finally, observing a row of roadside trees, the tree area in the reconstruction results of MocycleGAN and DECENT is color-bonded together, losing the original shape and details of the trees. In summary, the method proposed in this section maintains consistency with the original image in content and structure, retains more texture and detail information, and has not experienced large-scale color overflow or shadow errors.
Quantitative evaluation: By analyzing the comparative results of various metrics in Table 1, it can be observed that the proposed method has optimized the PSNR metric by approximately 3% compared with the next best result, with the generated images having the smallest error compared with the reference images. In terms of the SSIM index, the proposed method has optimized the next best result by about 6%, with the overall structure of the reconstruction results being closer to the original visible light images. Regarding the VIF index, the proposed method has improved by about 1% compared to the next best result. The proposed method outperforms other algorithms in terms of the PSNR, SSIM, and VIF metrics. In the LPIPS index, the difference between the proposed method and the optimal method is very small. In summary, the proposed method performs satisfactorily in both subjective and objective assessments.
(2) Experimental results and analysis based on KAIST “Road1” and “Road2” scenarios
Qualitative evaluation: As shown in Figure 11, from the colorization results of the “Road1” scene images in the KAIST dataset, each method maintains a color distribution that is roughly consistent with that of the original visible light images. However, upon observing the figures in the image, the reconstruction results of the CycleGAN, ToDayGAN, MocycleGAN, and DECENT methods exhibit inconsistencies in the content and structure of the figures compared with the original images. For instance, the figures reconstructed by MocycleGAN lost the lower leg content, while the CUT method and the proposed method essentially reconstructed the general shape of the figures, with the edges of the figures in the reconstruction results of the proposed method being clearer and more distinct. Observing the roadside steps, the distant steps in the reconstruction results of CycleGAN and ToDayGAN did not have the corresponding colors reconstructed. The steps in the MocycleGAN reconstruction results were blurred, losing edges and details. In the CUT method’s reconstruction results, line distortion occurred in the distant steps. In the reconstruction results of the proposed method, the steps were almost completely reconstructed, with the color distribution being essentially consistent with the original image. Observing the streetlight area in the image, the reconstruction results of the CycleGAN, MocycleGAN, CUT, and DECENT methods indicate that the distant streetlights are not prominent, with content loss, making it difficult to fully distinguish their shapes. However, in the reconstruction results of the proposed method, the distant streetlights are relatively more prominent. From the perspective of observing the ground marking lines, compared with other baseline algorithms, the reconstruction results of the proposed method show clearer edges of the marking lines, with relatively balanced line shapes, and the highest similarity to the original image. Observing the red shrub area on the left side of the image, the reconstruction results of CycleGAN, ToDayGAN, and MocycleGAN failed to reconstruct this part of the content. The colorization position of the infrared shrubs in the DECENT reconstruction results is incorrect, while both the CUT method and the proposed method can correctly reconstruct the position of the red shrubs.
Quantitative evaluation: In the analysis of the comparative results of various indicators in Table 2 and Table 3, for the “Road1” and “Road2” scenarios, the proposed method has achieved the best performance, generating images with the highest quality and lowest distortion, and the overall structure has higher similarity. Furthermore, judging from the aforementioned metrics, our generated images have colors that align more closely with human basic cognition compared with those from methods based on color lookup tables, and they exhibit superior textural characteristics.
(3) The efficiency of different comparison methods
Considering the assisted driving vehicles’ load capacity and power supply, the performance of computing platforms is limited. Thus, all of the comparison methods’ efficiency must be taken into consideration. The efficiency of all the methods is compared on the NVIDIA DRIVE AGX Orin, and the results are shown in Table 4.
The methods with fewer parameters (M means megabytes of memory) that still manage shorter runtimes are considered highly efficient. The experimental results show that the proposed method in this paper has a minimal amount of parameters and computation and higher inference speed compared with the different comparison methods. In physical experiments, only the image network generation module participates in the computation. This module is based on the Darknet architecture to ensure efficient image generation. With the acceleration of the TensorRT engine, it can achieve an inference speed of over 15 frames per second on the in-vehicle intelligent inference terminal NVIDIA DRIVE AGX Orin. The results mean the proposed is suitable for applications with strict power and speed constraints like assisted driving.

4.3. Ablation Experiment

To further validate the effectiveness of each module of the improved method, ablation experiments are conducted on the KAIST dataset. The baseline network used is the QS-Attn network, which is based on a generative adversarial structure and also uses U-Net as the generator, with spatial-domain information (PatchGAN) forming the discriminator. Based on the baseline network, this study introduces a local feature extraction network based on CNN and establishes a corresponding local saliency contrast loss. Meanwhile, a global feature extraction network is introduced based on Transformer and a global saliency contrastive loss is thus established. To verify the effectiveness of local and global feature extraction networks, the ablation experiment is presented as follows: (1) Baseline QS-Attn model without local and global feature extraction network modules; (2) baseline model with local feature extraction module and local saliency comparison loss; (3) baseline model with global feature extraction module and global saliency comparison loss; (4) the proposed method. The results of the ablation experiment are shown in Figure 12. The objective evaluation indicators are listed out in Table 5, in which the bolded data indicate the optimal results.
Qualitative evaluation: The ablation comparison experiment results for the improved modules in the “Road” scenario of the KAIST dataset are shown in Figure 12. Compared with the baseline method, the addition of a local and global feature extraction module improved the restoration effect of the road surface font in the road surface representation part of the image. However, compared with the baseline method, the proposed method optimized the generated image road surface font to be relatively more complete. Additionally, observing the zebra crossing area in the image, both the baseline method and the reconstruction results with local feature extraction have large areas of zebra crossing color loss. It must be pointed out that the reconstruction of the zebra crossing area using global extraction and the proposed method is more complete, and the edge color distribution of the zebra crossing using the proposed method is also more accurate.
Quantitative evaluation: The comparative results of the indicators in Table 5 demonstrate that the proposed global feature extraction, local feature extraction, and spatial-frequency-domain discrimination have all improved the quality of image generation to varying degrees. Among them, the addition of a frequency-domain discriminator only leads to a slight decrease in the VIF metric. Overall, by using a dual-branch structure to selectively extract relevant features and calculate the corresponding saliency contrast loss, the learning process is constrained to maintain balance and consistency in the structure of the reconstructed image content. The dual-branch feature extraction and spatial-frequency-domain discrimination ensure the best quality and lowest distortion in the colorization results of infrared images, with the overall structure being closer to that of the original visible light images.

5. Conclusions

Using infrared image colorization methods to imbue infrared images with color information can reduce driving risks in harsh environments. This study mainly investigates the method of infrared image colorization, proposes and implements two innovative technologies, and verifies its reliability and practicability on the edge computing platform. Firstly, a dual-branch feature extraction network is proposed to address the issues of content consistency and structural preservation during the colorization of infrared images. This proposal combines GAN with Transformer and CNN dual-branch feature extractors to capture long-range features while retaining the original texture details. By introducing a global and local self-supervised saliency query module, the proposed method compares the losses between global and local self-supervision to ensure that the content and structure of the generated image are consistent with the original image, thereby improving the quality and stability of image colorization. Secondly, this study proposes a method that employs spatial- and frequency-domain discrimination strategies, addressing the issues of undersaturated and inaccurate coloring in image edge areas as well as the detail loss caused by deep networks during colorization tasks. The spatial discriminator refines color distribution, while the frequency-domain discriminator extracts key features and preserves original details through multi-scale geometric analysis transformation, making the generated color images more realistic and natural. Thirdly, an image colorization processing system has been established based on the Orin edge computing platform, dual-spectrum intelligent cameras, and monitors, completing hardware connections and testing, thereby verifying the feasibility and practicality of the proposed method.
Despite the above research findings, the proposed method in this study can be further optimized and improved. Although the proposed method based on a dual-branch feature extraction network is effective, the use of Transformer makes the overall structure complicated and increases the computation load, resulting in a slower training and inference speed. It is recommended to adopt lightweight technology and knowledge distillation to ensure effectiveness while reducing the model complexity and computation pressure. In addition, the method of combining spatial- and frequency-domain discrimination strategies has improved the authenticity of coloring and detail preservation, but the object edge lines generated in the image are still unstable. Our future research interest will be focusing on adding additional loss functions to the frequency-domain discriminator to enhance its constraint capability on the generator, thereby further improving the quality and stability of the image. Furthermore, employing multi-band infrared images can enable a rough separation of different types of surface materials within a scene, thereby allowing for better color restoration. This is an area of work that we intend to pursue in the next phase.

Author Contributions

Conceptualization, S.L., T.Z. and D.M.; methodology, S.L. and D.M.; software, S.L. and D.M.; validation, T.Z., Y.D. and S.L.; formal analysis, Y.D. and T.Z.; investigation, S.L., Y.D. and D.M.; resources, Y.D.; data curation, S.L., D.M., Y.X. and Y.D.; writing—original draft preparation, S.L., Y.D. and D.M.; writing—review and editing, S.L., Y.D. and D.M.; visualization, S.L., D.M. and Y.X.; supervision, Y.D. and T.Z.; project administration, S.L., Y.D. and Y.X.; funding acquisition, S.L. and Y.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by a grant from the National Natural Science Foundation of China (No. 62103432), a grant from the China Postdoctoral Science Foundation (No. 2022M721841), and the Young Talent Fund of the University Association for Science and Technology in Shannxi, China (No. 2021108).

Data Availability Statement

Dataset available on request from the authors.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

  1. Zhao, G.; Hu, Z.; Feng, S.; Wang, Z.; Wu, H. GLFuse: A Global and Local Four-Branch Feature Extraction Network for Infrared and Visible Image Fusion. Remote Sens. 2024, 16, 3246. [Google Scholar] [CrossRef]
  2. Gao, X.; Liu, S. BCMFIFuse: A Bilateral Cross-Modal Feature Interaction-Based Network for Infrared and Visible Image Fusion. Remote Sens. 2024, 16, 3136. [Google Scholar] [CrossRef]
  3. St-Laurent, L.; Maldague, X.; Prévost, D. Combination of colour and thermal sensors for enhanced object detection. In Proceedings of the 2007 10th International Conference on Information Fusion, Quebec, QC, Canada, 9–12 July 2007; pp. 1–8. [Google Scholar]
  4. Watson, J.D. 9—The Human Visual System. In Brain Mapping: The Systems; Toga, A.W., Mazziotta, J.C., Eds.; Academic Press: San Diego, CA, USA, 2000; pp. 263–289. [Google Scholar] [CrossRef]
  5. Luo, F.Y.; Liu, S.L.; Cao, Y.J.; Yang, K.F.; Xie, C.Y.; Liu, Y.; Li, Y.J. Nighttime Thermal Infrared Image Colorization with Feedback-Based Object Appearance Learning. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 4745–4761. [Google Scholar] [CrossRef]
  6. Yatziv, L.; Sapiro, G. Fast image and video colorization using chrominance blending. IEEE Trans. Image Process. 2006, 15, 1120–1129. [Google Scholar] [CrossRef]
  7. Qu, Y.; Wong, T.T.; Heng, P.A. Manga colorization. ACM Trans. Graph. 2006, 25, 1214–1220. [Google Scholar] [CrossRef]
  8. Luan, Q.; Wen, F.; Cohen-Or, D.; Liang, L.; Xu, Y.Q.; Shum, H.Y. Natural image colorization. In Proceedings of the 18th Eurographics Conference on Rendering Techniques, Goslar, DEU, Goslar, Germany, 25–27 June 2007; EGSR’07. pp. 309–320. [Google Scholar]
  9. An, X.; Pellacini, F. AppProp: All-pairs appearance-space edit propagation. ACM Trans. Graph. 2008, 27, 1–9. [Google Scholar] [CrossRef]
  10. Fattal, R. Edge-avoiding wavelets and their applications. ACM Trans. Graph. 2009, 28, 22. [Google Scholar] [CrossRef]
  11. Xu, K.; Li, Y.; Ju, T.; Hu, S.M.; Liu, T.Q. Efficient affinity-based edit propagation using K-D tree. ACM Trans. Graph. 2009, 28, 1–6. [Google Scholar] [CrossRef]
  12. Ironi, R.; Cohen-Or, D.; Lischinski, D. Colorization by example. In Proceedings of the Eurographics Symposium on Rendering, Konstanz, Germany, 29 June–1 July 2005. [Google Scholar]
  13. Liu, X.; Wan, L.; Qu, Y.; Wong, T.T.; Lin, S.; Leung, C.S.; Heng, P.A. Intrinsic colorization. ACM Trans. Graph. 2008, 27, 152. [Google Scholar] [CrossRef]
  14. Morimoto, Y.; Taguchi, Y.; Naemura, T. Automatic colorization of grayscale images using multiple images on the web. In Proceedings of the SIGGRAPH 2009: Talks, New York, NY, USA, 3–7 August 2009. SIGGRAPH ’09. [Google Scholar] [CrossRef]
  15. Gupta, R.K.; Chia, A.Y.S.; Rajan, D.; Ng, E.S.; Zhiyong, H. Image colorization using similar images. In Proceedings of the 20th ACM International Conference on Multimedia, New York, NY, USA, 29 October–2 November 2012; MM ’12. pp. 369–378. [Google Scholar] [CrossRef]
  16. Bugeau, A.; Ta, V.T.; Papadakis, N. Variational Exemplar-Based Image Colorization. IEEE Trans. Image Process. 2014, 23, 298–307. [Google Scholar] [CrossRef]
  17. Li, B.; Lai, Y.K.; John, M.; Rosin, P.L. Automatic Example-Based Image Colorization Using Location-Aware Cross-Scale Matching. IEEE Trans. Image Process. 2019, 28, 4606–4619. [Google Scholar] [CrossRef] [PubMed]
  18. Fang, F.; Wang, T.; Zeng, T.; Zhang, G. A Superpixel-Based Variational Model for Image Colorization. IEEE Trans. Vis. Comput. Graph. 2020, 26, 2931–2943. [Google Scholar] [CrossRef]
  19. Wang, J.; Wang, X. VCells: Simple and Efficient Superpixels Using Edge-Weighted Centroidal Voronoi Tessellations. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 1241–1247. [Google Scholar] [CrossRef]
  20. Yang, S.; Sun, M.; Lou, X.; Yang, H.; Liu, D. Nighttime Thermal Infrared Image Translation Integrating Visible Images. Remote Sens. 2024, 16, 666. [Google Scholar] [CrossRef]
  21. Yang, S.; Sun, M.; Lou, X.; Yang, H.; Zhou, H. An Unpaired Thermal Infrared Image Translation Method Using GMA-CycleGAN. Remote Sens. 2023, 15, 663. [Google Scholar] [CrossRef]
  22. Tan, D.; Liu, Y.; Li, G.; Yao, L.; Sun, S.; He, Y. Serial GANs: A Feature-Preserving Heterogeneous Remote Sensing Image Transformation Model. Remote Sens. 2021, 13, 3968. [Google Scholar] [CrossRef]
  23. Tang, R.; Liu, H.; Wei, J. Visualizing Near Infrared Hyperspectral Images with Generative Adversarial Networks. Remote Sens. 2020, 12, 3848. [Google Scholar] [CrossRef]
  24. Cheng, Z.; Yang, Q.; Sheng, B. Deep Colorization. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; ICCV ’15. pp. 415–423. [Google Scholar] [CrossRef]
  25. Iizuka, S.; Simo-Serra, E.; Ishikawa, H. Let there be color! Joint end-to-end learning of global and local image priors for automatic image colorization with simultaneous classification. ACM Trans. Graph. 2016, 35, 110. [Google Scholar] [CrossRef]
  26. Larsson, G.; Maire, M.; Shakhnarovich, G. Learning Representations for Automatic Colorization. 2017. Available online: http://arxiv.org/abs/1603.06668 (accessed on 16 September 2024).
  27. Zhang, R.; Isola, P.; Efros, A.A. Colorful Image Colorization. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
  28. Lee, G.; Shin, S.; Na, T.; Woo, S.S. Real-Time User-guided Adaptive Colorization with Vision Transformer. In Proceedings of the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 1–6 January 2024; pp. 473–482. [Google Scholar]
  29. Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems—Volume 2, Cambridge, MA, USA, 8–13 December 2014; NIPS’14. pp. 2672–2680. [Google Scholar]
  30. Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-Image Translation with Conditional Adversarial Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5967–5976. [Google Scholar] [CrossRef]
  31. Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2242–2251. [Google Scholar] [CrossRef]
  32. He, M.; Chen, D.; Liao, J.; Sander, P.V.; Yuan, L. Deep exemplar-based colorization. ACM Trans. Graph. 2018, 37, 47. [Google Scholar] [CrossRef]
  33. Zhang, B.; He, M.; Liao, J.; Sander, P.V.; Yuan, L.; Bermak, A.; Chen, D. Deep Exemplar-Based Video Colorization. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 8044–8053. [Google Scholar] [CrossRef]
  34. Dabas, C.; Jain, S.; Bansal, A.; Sharma, V. Implementation of image colorization with convolutional neural network. Int. J. Syst. Assur. Eng. Manag. 2020, 11, 625–634. [Google Scholar] [CrossRef]
  35. Dong, X.; Li, W.; Wang, X. Pyramid convolutional network for colorization in monochrome-color multi-lens camera system. Neurocomputing 2021, 450, 129–142. [Google Scholar] [CrossRef]
  36. Pang, Y.; Jin, X.; Fu, J.; Chen, Z. Structure-preserving feature alignment for old photo colorization. Pattern Recogn. 2024, 145, 109968. [Google Scholar] [CrossRef]
  37. Suárez, P.L.; Sappa, A.D.; Vintimilla, B.X. Colorizing Infrared Images Through a Triplet Conditional DCGAN Architecture. In Proceedings of the International Conference on Image Analysis and Processing, Catania, Italy, 11–15 September 2017. [Google Scholar]
  38. Benaim, S.; Wolf, L. One-sided unsupervised domain mapping. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; NIPS’17. pp. 752–762. [Google Scholar]
  39. Bansal, A.; Ma, S.; Ramanan, D.; Sheikh, Y. Recycle-GAN: Unsupervised Video Retargeting. In Proceedings of the Computer Vision—ECCV 2018: 15th European Conference, Munich, Germany, 8–14 September 2018; pp. 122–138. [Google Scholar] [CrossRef]
  40. Kniaz, V.V.; Knyaz, V.A.; Hladůvka, J.; Kropatsch, W.G.; Mizginov, V. ThermalGAN: Multimodal Color-to-Thermal Image Translation for Person Re-identification in Multispectral Dataset. In Proceedings of the ECCV Workshops, Munich, Germany, 8–14 September 2018. [Google Scholar]
  41. Mehri, A.; Sappa, A.D. Colorizing Near Infrared Images through a Cyclic Adversarial Approach of Unpaired Samples. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, CA, USA, 16–17 June 2019; pp. 971–979. [Google Scholar] [CrossRef]
  42. Abbott, R.; Robertson, N.M.; del Rincón, J.M.; Connor, B. Unsupervised object detection via LWIR/RGB translation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; pp. 407–415. [Google Scholar]
  43. Emami, H.; Aliabadi, M.M.; Dong, M.; Chinnam, R.B. SPA-GAN: Spatial Attention GAN for Image-to-Image Translation. [arXiv:cs.CV/1908.06616]. 2020. Available online: http://arxiv.org/abs/1908.06616 (accessed on 17 September 2024).
  44. Chen, R.; Huang, W.; Huang, B.; Sun, F.; Fang, B. Reusing Discriminators for Encoding: Towards Unsupervised Image-to-Image Translation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 8165–8174. [Google Scholar] [CrossRef]
  45. Park, T.; Efros, A.A.; Zhang, R.; Zhu, J.Y. Contrastive Learning for Unpaired Image-to-Image Translation. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020. [Google Scholar]
  46. Han, J.; Shoeiby, M.; Petersson, L.; Armin, M.A. Dual Contrastive Learning for Unsupervised Image-to-Image Translation. [arXiv:cs.CV/2104.07689]. 2021. Available online: http://arxiv.org/abs/2104.07689 (accessed on 17 September 2024).
  47. Huang, S.; Jin, X.; Jiang, Q.; Li, J.; Lee, S.J.; Wang, P.; Yao, S. A fully-automatic image colorization scheme using improved CycleGAN with skip connections. Multimed. Tools Appl. 2021, 80, 26465–26492. [Google Scholar] [CrossRef]
  48. Li, S.; Han, B.; Yu, Z.; Liu, C.H.; Chen, K.; Wang, S. I2V-GAN: Unpaired Infrared-to-Visible Video Translation. In Proceedings of the 29th ACM International Conference on Multimedia, New York, NY, USA, 17 October 2021; MM ’21. pp. 3061–3069. [Google Scholar] [CrossRef]
  49. Yadav, N.K.; Singh, S.K.; Dubey, S.R. MobileAR-GAN: MobileNet-Based Efficient Attentive Recurrent Generative Adversarial Network for Infrared-to-Visual Transformations. IEEE Trans. Instrum. Meas. 2022, 71, 1–9. [Google Scholar] [CrossRef]
  50. Luo, F.; Li, Y.; Zeng, G.; Peng, P.; Wang, G.; Li, Y. Thermal Infrared Image Colorization for Nighttime Driving Scenes With Top-Down Guided Attention. IEEE Trans. Intell. Transp. Syst. 2021, 23, 15808–15823. [Google Scholar] [CrossRef]
  51. Yu, Z.; Chen, K.; Li, S.; Han, B.; Liu, C.H.; Wang, S. ROMA: Cross-Domain Region Similarity Matching for Unpaired Nighttime Infrared to Daytime Visible Video Translation. [arXiv:cs.CV/2204.12367]. 2022. Available online: http://arxiv.org/abs/2204.12367 (accessed on 17 September 2024).
  52. Guo, J.; Li, J.; Fu, H.; Gong, M.; Zhang, K.; Tao, D. Alleviating Semantics Distortion in Unsupervised Low-Level Image-to-Image Translation via Structure Consistency Constraint. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 18228–18238. [Google Scholar] [CrossRef]
  53. Lin, Y.; Zhang, S.; Chen, T.; Lu, Y.; Li, G.; Shi, Y. Exploring Negatives in Contrastive Learning for Unpaired Image-to-Image Translation. In Proceedings of the 30th ACM International Conference on Multimedia, New York, NY, USA, 10 October 2022; MM ’22. pp. 1186–1194. [Google Scholar] [CrossRef]
  54. Bharti, V.; Biswas, B.; Shukla, K.K. QEMCGAN: Quantized Evolutionary Gradient Aware Multiobjective Cyclic GAN for Medical Image Translation. IEEE J. Biomed. Health Inform. 2024, 28, 1240–1251. [Google Scholar] [CrossRef] [PubMed]
  55. Zhao, M.; Feng, G.; Tan, J.; Zhang, N.; Lu, X. CSTGAN: Cycle Swin Transformer GAN for Unpaired Infrared Image Colorization. In Proceedings of the 2022 3rd International Conference on Control, Robotics and Intelligent System, New York, NY, USA, 26–28 August 2022; CCRIS ’22. pp. 241–247. [Google Scholar] [CrossRef]
  56. Feng, L.; Geng, G.; Li, Q.; Jiang, Y.H.; Li, Z.; Li, K. CRPGAN: Learning image-to-image translation of two unpaired images by cross-attention mechanism and parallelization strategy. PLoS ONE 2023, 18, e0280073. [Google Scholar] [CrossRef]
  57. Gou, Y.; Li, M.; Song, Y.; He, Y.; Wang, L. Multi-feature contrastive learning for unpaired image-to-image translation. Complex Intell. Syst. 2022, 9, 4111–4122. [Google Scholar] [CrossRef]
  58. Liu, Y.; Zhao, H.; Chan, K.C.K.; Wang, X.; Loy, C.C.; Qiao, Y.; Dong, C. Temporally consistent video colorization with deep feature propagation and self-regularization learning. Comput. Vis. Media 2021, 10, 375–395. [Google Scholar] [CrossRef]
  59. Liang, Z.; Li, Z.; Zhou, S.; Li, C.; Loy, C.C. Control Color: Multimodal Diffusion-based Interactive Image Colorization. arXiv 2024, arXiv:2402.10855. [Google Scholar]
  60. Wei, C.; Chen, H.; Bai, L.; Han, J.; Chen, X. Infrared colorization with cross-modality zero-shot learning. Neurocomputing 2024, 579, 127449. [Google Scholar] [CrossRef]
  61. Kumar, M.; Weissenborn, D.; Kalchbrenner, N. Colorization Transformer. arXiv 2021, arXiv:2102.04432. [Google Scholar]
  62. Kim, S.; Baek, J.; Park, J.; Kim, G.; Kim, S. InstaFormer: Instance-Aware Image-to-Image Translation with Transformer. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 18300–18310. [Google Scholar] [CrossRef]
  63. Ji, X.; Jiang, B.; Luo, D.; Tao, G.; Chu, W.; Xie, Z.; Wang, C.; Tai, Y. ColorFormer: Image Colorization via Color Memory Assisted Hybrid-Attention Transformer; Springer: Cham, Switzerland, 2022. [Google Scholar]
  64. Zheng, W.; Li, Q.; Zhang, G.; Wan, P.; Wang, Z. ITTR: Unpaired Image-to-Image Translation with Transformers. [arXiv:cs.CV/2203.16015]. 2022. Available online: http://arxiv.org/abs/2203.16015 (accessed on 17 September 2024).
  65. Torbunov, D.; Huang, Y.; Yu, H.; zhi Huang, J.; Yoo, S.; Lin, M.; Viren, B.; Ren, Y. UVCGAN: UNet Vision Transformer cycle-consistent GAN for unpaired image-to-image translation. In Proceedings of the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 702–712. [Google Scholar]
  66. Ma, T.; Li, B.; Liu, W.; Hua, M.; Dong, J.; Tan, T. CFFT-GAN: Cross-domain Feature Fusion Transformer for Exemplar-based Image Translation. arXiv 2023, arXiv:2302.01608. [Google Scholar] [CrossRef]
  67. Jiang, C.; Gao, F.; Ma, B.; Lin, Y.; Wang, N.; Xu, G. Masked and Adaptive Transformer for Exemplar Based Image Translation. [arXiv:cs.CV/2303.17123]. 2023. Available online: http://arxiv.org/abs/2303.17123 (accessed on 17 September 2024).
  68. Chen, S.Y.; Li, X.; Zhang, X.; Wang, M.; Zhang, Y.; Han, J.; Zhang, Y. Exemplar-based Video Colorization with Long-term Spatiotemporal Dependency. Knowl. Based Syst. 2023, 284, 111240. [Google Scholar] [CrossRef]
  69. Wu, Z.; Liu, Z.; Lin, J.; Lin, Y.; Han, S. Lite Transformer with Long-Short Range Attention. In Proceedings of the International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 30 April 2020. [Google Scholar]
  70. Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
  71. Hwang, S.; Park, J.; Kim, N.; Choi, Y.; Kweon, I.S. Multispectral pedestrian detection: Benchmark dataset and baseline. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1037–1045. [Google Scholar] [CrossRef]
  72. Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 586–595. [Google Scholar]
  73. Sheikh, H.; Bovik, A.; de Veciana, G. An information fidelity criterion for image quality assessment using natural scene statistics. IEEE Trans. Image Process. 2005, 14, 2117–2128. [Google Scholar] [CrossRef] [PubMed]
  74. Chen, Y.; Pan, Y.; Yao, T.; Tian, X.; Mei, T. Mocycle-GAN: Unpaired Video-to-Video Translation. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019. [Google Scholar]
  75. Anoosheh, A.; Sattler, T.; Timofte, R.; Pollefeys, M.; Gool, L.V. Night-to-Day Image Translation for Retrieval-Based Localization. [arXiv:cs.CV/1809.09767]. 2019. Available online: http://arxiv.org/abs/1809.09767 (accessed on 10 September 2024).
Figure 1. Comparison of visible light camera and infrared camera imaging in complex traffic environments.
Figure 1. Comparison of visible light camera and infrared camera imaging in complex traffic environments.
Remotesensing 16 03766 g001
Figure 2. The schematic diagram of the overall structure of the proposed colorization algorithm.
Figure 2. The schematic diagram of the overall structure of the proposed colorization algorithm.
Remotesensing 16 03766 g002
Figure 3. Calculation process of INN: (a) forward calculation process and (b) reverse calculation process.
Figure 3. Calculation process of INN: (a) forward calculation process and (b) reverse calculation process.
Remotesensing 16 03766 g003
Figure 4. Global self-supervised saliency query module.
Figure 4. Global self-supervised saliency query module.
Remotesensing 16 03766 g004
Figure 5. The overall structure of spatial-frequency-domain discriminator.
Figure 5. The overall structure of spatial-frequency-domain discriminator.
Remotesensing 16 03766 g005
Figure 6. Structure of spatial-domain discriminator.
Figure 6. Structure of spatial-domain discriminator.
Remotesensing 16 03766 g006
Figure 7. Structure of frequency-domain discriminator.
Figure 7. Structure of frequency-domain discriminator.
Remotesensing 16 03766 g007
Figure 8. Schematic diagram of the Squeeze-and-Excitation block.
Figure 8. Schematic diagram of the Squeeze-and-Excitation block.
Remotesensing 16 03766 g008
Figure 9. Partial infrared and visible light data from KAIST.
Figure 9. Partial infrared and visible light data from KAIST.
Remotesensing 16 03766 g009
Figure 10. Qualitative colorization results of the traffic scene of KAIST dataset.
Figure 10. Qualitative colorization results of the traffic scene of KAIST dataset.
Remotesensing 16 03766 g010
Figure 11. Qualitative colorization results of “Road1” scene in KAIST dataset.
Figure 11. Qualitative colorization results of “Road1” scene in KAIST dataset.
Remotesensing 16 03766 g011
Figure 12. Qualitative colorization effect of infrared images using different modules.
Figure 12. Qualitative colorization effect of infrared images using different modules.
Remotesensing 16 03766 g012
Table 1. KAIST “Traffic” scenario indicator quantitative results.
Table 1. KAIST “Traffic” scenario indicator quantitative results.
IndicatorCycleGANToDayGANMocycleGANCUTDECENTOurs
PSNR ↑13.390710.686613.934714.211415.223015.7025
SSIM ↑0.30910.29520.40750.36590.30350.4038
LPIPS ↓0.41090.55420.46470.35280.38310.3637
VIF ↑0.78270.80570.80400.80260.80680.8162
The bolded data indicate the best results.
Table 2. Indicator result of “Road1” scene in KAIST.
Table 2. Indicator result of “Road1” scene in KAIST.
IndicatorCycleGANToDayGANMocycleGANCUTDECENTOurs
PSNR ↑14.172910.391210.052812.312817.558917.6733
SSIM ↑0.45840.26940.22450.31530.50260.5697
LPIPS ↓0.41130.57330.59860.42870.49520.3868
VIF ↑0.79070.77880.79340.78640.80630.8471
Table 3. Indicator result of “Road2” scene in KAIST.
Table 3. Indicator result of “Road2” scene in KAIST.
IndicatorCycleGANToDayGANMocycleGANCUTDECENTOurs
PSNR ↑13.645710.521311.546712.549816.888017.4392
SSIM ↑0.44080.24410.30590.31020.57980.6123
LPIPS ↓0.43290.60350.51300.44220.29650.2919
VIF ↑0.82660.80390.80260.80730.84000.8475
Table 4. Efficiency of different comparison methods.
Table 4. Efficiency of different comparison methods.
IndicatorCycleGANToDayGANMocycleGANCUTDECENTOurs
Parameter(M) ↓28.28683.17142.14714.70357.18312.371
Runtime(s) ↓0.2510.7830.3740.1370.2430.081
Table 5. Objective indicators for infrared image colorization using different modules.
Table 5. Objective indicators for infrared image colorization using different modules.
Methods ComparedPSNR ↑SSIM ↑LPIPS ↓VIF ↑
Baseline17.15390.52850.34570.8320
Baseline + local feature extraction18.75830.55800.31230.7947
Baseline + global feature extraction20.17900.61020.29210.8424
Baseline + baseline + global + local20.24990.61140.27690.8438
Ours20.51340.61560.24070.8404
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, S.; Ma, D.; Ding, Y.; Xian, Y.; Zhang, T. DBSF-Net: Infrared Image Colorization Based on the Generative Adversarial Model with Dual-Branch Feature Extraction and Spatial-Frequency-Domain Discrimination. Remote Sens. 2024, 16, 3766. https://doi.org/10.3390/rs16203766

AMA Style

Li S, Ma D, Ding Y, Xian Y, Zhang T. DBSF-Net: Infrared Image Colorization Based on the Generative Adversarial Model with Dual-Branch Feature Extraction and Spatial-Frequency-Domain Discrimination. Remote Sensing. 2024; 16(20):3766. https://doi.org/10.3390/rs16203766

Chicago/Turabian Style

Li, Shaopeng, Decao Ma, Yao Ding, Yong Xian, and Tao Zhang. 2024. "DBSF-Net: Infrared Image Colorization Based on the Generative Adversarial Model with Dual-Branch Feature Extraction and Spatial-Frequency-Domain Discrimination" Remote Sensing 16, no. 20: 3766. https://doi.org/10.3390/rs16203766

APA Style

Li, S., Ma, D., Ding, Y., Xian, Y., & Zhang, T. (2024). DBSF-Net: Infrared Image Colorization Based on the Generative Adversarial Model with Dual-Branch Feature Extraction and Spatial-Frequency-Domain Discrimination. Remote Sensing, 16(20), 3766. https://doi.org/10.3390/rs16203766

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop