Transforming Color: A Novel Image Colorization Method

Shafiq, Hamza; Lee, Bumshik

doi:10.3390/electronics13132511

Open AccessArticle

Transforming Color: A Novel Image Colorization Method

by

Hamza Shafiq

and

Bumshik Lee

^*

Department of Information and Communication Engineering, Chosun University, Gwangju 61452, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(13), 2511; https://doi.org/10.3390/electronics13132511

Submission received: 28 May 2024 / Revised: 20 June 2024 / Accepted: 24 June 2024 / Published: 26 June 2024

(This article belongs to the Special Issue Deep Learning in Image Processing and Pattern Recognition, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

This paper introduces a novel method for image colorization that utilizes a color transformer and generative adversarial networks (GANs) to address the challenge of generating visually appealing colorized images. Conventional approaches often struggle with capturing long-range dependencies and producing realistic colorizations. The proposed method integrates a transformer architecture to capture global information and a GAN framework to improve visual quality. In this study, a color encoder that utilizes a random normal distribution to generate color features is applied. These features are then integrated with grayscale image features to enhance the overall representation of the images. Our method demonstrates superior performance compared with existing approaches by utilizing the capacity of the transformer, which can capture long-range dependencies and generate a realistic colorization of the GAN. Experimental results show that the proposed network significantly outperforms other state-of-the-art colorization techniques, highlighting its potential for image colorization. This research opens new possibilities for precise and visually compelling image colorization in domains such as digital restoration and historical image analysis.

Keywords:

image colorization; transformer; generative adversarial network

1. Introduction

Adding colors to grayscale or black-and-white images is known as image colorization. This technology holds considerable importance across multiple fields, such as the digital restoration of old photographs, entertainment and media sectors, historical preservation, and augmentation of visual communication. Incorporating color into images can enhance their realism and visual appeal by accurately representing the depicted scene or object.

Despite its significance, image colorization poses numerous challenges. One of the foremost challenges is the precise selection of suitable colors for individual pixels within an image, particularly when color data are unavailable. Accurately predicting the appropriate colors involves understanding the context and semantics of the image, which can be especially challenging for complex scenes with multiple objects and varying textures. The difficulty level of this task escalates when confronted with complex visuals or uncertain grayscale variations. Throughout the years, researchers have proposed various methodologies to address the issue of image colorization. Historically, conventional techniques have frequently required human involvement, whereby skilled artists or experts have meticulously incorporated hues into monochromatic images. Although these methods produced adequate outcomes, they were time-intensive and required knowledge of color theory and image manipulation. Furthermore, the manual process was not scalable and could not meet the growing demand for colorized images in various industries, including entertainment, restoration, and digital art.

Automated image colorization techniques have attracted significant attention recently, owing to the progress made in deep learning and computer vision. These methodologies utilize extensive datasets, convolutional neural networks (CNNs), and generative models [1] to learn the correlation between grayscale and color images. In this study, we aim to develop an automated system for predicting and assigning appropriate colors to grayscale pixels, thereby significantly reducing the need for manual intervention.

Although current methods leverage CNNs or transformer architectures, they face challenges, such as color bleeding, desaturation, and limitations in effectively capturing local and global features. The trickier part of colorization is determining the right balance between paying attention to small details, such as textures, and comprehending a broader context. These challenges make it difficult for automated methods to consistently produce accurate and visually pleasing colorizations. Striking a delicate balance between preserving fine details and comprehending a broader context is crucial for achieving natural and realistic results. Overcoming these challenges is essential for advancing state-of-the-art image colorization and ensuring that automated techniques seamlessly capture fine details for colorization.

This paper presents a new method for image colorization that overcomes certain limitations of current methodologies. The proposed method utilizes a color encoder [2], a color transformer, and a encoder–decoder-based generative adversarial network (GAN) [1] architecture to achieve precise and effective colorization. The objective of incorporating a color transformer and encoder into the generator architecture is to improve the colorization procedure by providing color assignments that are more contextually relevant and coherent. By adding these components, our method aims to leverage the strengths of both transformer networks and GANs to produce visually pleasing colorized images.

The subsequent sections of this paper are organized as follows. Section 2 gives a detailed review of existing works, providing insight into existing methodologies and their limitations. Following this, Section 3 elaborates on our proposed method, detailing its theoretical basis and practical implementation. Subsequently, Section 4 presents the results of our experimental evaluation, which includes comparisons with other state-of-the-art techniques. Finally, Section 5 concludes the paper, summarizing key findings and out-lining directions for future research.

2. Related Works

This section provides a comprehensive overview of current image colorization methodologies, including conventional and deep learning-based techniques. The strengths, limitations, and areas for improvement of the subject are analyzed, and this section also highlights the specific gaps in existing research that our proposed method addresses.

Historically, conventional methods of colorizing images have relied heavily on the involvement of skilled artists and professionals. The aforementioned techniques entail a rigorous procedure for incorporating hues into monochromatic images based on color theory and artistic proficiency. An example of such a methodology is the research conducted by Levin et al., who presented a colorization technique based on scribbles [3]. Although manual techniques can produce acceptable outcomes, they are time-consuming and labor-intensive, and they require proficient human operators. Traditional example-based methods were crucial in early attempts at image colorization. Approaches such as optimization techniques using graph cuts, energy minimization [4], texture-based methods involving texture synthesis, and patch-based techniques [5] have been explored. These methods often rely on transferring color information from reference or exemplar images to grayscale targets, although they can face challenges in handling complex scenes and may introduce artifacts, such as color bleeding. The manual selection of reference images is also time-consuming.

Automated image colorization techniques have emerged as a promising approach owing to advancements in deep learning and computer vision. Deep learning techniques utilize extensive datasets and convolutional neural networks (CNNs) to comprehensively understand the complex associations between grayscale and color images. One noteworthy technique employs a deep learning-based strategy that incorporates both classification and colorization networks [6].

GANs have recently gained considerable attention. Using generative models enables multimodal colorization. In a recent study [1], a conditional GAN-based image-to-image translation model was proposed utilizing a generator based on the UNet architecture. The results of this approach demonstrate improved image colorization, which can be attributed to the use of adversarial training. In [7], the model was extended to include high-resolution images. The generative priors for colorization were further investigated in [8], given that the spatial structures of the image had already been produced. SCGAN [9] is a GAN-based image colorization method that uses saliency maps to guide colorization. SCGAN can first focus on the most significant portions of an image by employing saliency maps, resulting in more accurate and realistic colorization results. The double-channel-guided GAN (DCGAN) [10] is another GAN-based image colorization approach and guides the colorization process using two channels, where the first channel contains a grayscale image, and the second includes a color palette. DCGAN learns the structure of an image using grayscale images, whereas the color palette learns the color distribution using color palettes. Vivid and diverse image colorization with a generative color prior (GCPrior) [11] is an image colorization method that learns color priors using a generative model. The color prior is the distribution of probable colors for each pixel in the image. The color prior is used by GCPrior to guide the colorization process, resulting in more vivid and diverse colorizations. DDColor [12] is an image colorization approach that uses dual decoders. The pixel decoder reconstructs the spatial resolution of an image, whereas the query-based color decoder learns semantically aware color representations from multiscale visual data. The two decoders were merged using cross-attention to establish correlations between color and semantic information, substantially relieving the color-bleeding effect.

Transformers [13] have attracted significant interest in computer vision. Vaswani et al. [13] initially presented the transformer architecture. Subsequently, a novel approach to image classification was introduced, denoted as vision transformers (ViTs) [14]. ViTs adapt the transformer architecture for image data, enabling the model to efficiently capture long-range dependencies and hierarchical representations. In addition to image classification, transformers have been utilized in various other image-processing tasks, including object detection, segmentation, image super-resolution, denoising, and colorization. The transformer’s ability to process entire images as sequences of patches has enabled more comprehensive feature extraction and representation. In addition, transformers, exemplified by ColTran [15], have exhibited encouraging outcomes in the image colorization task, thereby attesting to their efficacy in this domain. CT2 [16] is another example of image colorization that uses an end-to-end transformer framework. Grayscale features were extracted and encoded, and discrete color tokens representing quantized ab spaces were introduced. A dedicated color transformer fuses the image and color information guided by luminance selection and color attention modules.

Despite the notable performance improvement, existing colorization networks that rely on CNNs or transformers encounter notable limitations, including color bleeding, desaturation, and difficulties in capturing local and global features. We introduce a novel image colorization method to overcome these challenges that strategically integrates transformers and CNNs into the generator architecture. This approach aims to effectively address the challenges of existing methods by leveraging the strengths of architectures and adversarial training. Moreover, we introduce two key components in the generator, the color encoder and the color transformer, to further augment the colorization process. The color encoder focuses on capturing intricate color features, whereas the color transformer enhances the integration of local and global information using a transformer architecture. These modules form a comprehensive and robust image colorization framework, mitigating the shortcomings observed in the current state-of-the-art approaches.

3. Proposed Method

The proposed colorization method in this paper uses an encoder–decoder architecture with a color transformer at the bottleneck and a color encoder block in the generator. In this section, we describe the overall architecture of the proposed method. We then provide the details of the proposed generator architecture, which includes the color encoder, color transformer, and proposed objective function.

3.1. Overall Architecture

Figure 1 shows the overall architecture of the proposed image colorization network. The proposed method introduces a comprehensive architectural design that integrates several key components. Specifically, we employ VGG-based global feature extraction, a color encoder, a color transformer, and GAN architecture to enhance the visual quality. Initially, the RGB color space is converted into the CIELAB color space (Lab) [17]. The Lab color space separates luminance from chromaticity, thus providing a perceptually uniform space. L represents the luminance channel of an image, and ab represents the chrominance channels of the image. This separation helps the colorization model to capture chromatic details independent of luminance, thereby improving the overall accuracy and perceptual quality. The luminance channel image input undergoes initial processing via a pretrained VGG network and encoder, extracting high-level global features that capture semantic information. The global features from the VGG network are combined with the encoder layers, as shown in Figure 1. This integration of pretrained VGG features at different encoder levels is designed to enrich the understanding of the input image in the encoder, providing a more enhanced representation that facilitates improved colorization performance. Concurrently, a color encoder uses convolutional layers to produce color features from a normal distribution, as described in reference [2]. The integration of global and color-specific information is facilitated by fusing the color-encoded features at the bottleneck in the color transformer block and the global features in the encoder layers, as shown in Figure 1. The fused features are then fed into a Swin Transformer [18] block that captures the long-range dependencies and spatial relationships in the image. Two transformer blocks are used to effectively capture the global information. The decoder network employs a gradual upsampling process to reconstruct the ab channels of the Lab color space while preserving fine-grained details using skip connections.

GAN architecture, which consists of a generator that includes an encoder, color transformer, color encoder, decoder, and discriminator, is utilized to improve visual fidelity. The generator tries to produce convincingly realistic colorizations, thereby deceiving the discriminator. In contrast, the discriminator’s role is to differentiate between the colorized outputs and the actual color images that serve as the ground truth (GT). The training process is guided by various loss functions, such as perceptual loss, adversarial loss, and color loss, which collectively contribute to precise colorization. In our proposed architecture, we utilize a Patch-GAN-based discriminator [1] for image colorization. The Patch-GAN discriminator assesses local image patches instead of the entire image, allowing for a more detailed evaluation of textures and features. By concentrating on smaller regions, our method significantly enhances the synthesis of colorized images, achieving improved local coherence and a realistic distribution of textures, which contributes to the enhanced overall quality of the generated results.

3.2. Color Encoder

The color encoder plays a crucial role in the proposed image colorization by generating color features from a Gaussian normal distribution. This part utilizes a CNN to convert randomly sampled normal features into significant color-encoded features. Normal features are fed into the color encoder and subjected to multiple convolutional layers. These layers learn to extract spatially relevant information from the normal features, resulting in color-coded features that capture color-specific information.

To train the color encoder, the output of the VGG network is compared with the generated color-encoded features. The VGG network receives a color image input that comprises the L channel input image and the GT image ab channels. A VGG network can extract global features from a given color image, capturing high-level semantic information. The color-encoded features generated by the color encoder are compared with the global features extracted by the VGG network using

L_{1}

loss. A color encoder is crucial for generating colorful and visually appealing colorization results.

3.3. Color Transformer

A color transformer module is designed to improve the image colorization process. This is achieved by integrating color features with grayscale image features and subsequently passing them through two Swin Transformer [18] layers, as shown in Figure 2. The fusion process generates a comprehensive representation by integrating global and color-specific information. Swin Transformer layers can capture global dependencies and spatial relationships, which facilitate the model’s understanding of long-range dependencies and complex relationships present within the image. The incorporation of global information from a color transformer enhances the precision and visual quality of colorized outputs. A residual connection is used to compensate for missing information and improve the gradient flow. This process can be described by (1)–(5). First, the outputs of the encoder and color encoder

x_{e}

and

x_{c e}

are concatenated, and single tensor

x_{i}

is obtained as (1), setting the foundation for integrated feature processing.

x_{i} = C o n c (x_{e}, x_{c e}),

(1)

Let

C o n v (x_{i})

denote a convolution operation that transforms

x_{i}

by extracting spatial relationships. We use a standard 3 × 3 convolutional kernel for this operation, which effectively captures local features while maintaining computational efficiency. Then

x_{c}

is obtained by (2).

x_{c} = C o n v (x_{i}),

(2)

Subsequently,

x_{c}

is passed through two Swin Transformer blocks represented as

T_{1}^{s w}

and

T_{2}^{s w}

. The two Swin Transformer blocks are constructed following the architecture described for the original Swin Transformer [18]. Each block consists of a shifted window-based multi-head self-attention (W-MSA) module and a multi-layer perceptron (MLP) with a GELU activation function. Layer normalization (LN) is applied before both W-MSA and MLP, and a residual connection is employed after each module. The Swin Transformer blocks enable the extraction and enhancement of long-range dependencies within the data.

x_{s t, 1} = T_{1}^{s w} (x_{c}),

(3)

x_{s t, 2} = T_{2}^{s w} (x_{s t, 1}),

(4)

Finally, the obtained output

x_{s t, 2}

is added elementwise to

x_{c}

and represented as (5).

y = x_{c} + x_{s t, 2}

(5)

where

y

represents the output of color transformer module as in (5). This addition of the initial convolutional features with the advanced features processed by the Swin Transformer blocks creates a residual connection that enhances the flow of gradients and compensates for any potential loss of information.

The color transformer ensures the seamless integration of grayscale and color features, facilitating a comprehensive understanding of the image content within the color transformer module. Incorporating Swin Transformers and the residual connection collectively enhances the capability of the model to produce accurate and visually compelling colorizations.

3.4. Objective Function

The objective function in the proposed method is defined as (6).

L = λ_{g} L_{g} + {λ_{p} L}_{p} + λ_{L 1} L_{L 1} + λ_{c} L_{c},

(6)

where

L

represents the total loss, and

L_{g}

denotes the adversarial Wasserstein (WGAN) loss [19] and is used to avoid the vanishing gradient problem and achieve stable training for the GAN. The purpose of the objective function is to optimize the colorization process by balancing different aspects of the loss. The adversarial loss

L_{g}

encourages the generator to produce images that are indistinguishable from real images, thereby improving the realism of the generated images.

The perceptual loss

L_{p}

is obtained by comparing the high-level feature representations of the generated and GT images using a pretrained VGG network. Specifically, the

L_{2}

distance between the feature maps of the generated image

\tilde{y}

and the GT image

y_{G T}

at different layers

k

of the VGG network is computed as follows:

L_{p} = {||φ_{k} (y_{G T}) - φ_{k} (\tilde{y})||}_{2}^{2}

(7)

where

φ_{k}

represents the feature extractor function of the pretrained VGG network. Mathematically,

φ_{k} (I)

denotes the feature map extracted from the input image

I

at the

k

-th layer of the VGG network. Each

φ_{k}

can be considered as a mapping function

φ_{k} : R^{H \times W \times C} R^{H^{'} \times W^{'} \times C^{'}}

, where

H, W

, and

C

are the height, width, and number of channels of the input image;

H^{'}, W^{'},

and

C^{'}

are the dimensions of the feature map at layer

k

; and

y_{G T}

and

\tilde{y}

represent the GT and output image, respectively. In our implementation, we use the VGG16 model pretrained on ImageNet, and we divide it into several blocks corresponding to different layers. Specifically, we utilize the feature maps from the conv1_2, conv2_2, conv3_3, and conv4_3 layers of the VGG16 model. These layers are selected to capture both low-level and high-level features, which are crucial for ensuring high-quality perceptual similarity between the generated and GT images. The perceptual loss

L_{p}

helps to preserve high-level features and details by ensuring that the generated images are perceptually similar to the GT images as perceived by the VGG network.

The

L_{L 1}

loss is the conventional

L_{1}

loss, which is obtained by computing the absolute differences between the pixel values of the generated image

\tilde{y}

and the GT image

y_{G T}

. This can be mathematically expressed as (8).

L_{L 1} = {||(y_{G T}) - (\tilde{y})||}_{1},

(8)

where

{||.||}_{1}

denotes the

L_{1}

norm, which sums up the absolute differences between the corresponding pixels of the two images. The

L_{1}

loss ensures pixel-wise accuracy by minimizing these differences, which helps in maintaining the overall structure and color integrity.

L_{c}

is the color loss, which is the comparison of the random normal distribution feature map from the color encoder and GT image feature map and is defined as (9).

L_{c} = E {||G_{f} (N (μ, σ^{2})) - V G G (y_{G T})||}_{1},

(9)

where

E

represents the expectation operator, which averages the loss over the distribution of the training data. This means that the color loss

L_{c}

is computed as the expected value of the

L_{1}

norm of the differences between the generated features and the VGG features of the GT images over all samples in the training set. Furthermore,

N (μ, σ^{2})

is the random normal distribution with mean

μ = 0

and standard deviation

σ^{2} = 0.1

.

G_{f}

represents the function of the color encoder, and

y_{G T}

is the GT image. The color loss

L_{c}

ensures that the generated color features are consistent with those of the GT image, thereby enhancing the color accuracy and vividness of the final output.

λ_{g}, λ_{p}, λ_{L 1}, and λ_{c}

values are fixed and empirically set to {

λ_{g}, λ_{p}, λ_{L 1}, λ_{c}} = {0.1, 100, 10, 1}

, respectively.

4. Experimental Results

4.1. Implementation Details and Results

The PASCALVOC [20] dataset, which contains 17,125 images, was used for training. A total of 15,413 images were used for training, and 1712 images were used for testing. The images were then rescaled to 256

\times

256 pixels. The preprocessing steps include extracting the luminance channel of images and applying data augmentation techniques like resizing, flipping, and rotation. The network parameters are initialized using the Xavier method, and the transformer and GAN components are trained jointly. The transformer learns global features while the GAN framework, comprising a generator and discriminator, generates realistic colorized images. The joint training uses a combined objective function of adversarial and reconstruction losses. The learning rates were set to

1 \times 10^{- 4} and 2 \times 10^{- 4}

for the generator and discriminator, respectively. We used a batch size of 16 and an Adam optimizer with

β_{1} = 0.5

and

β_{2} = 0.999

.

We used the peak signal-to-noise ratio (PSNR) [21], structural similarity index (SSIM) [22], and colorfulness [23] metrics to evaluate model performance. Higher PSNR values indicate better fidelity to the GT, while SSIM evaluates the perceived quality by comparing luminance, contrast, and structure between the generated and GT images. Colorfulness can be quantified mathematically by assessing the variation in color intensity within an image. The colorfulness is the standard deviation of the pixel values in the color channels. A higher standard deviation indicates greater color diversity and consequently, higher colorfulness. Δcolorfulness measures the difference in colorfulness values between the colored and GT images. Figure 3 shows the colorization results of the proposed network.

Our method was compared with four fully automatic state-of-the-art image colorization networks: BigColor [8], ChromaGAN [24], ColTran [15], and CT2 [16]. Figure 3 shows that the proposed method yields more natural colors and outperforms the other methods. ChromaGAN [24] and ColTran [15] suffer from color-bleeding artifacts and desaturation. On the other hand, BigColor and CT2 gave more vivid results but were far from GT. This is more evident in the second row of Figure 3. More vividness results in unnatural colorization as shown in Figure 3. As in the third row of Figure 3, BigColor and CT2 give more vivid colors to a person in an image but lack realism and visual appeal. The proposed method output images that were more natural and closer to the GT colors. The superior performance of our method is consistent across various images and scenarios, demonstrating its robustness and effectiveness in producing high-quality colorizations.

We further evaluated our method using the PSNR, SSIM, and colorfulness. Our method outperformed the other methods in terms of the quantitative metrics, as shown in Table 1. Higher PSNR and SSIM values indicated a more accurate and faithful representation of the colorized images than with their GT counterparts. These metrics are crucial as they quantitatively measure the similarity and structural fidelity between the generated images and the GT. The Δcolorfulness values obtained in our experimental results indicate that our proposed method produces color variations that are close to GT images. Colorfulness has a higher value if rare colors are present in an image, regardless of color accuracy. The other methods may have higher colorfulness values, but the proposed method excels in Δcolorfulness, as the output image colors are closer to the GT. Figure 3 shows that the results of CT2 are more vivid but have bleeding artifacts. Such artifacts diminish the overall visual quality and realism of the colorized images. State-of-the-art methods show higher value in colorfulness due to more variations in color but are far from GT. This indicates that while state-of-the-art methods may introduce more color, they fail to do so in a manner that is consistent with the GT. The proposed network shows higher performance in Δcolorfulness values, which indicates that color is closely aligned to GT images. Overall, our method achieves a better balance of color accuracy and naturalness, making it more suitable for practical applications where visual fidelity is paramount.

To demonstrate the practical applicability and robustness of our proposed method, we applied it to historical images as shown in Figure 4. This experiment aims to validate the model’s performance in a real-world context, specifically in the analysis and restoration of historical photographs. We selected black-and-white historical images for this experiment. The images were preprocessed and then colorized using our proposed method. The colorized output demonstrates the model’s capability to produce visually appealing and realistic colorizations, closely resembling the expected colors of the historical scene, highlighting its potential for digital restoration and historical analysis applications.

4.2. Ablation Studies

Ablation studies were performed to evaluate the individual contributions of the color transformer and color encoder components in the proposed method.

A (UNet): The baseline UNet model was developed by removing the color encoder and transformer from the architecture. UNet provides standard performance across all metrics. Additionally, UNet tends to produce oversaturated and somewhat unrealistic colors, as shown in Figure 5.
B (without color encoder): We removed the color encoder and evaluated the performance of the color transformer using the proposed method. The B variant performs slightly better in terms of color consistency but still struggles with achieving natural-looking colors. Excluding the color encoder resulted in improved PSNR and colorfulness, with a reduction in Δcolorfulness, as shown in Table 2.
C (without a color transformer): We removed the color transformer and evaluated the performance of the color encoder in the proposed network. Removing the color transformer decreased the PSNR. The colorfulness value increased, accompanied by a notable reduction in Δcolorfulness. A notable reduction in Δcolorfulness indicated a decrease in perceptual differences when compared to the baseline. Moreover, there is a lack of sharpness and precision in certain areas, such as the finer details of the subjects’ faces and clothing as seen in the visual results.
Proposed: The proposed method achieved the highest PSNR and SSIM, indicating superior image colorization quality. Notably, colorfulness reached its peak, and Δcolorfulness was minimized, indicating the effectiveness of both the color encoder and color transformer in enhancing colorization performance. The colors are well balanced and realistic, with smooth transitions and a high level of detail, as shown in Figure 5.

5. Conclusions

This study concludes that the integration of a color transformer, color encoder, and GAN results in a novel image colorization method that demonstrates significant progress in image colorization. Our approach utilizes a transformer architecture to gather global information efficiently and incorporates the realistic colorization capabilities of GANs to produce precise and visually appealing colorization results. The fusion of color features from the color encoder with grayscale image features enhances the overall colorization process, preserves fine-grained details, and produces a high-quality output. Experimental evaluations demonstrate the superiority of our approach over other state-of-the-art methods. This research makes a valuable contribution to the progress of image colorization methods, thereby opening up possibilities for their use in digital restoration, entertainment, and analysis of historical images. Future work may explore further optimizations and extensions to address specific challenges and continue to enhance the accuracy and realism of the colorization outputs. Additionally, future work should also focus on optimizing the model to reduce computational overhead. Potential strategies include model pruning, quantization, and the development of more efficient transformer variants.

Author Contributions

Conceptualization, H.S. and B.L.; methodology, H.S. and B.L.; software, H.S.; validation, H.S. and B.L.; formal analysis, H.S.; investigation, H.S. and B.L.; resources, H.S.; data curation, H.S.; writing—original draft preparation, H.S.; writing—review and editing, B.L.; visualization, H.S.; supervision, B.L.; project administration, B.L.; funding acquisition, B.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Research Foundation of Korea (NRF) funded by the Korean Government under Grant 2022R1I1A3065473.

Data Availability Statement

The data presented in this study are openly available at http://host.robots.ox.ac.uk/pascal/VOC/ (accessed on 21 July 2023) at doi: 10.1007/s11263-009-0275-4 [20].

Conflicts of Interest

The authors declare no conflicts of interest.

References

Isola, P.; Zhu, J.-Y.; Zhou, T.; Efros, A.A. Image-to-Image Translation with Conditional Adversarial Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5967–5976. [Google Scholar] [CrossRef]
Shafiq, H.; Lee, B. Image Colorization Using Color-Features and Adversarial Learning. IEEE Accesss 2023, 11, 132811–132821. [Google Scholar] [CrossRef]
Levin, A.; Lischinski, D.; Weiss, Y. Colorization using optimization. ACM Trans. Graph. 2004, 23, 689–694. [Google Scholar] [CrossRef]
Irony, R.; Cohen-Or, D.; Lischinski, D. Colorization by Example. In Proceedings of the Sixteenth Eurographics Conference on Rendering Techniques, Konstanz, Germany, 29 June–1 July 2005; Eurographics Association: Goslar, Germany, 2005; pp. 201–210. [Google Scholar]
Welsh, T.; Ashikhmin, M.; Mueller, K. Transferring color to greyscale images. ACM Trans. Graph. 2002, 21, 277–280. [Google Scholar] [CrossRef]
Zhang, R.; Isola, P.; Efros, A.A. Colorful Image Colorization. In Computer Vision—ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 649–666. [Google Scholar]
Nazeri, K.; Ng, E. Image Colorization with Generative Adversarial Networks. arXiv 2018, arXiv:1803.05400. [Google Scholar]
Kim, G.; Kang, K.; Kim, S.; Lee, H.; Kim, S.; Kim, J.; Baek, S.-H. BigColor: Colorization Using a Generative Color Prior for Natural Images. In Computer Vision—ECCV 2022; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Springer Nature: Cham, Switzerland, 2022; pp. 350–366. [Google Scholar]
Zhao, Y.; Po, L.-M.; Cheung, K.-W.; Yu, W.-Y.; Rehman, Y.A.U. SCGAN: Saliency Map-Guided Colorization With Generative Adversarial Network. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 3062–3077. [Google Scholar] [CrossRef]
Du, K.; Liu, C.; Cao, L.; Guo, Y.; Zhang, F.; Wang, T. Double-Channel Guided Generative Adversarial Network for Image Colorization. IEEE Access 2021, 9, 21604–21617. [Google Scholar] [CrossRef]
Wu, Y.; Wang, X.; Li, Y.; Zhang, H.; Zhao, X.; Shan, Y. Towards Vivid and Diverse Image Colorization with Generative Color Prior. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 14357–14366. [Google Scholar] [CrossRef]
Kang, X.; Yang, T.; Ouyang, W.; Ren, P.; Li, L.; Xie, X. DDColor: Towards Photo-Realistic Image Colorization via Dual Decoders. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. Available online: https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf (accessed on 14 September 2022).
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Kumar, M.; Weissenborn, D.; Kalchbrenner, N. Colorization Transformer. arXiv 2021, arXiv:2102.04432. [Google Scholar]
Weng, S.; Sun, J.; Li, Y.; Li, S.; Shi, B. CT²: Colorization Transformer via Color Tokens. In Computer Vision—ECCV; Springer: Cham, Switzerland, 2022; pp. 1–16. [Google Scholar] [CrossRef]
International Commission on Illumination (CIE). Colorimetry, 3rd ed.; CIE: Vienna, Austria, 2004. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Los Alamitos, CA, USA, 11–17 October 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein Generative Adversarial Networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Precup, D., Teh, Y.W., Eds.; PMLR: Breckenridge, CO, USA, 2017; Volume 70, pp. 214–223. Available online: https://proceedings.mlr.press/v70/arjovsky17a.html (accessed on 14 September 2022).
Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Slepian, D.; Wolf, J. Noiseless coding of correlated information sources. IEEE Trans. Inf. Theory 1973, 19, 471–480. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
Hasler, D.; Suesstrunk, S.E. Measuring colorfulness in natural images. In Human Vision and Electronic Imaging; Rogowitz, B.E., Pappas, T.N., Eds.; SPIE: Bellingham, WA, USA, 2003; p. 87. [Google Scholar] [CrossRef]
Vitoria, P.; Raad, L.; Ballester, C. ChromaGAN: Adversarial Picture Colorization with Semantic Class Distribution. In Proceedings of the 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), Snowmass, CO, USA, 1–5 March 2020; pp. 2434–2443. [Google Scholar] [CrossRef]

Figure 1. Overall architecture of the proposed network architecture.

Figure 2. Color transformer.

Figure 3. Qualitative comparisons.

Figure 4. Historical image colorization.

Figure 5. Visual results for ablation studies.

Table 1. Quantitative comparisons in PSNR (dB), SSIM, colorfulness, and Δcolorfulness.

Models	Evaluation Metrics
Models	PSNR (dB)	SSIM	Colorfulness	ΔColorfulness
BigColor [8]	21.473	0.883	35.71	4.63
ChromaGAN [24]	23.636	0.882	21.89	9.19
ColTran [15]	23.839	0.868	35.74	4.66
CT2 [16]	19.304	0.912	36.04	4.96
Proposed	24.023	0.941	27.22	3.86

Bold values indicate the highest performance for the respective evaluation metric.

Table 2. Quantitative results of ablation studies.

Models	Evaluation Metrics
Models	PSNR (dB)	SSIM	Colorfulness	ΔColorfulness
A (UNet)	23.11	0.936	13.97	17.11
B (w/o color encoder)	23.48	0.937	19.15	11.93
C (w/o color transformer)	22.97	0.937	23.29	7.79
Proposed	24.02	0.941	27.22	3.86

Bold values indicate the highest performance for the respective evaluation metric.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shafiq, H.; Lee, B. Transforming Color: A Novel Image Colorization Method. Electronics 2024, 13, 2511. https://doi.org/10.3390/electronics13132511

AMA Style

Shafiq H, Lee B. Transforming Color: A Novel Image Colorization Method. Electronics. 2024; 13(13):2511. https://doi.org/10.3390/electronics13132511

Chicago/Turabian Style

Shafiq, Hamza, and Bumshik Lee. 2024. "Transforming Color: A Novel Image Colorization Method" Electronics 13, no. 13: 2511. https://doi.org/10.3390/electronics13132511

APA Style

Shafiq, H., & Lee, B. (2024). Transforming Color: A Novel Image Colorization Method. Electronics, 13(13), 2511. https://doi.org/10.3390/electronics13132511

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Transforming Color: A Novel Image Colorization Method

Abstract

1. Introduction

2. Related Works

3. Proposed Method

3.1. Overall Architecture

3.2. Color Encoder

3.3. Color Transformer

3.4. Objective Function

4. Experimental Results

4.1. Implementation Details and Results

4.2. Ablation Studies

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI