Image Colorization with Residual Attention U-Net

Yang, Jun; Zhang, Donghui; Wu, Fan; Yang, Le

doi:10.3390/electronics15071462

Open AccessArticle

Image Colorization with Residual Attention U-Net

¹

College of Artificial Intelligence, Jiaxing University, Jiaxing 314001, China

²

School of Control Engineering, Northeastern University, Shenyang 110004, China

³

School of Intelligent Sensing and Optoelectronic Engineering, Northeastern University at Qinhuangdao, Qinghuangdao 066004, China

^*

Authors to whom correspondence should be addressed.

Electronics 2026, 15(7), 1462; https://doi.org/10.3390/electronics15071462

Submission received: 14 February 2026 / Revised: 26 March 2026 / Accepted: 27 March 2026 / Published: 1 April 2026

Download

Browse Figures

Versions Notes

Abstract

Image colorization aims to add plausible colors to grayscale images. However, existing methods often suffer from detail loss, dull colors, and unrealistic results. To address these issues, we propose a novel image colorization method based on a residual attention U-Net. First, a shallow feature extraction module with a fusion attention mechanism is designed to capture shallow features. Second, a residual attention U-Net is constructed by integrating a residual attention module into an improved U-Net architecture. Finally, we fuse the extracted shallow features with the shallow attention features within the residual attention U-Net to enhance detail preservation and improve colorization quality. Experimental results on the summer2winter dataset show that our method improves the average PSNR by 1.32 dB and SSIM by 0.0139, while reducing LPIPS by 0.01. Furthermore, our method achieves the best average PSNR and LPIPS on the NCData and COCO-Stuff datasets. Visual results demonstrate that our approach preserves fine details, produces more vibrant colors, and achieves a higher degree of realism and naturalness.

Keywords:

image colorization; residual; attention mechanism; U-Net

1. Introduction

Image colorization is a classic task in computer vision that aims to add natural and realistic colors to grayscale images, bridging the gap between black-and-white visual information and colorful real-world scenes. This technology has broad applications in fields such as historical image restoration, medical image processing, film and television post-production, and virtual reality [1,2,3]. Despite significant advancements in recent years, image colorization still faces several key challenges. Balancing color addition with feature preservation is challenging, as it is crucial to ensure that the added colors do not obscure the inherent details of the image. Ensuring color consistency across different regions of the image is another significant challenge. For example, adjacent regions should have colors that blend smoothly and logically, reflecting real-world color transitions. Image colorization, especially using deep learning models, can be computationally intensive. This can lead to longer processing times, which may not be feasible for real-time applications or large-scale image datasets. While some colorization methods are fully automatic, others require user input to guide the colorization process. Balancing the need for user interaction with the desire for automation is a delicate task. Generalizing colorization methods to diverse types of images, including those with complex scenes, varying lighting conditions, and different textures, remains a challenge.

In this paper, we propose a deep learning colorization method based on feature fusion from the perspectives of preserving the details of the colorized image and improving the colorization effect. This paper constructs a Shallow Feature Extraction Block (SFEB) to obtain the shallow features of the image. Then, the Residual Attention Block (RAB) was designed, which was combined with the improved U-Net to construct the residual attention U-Net. The shallow features extracted by SFEB were fused with the low-level structure low-level features extracted by the low-level network of the residual attention U-Net. The fused features were input into the residual attention U-Net high-level network to extract the deep features of the image and enrich the expression ability of the model. Finally, compared with the existing image colorization methods, the effectiveness of the proposed method is verified.

The main contributions of this work are clearly summarized as follows:

A shallow feature extraction module is constructed to extract image detail features. An attention mechanism is then incorporated into the module to assign different weights to each element in the feature information, thereby increasing the model’s focus on useful features;
A residual attention module is designed and combined with the improved U-Net to construct the residual attention U-Net, which effectively leverages both shallow and deep image features to enhance the colorization effect;
An image colorization method based on residual attention U-Net is proposed, utilizing the shallow feature extraction module and residual attention U-Net to achieve colorization while preserving more image details and producing more vibrant colors.

2. Related Works

Image colorization methods can be divided into traditional methods and deep learning methods. The traditional methods [4,5] have the disadvantages of poor colorization effects, many calculations and considerable manual interaction. In recent years, with the development of artificial intelligence technology, deep learning methods [6,7] driven by large-scale data have provided a new method to overcome the defects of the above image colorization methods. The image colorization methods can also be divided into scribble-based image colorization methods, reference image-based colorization methods and automatic image colorization methods.

2.1. Scribble-Based Methods

The scribble-based image colorization method is a technique that uses user input in the form of simple scribbles to guide the process of adding color to black-and-white images. This method allows for greater control and precision, as users can directly indicate which areas should be colored with specific hues. Levin et al. [8] proposed the first scribble-based colorization method. Huang et al. [9] proposed an adaptive edge detection colorization method based on Sobel filters to prevent color overflow at edges. Yatziv et al. [10] proposed a chrominance fusion colorization method, which calculates the distance between each pixel and multiple scribbles and then determines the pixel color based on the weighted sum of scribble colors. Compared with the method of Levin et al. [8], Yatziv et al. [10] achieved lower time complexity and computational complexity. However, color overflow may still occur in regions with weaker image edges. Kim et al. [11] improved upon Yatziv et al.’s method by introducing a data-driven distance measurement approach based on a novel restart random walk [12], ensuring more consistent edge colors. Scribble-based colorization methods require manual participation and demand a certain level of color perception from users, making the external conditions relatively strict. Therefore, as deep learning-based image colorization methods have emerged, research on scribble-based methods has gradually declined.

2.2. Exemplar-Based Methods

Exemplar-based image colorization methods utilize reference images to add color to black and white or grayscale images. These methods rely on finding similar exemplars in a database of colorized images and using their color information to guide the colorization process. Ironi et al. [13] incorporated image segmentation information into the colorization process and used domain matching algorithms to assign colors from the reference image to each pixel. However, when lighting conditions differ significantly between the reference and target images, the colorization effect is poor. To address this issue, Liu et al. [4] proposed an intrinsic colorization method. First, an image is represented as two components: reflectance and illumination. The reflectance from the reference image is then combined with the illumination from the grayscale image to generate a preliminary color image. Finally, a subset pixels is extracted from the color image as color scribbles, and the method of Levin et al. [8] is applied to perform colorization. Xu et al. [14] proposed a fast instance colorization network based on stylization to achieve spatial consistency and improve colorization quality. Welsh et al. [15] transferred color information of a reference image to a gray image by matching brightness and texture features. However, this local matching method often lacks spatial coherence, resulting in suboptimal colorization. Compared with scribble-based methods, reference image-based methods reduce manual involvement by introducing reference images. However, their results are highly depend on the reference image. If there is a significant visual discrepancy between the two images, the colorization quality may degrade. Additionally, the number of reference images is a key factor; using too few reference images may lead to overfitting.

2.3. Automatic Image Colorization Methods

Automatic image colorization methods aim to automatically add color to black and white or grayscale images. These methods utilize algorithms and machine learning techniques to analyze the content of an image and predict appropriate colors for different objects, backgrounds, and lighting conditions. By leveraging large datasets of colorized images, these methods can learn patterns and relationships between different elements in an image, allowing them to accurately apply color with minimal human intervention. With the advent of deep learning, various techniques have been proposed to address automatic image colorization. Such as image processing methods based on some well-designed deep convolutional neural networks (CNN) has largely exceeded traditional approaches [16,17,18,19,20]. Cheng et al. [21] proposed a neural network-based image colorization method for the first time, extracting feature information from different regions of the image as input to the neural network and then using joint bilateral filtering to eliminate image artifacts. Wu et al. [22] proposed the method for coloring remote sensing images based on deep convolution generation adversarial network. Wang et al. [23] proposed an automatic colorization framework for a Thangka sketch, which has a highly accurate response to a user selection process. The network model proposed by Cheng et al. [21] uses manually designed features, which makes it impossible for the model to undergo end-to-end training. Therefore, in the method proposed by Iizuka et al. [1], grayscale images are directly used as inputs to the neural network, and the predicted chromaticity channels are used as outputs; then, a second network was designed to extract global information from the image, which was then fused with the chromaticity channel to provide the model with a better understanding of the overall semantic information of the image, thereby improving the colorization effect and alleviating colorization overflow. In addition, because this method directly takes grayscale images as input to the model, the colorization time is reduced. Deshpande et al. [3] used a variational autoencoder to perform low-dimensional embedding encoding on colors, and then combined with a mixed density network for modeling to achieve diverse color images. Isola et al. [24] proposed an image translation method based on the pix2pix network, which is also suitable for image colorization tasks. Yoo et al. [25] proposed a memory-enhanced colorization network model focused on solving small sample colorization problems; this model can maintain high-quality colorization results when datasets are limited. Xia et al. [26] proposed a double-branch colorization network model that included a color modeler to predict the color of the anchor point to represent the color distribution and a color generator to predict the pixel color by referencing the sampled anchor points. Zhong et al. [27] proposed a grayscale enhancement colorization network (GECNet) to bridge the modality gap by retaining the structure of the colorized image which contains rich information.

In recent years, several studies on colorization using residual U-Nets have achieved considerable performance results [28,29,30,31]. Sharma et al. [28] proposed a Robust Image Colorization using Self-attention-based Progressive Generative Adversarial Network (RICSPGAN), which consists of a residual encoder–decoder (RED) network and a Self-attention-based Progressive Generative Network (SP-GAN) in a cascaded form to perform denoising and colorization. Kumar et al. [29] presented a parallel GAN-based colorization framework that uses parallel GANs tailored to colorize the foreground (using object-level features) and background (using full-image features) independently and performs unbalanced GAN training. Guo et al. [30] designed a novel Bilateral Res-Unet based on GAN, which is used in generator to transfer color features on both sides of the encoder. Liu et al. [31] proposed an efficient anime sketch colorization method using swish-gated residual U-Net (SGRU) and a spectrally normalized GAN (SNGAN) to address the problem of low-quality colorization effects. These residual U-Net-based methods have made remarkable progress in image colorization, but they still have limitations in feature fusion and detail preservation. Our proposed method differs from these existing works in the strategic integration of a shallow feature extraction module (SFEB) with a residual attention U-Net, and in the unique feature fusion mechanism that strengthens shallow features to enhance detail preservation, effectively addressing the shortcomings of existing residual U-Net colorization methods.

In addition, with the rapid development of deep learning, transformer-based and diffusion-based colorization methods have become new research hotspots and achieved excellent performance. For example, DDColor [32] proposed a dual-decoder structure to achieve photo-realistic image colorization; BigColor [33] utilized a generative color prior to improve colorization quality; and L-CAD^∗ [34] introduced language-based colorization with diffusion priors. These methods have shown superior performance in some scenarios, but they often have higher computational complexity and require more computing resources. Our method, based on residual attention U-Net, achieves a better balance between colorization quality and computational efficiency, and still maintains competitive performance compared with these recent advanced methods.

Although deep learning methods can achieve automatic image colorization, the model is strongly affected by the color tone of the image in the training dataset, resulting in unsatisfactory image colorization effects and problems such as detail loss, color dimming, and unreal colors. The color tones in a training dataset can have a significant impact on model performance, particularly in image colorization. To ensure robust and generalizable models, it is important to consider the diversity of color tones in the dataset, employ data augmentation techniques, normalize or standardize the color distribution, and carefully select and tune the machine learning model. By addressing these aspects, researchers and practitioners can improve the performance and reliability of their models across various real-world scenarios. This paper proposes a deep learning colorization method based on residual attention U-Net, which preserves the details of the colorized image and improves the colorization effect.

3. Methodology

3.1. Network Structure

To clearly describe the network structure and its working mechanism, we reorganize the description in a logical and coherent manner, explaining the motivation and interaction of each component in detail. We convert the image from the RGB color space to the Lab color space so that the brightness information (L) and chrominance information (ab) are separated, which facilitates better processing by the model. The proposed image colorization network structure is shown in Figure 1. The L-channel image is input into the SFEB to obtain shallow features

F_{s}

; additionally, the L-channel image is processed through convolutional layers and the residual attention U-Net shallow network to obtain residual attention features

F_{us}

. The above calculation process can be expressed by Equations (1) and (2):

F_{s} = S_{S F E B} (X_{L})

(1)

F_{u s} = U_{S} (C_{4 \times 4} (X_{L}))

(2)

where

X_{L}

is the L-channel image,

C_{4 \times 4}

is the

4 \times 4

convolutional operation,

S_{S F E B}

represents shallow feature extraction, and

U_{S}

represents residual attention U-Net shallow network. The feature fusion module is shown in Figure 2. The shallow features

F_{s}

are extended to the same size as the features

F_{us}

to fuse features of the same size. After that, the fused feature

F_{f}

is obtained via a convolution operation. The fusion process can strengthen the shallow feature information so that the image colored by the model has rich detailed features. The above calculation process can be represented by Equation (3):

F_{f} = C_{1 \times 1} (δ (E (F_{s}), F_{u s}))

(3)

where E represents extended operation,

δ

represents feature stitching operation, and

C_{1 \times 1}

is the

1 \times 1

convolutional operation. The

1 \times 1

convolution operates across channels without altering spatial dimensions, making it efficient for feature recombination.

Then the features

F_{f}

are input into the residual attention U-Net deep network and passed through the convolutional layer to obtain the ab channel image. This process extracts deeper features by traversing from shallow to deep layers and integrates them through skip connections, thereby preserving more detailed information in the colorized image. As a result, the colors are rendered with greater realism and naturalness. The aforementioned calculation process can be represented by Equation (4).

a b = C_{1 \times 1} (U_{h} (F_{f}))

(4)

where

U_{h}

represents the residual attention U-Net deep network, and

C_{1 \times 1}

is the

1 \times 1

convolutional operation. Finally the L and ab channel images are combined to generate the image in the Lab color space, which is subsequently converted into the RGB color space. The above calculation process can be expressed by Equations (5) and (6):

X_{L a b} = δ (X_{L}, X_{a b})

(5)

X_{R G B} = R_{l a b 2 r g b} (X_{L a b})

(6)

where

δ

represents feature stitching operation, and

R_{l a b 2 r g b}

convert an image in Lab color space into an image representation in RGB color space.

3.2. Shallow Feature Extraction Module

The shallow features contain image texture information, which is crucial in the colorization process and conducive to restoring image details. In addition, shallow feature information can generate preliminary color information for the image, which is beneficial for the model to obtain the color distribution of different regions in the image and improve the accuracy of the image color. DenseNet121 [35] adopts a densely connected network structure, with each layer directly connected to all previous layers. This structure realizes feature reuse and reduces image information loss when extracting features. DenseNet121 has been widely chosen due to its outstanding performance, efficient parameter usage, and the ease of modification and adaptation. Its dense connection method helps to enhance the gradient flow and feature reuse effect, but it is not the main focus in terms of color reproduction accuracy. The attention module enables the model to enhance or suppress the elements in the feature extraction to enhance the model’s attention to the useful image colorization features. The attention module can also give higher weights to the texture details of the image so that the model focuses on the high-frequency areas of the image and the colorized image retains more detailed information. In this paper, a convolution layer is introduced between the first three blocks (TBs) of DenseNet121 and the efficient channel attention block (ECAB) [36]. Then we perform element-wise multiplication between the feature outputs from the convolution layer and those from the attention module. Based on the above improvements, we build the SFEB structure shown in Figure 3. Input L channel image

X_{L}

to the TBs, and then pass through the convolution layer to obtain feature information

F_{c}

. The above calculation process can be represented by Equation (7):

F_{c} = C_{1 \times 1} (T_{3} (X_{L}))

(7)

where

T_{3}

represents TB. Then, feature

F_{c}

passes through ECAB to obtain feature weight

W_{a}

. The feature weight

W_{a}

is multiplied by the

F_{c}

element by element to obtain the shallow feature

F_{s}

. The above process assigns weights to elements in different regions of the image to capture useful feature information for image colorization. The above calculation process can be represented by Equations (8) and (9):

W_{a} = A_{E C A} (F_{c})

(8)

F_{s} = F_{c} \otimes W_{a}

(9)

where

A_{E C A}

represents the application channel attention mechanism, and ⊗ represents multiplication between elements.

3.3. Channel Attention Module

To improve the utilization of image feature information, this paper introduces ECAB in the shallow feature extraction module. ECAB is a lightweight attention module that uses one-dimensional convolution to facilitate information exchange between channels, avoiding the negative impact of dimensionality reduction and improving the performance of convolutional neural networks with a smaller number of parameters. To obtain more effective shallow feature information, ECAB is used to enhance useful information in detailed features. The ECAB structure is shown in Figure 4. H, W, and C represent the height, width, and the numbers of channels of the image, respectively. Given the value of C in a feature map, the size of the convolution kernel k can be determined. The calculation of the size of k can be expressed by Equation (10):

k = φ (C) = {|\frac{{log}_{2} C}{γ} + \frac{b}{γ}|}_{o d d}

(10)

where

γ

and b are the constant coefficients, which are experimentally set as 2 and 1 respectively, and the size of the convolution kernel is set as 5 when inputting features

F_{c}

. The hyperparameters

γ

and b control how the kernel size scales with the number of channels. The specific choice of

γ = 2

and

b = 1

is motivated by three aspects as follows.

Balancing Receptive Field and Efficiency: The term ${log}_{2} C$ ensures that the kernel size grows logarithmically with channel count, preventing excessively large kernels in high-dimensional spaces. Dividing by $γ = 2$ moderates the growth rate, keeping the kernel compact while still allowing sufficient cross-channel interaction. The bias term $b = 1$ ensures that even for small C, the kernel size does not collapse to an ineffective value.
Empirical Performance: Experiments show that this configuration provides a good trade-off between model accuracy and computational cost. A larger $γ$ would shrink the kernel too aggressively, while a smaller $γ$ would make it unnecessarily large.
Ensuring Odd Kernel Sizes: The ${|\cdot|}_{o d d}$ operation rounds the result to the nearest odd integer, ensuring symmetry in the $1 D$ convolution.

Figure 4. The ECAB structure.

3.4. Residual Attention U-Net

The residual block can prevent gradient vanishing as the network deepens and enable the model to capture more feature information. The attention mechanism can assign different weights to the elements in the feature information, allowing the model to focus on important features. Therefore, we integrate ECAB into a residual block and designs the residual attention block (RAB) based on these improvements. Its structure is shown in Figure 5. The input feature

F_{in}

passes sequentially through a convolution layer, activation function, convolution layer, and ECAB to obtain the feature information

F_{a}

. The output feature

F_{out}

is obtained by adding

F_{a}

and

F_{in}

. This process can be expressed by Equations (11) and (12):

F_{a} = E C A B (C_{3 \times 3} (ψ (C_{3 \times 3} (F_{i n}))))

(11)

F_{o u t} = F_{i n} \oplus F_{a}

(12)

where

C_{3 \times 3}

represents

3 \times 3

convolution operation.

ψ

represents Relu activation function. ⊕ represents adding by element.

U-Net [37] is composed of an encoder and a decoder, and this symmetrical structure is conducive to extracting image context information. U-Net connects the same dimensional levels in the encoder and decoder through skip connections to effectively utilize the shallow and deep features of the image.

To enhance the model’s representational capacity and obtain more accurate semantic expressions, we improve the U-Net structure. To enable the model to learn the relationships between color information in different regions, higher weights are assigned to color information that is beneficial for colorization. In this way, color information can be fully utilized. We add RABs between the encoder and decoder and design a residual attention U-Net.

The convolution operation is used in the residual attention U-Net. First, an image of size

256 \times 256

with 64 channels is input. After each convolution, the image size becomes half of the original, and after 7 downsampling layers, the image size is reduced to

2 \times 2

, and the number of channels is 512. After RAB, the image size remains the same, and the number of channels is still 512. To ensure that the sizes of the input and output ends are consistent, deconvolution is used for upsampling. After 7 uppersampling layers, the image is restored to the original size with 64 channels. Finally, the feature information extracted by downsampling and upsampling is fused via skip connections to fully utilize the shallow feature information and high-level semantic information of the image.

3.5. Loss Function and Training Strategy

The loss function quantifies the disparity between predicted and actual pixel values, with smaller values indicating better model performance. Mean Square Error (MSE) is employed as a quantitative metric, and model parameters are optimized by minimizing the MSE between predicted and ground-truth images. It should be noted that MSE loss has a well-known limitation: it tends to produce overly smooth and desaturated colorization results due to its averaging of pixel-wise differences. This is a potential shortcoming of our current method, as it may lead to reduced color vividness in some cases. While MSE loss ensures high fidelity in terms of pixel accuracy, it does not fully capture perceptual color quality. We discuss this limitation further in the Discussion section and propose improvement directions in Future Work. For image X, MSE can be represented by Equation (13):

M S E = \frac{1}{n} \sum_{i = 1}^{n} {(X - \hat{X})}^{2}

(13)

where X and

\hat{X}

represents the target image and the generated image respectively.

During training, we employ dropout to stochastically deactivate the output of select neurons, thereby mitigating inter-neuronal dependencies and enhancing model generalization. To prevent overfitting, dropout is applied to the decoder with a rate of

0.5

. Additionally, while batch normalization (BN) is commonly used to normalize input data by calculating mean and variance within a batch, it heavily relies on batch size and can be affected by other images. To address these limitations, we adopt Instance Normalization (IN), which normalize each image independently, reducing inter-image dependencies and more accurately reflecting the numerical distribution of individual images, thereby improving convergence. Replacing BN with IN significantly influences training dynamics, particularly in terms of convergence behavior, sensitivity to hyperparameters, and generalization. During training, images are cropped to a fixed size, and the network uses 8 Residual Attention Blocks (RABs). Optimization is performed using the Adam optimizer with an initial learning rate of

0.0001

. When the loss stops decreasing, the learning rate is automatically reduced by a factor of 10 until convergence is achieved over 200 training iterations.

4. Experiment

In this paper, the public datasets summer2winter [38], NCData [39] and COCO-Stuff [40] are used. The summer2winter dataset includes 1231 color images as the training set and 309 color images as the test set, respectively. The NCData dataset includes 721 color images, which are randomly divided into training and testing sets in an 8:2 ratio. The original COCO dataset provides instance-level annotations for 80 thing classes, and COCO-Stuff adds dense pixel-wise annotations for 91 stuff classes. The experimental equipment is as follows, including a 64-bit Ubuntu system, an Intel(R) Core(TM) i9-10900X CPU@3.70 GHZ processor, and a GeForce RTX 3090Ti graphics card. Python 3.8.0, PyTorch 1.11.0 and CUDA 11.3 were used for the experiment.

4.1. Comparisons with Other Methods

4.1.1. The summer2winter Dataset

To verify the effectiveness of our method, we compared it with several methods, such as pix2pix [24], Memo (MemoPainter) [25], and the literature [26], on the summer2winter dataset. To accurately evaluate the experimental results, the above methods were trained and tested on the same dataset. The objective evaluation indices are calculated and averaged on the test set, as shown in Table 1. We use the PSNR (Peak Signal-to-Noise Ratio), SSIM (Structural Similarity Index Measure) and LPIPS (Learned Perceptual Image Patch Similarity) to evaluate the effect of the image coloration methods. The PSNR metric evaluates the degree of image distortion by calculating the peak signal to noise ratio, which is suitable for the objective standard of measuring image quality. However, SSIM evaluates image similarity by considering brightness, contrast and structure, which is closer to human visual perception. The LPIPS metric uses two images as input and then outputs a perceptual similarity score between them. This score is calculated by comparing the feature representations of the two images in a deep neural network.

The average values of the PSNR and SSIM increase by

1.32

dB and

0.0139

respectively, while the average value of the LPIPS decreases by

0.01

. The comparison results show that the colorized image is more consistent with the real image in terms of distortion, structure and similarity.

The subjective visual results of the colorized images are shown in Figure 6. In the images (Column 2) colored by the pix2pix [24] method, the color of the lake in the first row and the ground areas in the fourth row are yellow. In the images (Column 3) colored by the Memo [25] method, the lake surfaces in Rows 1 and 2 are yellow, the image in Row 3 is dim, and the image in Row 4 is dark blue. In the images (Column 4) colored by the method [26], the trees in the first and fourth rows are fluorescent green, while the lake in the second row and the grass in the fourth row are dim. The colorized images (Column 5) are colored by our method, which exhibit a natural and realistic effect. The images colored using the pix2pix (Column 2) are quite different from the original images (Column 6). The images in Column 4 and Rows 2, 3 and 4 are quite different from the corresponding original images. Especially in the second row, the image is quite different from the original image. These images need to be improved in terms of details. The colorized images are closer to the real images in the sky, lake, mountain and ground areas. It can be seen that our method can retain more detailed information because of the enhancement of shallow features. By using residual attention U-Net to extract more effective feature information and semantic information, the color information becomes richer, more realistic and natural.

4.1.2. The NCData Dataset

To further verify the effectiveness of the proposed method, the experiments were carried out on the NCData dataset. As shown in Table 2, compared with the method EnCycleGAN [42], although the average SSIM is slightly lower, the average PSNR is increased by

1.19

dB, and the average LPIPS is decreased by

0.007

, with an obvious improvement effect. The possible reason for the slight decrease in SSIM on NCData is that the images in the NCData have simple structures and smooth textures, which is insensitive with SSIM values. The above comparison results show that the proposed method has better performance.

The colorization results are shown in Figure 7. The images colored by our method (Column 5) exhibit rich, realistic, and natural colors. The corn, cherry and strawberry plants appear close to the real images. Although there are slight differences, the visual perception is not adversely affected. Other images reproduce most of the colors in the real image, with only minor distorted, such as the carrot stems and eggplant leaves. Therefore, the proposed method yields rich and realistic colors on the NCData dataset while preserving more details and achieving better subjective visual effects.

We also compared our method with the latest approach [41]. The objective and subjective comparison are shown in Figure 8 and the last two rows in Table 2 respectively. We can see that our approach achieves the better PSNR values although the lower SSIM. Our method exhibits lower SSIM values, which may be attributed to the following two reasons:

SSIM is the structural similarity metric for images. Since the images in the NCData dataset contain mostly smooth regions with relatively few complex textures and edge contours, the larger deviation in this metric does not significantly impact perceived image quality, as the reduction in structural similarity has a minimal effect.
The experiments on the NCData dataset primarily aim to demonstrate that our method achieves better fidelity and more natural colorization. Therefore, PSNR and LPIPS are the more suitable metrics for evaluating the quality of the colorized images. And it can be seen in Figure 8 that the visual effect of our method is significantly better than the method [41].

Figure 8. Comparisons of our method with [41] on three images (Apple25, Cherry12 and Brinjal19) in NCData dataset (From left to right: The Original, gray, Reference [41] and ours).

4.1.3. The COCO-Stuff Dataset

For a fair comparison, we select the COCO-Stuff dataset to conduct comparative experiments with image colorization algorithms. The COCO-Stuff dataset is a substantial extension of the widely used Microsoft Common Objects in Context (COCO) dataset [40]. While the original COCO dataset focused primarily on thing classes (countable objects such as people, cars, and animals), COCO-Stuff augments it with dense pixel-level annotations for stuff classes amorphous background materials and surfaces like sky, grass, wall, and floor. This comprehensive annotation enables more holistic scene understanding, bridging the gap between object recognition and scene parsing. It is shown in Table 3 that our method achieves the best performance than other methods.

4.2. Ablation Experiments

4.2.1. Comparisons of Various Modules

The network model is composed of residual attention U-Net and SFEB. To prove the rationality and validity of the designed network model, we compare the image colorization effects before and after adding each module. The PSNR, SSIM and LPIPS of the whole test set are calculated and averaged, and the comparisons are shown in Table 4. The baseline is U-Net with 7 upper and lower sampling layers.

From Table 4, it can be seen that after adding RAB, the average values of PSNR and SSIM increased by

0.03

dB and

0.0010

respectively, while the average value of LPIPS decreased by 0.002. This is because RAB helps the model focus more on useful features for colorization in the image while enhancing the information transfer between the encoder and decoder in U-Net. The model can selectively retain image feature information, which helps reduce information loss and improve the quality of colorized images. After adding SFEB, the average values of PSNR and SSIM increased by

0.17

dB and

0.0014

respectively, while the average value of LPIPS decreased by 0.004. Shallow features contain detailed information from the image, and adding SFEB can enhance the extraction of shallow feature information, which helps the model better restore image details. It can also serve as a supplement to deep features, which is beneficial for the model to extract rich textures and edges, thereby generating more natural color images. After improving U-Net by incorporating both the SFEB and RAB, the objective evaluation index is optimal. The above results indicate that the improved U-Net model, with the addition of SFEB and RAB simultaneously, has the best effect. The disparity between the colorized image and the original image is negligible, while their structural resemblance remains intact.

To further validate the statistical significance of the improvements observed in Table 4, we performed paired t-tests between the baseline and each variant across the test set. The results indicate that the improvements in PSNR, SSIM, and LPIPS are statistically significant with p-values < 0.05 for all comparisons. Specifically, the addition of RAB yields p-values of 0.032 (PSNR), 0.028 (SSIM), and 0.041 (LPIPS), while the addition of SFEB yields p-values of 0.017 (PSNR), 0.022 (SSIM), and 0.035 (LPIPS). The full model with both modules achieves p-values < 0.01 across all three metrics. These results confirm that the observed improvements are not due to random variation and demonstrate the effectiveness of the proposed modules.

The visual effect is shown in Figure 9. In the color images generated by the RAB and SFEB modules (column 5), the color of the sky in the first row and the cloud in the second row image are relatively close to the real image. In the color image (column 2) generated without RAB and SFEB modules, the color of the sky in the first row is lighter and the color of the left mountain peak is yellowish. In the color image generated by the RAB module (column 3), the color of the sky in the first row and the color of the clouds in the second row are relatively dim. In the color image generated by the RAB module (column 4), the color of the sky in the second row is darker. From the above comparisons, the incorporation of a shallow feature extraction module and a residual attention module is observed to facilitate the transfer of local information to deep features, concurrently assigning weights to features at various positions and fusing them together. This augmentation enhances the model’s perceptual capability for image details and contextual information, resulting in more realistic colorized images with considerable improved visual effects.

4.2.2. Comparisons for Different Layers

We also compare the colorization effects of U-Net with different numbers of upsampling and downsampling layers. The PSNR, SSIM, and LPIPS values over the entire test set are calculated and averaged, and the comparison results are shown in Table 5. The gains across different layer configurations are relatively small, indicating that the exact number of layers is not critical. It can be observed that setting the number of upsampling and downsampling layers to 7 yields the best objective evaluation metrics. As the number of U-Net layers increases, the receptive field gradually expands, allowing the model to extract richer deep feature information and obtain more accurate semantic representation. Additionally, more layers facilitate greater feature reuse and deeper feature fusion, enabling the model to comprehensively utilize multi-level features, thereby improving colorization accuracy and overall effect.

The visual effect of the colorized image is shown in Figure 10. When the downsampling layer is set as 7, the color of the lake in the first row and the color of the clouds and grasslands in the second row are close to the real images (column 5). When the number of downsampling layers is set as 4, the image is colored (column 2). The color of the lake in the first row is slightly green, and the color of the clouds in the second row is slightly red. When the number of downsampling layers is set as 5, the image is shaded (column 3), and the color of the sky in the first row is darker. When the downsampling layer is set as 6, the image is colored (column 4), and the color of the peaks in the second row is slightly white. From the above visualization results, it can be seen that as the number of upsampling and downsampling layers increases, the model can better extract deep feature information from images, demonstrating better performance in both color image restoration and realism.

4.2.3. Residual Attention U-Net Partitioning

To investigate the impact of the number of downsampling layers in the low-level network of the residual attention U-Net on image colorization performance, we configure the low-level network with different numbers of downsampled layers. We calculate the average PSNR, SSIM, and LPIPS on the test set, and the results are shown in Table 6. When the number of layers in the low-level network is set to 3, although the average PSNR is slightly lower than that with 2 layers, the average SSIM and LPIPS are improved. This indicates that with three layers, the model achieves a better balance between image details and global information. Features are more fully integrated, and the colorization effect is significantly enhanced.

5. Discussion

Automatic colorizing is one of the most interesting problems in computer graphics. This paper presents an automatic colorization method based on improved residual attention U-Net that achieves grayscale image colorization in an end-to-end manner. The proposed method achieves better subjective visual effects and objective metrics, with richer colors and more preserved details in the colorized image, resulting in higher overall quality and more natural colors. This research provides a new method and direction for automatic colorization technology.

6. Conclusions

In this paper, we design an image colorization method based on a residual attention U-Net to make the grayscale image more visually appealing and realistic. The method uses the SFEB to extract shallow feature information from the image and fuses these shallow features with the low-level feature extracted by the residual attention U-Net’s low-level network to generate more vibrant color image. However, this paper only studies the single task of colorization using models trained on small datasets. Future work will consider jointly reconstructing grayscale image (e.g., through denoising, inverse halftoning, and super-resolution) and colorizing them in a single end-to-end model. Additionally, we will use more diverse dataset to further enhance the model’s practicality.

Author Contributions

Conceptualization, D.Z.; methodology, J.Y.; software, F.W.; validation, J.Y., D.Z. and L.Y.; formal analysis, L.Y.; investigation, D.Z.; resources, F.W.; data curation, D.Z.; writing—original draft preparation, D.Z.; writing—review and editing, J.Y.; visualization, F.W.; supervision, J.Y. and L.Y.; project administration, J.Y.; funding acquisition, J.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation Project of China grant number 62302197.

Data Availability Statement

Our code repository includes pre-trained weights, and the pre-trained models are available upon request.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

SFB	Shallow Feature Extraction Block
RAB	Residual Attention Block
ECAB	Efficient Channel Attention Block
CNN	Convolutional Neural Networks
MSE	Mean Square Error
PSNR	Peak Signal-to-Noise Ratio
SSIM	Structural Similarity Index Measure
LPIPS	Learned Perceptual Image Patch Similarity
BN	Batch Normalization
IN	Instance Normalization
RED	Residual Encoder–Decoder
SP-GAN	Self-attention based Progressive Generative Network
SGRU	Swish-gated Residual U-net
SNGAN	Spectrally Normalized GAN
GECNet	Grayscale Enhancement Colorization Network
TIC	Text-guided Image Colorization
HAC-Net	Hybrid Attention Network with Color Query

References

Iizuka, S.; Simo-Serra, E.; Ishikawa, H. Let there be color!: Joint end-to-end learning of global and local image priors for automatic image colorization with simultaneous classification. ACM Trans. Graph. 2016, 35, 110–111. [Google Scholar] [CrossRef]
Zhang, R.; Isola, P.; Efros, A.A. Colorful Image Colorization. In Proceedings of the 14th European Conference; Springer: Cham, Switzerland, 2016; pp. 649–666. [Google Scholar]
Deshpande, A.; Lu, J.; Yeh, M.C.; Chong, M.J.; Forsyth, D. Learning Diverse Image Colorization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2017; pp. 6837–6845. [Google Scholar]
Liu, X.; Wan, L.; Qu, Y.; Wong, T.T.; Lin, S.; Leung, C.S.; Heng, P.A. Intrinsic colorization. ACM Trans. Graph. 2008, 27, 1–9. [Google Scholar] [CrossRef]
Chia, A.Y.S.; Zhuo, S.; Gupta, R.K.; Tai, Y.W.; Cho, S.Y.; Tan, P.; Lin, S. Semantic colorization with internet images. ACM Trans. Graph. 2011, 30, 1–8. [Google Scholar] [CrossRef]
Zou, A.; Shen, X.; Zhang, X.; Wu, Z. Neutral Color Correction Algorithm for Color Transfer Between Multicolor Images. In Advances in Graphic Communication, Printing and Packaging Technology and Materials; Springer: Singapore, 2021; pp. 176–182. [Google Scholar]
Zhao, J.; Liu, L.; Snoek, C.G.M.; Han, J.; Shao, L. Pixel-level Semantics Guided Image Colorization. arXiv 2018, arXiv:1808.01597. [Google Scholar] [CrossRef]
Levin, A.; Lischinski, D.; Weiss, Y. Colorization using optimization. ACM Trans. Graph. 2004, 23, 689–694. [Google Scholar] [CrossRef]
Huang, Y.C.; Tung, Y.S.; Chen, J.C.; Wang, S.W.; Wu, J.L. An adaptive edge detection based colorization algorithm and its applications. In Proceedings of the The 13th Annual ACM International Conference on Multimedia; Association for Computing Machinery: New York, NY, USA, 2005; pp. 351–354. [Google Scholar]
Yatziv, L.; Sapiro, G. Fast image and video colorization using chrominance blending. IEEE Trans. Image Process. 2006, 15, 1120–1129. [Google Scholar] [CrossRef]
Kim, T.H.; Lee, K.M.; Lee, S.U. Edge-preserving colorization using data-driven Random Walks with Restart. In Proceedings of the 16th IEEE International Conference on Image Processing; IEEE: Piscataway, NJ, USA, 2009; pp. 1661–1664. [Google Scholar]
Kim, T.; Lee, K.; Lee, S. Generative Image Segmentation Using Random Walks with Restart. In Proceedings of the 10th European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2008; pp. 264–275. [Google Scholar]
Ironi, R.; Cohen-Or, D.; Lischinski, D. Colorization by Example. Render. Tech. 2005, 29, 201–210. [Google Scholar]
Xu, Z.; Wang, T.; Fang, F.; Sheng, Y.; Zhang, G. Stylization-Based Architecture for Fast Deep Exemplar Colorization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2020; pp. 9363–9372. [Google Scholar]
Welsh, T.; Ashikhmin, M.; Mueller, K. Transferring color to greyscale images. ACM Trans. Graph. 2002, 21, 277–280. [Google Scholar] [CrossRef]
Lin, X.; Sun, S.; Huang, W.; Sheng, B.; Li, P.; Feng, D.D. EAPT: Efficient attention pyramid transformer for image processing. IEEE Trans. Multimed. 2021, 25, 50–61. [Google Scholar] [CrossRef]
Xie, Z.; Zhang, W.; Sheng, B.; Li, P.; Chen, C.P. BaGFN: Broad attentive graph fusion network for high-order feature interactions. IEEE Trans. Neural Netw. Learn. Syst. 2021, 34, 4499–4513. [Google Scholar] [CrossRef]
Al-Jebrni, A.H.; Ali, S.G.; Li, H.; Lin, X.; Li, P.; Jung, Y.; Kim, J.; Feng, D.D.; Sheng, B.; Jiang, L.; et al. Sthy-net: A feature fusion-enhanced dense-branched modules network for small thyroid nodule classification from ultrasound images. Vis. Comput. 2023, 39, 3675–3689. [Google Scholar] [CrossRef]
Zhou, Y.; Chen, Z.; Li, P.; Song, H.; Chen, C.P.; Sheng, B. FSAD-Net: Feedback spatial attention dehazing network. IEEE Trans. Neural Netw. Learn. Syst. 2022, 34, 7719–7733. [Google Scholar] [CrossRef]
Huang, S.; Liu, X.; Tan, T.; Hu, M.; Wei, X.; Chen, T.; Sheng, B. TransMRSR: Transformer-based self-distilled generative prior for brain MRI super-resolution. Vis. Comput. 2023, 39, 3647–3659. [Google Scholar] [CrossRef]
Cheng, Z.; Yang, Q.; Sheng, B. Deep Colorization. In Proceedings of the IEEE International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2015; pp. 415–423. [Google Scholar]
Wu, M.; Jin, X.; Jiang, Q.; Lee, S.j.; Liang, W.; Lin, G.; Yao, S. Remote sensing image colorization using symmetrical multi-scale DCGAN in YUV color space. Vis. Comput. 2021, 37, 1707–1729. [Google Scholar] [CrossRef]
Wang, F.; Geng, S.; Zhang, D.; Zhou, M. Automatic colorization for Thangka sketch-based paintings. Vis. Comput. 2024, 40, 761–779. [Google Scholar] [CrossRef]
Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2017; pp. 1125–1134. [Google Scholar]
Yoo, S.; Bahng, H.; Chung, S.; Lee, J.; Chang, J.; Choo, J. Coloring With Limited Data: Few-Shot Colorization via Memory-Augmented Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2019; pp. 11283–11292. [Google Scholar]
Xia, M.; Hu, W.; Wong, T.T.; Wang, J. Disentangled image colorization via global anchors. ACM Trans. Graph. 2022, 41, 1–13. [Google Scholar] [CrossRef]
Zhong, X.; Lu, T.; Huang, W.; Ye, M.; Lin, C.W. Grayscale Enhancement Colorization Network for Visible-infrared Person Re-identification. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 1418–1430. [Google Scholar] [CrossRef]
Sharma, M.; Makwana, M.; Upadhyay, A.; Pratap Singh, A.; Badhwar, A.; Trivedi, A.; Saini, A.; Chaudhury, S. Robust image colorization using self attention based progressive generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops; IEEE: Piscataway, NJ, USA, 2019. [Google Scholar]
Kumar, H.; Banerjee, A.; Saurav, S.; Singh, S. ParaColorizer-Realistic image colorization using parallel generative networks. Vis. Comput. 2024, 40, 4039–4054. [Google Scholar] [CrossRef]
Guo, H.; Guo, Z.; Pan, Z.; Liu, X. Bilateral Res-Unet for image colorization with limited data via GANs. In Proceedings of the IEEE 33rd International Conference on Tools with Artificial Intelligence; IEEE: Piscataway, NJ, USA, 2021; pp. 729–735. [Google Scholar]
Liu, G.; Chen, X.; Hu, Y. Anime Sketch Coloring with Swish-gated Residual U-net and Spectrally Normalized GAN. Eng. Lett. 2019, 27, 1–7. [Google Scholar]
Kang, X.; Yang, T.; Ouyang, W.; Ren, P.; Li, L.; Xie, X. Ddcolor: Towards photo-realistic image colorization via dual decoders. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2023; pp. 328–338. [Google Scholar]
Kim, G.; Kang, K.; Kim, S.; Lee, H.; Kim, S.; Kim, J.; Baek, S.H.; Cho, S. Bigcolor: Colorization using a generative color prior for natural images. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 350–366. [Google Scholar]
Weng, S.; Zhang, P.; Li, Y.; Li, S.; Shi, B.; Chang, Z. L-cad: Language-based colorization with any-level descriptions using diffusion priors. Adv. Neural Inf. Process. Syst. 2023, 36, 77174–77186. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2017; pp. 4700–4708. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2020; pp. 11534–11542. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the 18th International Conference; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2017; pp. 2223–2232. [Google Scholar]
Anwar, S.; Tahir, M.; Li, C.; Mian, A.; Khan, F.S.; Muzaffar, A.W. Image colorization: A survey and dataset. Inf. Fusion 2025, 114, 102720. [Google Scholar] [CrossRef]
Caesar, H.; Uijlings, J.; Ferrari, V. Coco-stuff: Thing and stuff classes in context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2018; pp. 1209–1218. [Google Scholar]
Li, B.; Lu, Y.; Pang, W.; Xu, H. Image Colorization using CycleGAN with semantic and spatial rationality. Multimed. Tools Appl. 2023, 82, 21641–21655. [Google Scholar] [CrossRef]
Zhou, L. An enhanced CycleGAN approach for landscape design: Style transfer and color harmonization. Alex. Eng. J. 2025, 133, 225–238. [Google Scholar] [CrossRef]
Ghosh, S.; Roy, P.; Bhattacharya, S.; Pal, U.; Blumenstein, M. TIC: Text-guided image colorization using conditional generative model. Multimed. Tools Appl. 2024, 83, 41121–41136. [Google Scholar] [CrossRef]
Yun, J.; Lee, S.; Park, M.; Choo, J. iColoriT: Towards propagating local hints to the right region in interactive colorization by leveraging vision transformer. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision; IEEE: Piscataway, NJ, USA, 2023; pp. 1787–1796. [Google Scholar]
Zhao, T.; Li, G.; Zhao, S. End-to-end image colorization with multiscale pyramid transformer. IEEE Trans. Multimed. 2024, 26, 11332–11344. [Google Scholar] [CrossRef]
Cong, X.; Wu, Y.; Chen, Q.; Lei, C. Automatic controllable colorization via imagination. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2024; pp. 2609–2619. [Google Scholar]
Kochkombaev, B.; Dizdaroğlu, B. Image Colorization with an Attention-Mechanism-Based Encoder and Decoder Approach. In Proceedings of the 2025 16th International Conference on Electrical and Electronics Engineering (ELECO); IEEE: Piscataway, NJ, USA, 2025; pp. 1–5. [Google Scholar]
Wang, Y.; Wang, H.; Liu, R.; Wang, S.; Liu, Y. A Hybrid Attention Network with Color Query for Image Colorization. In Proceedings of the 2025 44th Chinese Control Conference (CCC); IEEE: Piscataway, NJ, USA, 2025; pp. 8068–8073. [Google Scholar]

Figure 1. The proposed network structure based on residual attention U-Net.

Figure 2. The proposed feature fusion block.

Figure 3. The improved SFEB structure.

Figure 5. The proposed RAB structure.

Figure 6. Comparisons of methods on several images in summer2winter dataset (From left to right: Gray, pix2pix [24], Memo [25], Reference [26], ours and ground truth).

Figure 7. Comparisons of methods on several images in NCData dataset (From left to right: Gray, pix2pix [24], memo [25], Reference [26], ours and the original).

Figure 9. Comparison results for different blocks (From left to right: Gray, baseline, baseline + RAB, baseline + SFEB, baseline + RAB + SFEB and the original).

Figure 10. Comparison results for different numbers of layers (From left to right: grayscale, 4 layer, 5 layer, 6 layer, 7 layer, real and original).

Table 1. Comparisons of different models on the summer2winter dataset. ↑: higher is better; ↓: lower is better. The best, second best, and third best results in each column are marked in red, blue, and green, respectively.

Methods	PSNR/dB ↑	SSIM ↑	LPIPS ↓
pix2pix [24]	23.24	0.8859	0.108
Memo [25]	21.91	0.8808	0.136
Reference [26]	24.02	0.9164	0.104
CycleGAN [41]	22.31	0.8600	0.250
Ours	25.34	0.9303	0.094

Table 2. Comparisons of methods on the NCData dataset. ↑: higher is better; ↓: lower is better. The best, second best, and third best results in each column are marked in red, blue, and green, respectively.

Method	PSNR/dB ↑	SSIM ↑	LPIPS ↓
pix2pix [24]	24.01	0.8944	0.074
Memo [25]	22.18	0.8696	0.101
Reference [26]	24.03	0.9150	0.078
EnCycleGAN [42]	24.54	0.9358	—
TIC [43]	23.27	0.9170	0.133
Ours	25.22	0.9046	0.071

Table 3. Comparisons of methods on the COCO-Stuff dataset. ↑: higher is better; ↓: lower is better. The best, second best, and third best results in each column are marked in red, blue, and green, respectively.

Method	PSNR/dB ↑	SSIM ↑	LPIPS ↓
DDColor [32]	23.10	0.9040	0.169
BigColor [33]	21.35	0.8818	0.231
${L - CAD}^{*}$ [34]	24.13	0.9101	0.187
iColoriT [44]	23.34	0.7522	0.433
Color-Attention [45]	23.85	-	-
Reference [46]	23.30	0.8590	0.180
Reference [47]	21.92	0.7584	0.197
HAC-Net [48]	23.51	-	-
Ours	24.22	0.9127	0.166

Table 4. Comparison results for different modules. ↑: higher is better; ↓: lower is better. The best, second best, and third best results in each column are marked in red, blue, and green, respectively.

Block	SFEB	RAB	PSNR/dB ↑	SSIM ↑	LPIPS ↓
Baseline			25.01	0.9265	0.101
Baseline		✓	25.04	0.9275	0.099
Baseline	✓		25.18	0.9279	0.097
Baseline	✓	✓	25.34	0.9303	0.094

Table 5. Comparison results for different numbers of layers. The best results are marked in bold black.

Layers	PSNR/dB	SSIM	LPIPS
4	24.41	0.9160	0.103
5	24.89	0.9251	0.101
6	25.01	0.9254	0.101
7	25.01	0.9265	0.101

Table 6. Comparisons for dividing different numbers of layers. The best results are marked in bold black.

Layers	PSNR/dB	SSIM	LPIPS
2	25.35	0.9282	0.097
3	25.34	0.9303	0.094
4	25.20	0.9300	0.097
5	24.78	0.9255	0.103
6	24.99	0.9264	0.100
7	24.77	0.9248	0.102

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, J.; Zhang, D.; Wu, F.; Yang, L. Image Colorization with Residual Attention U-Net. Electronics 2026, 15, 1462. https://doi.org/10.3390/electronics15071462

AMA Style

Yang J, Zhang D, Wu F, Yang L. Image Colorization with Residual Attention U-Net. Electronics. 2026; 15(7):1462. https://doi.org/10.3390/electronics15071462

Chicago/Turabian Style

Yang, Jun, Donghui Zhang, Fan Wu, and Le Yang. 2026. "Image Colorization with Residual Attention U-Net" Electronics 15, no. 7: 1462. https://doi.org/10.3390/electronics15071462

APA Style

Yang, J., Zhang, D., Wu, F., & Yang, L. (2026). Image Colorization with Residual Attention U-Net. Electronics, 15(7), 1462. https://doi.org/10.3390/electronics15071462

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Image Colorization with Residual Attention U-Net

Abstract

1. Introduction

2. Related Works

2.1. Scribble-Based Methods

2.2. Exemplar-Based Methods

2.3. Automatic Image Colorization Methods

3. Methodology

3.1. Network Structure

3.2. Shallow Feature Extraction Module

3.3. Channel Attention Module

3.4. Residual Attention U-Net

3.5. Loss Function and Training Strategy

4. Experiment

4.1. Comparisons with Other Methods

4.1.1. The summer2winter Dataset

4.1.2. The NCData Dataset

4.1.3. The COCO-Stuff Dataset

4.2. Ablation Experiments

4.2.1. Comparisons of Various Modules

4.2.2. Comparisons for Different Layers

4.2.3. Residual Attention U-Net Partitioning

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI