Attentional Colorization Networks with Adaptive Group-Instance Normalization

: We propose a novel end-to-end image colorization framework which integrates attention mechanism and a learnable adaptive normalization function. In contrast to previous colorization methods that directly generate the whole image, we believe that the color of the signiﬁcant area determines the quality of the colorized image. The attention mechanism uses the attention map which is obtained by the auxiliary classiﬁer to guide our framework to produce more subtle content and visually pleasing color in salient visual regions. Furthermore, we apply Adaptive Group Instance Normalization (AGIN) function to promote our framework to generate vivid colorized images ﬂexibly, under the circumstance that we consider colorization as a particular style transfer task. Experiments show that our model is superior to previous the state-of-the-art models in coloring foreground objects.


Introduction
Colorization is a method of propagating color to a grayscale image, and the colorized image should be reasonable in content and visually comfortable. This problem is highly ill-posed and dramatically ambiguous. Under normal circumstances, we can easily draw simple conclusions from the semantics of the scenes and the texture of the objects: the sky and the ocean are blue, and the grass and forests are green. However, for intricacy artifacts, it is difficult to reproduce their true color. Moreover, the huge workload of pure hand-painting has discouraged dedicated artists, not to mention the ordinary users. To solve these problems, an increasing number of researchers have begun to develop automatic coloring methods.
In this paper, we propose a novel end-to-end image colorization framework which integrates attention mechanism and an adaptive normalization function. Previous learning-based colorization methods generate the entire image directly, ignoring the attention mechanism in human perception. Our framework colors image from grayscale domain with the guidance of the attention map which is obtained by the encoder feature map and importance weights acquired from the auxiliary classifier. Both generator and discriminator are affiliated with attention maps to focus on the importance salient region. The attention map facilitates the color propagation in the generator, while optimizes the discriminator in detail by distinguishing the difference between colorized image and ground-truth images from color domain.
We consider colorization as a particular style transfer task where color information rather than a certain style is transferred to the image from grayscale domain. Thus, multiple normalization functions play a significant role in producing vivid colorized images. Inspired by the Adaptive Layer-Instance Normalization (AdaLIN) [1], we present the Adaptive Group-Instance Normalization (AGIN). The AGIN function promotes attention mechanism to guide our model to produce visually appealing color flexibly and freely. Specifically, the parameters in AGIN is trained to learn the appropriate weights of Group Normalization (GN) [2] and Instance Normalization (IN) [3], where they perform well in the small batch size work and the individual picture work, respectively.
Our main contributions in this paper are as follows: • We proposed a novel end-to-end framework for colorization with attention mechanism and AGIN which is a learnable normalization function. • Our framework is guided by attention maps produced by the auxiliary classifier to know where the salient area is and to give more delicate color. • AGIN is a learnable normalization function which helps our framework generate reasonable color flexibly and freely without transforming the network.

Networks
The early color networks were very simple. For example, Koleini et al. [4] trained Artificial Neural Networks(ANN) by matching the pixels of gray image and color image, and Cheng et al. [5] first applied Convolutional Neural Networks (CNN) to the colorization of grayscale images. Putri et al. [6] inverted sketches into photos by predicting colors based on Deep CNN. In addition, Vitynskyi et al. [7] proposed a promising approach based on the neural-like structure of the Successive Geometric Transformations Model(SGTM), which improved the accuracy of image classification and regression methods. However, this kind of network has not been applied in the field of image translation.
Generative Adversarial Networks(GAN) can be an excellent solution for many ill-posed image processing problem and already have multiple remarkable achievements, such as colorization, image inpainting, super-resolution, style transfer, and so on. In pursuit of higher quality images, various novel GAN are beginning to prevail. DCGAN [8] uses CNN to implement generator and discriminator and replaces the pooling layer with strided convolutions. Although CNN and GAN are successfully combined, GAN is less robust. Isola et al. [9] applied Conditional GAN in image-to-image translation task and achieved wonderful results even with highly complex structure. Although GAN are growing rapidly, the inside of the generator is still as confusing as the black box. StyleGAN [10] does a good job in this respect by passing the latent code through non-linear mapping and affine transformation, and then through adaptive instance normalization in each convolutional layer control generator. CycleGAN [11] differs from the three GAN mentioned above in that it can handle unpaired data, which means that it pays more attention to the migration of features between images and images. Moreover, cycle consistency loss can avoid the contradiction between generators.

Colorization
The scribble-based colorization methods diffuse the user's color hints (such as color points, strokes, and blocks) to the entire grayscale image, while the color propagation is based on low-scale features. Levin et al. [12] and Zhang et al. [13] proposed the scribble-based methods diffusing the color of the strokes prompted by the user to the entire image. Sangkloy's method [14] allows the user to control the generation of color images through sketches and sparse strokes. Levin et al. [12] first proposed that adjacent pixels with similar luminance also have similar colors. According to this theory, boundary information is not even required from a user's sparse simple stroke to a complex full color image. More advanced work extend from the luminance information to the textures, and solve color bleeding with edge guidance. The system developed by Zhang et al. [13] can fuse user's sparse low-level stroke information and high-level semantic information to color the grayscale image. Although all the above methods relieve the user's burden to a certain extent, they still need abundance or less manual intervention. Moreover, the rationality of image colors depends to a large extent on the user's strokes, which means that the image quality is constrained by the user's professionalism and rich experience. Therefore, exemplar-based colorization which is one of fully automatic methods is prevailing to reduce the burden on users.
The exemplar-based colorization methods provide a similar color reference map to the target grayscale image for more direct coloring. Welsh et al. [15] focus on the global information of the image, which colors the target image by matching the brightness information between the reference image and each pixel of the target image. Tai et al. [16] paid attention to the local information of the image and segmented the image with soft boundaries to achieve color transfer and propagation. However, for substantial content regions, it is difficult to obtain low-level features by these methods. The system designed by He et al. [17,18] can recommend appropriate references based on luminance and semantic information, reducing the steps of manually screening, then achieving full-automatic coloring. Yoo et al. [19] use small-scale data to produce high quality images with a colorization model which has memory components. The above methods all suffered from the same problem that the reference does not exactly match all brightness information in source domain. Then, how to select an applicable reference becomes a challenge. Hence, the learning-based method learns color transfer pattern from large-scale data and applies different loss functions to restrain the quality of the generated color images.
The learning-based colorization methods obtain networks by training on large-scale data, and networks can automatically generate various results without user intervention. Almost the same period, Larsson et al. [20]; Iizuka et al. [21]; Zhang et al. [22] proposed similar methods with different loss functions based on CNN. Larsson and Zhang applied classification loss and Iizuka applied L 2 regression loss. Isola et al. [9] believe that L 1 loss can reduce image blur, so the combination of L 1 loss and GAN loss is applied. To produce more diverse colorization results, Messaoud et al. [23] established a conditional random field, and Cao et al. [24] developed a fully convolutional generator with multi-layer noise. Zhao et al. [25] exploit pixel-level semantic information to guide the generator.

Class Activation Mapping
Zhou et al. [26] pioneered a class activation mapping, which uses global average pooling to obtain the weights of each convolutional layer and multiply the weighted sum by each feature map. We can input an image of any size, as long as it is simply upsampled to the source image size, salient area of the image will be showed. Grad-CAM [27] is an improvement based on CAM which uses the global average of the gradient to calculate weights, and no need to change the network structure.

Normalization
We consider colorization as the transfer of color features to the target grayscale image, which is the same as the style transfer. At the same time, the normalization function used in style transfer can also be used in colorization through adjustment. Although Batch Normalization (BN) [28] has made good achievements, a large number of improvement methods are emerging, such as: Layer Normalization [29], GN [2], etc. BN shows robustness in a wide range of batch sizes, even when it is small. In [3], because its results depend on a certain image instance, they achieve remarkable results in style transfer. To have a better image effect, a composite method such as Batch-Instance Normalization(BIN) [30], Adaptive Instance Normalization (AdaIN) [31], and Conditional Instance Normalization (CIN) [32] is often used instead of using IN alone. CIN improves the layer affine parameters of IN. By using the same network parameters, different style effects can be obtained. BIN can selectively normalize the style. AdaIN generates an image of any given type by using adaptive affine parameters, which functions as an exchange style.

Network
For the existing grayscale image domain X g and color image domain X c , our purpose is to train a mapping G g→c that can generate X c domain images from the X g domain images. Our framework comprises two generators( G g , G c ) and two discriminators( D g , D c ), which both incorporate attention mechanism. The framework structure is based on CycleGAN, so only G g and D c in the forward cycle are explained here (see Figure 1). The reverse cycle is consistent with its principle. To distinguish the input of the forward cycle is represented by x and the input of the reverse cycle is represented by y. Figure 1. The architecture of our framework, and the details are covered in Section 3.1.

Generator
Let x ∈ {X g , X c }, and G g (x) represents the output of an image from the gray image domain to the color image domain. G g is generator which consists of an encoder E g , a decoder D c and an auxiliary classifier A g , where E g (x) is the activation map of encoder, E i g (x) is the i-th activation map, and E i (a,b) g (x) is the value at (a, b). A g (x) represents the degree of correspondence between x and the image in the X g domain. The auxiliary classifier of CAM [26] uses global average pooling to learn the importance weights of the i-th activation map. We exploit the combination of Global Average Pooling (GAP) and Median Pooling (MP) to learn the edge feature better, and the importance weights of the i-th . A set of X g domain-specific attention feature map M g (x) can be generated by the importance weights and the previous convolutional layers, where M g (x) = {W i g (x) * E i g (x)|1 ≤ i ≤ n}, and n is the amount of encoder activation maps. Inspired by AdaLIN [1], We integrate the residual blocks with AGIN which is a fusion of GN [2] and IN [3].
where µ G , µ I and σ G , σ I are the mean of x on group scale, channel scale and standard deviation respectively. β and γ are affine transformation parameters with predictions generated by fully connected layer. τ is learning rate. ρ ∈ [0, 1], where ρ is restricted by ∆ρ which is a dynamically computed parameter vector(e.g., the gradient). The value of ρ represents the choice of normalization method. If the value approaches 0, it means that this task is more suitable for IN, and if the value approaches 1, it means that GN is more important for this task.

Discriminator
G g (X g ) is a domain which contains generated fake color images. Let x ∈ {X g , G g (X g )} represent x from the color domain and the fake color domain. Our discriminatorD c is comprised of an encoder E c , a classifierĈ c and an auxiliary classifierÂ c . With a input x, we acquire from the encoder with feature mapsÊ c (x) which can be used to obtain the importance weightsŴ c . The attention feature maps is calculated usingM c (x) = {Ŵ i c (x)Ê i c (x)|1 ≤ i ≤ n}, and exploited byD c .Ê c is trained byÂ c which along withD c are trained to discriminate where x belongs to, X g or G g (X g ).

Loss
The full loss function of our framework is composed of four parts.

Adversarial Loss
Both forward mapping G g and reverse mapping G c apply adversarial losses: where G g aims to generate fake images G g (x) to foolD c , andD c tries to distinguish whether the generated images are from domain X g or X c . Concisely, function min G g maxD c L adv,G g (G g ,D c , X, Y) represents G g tries to minimize this function, on the contrary,D c needs to maximize it. Similarly, the function of revers mapping apply the min G c maxD g L adv,G c (G c ,D g , Y, X).

Cycle Consistency Loss
To avoid the color of all generated images from X g tending to one image in the X c and each image generated by G g can also be restored from G c to x, We introduce cycle consistency loss:

Content Loss
With the purpose of ensuring the input and output are similar in content, we apply content loss to restrain generators.

CAM Loss
For x ∈ {X g , X c } we trained G g and D c with the parameters inferred from auxiliary classifiers A g andÂ c . With CAM losses: G g can be aware of where improvement is needed to generate images more similar to the images in X c .
D c gets to know where to identify details that can distinguish the difference between two domain images.

Architecture
Our generator is composed of an encoder, a decoder, and an auxiliary classifier. The encoder consists of two convolutional layers of down-sampling with the stride size of two and four residual blocks. The decoder consists of two up-sampling convolutional layers with the stride size of one and four adaptive residual blocks which is equipped with AGIN, unlike in the decoder where only instance normalization is used. We use two scales of PatchGAN [9] in the discriminator network for identification, in which the size of the local patch size is 70 × 70 and the size of the global patch size is 286 × 286. In discriminator, we use spectral normalization. The ReLUs used in the generator are not leaky, while ReLUs in the discriminator are leaky, with a slope of 0.2.

Training
To expand the training data, we first resized input images with the size of 256 × 256 to 286 × 286, and then randomly cropped back to the size of 256 × 256. The batch size in experiment is set to one. We applied Adam [33] in training with a learning rate of 0.0002 and momentum parameters β 1 = 0.5, β 2 = 0.999.

Dataset
We train networks on COCO [34] and VisualGenome [35]. For the training, all images are resized to 256 × 256. In addition, all grayscale images are obtained by grayscale conversion of color images.

Comparisons with State-of-the-Art
We first get the results of the proposed model (see Figure 2), and also conduct ablation experiments on attention mechanism to prove its validity. We compare our model with the colorization state-of-the-art(Zhang et al. [22]; Larsson et al. [20]; Iizuka et al. [21]). The colorization results are shown in Figure 3. From the overall chrominance, the results of our model are more realistic and convincing (row 1, 3). Our model is also superior in terms of detail coloring (row 2, 4). Also, Our model can color foreground objects correctly while carefully handling edge problems (5,6). In addition, we compare the performance of our model with that of other outstanding image translation models (see Figure 4). Furthermore, the qualitative and quantitative evaluations are also used to evaluate the quality of generated images.

CAM Ablation Experiment
To prove the effectiveness of attention mechanism, we conducted CAM ablation experiment. We can find from the CAM ablation experiment results (see Figure 5) that color bleeding problem (row d, columns 1, 3, 4, 5, 6) and color failure caused by blurring boundaries (row d, columns 2, 7) are common in the results without CAM. The addition of CAM can make the model pay more attention to the key areas when coloring and deal with the boundary more carefully. The colorized results with CAM is shown in Figure 5e which confirms that attention mechanism plays a positive role in the color bleeding problem of colorization.

AGIN Ablation Experiment
We consider colorization as a particular style transfer task, i.e., transferring color information rather than a specific style. We use AGIN to combine the advantages of GN and IN for better color transferring. As we introduced in Section 3.1, the value of ρ is learnable. When the value of ρ is learn to approach 0, it means that the normalization layers tend to adopt IN. When the value of ρ is learn to approach 1, it means that the normalization layers tend to adopt GN. Hence, we conducted AGIN ablation experiment to confirm that the AGIN used in generator is beneficial to produce vivid color. GN computes the group-wise features, so unreasonable colors may appear in the generated image (see Figure 6, row c). IN calculates the channel-wise features, too many of them are retained, so that the overall chrominance of the colorized image is dark and the contrast is not enough (see Figure 6, row d). Therefore, we believe that AGIN can combine the advantages of GN and IN to allocate the weight adaptively, which can make colorized images more visually pleasing.

Qualitative and Quantitative Evaluations
To evaluate the quality of the colorized images, we conducted a preference study. 197 observers(including researchers and people without any colorization knowledge) are asked to select the best colorized image from images generated by different methods. As can be seen from Table 1, the results of our method were approved by the majority of users. Table 2 also shows that our method can produce higher quality images than other methods.
To be visually pleasing, we also evaluated the naturalness of the colorized images. We compare our model with the colorization state-of-the-art (Zhang et al. [22]; Larsson et al. [20]; Iizuka et al. [21]). 15 observers are randomly shown 500 images (ground truth images, and colorized images generated by our method and the state-of-the-art methods, 100 images each) one at a time, and asked to judge the image is natural to themselves or not. We let observers intuitively determine whether the image is natural. Similarly, we compare the naturalness of our method with that of image translation. Table 3 shows that 93.21% of colorized images generated by our method are considered as natural, which bears out our model is able to generate natural and visually pleasing color images.  Table 3. Naturalness evaluation.

Method Naturalness (Mean)
Zhang et al. [22] 87.53% Larsson et al. [20] 85.58% Iizuka et al. [21] 89 Evaluating the results of colorization methods is a very subjective challenge, and both quantitative and qualitative evaluations are difficult. As for qualitative evaluation, it is very difficult to make qualitative analysis on such a highly ill-posed problem as colorization. The peak signal-to-noise ratio(PSNR) is widely used in the field of image processing, and many colorization methods (Larsson et al. [20]; He et al. [17]) also use PSNR to evaluate image quality The comparison results are shown in Table 4. Our method has a higher PSNR than other methods, which proves that our method can produce more realistic and higher-quality images.

Limitations and Discussion
Colorization is a highly ill-posed and ambiguous problem. We can easily infer the colors of the oceans and forests, but there is often no unique solution to the colors of the clothes people wear. In addition, the limitation of our method is that we can misjudge objects (see Figure 7, row 1, the land was wrongly colored green) and color the artifacts incorrectly (see Figure 7, row 2, the kite was incorrectly colored). Learning-based methods are data-driven. As long as the dataset is large enough and the content is rich enough, the better the colorized image quality is theoretically. At the same time, the same object will present different colors in different environments, weather and seasons, i.e., changeable lighting conditions also bring challenges to colorization. Meanwhile, to make the colors of the artifacts in the dataset representative and universal may still need to be manually labeled. However, the possibility of semantic colorization is not explored in this paper.

Conclusions
In this paper, we proposed a novel end-to-end colorization framework that integrates attention mechanism and AGIN which is a leranable adaptive normalization function. Attention maps produced by auxiliary classifier serve as a guide for the generator to focus on details that are easily overlooked. The addition of attention mechanism solves the problem of color bleeding well. Furthermore, AGIN plays an integral role in flexibly and freely producing vivid colors when the dataset contains images with complex content and diverse scenes. Mass experimental results verify that our framework can transfer reasonable and visually pleasing color to black and white images. In addition, our method is superior to other state-of-the-art GAN-based colorization methods.
In the future research, from the perspective of algorithm, the network structure can continue to be optimized to reduce the model training time, and the video colorization algorithm can also be taken into consideration. From the perspective of dataset, a new dataset can be built to train the network, which can be applied in the colorization of legacy photos and old movies.

Conflicts of Interest:
The authors declare no conflict of interest.