Thangka Sketch Colorization Based on Multi-Level Adaptive-Instance-Normalized Color Fusion and Skip Connection Attention

: Thangka is an important intangible cultural heritage of Tibet. Due to the complexity, and time-consuming nature of the Thangka painting technique, this technique is currently facing the risk of being lost. It is important to preserve the art of Thangka through digital painting methods. Machine learning-based auto-sketch colorization is one of the vital steps for digital Thangka painting. However, existing learning-based sketch colorization methods face two challenges in solving the problem of colorizing Thangka: (1) the extremely rich colors of the Thangka make it difﬁcult to color accurately with existing algorithms, and (2) the line density of the Thangka brings extreme challenges for algorithms to deﬁne what semantic information the lines imply. To resolve these problems, we propose a Thangka sketch colorization method based on multi-level adaptive-instance-normalized color fusion (MACF) and skip connection attention (SCA). The proposed method consists of two parts: (1) a multi-level adaptive-instance-normalized color fusion (MACF) to fuse sketch feature and color feature; and (2) a skip connection attention (SCA) mechanism to distinguish the semantic information implied by the sketch lines. Experiments on colorizing Thangka sketches show that our method works well on two small datasets—the Danbooru 2019 dataset and the Thangka dataset. Our approach can generate exquisite Thangka.


Introduction
As a kind of Tibetan encyclopedia [1], Thangka art is one of Tibet's most valuable cultural heritages and one of the most precious materials for studying Tibetan history. It takes a professional Thangka painter dozens of days or even years to paint a beautiful Thangka. Thangka colorization is one of the essential parts of the Thangka painting process, which requires a lot of time and effort for professional Thangka painters. Recently, the colorization of a given image has attracted much attention in computer vision. The machine learning-based sketch colorization method enables us to use digital means to create and preserve Thangka better.
Although great progress [2,3] has been made in sketch colorization methods, there is no available solution for the task of Thangka sketch colorization due to three main reasons: (1) Thangka artworks are extremely colorful, which makes both traditional non-learning methods and standard convolution-based learning methods hard to color correctly; (2) the line density of the Thangka makes it difficult for the existing methods to correctly define what semantic information the Thangka lines imply; and (3) the existing Thangka dataset is too small to be trained well with the existing colorization methods.
Reference-based sketch image colorization [2][3][4] has excellent potential for sketch colorization of Thangka, and these related works have achieved remarkable results. However, these existing reference-based sketch image colorization methods are still unable to produce satisfactory colorization results because of two challenges: Challenge 1: Existing reference-based sketch image colorization algorithms cannot correctly extract the color features of Thangka. Compared with the existing datasets, such as face images, landscapes, indoor scenes, flowers, animals, animation and cars, etc., the colors in the Thangka dataset are extremely rich. The performance of Thangka painting depends on the color feature extraction. However, the existing colorization methods are mainly applied to color animation works which only contain simple colors and single structures.
Challenge 2: Compared with other sketch images, the lines of the Thangka sketches are too dense for algorithms to define what semantic information the lines imply, which leads to problems in the colorization process (for example, in Figure 1, the results of existing methods show obvious artifacts, wrong colors, and color confusion).

Reference The result of our method
The result of existing method Ⅰ Ⅱ Ⅲ Ⅳ Ⅴ Ⅵ

Detailes of existing method result b)The result of our method and its detailes a)Sketch and reference images
c)The result of existing method and its detailes Figure 1. The proposed MACF-SCA obtains the best colorization map with more accurate colors and better visual effects (b). Existing reference-based methods cannot accurately migrate the color semantic information (c). The detailed parts show that our model can accurately distinguish the semantic information of dense lines (I,II,III), and the existing model shows obvious visual artifacts, color confusion and color errors (IV,V,VI).
In this paper, a multi-level adaptive-instance-normalized color fusion (MACF) and skip connection attention mechanism (SCA) is proposed ( Figure 2) to solve the Thangka sketch colorization. The proposed method consists of two parts: First (solution for Challenge 1), a new multi-level adaptive-instance-normalized color fusion (MACF) is proposed to fuse rich color features with sketch features efficiently. MACF consists of a combination of four identical Convolutional AdaIN ReLU (CAR) modules (Section 3.3). Firstly, color features and sketch features are extracted using a color feature encoder and sketch feature encoder, respectively. Then they are fed together into a multistage MACF for fusion. The multi-level MACF ensures the accurate fusion of color features and sketch features.
Second (solution for Challenge 2), a new semantic information distinction based on skip connection attention (SCA) is proposed to focus on sketch lines. SCA is extremely sensitive to subtle objects, allowing our model to accurately distinguish what semantic information the subtle lines imply. Our skip connection attention-trained Thangka colorization model not only provides accurate recognition of semantic information but also avoids overfitting.
Experiments on the colorization of Thangka sketches show that our method can generate high-quality Thangka colorization results in one step without post-processing. In summary, the main contributions of this work are:

1.
We propose a multi-level color fusion module multi-level adaptive-instance-normalized color fusion (MACF), which can accurately fuse color features with sketch features and generate high-quality colorization works.

2.
We propose a skip connection attention (SCA) module for accurately distinguishing semantic information consisting of dense lines.

3.
For the first time, we present a framework applicable to the Thangka sketches colorization, and we also constructed a new Thangka dataset (5081 images).

Automatic Sketch Colorization
Automatic sketch colorization methods based on deep learning [5][6][7][8][9][10][11] have received increasing attention in recent years. Relying on the powerful representation capability of deep neural networks, automatic sketch colorization methods can be implemented by designing various network structures and using large-scale image datasets. Liu et al. [5] used a feed-forward deep neural network as a generator to output color images with pixel-level resolution using sketches as the input. Frans et al. [6] proposed two tandem adversarial networks for the automatic colorization of sketch images. Recent studies [7,11] have improved using the U-net network and proposed a U-net-based architecture for automatic sketch colorization.
For all automatic sketch colorization methods, there are two main problems: (1) these methods are sensitive to visual artifacts when the sketch has complex content with multiple objects, and (2) the existing methods tend to output single-color results and have no multimodality since the network parameters are fixed.

User Prompt Based Colorization
Early interactive colorization methods [12,13] used low-similarity metrics to propagate stroke colors. Recently, some algorithms [5,[14][15][16][17] introduced manual guidance to apply initial color points or strokes to the entire sketch image. Ci et al. [14] proposed a deep conditional adversarial architecture to robustly train the network to make synthetic images more natural and realistic. Zhang et al. [15] proposed a two-stage colorization framework based on semi-automatic learning to color sketches with appropriate colors, textures, and gradients. Yuan et al. [16] proposed a tandem and U-net-based framework on a spatial attention module that can generate more consistent and higher-quality sketch colorization from the cues given by the user.
However, all these methods have limitations: (1) these palette-based colorization methods are susceptible to user aesthetic limitations, and (2) it is difficult for untrained users to select the appropriate points and associated colors from the palette.

Reference-Based Sketch Image Colorization
In contrast to the user prompt-based colorization, reference-based sketch image colorization only requires a user to select a suitable reference image based on a target sketch image. Colorization of the sketch image according to the reference style is a user-friendly method that helps a designer choose the right color image for the sketch [2][3][4][18][19][20]. With the recent rise of deep neural networks, Zhang et al. [4] integrated the residual U-net into a generative adversarial network (AC-GAN) with an auxiliary classifier for the anime sketch colorization task. Due to the limitations of the sketch-reference image pair dataset, Lee et al. [2] proposed an enhanced self-reference generation method, where the reference image is generated from the original image by color perturbation and geometric distortion, followed by an attention-based pixel feature transfer module to colorize the sketch image. Li et al. [3] proposed a stop-gradient-attention (SGA) training strategy based on [2] to eliminate gradient conflicts and help models learn better colorization correspondences.
Although these models achieved good results, the results of the Thangka sketch drawings are not satisfactory. As shown in Figure 1, a comparison between the existing method ( Figure 1c) and our method ( Figure 1b) shows obvious artifacts, color errors, and semantic mismatches of the existing method.

Methodology
The proposed sketch colorization method for Thangka consists of two parts: (1) a multi-level adaptive-instance-normalized color fusion (MACF) for fusing color features and sketch features, and (2) a skip connection attention (SCA) module that integrates skip connection and attention mechanism, which can accurately discern the semantic information of dense lines.

Overall Workflow
As shown in Figure 2, given a color image I, we first convert it to an artistic line image I s using XDoG [21]. Then, inspired by [2], we obtain the expected colorization result I c by adding a random color dithering on I. Next, a self-styled reference image I r was generated by applying the thin plate spline (TPS) transformation to I c . Finally, we use a self-supervised training process similar to [2]. In the training process, our model takes I r and I s as inputs and uses two independent encoders E s (I s ) and E r (I r ) to extract sketch features f s ∈ R c * h * w and color features f r ∈ R c * h * w .
To carry out sketch feature alignment and color feature fusion simultaneously, the extracted color features are fused into the depth representation of the sketch through our MACF block control feature map. The final color image I g is then generated using multiple residual blocks and a decoder with a skip connection to the sketch encoder E s . In the end, we add an adversarial loss [22] by using a discriminator D to distinguish the output I g and the ground truth I c . The color style is similar to the reference image, and the content is consistent with the input sketch.

Self-Enhanced Self-Referential Learning
Due to the scarcity of the Thangka dataset, preparing reference images for Thangka sketch images and linking these two inputs for pixel-level pairing training is a crucial bottleneck; we adopt the self-reference generation method from [2]. To generate a random reference image I r for a given Thangka sketch I s , we perform a spatial transformation of the original real color image I. Since I r is essentially generated by I, this process guarantees enough color information to color I s , which encourages the proposed model to reflect I r in the colorization process.
Detailed information on how these conversions work is described below. First, the content transformation C(·) adds a specific random color perturbation on I. The resulting output C(I) is then used as the ground truth I c for the colorization output of our model. The reason why we impose the color perturbation to the original I is to increase the training samples. The same original color image I can have different reference images. Afterward, we further apply the thin plate spline (TPS) [23] transform T(·), a non-linear spatial transformation operator to C(I) (or I c ), resulting in our final reference image I r . This prevents our model from lazily bringing colors from I to the same pixel location while forcing our model to extract semantic color information only from the reference image, even if it has a different layout in space. For example, differences in orientation, shape, and posture. The above two transformations help our model learn to transfer the correct color information from the reference image to the target image.

Multi-Level Adaptive-Instance-Normalized Color Fusion (MACF)
Existing deep learning methods, such as SCFT [2] and SGA [3], have reached state-ofthe-art image colorization. However, they still fail to correctly migrate rich colors through deep networks, which inevitably leads to inaccurate colorization of the final results. Our MACF consists of four identical Convolutional AdaIN ReLU (CAR) modules, and the composition of the CAR module is shown in Figure 3. MACF fuses the color feature map extracted by the color encoder with the sketch feature map extracted by the sketch encoder. In MACF, we use the AdaIN layer to control the input feature maps to achieve alignment of color features with sketch features. Multilayer CAR is used to output multi-scale result maps.

Skip Connection Attention (SCA)
Inspired by [24], we note that attention gates (AG) are extremely sensitive to subtle changes, which helps us process complex textured images such as Thangka. We merged the AG into our net architecture to highlight the sketch features passing through the skip connection. For the SCA module, we provide two inputs-a complete sketch feature map and a rough feature map. As shown in Figure 2, the sketch feature information is roughly extracted for gating in the E s encoder using the vgg19 network [25] to eliminate irrelevant noise and ambiguous responses in the skip connection. This is performed before the join operation to merge and activate only the relevant feature information in the decoder.
As shown in Figure 4, in order to accurately distinguish the semantic information of sketch lines, a sufficiently sizeable receptive field needs to be captured. When the decoder is connected to the skip connection, the attention gate calculates the activation weights in the skip connection, locks the spatial region and scales the sketch features delivered by the skip connection. The sketch feature map is scaled using the attention factor (α) computed by AG. The spatial regions are selected by analyzing the activation and line information provided by the gating signal (G), which is collected from a coarser scale. Finally, sketch feature mapping and color feature fusion are performed in the decoder to obtain the final colorized Thangka image.

Loss Function
For machine learning, the design of the loss function depends on the goal of the training. In this work, the goal of Thangka sketch colorization is to give the Thangka sketch appropriate colors to show the beauty of the Thangka artwork. To achieve this goal, the Reconstruction Loss (L rec ), Adversarial Gen Loss (L adv ), Perceptual Loss (L perc ) and Style Loss (L style ) functions are used in our method.
Reconstruction Loss. According to Section 3.1, the generated image I g and the ground truth image I c should be stylistically consistent with the reference image I r and retain consistent contours with the sketch image I s , respectively. Therefore, we use the L1 [26] criterion to measure the difference between I g and I c , which ensures that the model adds color correctly and distinctly. The reconstruction loss can be expressed as: where G (I s , I r ) means coloring the sketch I s with the reference I s and I c is the color image. Adversarial Gen Loss. As an adversary of the generator, the discriminator D aims to distinguish the images generated by the generator from the real ones. The output of the real/fake classifier (X) represents the probability that any image X is a real image. We chose conditional GANs, which uses generated samples and additional conditions [27] simultaneously. In this work, we use the input image I s as the condition for adversarial loss because preserving the content of I s and generating plausible fake images is important. The optimization of D s loss is expressed as a standard cross entropy loss as: where G represents the generator , D represents the discriminator, I s is the sketch image, I r is the reference image, I c is the color image, and I g is the generated image.
The first term E (I s ,I c ) [log D(I s , I c )] represents the discriminator's loss for real images (I s , I c ), so the goal of this term is to make the discriminator better at distinguishing real images from fake ones by making its output log D(I s , I c ) closer to 1. The second term E (I s ,I r ) [log(1 − D(I s , G(I s , I r )))] represents the generator's loss for generated fake images (I s , G(I s , I r )); this term aims to make the discriminator output log(1 − D(I s , G(I s , I r ))) closer to 0.
Perceptual Loss. As shown in previous work [28], perceptual loss [29] can drive the network to produce perceptually plausible outputs and has also been shown to facilitate the training of sketch colorization models [30,31]. We use the perceptual loss computed on the VGG19 network [25] pre-trained on ImageNet as the content loss of the generator as: where T i is the number of elements in the i-th layer of VGG19 and F (i) is the feature mapping in the i-th layer. Style loss. Lee et al. [2] have shown that style loss helps the network to produce reasonable outputs. The style loss is calculated as: where G is a gram matrix.
In summary, the overall loss function for training is defined as:

Dataset
We used the Danbooru 2019 and Thangka datasets to train and validate our model. Danbooru 2019 dataset [32]. Danbooru 2019 is the most widely used dataset in animation sketch colorization. For the Danbooru 2019 dataset, we filtered 16,170 images from it for training and 2000 images for testing. It consists of objects with black background images. Since the black background has obvious area boundaries when extracting lines, we substitute all black backgrounds with white backgrounds to facilitate the extraction of sketches. This dataset is used to train our model in the cartoon domain so that the Thangka sketches have the stylistic characteristics of anime.
Thangka dataset. Since there is no publicly available Thangka dataset for our study, we collected 128 ultra-high-definition Thangka images (size 12,869 × 16,710) from the Internet and then manually cropped and cut out beautiful partial pictures of these Thangka murals containing portraits of Buddha statues, lotus bases, sacred animals, auspicious clouds, auspicious treasures, temples, etc. Finally, 5662 Thangka images (size 512 × 512) were obtained for the experiment. We allocate 5081 images for training and 581 images for testing.
To simulate the lines drawn by the artist for both the Danbooru 2019 dataset and the Thangka dataset, we used XDoG [21] to extract the sketch inputs and set the parameters of the XDoG algorithm to φ = 1 × 10 9 in order to maintain a step transition at the boundary of the sketch lines. For other parameters, we set σ = 0.5, p = 19, k = 4.5, and ε = 0.01 by default in XDoG.

Implementation Details
We trained our model on a single NVIDIA 3090 GPU and we set the coefficients of each loss term as follows: λ rec = 30, λ perc = 0.01, and λ style = 50. We use the Adam solver [33] for optimization with momentum hyperparameters β 1 = 0.5 and β 2 = 0.999. The learning rates of the generator and discriminator are initially set to 0.0001 and 0.0002, respectively. For each dataset, the size of the input image is fixed as 512 × 512.

Qualitative Evaluation
We conducted experiments on the Danbooru 2019 and Thangka datasets and compared our approach with existing state-of-the-art methods that include not only referencebased line art colorization [2,3] but also image-to-image translation [20]. Figure 5 visually compares the overall qualitative results of our method with the state-of-the-art methods. Figure 5A,B show the results of the Danbooru 2019 dataset, and C,D compare the results of the Thangka dataset. The sketch and reference images are given in the first and second columns, respectively. On each dataset, our model extracts the exact colors from the reference image and injects them into the corresponding positions in the sketch. For example, in the first row of Figure 5, our model colored the lotus base correctly, while SCFT [2], SGA [3] and Munit [20] all showed unsatisfactory visual effects. In contrast, our method finely fills in the same colors as the reference image. As shown in the third row of Figure 5, the results of the Thangka show that Munit [20] was unable to learn the correct semantic information of the Thangka image. SCFT [2] and SGA [3] also showed obvious color overflow, errors, and noticeable visual artifacts.
The experimental results of the Danbooru 2019 and Thangka datasets show the superiority of our method over SCFT [2], SGA [3] and Munit [20], demonstrating the advantages of our model in establishing visual correspondences and generating appropriate colors in Thangka images.

Quantitative Evaluation
In traditional sketch colorization setups, pixel-level evaluation metrics such as peak signal-to-noise ratio (PSNR) and contour retention evaluation metrics such as structural similarity index (SSIM) are widely used. The Fréchet inception distance (FID) [34] is a wellknown metric used to evaluate the performance of generative models. In our study, we use the following three metrics to evaluate the results of our model quantitatively.
Fréchet inception distance (FID) [34]. FID is a well-known metric used to evaluate the performance of generative models by measuring the Wasserstein-2 distance between the feature space representation of the actual image and its generated output. A low FID score indicates that the model generates images with quality and diversity close to the real data distribution.
Peak signal-to-noise ratio (PSNR). PSNR is based on the error between corresponding pixel points and is one of the most widely used objective image evaluation metrics. A higher PSNR score indicates a better similarity between the reconstructed and ground truth color images.
Structural similarity index (SSIM). The structural similarity index, which calculates the structural similarity index (SSIM) between the reconstructed image and the original color image, measures the preservation of the contours of the drawing during the colorization process. The higher the score, the more similar the two images are; the ideal value is 1.
To evaluate the performances of different methods, we randomly selected reference and sketch images for colorization and used the above three metrics for a quantitative study. Table 1 shows the results. Table 1. Quantitative comparisons show that our model trained for Thangka colorization outperforms the models trained by SCFT [2], SGA [3] and Munit [20] (tests are conducted on both the Danbooru 2019 dataset of 2000 images and the Thangka dataset of 581 images). FID [34] score: a lower score is better. PSNR and SSIM score: a higher score is better. We report the FID, SSIM and PSNR scores calculated by these models on different datasets in Table 1. Our model scores show that the proposed SCA module in the model plays a valuable role in generating realistic images by establishing context-supervised semantic correspondence through skip connections. Our method produces results closest to ground truth color images, which demonstrates the realism and robustness of our method on different images. We show more examples of Thangka sketch coloring in Figure 6.

Sketch
Reference Output

Ablation Study
We conducted several ablation experiments to validate the effectiveness of each component of our approach, namely multi-level adaptive-instance-normalized color fusion (MACF) and skip connection attention (SCA). Table 2 reports the quantitative ablation results, reflecting the validity of our model. PSNR/SSIM metrics are evaluated by paired sketch/reference inputs, and the FID is assessed by random reference.
First, we removed the MACF module to evaluate the effectiveness of multi-level adaptive-instance-normalized color fusion (MACF), which obtained poor performance in Table 2, verifying the necessity of our MACF.
Second, we conducted an ablation study on the skip connection attention (SCA) module to verify the advantages of the skip attention mechanism in our framework. Table 2 shows that the model's performance with the SCA module is significantly better than the model without the SCA module. Although a realistic image can be generated without the SCA module, it has a lower contour retention, i.e., SSIM metric.
Finally, in the third row of Table 2, we show the results of our whole model, and we can see that it performs with a significantly superior quality of image generation. The contour retention rate, i.e., the SSIM, is also the highest. Our ablation study demonstrated the effectiveness of MACF and SCA. We show two qualitative examples in Figure 7 that demonstrate the effectiveness of MACF and SCA.

W/O MACF W/O SCA FULL
Sketch Reference Figure 7. The proposed multi-level adaptive-instance-normalized color fusion (MACF) and skip connection attention (SCA) allow our MACF-SCA to produce visually better and more accurate colorization results by multi-level color fusion (see the fourth column). In contrast, the second column (without MACF strategy) shows obvious color confusion. In the third column (without SCA strategy), it is not difficult to find color errors due to misjudgment of line semantic information.

Conclusions
We propose a method of colorizing Thangka sketches based on multi-level adaptiveinstance-normalized color fusion (MACF) and skip connection attention (SCA) for generating Thangka artworks. The method consists of two parts: (1) a new multi-level adaptiveinstance-normalized color fusion module (MACF) for the accurate fusion of color features with sketch features, and (2) a new skip connection attention (SCA) for accurately distinguishing semantic information composed of dense lines. Experiments on two different datasets show that our method can produce more visually plausible and richer colorization maps compared to the existing methods. Both objective and subjective evaluations validated the performance of our method.
Although our method works well on anime and Thangka sketches, our outputs can still be affected by the colors and textures of style references. If the style reference contains little color information, output quality may be unsatisfactory. In addition, this study focuses more on coloring religious artworks, Thangka and anime, while other types of inputs still need further optimization and improvement. Future work includes using more related techniques such as color mapping, color gradient generation, and color blending to enhance the expressiveness and fidelity of coloring. It also includes adapting the model to accommodate more types of inputs and improving the model to handle higher resolution images.

Data Availability Statement:
The dataset used in this study will be considered publicly available at a later stage available from livingsailor@gmail.com available upon request.