Exemplar-Based Sketch Colorization with Cross-Domain Dense Semantic Correspondence

: This paper aims to solve the task of coloring a sketch image given a ready-colored exemplar image. Conventional exemplar-based colorization methods tend to transfer styles from reference images to grayscale images by employing image analogy techniques or establishing semantic correspondences. However, their practical capabilities are limited when semantic correspondences are elusive. This is the case with coloring for sketches (where semantic correspondences are challenging to ﬁnd) since it contains only edge information of the object and usually contains much noise. To address this, we present a framework for exemplar-based sketch colorization tasks that synthesizes colored images from sketch input and reference input in a distinct domain. Generally, we jointly proposed our domain alignment network, where the dense semantic correspondence can be established, with a simple but valuable adversarial strategy, that we term the structural and coloriﬁc conditions. Furthermore, we proposed to utilize a self-attention mechanism for style transfer from exemplar to sketch. It facilitates the establishment of dense semantic correspondence, which we term the spatially corresponding semantic transfer module. We demonstrate the effectiveness of our proposed method in several sketch-related translation tasks via quantitative and qualitative evaluation.


Introduction
Sketch roughly describes the attributes and appearances of an object or a scene by a series of lines, and sketch colorization, which assigns colors to binary line images to improve their visual quality while preserving the original semantic information. Nowadays, neural style translation has succeeded in image translation, which renders the image and changes its color and texture while keeping its content characteristics unchanged [1][2][3][4][5][6]. The previous neural translation methods perform well in grayscale images, but not in the conversion of sketch manuscript images. Therefore, the translation task on sketches has attracted a great deal of attention in both the content industry and computer vision. In contrast to the coloring task of sketch images, the grayscale coloring task is mainly based on the assumption that neighboring pixels with similar intensities in grayscale should have similar colors. Sketch images are information-scarce, making their colorization tasks naturally challenging. We consider that the previous method may fail to learn a more challenging mapping from sketches with intricate edges to colored images. Two types of methods of sketch colorization tasks have been explored: hint-based (e.g., strokes, palette, and text) approach and reference-based approach.
It comes up with an intuitive way to colorize a sketch with a small amount of auxiliary information given by users, such as stroke hint [7][8][9][10], color palette [11,12], and text label [13][14][15]. Although these hint-based colorization methods show impressive results, they still suffer from the requirement of unambiguous color information and precise spatial user inputs for every step. Therefore, a more convenient coloring mode appears, utilizing exemplar images for sketch colorization. In the practice of exemplar colorization, a critical point is the preparation of a sufficiently large number of semantic training image pairs and the ground truth that reflects the color results of a given exemplar. One attempt [16] used geometric distortion and color perturbation to synthesize a pseudo ground truth. However, it suffers from the problems that failed to handle cross-domain samples well and easy to mode collapse. Therefore, some research is aimed at cross-domain learning and has been successfully employed in image translation. Early methods [17][18][19][20][21][22] focus on utilizing the low-level features to compose colorization. Although the above early methods broaden the thinking of style transfer, there are still many limitations: (1) The source image and target image are required to have a certain similarity in form and shape; (2) there are some deficiencies in the display of the global semantic features of the image; and (3) the style of the generated image is monotonous, and the texture diversity is not rich enough.
To surmount such problems, recent studies [23][24][25][26][27][28][29][30] have explored the establishment of cross-domain correspondence between the exemplar and source input. An extension of Image Analogies [28] and Deep Analogy [29] tries to establish the dense semanticallymeaningful correspondence of an input pair using pre-trained VGG layers. We deem that such methods may fail to handle sketch colorization. In order to consider the sketch (or mask, edge) format in the task of image translation, some studies [24,31,32] explicitly divide the exemplars into semantic regions and learn to synthesize different regions separately. Some research [23,27,30] utilizes the deep network for composing semantically close sourcereference pairs or takes advantage of histograms [30] to exploit sketches in their training. In this manner, it managed to produce high-quality results. However, these methods are domain-specific and are unsuitable for sketch colorization with only complex edges composition. Additionally, the style only marries the global context style, regardless of spatially relevant information and partial local style.
Our concern is how to establish the dense correspondence between sketch and exemplar in a more efficient manner. Our motivations are mainly on two issues: Firstly, how to model and extract local and non-local styles from exemplar images more efficiently? Secondly, how to learn the mapping with desired style information extracted from exemplars while preserving the semantically-meaningful sketch composition. For the first issue, we proposed a cross-domain alignment module that transforms distinct domain inputs into a shared, embedded space to ulteriorly learn the dense correspondence in both local and non-local style manners. For the second case, we propose a module that explicitly transfers the canonical contextual representation to the spatial location of the sketch input through a self-attentive pixelated feature transfer mechanism, which we term the cross-domain spatially feature transfer module (CSFT). Finally, a set of spatially-invariant de-normalization blocks with a Moment Shortcut (MS) connection [33] are employed to synthesize the output progressively; then, a specific adversarial framework for colorization tasks, dual multiscale discriminators with the capability of distinguishing structural composition and style coloration, respectively, has been introduced in this paper to facilitate the joint training of alignment module and guide the reconstruction of stylized output. This indirect supervision departs from the requirement of manually-annotated samples with visual correspondence between source-exemplar pairs. It encourages the network can be fully optimized in an end-to-end manner.
Qualitative and quantitative experimental results show that our method outperforms previous methods and exhibits state-of-the-art performance. These promising results extensively demonstrate its great potential for practical applications in various fields. The main contributions of this paper can be summarized as follows: • The cross-domain alignment module is proposed for imposing the distinct domain to a shared, embedded space for progressively aligning and outputting the warped image in a coarse-to-fine manner. • To facilitate the establishment of dense correspondence, we proposed an explicit style transfer module utilizing self attention-based pixel-wise feature transfer mechanism, which we term the cross-domain spatially feature transfer module (CSFT).
• We proposed a specific adversarial strategy for exemplar-based sketch colorization to facilitate the imaging quality and stabilize the adversarial training.

Image-to-Image Translation
Image-to-image translation is the problem of converting a possible representation of one scene into another, such as mapping a semantical mask to an RGB image or vice versa. Most previous prominent approaches show their ability on translation tasks with a generative adversarial network [34] that leverages either paired data [6,35,36] or unpaired data [37][38][39]. The previous generative models solve the image-to-image translation with different domains. However, they can only learn the latent representation between two specific different domains at a time, which makes it hard to deal with the transformation between multiple domains. Therefore, Liu et al. [40] designed the UNIT network based on GAN and VAE, and they realized the conversion from unsupervised image to image by learning a shared latent space. Then, Choi et al. [41] proposed starGAN, which is trained on multiple cross-domain datasets to realize multi-domain transformation. However, none of these methods concern the geometric gap between source content and style target. Additionally, previous methods ignore the capability of delicate control of the final output because the latent space representation is rather complex and implicit in correspondence of the exemplar style. In contrast, our cross-domain alignment module supports customization of final colorization results by a given user-guided exemplar in a coarse-to-fine manner of warping and refining, allowing users to control their designed effect flexibly.

Sketch-Based Tasks
A sketch is a rough visual representation of a scene or object by a set of lines and edges. It has been utilized in several computer vision tasks such as image retrieval [42,43], sketch generation [44,45], and sketch recognition [46]. Unlike other image-to-image translation methods, sketch colorization plays a unique role in content creation. Frans [47] used a user-defined color scheme colorization model based on GANs, but it hardly generated agreeable results. Ci et al. [7] explored the line art colorization in the field of animation by introducing ResNeXt and a pre-trained model to alleviate the problem of overfitting. Hati et al. [9] is based on Ci's model, introducing a double generator to improve visual fidelity but greatly increase the number of parameters. Style2Paints [8] was published as a famous project on Github with 14k stars, and the newest version is Style2Paints V4.5 beta. The V4.5 version can generate visually pleasing line art colorization results by splitting line art images into different parts and colorize them respectively. Zhang et al. [48] used U-Net residual architecture and an auxiliary classifier to preliminarily realize the animation style colorization tasks of sketches. Although these methods show impressive results for sketch-based coloring, they inevitably require precise color information and a certain amount of geometric cueing information that the user needs to provide at each step.
An alternative approach, which utilizes an already colored image as an exemplar to colorize sketches, has been introduced to surmount these inconveniences. Lee et al. [16] explored geometric augmented-self reference in the training process to generate forged sample pairs. Sun et al. [30] composed the semantically-related reference-pairs by color histogram. Lian et al. [49] explored an anime sketch colorization net without encoder using Spatially-Adaptive Normalization. However, these pair composition methods tend to be sensitive to domains, limiting their capability in a specific dataset. In contrast, our cross-domain model can be better applied to cross-domain learning and different types of datasets. At the same time, we have designed a novel adversarial strategy for sketch colorization to facilitate the final imaging quality.

Exemplar-Based Image Synthesis
More recently, researchers [25,50,51] have proposed to synthesize images from the semantic layout of the input under the guidance of exemplars. Zhang et al. [27] designs a novel end-to-end dual branch network architecture. When reliable reference pictures are not available, it learns reasonable local coloring to generate meaningful reference pictures and makes a reasonable color prediction. Huang et al. [51] and Ma et al. [24] propose to employ Adaptive Instance Normalization [52] to transfer the style latent from the exemplar image. Park et al. [25] proposed a novel normalization layer for image synthesis and solved the problem of vanishment of semantic map of sparse input on synthesis in the previous image synthesis task. In contrast to the above approach of passing only global styles, our approach is to pass fine-grained local styles from the semantic counterpart region of the exemplar through the proposed self-attention mechanism.
Our work is inspired by recent examples-based image coloring, but we address a more subtle problem: exemplar-based coloring of sparse semantic and informationally complex sketches. At the same time, we present a novel training scheme to learn visual cross-domain correspondence and a sound adversarial strategy designed for sketch-based tasks aiming to improve the final imaging quality.

Proposed Method
In this section, we will describe the details of the proposed methods as shown in Figure 1. We first introduce a learnable domain alignment network in which dense semantic correspondences can be established, where the CSFT module is used to find spatial-level correspondences between the inputs. Then, we apply a coarse-to-fine generator to refine the coarse images gradually. Finally, we describe the structure and color strategy of the proposed discriminator. Network, Generator Network with Moment Shortcut strategy, and Discriminator Network with the structural and colorific conditions. Given the sketch input x s ∈ R H×W×1 and the exemplar input y e ∈ R H×W×3 , the Domain Alignment Network adapts them into a common domain c, where the dense corresponding is established, to get the coarse outputs. Then, the generator refines the coarse images and outputs the refined images.

Domain Alignment Network
Image analogy [28,29,53] is a typical style migration method that uses a pre-trained VGG network to propose high-level abstract semantic information and find a suitable match on the target image (e.g., a realistic photo converted to a painting under the same semantic target). However, this approach does not apply to the migration task of sketches since sketches contain only a limited binary structure. The traditional VGG layer cannot extract suitable features for matching. Therefore, we propose a domain alignment network to establish correspondence between sketches and examples. However, conventional domain alignment is problematic in obtaining common domains in different semantics and different styles, so we propose a cross-domain spatially feature transfer (CSFT) module to help solve this problem.

Domain Alignment
To be specific, we let user inputs x s ∈ R H×W×1 , and y e ∈ R H×W×3 , s denote the domain of sketch, e denotes the domain of exemplar, and H, W denote the height and width, respectively. Additionally, we construct exemplar training pairs by using paired data {x s , x e } that are semantically aligned but differ in domains. Similarly, exemplar training pairs {y e , y s } are constructed in the same way as shown in Figure 2. Firstly, we project the given inputs x s and y e into a common domain c where the representation is able to represent the semantics for both distinct input domains. Let F (x s ), F (y e ) be the corresponding features of x s , y e , where F (·) ∈ R H×W×L , L denotes the producing L activation maps ( f 1 , f 2 , . . . , f L ), and H,W are feature spatial size. Then, we let F s→c and F e→c be representations of the feature embedding, where the embedding space is the common domain c. So, the presentation can be formulated as: where θ denotes the learnable parameter of feature layers. The representations x c and y c contain the semantic and stylistic features of the inputs. In practice, domain alignment is crucial for correspondence establishment because x c and y c can be further matched with specific similarity measures in the same domain. Therefore, how to draw the representations of x c and y c more closely is a critical issue.  Figure 2. The illustration of training pairs. We construct pairs data {x s , x e } (a,c), {y e , y s } (b,d). In the training phase, we will shuffle the data as the training pair inputs (e.g., e-h). Subscript e means exemplar domain, and Subscript s means sketch domain.

Dense Correspondence
This subsection will describe how to close the distance between the features x c , y c obtained in the previous section. We use the cosine distance proposed by Zhang [27], which has the advantage of closing the intra-class distance and distancing the inter-class differences. Now, our goal is to build a learnable module to find the correlation matrix M ∈ R HW×HW , which can record the spatial correspondence between the representations. Let i ∈ {(H, W)}, j{(H, W)} denote spatial positions of channel-wise centralized featurê x c ∈ R C andŷ c ∈ R C . Therefore, the formula can be written as: ). The matrix M indicates a dense pixel-by-pixel spatial correspondence.
To establish an efficient spatially dense correspondence, we also need an efficient feature transfer module intending to map different local features of the input to valid regions. We do not apply direct supervised learning to the domain alignment network, but indirect joint training through a proposed Dynamic Moment Shortcut method, which allows the entire architecture to preserve end-to-end optimization capabilities. In this way, the transformation network may find that high-quality coloring images can only be produced by correct domain mapping of the exemplar input, which explicitly compels the network to learn the accurate dense correspondence. In light of this, we let w y→x by matching and computing the most relevant pixels in y e and matrix M in the shared domain c.
where α denotes a coefficient to control the degree of soft smoothing, default is 100. y e ∈ R HW is the deformed vector of y e .

Cross-Domain Spatially Feature Transfer
Under the guidance of Equation (4), we, therefore, propose the Cross-domain Spatially Feature Transfer module, which can effectively facilitate the establishment of spatially dense correspondence to the global statistical relationship between input features as shown in Figure 3. Figure 3. The illustration of the cross-domain spatially feature transfer (CSFT) module. CSFT establishes the dense correspondence mapping through the self-attention mechanism. The output results will be used for the next step of conversion, that is, to calculate the correspondence matrix M.
To be begin with, each of the two feature pyramid networks Er and Es consists of L convolutional layers, producing L activation maps ( f 1 , f 2 , . . . , f L ). Then, we apply downsampling to each response layer f i so that it scales to a consistent spatial size of f L , and concatenate them along the channel dimensions, obtaining the organized activation feature map V, i.e., where φ denotes the spatial downsampling method of each feature map, in this manner, we simultaneously obtained semantic information from high to low inputs.
After that, given the v i s and v j r , we can obtain the self-attention matrix A ∈ hw × hw, and following [54], we can get the scaled dot product result of α ij : where W q , W k ∈ R d v ×d v represents multilayer perceptron, and √ d v denotes the scaling factor. α can be used as the calculated attention weight of how much information v i s should bring from v j r . Now, we can obtain the context vector V * of region i of the exemplar image.
Then, the dimension of V * is adjusted by operations such as 1 × 1 convolution to obtain the x c , y c .

Coarse-to-Fine Generator
We employ a coarse-to-fine generative architecture to jointly train the domain alignment network, providing end-to-end training capability for the model. To avoid the failure of coarse image generation, we incorporate a Dynamic Moment Shortcut (DMS) structure in the generator, which has been shown to facilitate the generation of coarse deformation images.

Dynamic Moment Shortcut
Inspired by Dynamic Layer Normalization [55,56] and Position Normalization [33], and we employ Dynamic Moment Shortcut (DMS) in our generator. In generative models, although the conventional regularization layer may promote model convergence, it eliminates important semantic information about the images, which may cause generation failures, making it necessary for decoder structures with huge parameters to relearn the feature maps.
Instead, the introduction of DMS injects the positional moments extracted from earlier layers into the later layer of the network, enabling joint training of domain alignment networks with a low parametric number of decoders.

Structural and Colorific Strategy
In order to improve the color quality of the sketches, we propose the colorific and structural strategy, which effectively contributes to excellent and aesthetic coloring results. Here next, we describe in detail the structural and colorific strategy.

Structural Condition
The structural conditions are a brief overview and representation of the objects. We represent them using a series of binary black and white images, also the sketches we refer to. Concretely, we apply xDoG [57] in the training phase to generate simulation sketches, which can constitute our structural conditions. We train the discriminator by composing the structural information of the exemplar and the generated samples, respectively, letting the discriminator focus on comparing the structural reasonableness of the generated images and maintaining consistency with the sketches. The ablation experiments show that the structural discriminator can reduce the occurrence of color diffusion.

Colorific Condition
The color condition indicates whether the image's color matches the example image, and it is the key to generating reasonable colors. Our model strives to generate a reasonable coloring result given a sketch image and a reference image. We apply the multi-scale discriminators in the discriminator network and use image processing techniques to extract sketches and color styles from these RGB images automatically.
In the following way, we compute a 3D lab color histogram (8 × 8 × 8) for each RGB image [30] and then measure their similarity by k-means clustering if their colors are close to each other to merge the exemplar images. As shown in Figure 4, we get an image with similar color similarity to the reference input as our color conditional input. In this way, the discriminator improves its sensitivity to the color correlation of the generated images and exemplar.

Structural and Colorific Discriminators
As shown in Figure 5, we use pairwise discriminators with structural and colorific conditions to jointly train the generator part. Specifically, the structural discriminator is responsible for determining whether the generated images are structurally plausible and maintain structural consistency with the sketch input. We carefully designed positive and negative sample pairs to compel them to be sensitive only to the resulting structure. The colorific discriminator is responsible for identifying whether the resulting colors are reasonable. We perform positive and negative samples on images with different structures but similar colors, which forces the color discriminator to be more sensitive to changes in color patterns and promotes the generation of images that retain more of the style from the exemplar input. The structure discriminator prefers the spatial scale, while the coloring discriminator focuses on the style domain.

Loss for Exemplar-Based Sketch Colorization
We jointly train the domain alignment network and generator network along with the following loss functions.

Loss for Exemplar Translation
As shown in previous work [58], perceptual loss penalizes the model to decrease the semantic gap in the generated output, which means the multi-scale spatial differences of intermediate activation feature maps between the generated output and ground truth from the pre-trained VGG network.
where φ denotes the activation feature maps of l-th layer extracted at the relu5_2 from the pre-trained VGG19 network. Sajjadi et al. [59] have shown that reducing the style loss of the difference between the covariances of the activation maps helps to resolve the checkerboard effect. Therefore, we applied style loss to facilitate style transfer from the exemplars as follows: (G(x s , y e ))) − G(φ(y e )) 1 ] (11) where G denotes the gram matrix. Meanwhile, we employ the contextual loss the same as [60] to let the output adopt the style from the semantically corresponding patches from y e .
, φ l j (y e )))) (12) where i and j indexes the feature map of layer φ l , which contains the n l feature maps and w l restrains relative importance of different layers. In contrast to style loss, which primarily utilizes high-level features, context loss uses relu2_2 through relu5_2 layers because low-level features capture richer style information (e.g., color or texture) used to convey exemplar appearance.

Loss for Pseudo Reference Pairs
We construct training exemplar pairs {x s , x e } that are semantically aligned but domain separated. Concretely, we apply random geometric distortion such as TPS transformation s(·), a non-linear spatial transformation operator to x e , and get the distorted image x e = s(x e ). This keeps our model from lazily bringing the color in the same spatial position from x e . The interpretation of x s should be its counterpart x e when considering x e as an exemplar. We proposed to penalize the pixel-wise difference between the output and the ground truth x e as below:

Loss for Domain Alignment
We need to ensure that the representations x c and y c are in the same domain to make the domain alignment meaningful. To achieve this, we use the pseudo exemplar pair {x s , x e } and {y s , y e } to establish a shared domain c by penalizing the L1 distance between the representations. (14) In this way, the model can gradually learn the mapping of different domains to a common domain.

Loss for Adversarial Network
We proposed to train a conditional discriminator [34] with the structural and colorific conditions to discriminate the translation output and the ground truth sample from distinct domains. We construct the discriminator input as in Section 3.3.
where I similar denotes the sample that is similar to exemplar inputx e in color.

Experiments
This section demonstrates the superiority of our approach on a range of domain datasets, including real photos and anime (comics).

Implementation
We implement our model with the size of input images fixed in 256 × 256 resolution on every dataset. For training, we adopt the Adam solver for optimization with β 1 = 0.5, β 2 = 0.999, and the learning rates are both initially set to 0.0001 for generator and discriminator, respectively, following TTUR [61]. We conduct the experiments using NVIDIA GeForce RTX 3090 with batch size set as 8, and it probably takes three days to train 100 epochs on the Animepair dataset.

Anime-Sketch-Colorization-Pair Dataset
We use Kaggle's anime-sketch-colorization-pair [62] dataset to train our model to validate the model's performance on hand-drawn data. It contains 14,224 training samples and 3545 test samples, including paired hand-crafted sketch images and corresponding color images.

Animal Face Dataset
The Animal Face Dataset [63] includes 16,130 high-quality animal face data containing several distinct domains of animal species, namely cats, dogs, and wild animals, with wild animals including lions, tigers, foxes, and other animals. We use this dataset to validate the model's performance in cross-domain image translation, and it turns out that our model can work well.

Edge2Shoe Dataset
Edge2Shot [64,65] contains paired sketch color shoe images that have been widely used for image-to-image conversion tasks. With this dataset, we can effectively evaluate the performance of our method and existing methods on unpaired image-to-image transformation tasks.

Comparisons to Baselines
We select different state-of-the-art image translation methods for visual comparison.

Quantitative Evaluation
The quantitative model performance on different datasets is shown in Table 1. We evaluate our proposed method from five aspects: • Firstly, we use Fréchet Inception Distance (FID) [61] to measure the distance between the synthetic image and the natural image distribution. FID calculated the Wasserstein-2 distance between the two Gaussian distributions in line with the features representation of a pre-trained convolution network InceptionV3 [66]. As Table 2 shows, compared with other excellent models, our proposed model has the best score in FID. • Peak Signal to Noise Ratio (PSNR) is an engineering term representing the ratio between a signal's maximum power and the destructive noise power that affects its fidelity. We also evaluate the PSNR index of the models on different datasets, as shown in Table 3, and our model has achieved good performance. • Structural Similarity (SSIM) [67] is also an image quality evaluation index, which measures the similarity of two images from three aspects: brightness, contrast, and structure. The larger the value, the better, and the maximum is 1. The quantitative results are shown in Table 4. • NDB [68] and JSD [69]. To measure the similarity of the distribution between the real and generated images, we used two bin-based metrics, NDB (Number of Statistically-Different Bins) and JSD (Jensen-Shannon Divergence). These metrics evaluate the degree of pattern missing in the generated model. Our model has achieved good performance, as shown in Table 5.  Sketch Reference Figure 8. Qualitative results of our method on the anime dataset. Each row has the same semantic content, while each column has the same reference style. Please note that all the above results are generated from unseen images because the goal of our task is not to reconstruct the original image.

Exemplar
Coarse output Output Ground truth Figure 9. Qualitative results of our method on the animal-face dataset. Each row has the same semantic content, while each column has the same reference style.

Ablation Study
In order to verify the effectiveness of each part, we organized tailored ablation experiments. As Table 6 shows, domain alignment loss L align plays a crucial role in cross-domain image translation, which not only effectively facilitates training, but also generates satisfying images. We also ablate the contextual loss L context . In our experiments, we found that although the network produced the final output, the feature correspondence may have a large mismatch, and using L context loss enabled the correspondence to be well established. As shown in Table 2, we performed ablation experiments for the proposed structural and colorific conditions. The experimental results prove that the strategy effectively reduces detail loss and color diffusion. As shown in Figure 10, the colorific condition can promote the correct matching of the exemplar style and sketch correspondence, and the structure condition can reduce mismatch and color diffusion. As Table 2 shows, the FID metrics perform better with the addition of the CSFT module since CSFT can effectively facilitate the establishment of pixel-level correspondences and eliminate certain incorrect dense correspondences. At the same time, we found in practice that the addition of CSFT joint training can facilitate coarse image generation for domainaligned networks. The control group of CSFT is a series of residual convolutions to maintain input-output invariance.

Discussion
The method proposed in this paper facilitates the solution of the problem of coloring sketches with sparse information. Traditional image translation or image transfer methods are not well suited for sketch colorization tasks because they have limited capability to establish correspondence between sparse semantic images and exemplars. Therefore, for sketch colorization, we propose a cross-domain alignment network that facilitates dense correspondence at the pixel scale using the proposed CSFT module, and the proposed structural and colorific conditions can be effectively applied to exemplar-based sketch colorization tasks.
Our model is mainly trained on cropped image data with restricted resolution (e.g., 256 × 256). We do not employ a multi-scale architecture like pix2pixHD [36] for highresolution image synthesis. Moreover, the model is not exhaustive, and it is difficult to establish a perfect and correct correspondence because of the diversity and uncertainty of the user input. Therefore, it is challenging for the model to learn how to determine the given style's suitability and color it reasonably within a specific limit from the style of the usergiven exemplar. For example, in the animal face dataset, we find that the converted results are not always satisfactory, which is firstly caused by the excessive differences between different species and secondly by the fact that the model is not yet able to establish a perfect dense correspondence, and in this case, how to generate aesthetically and intuitively appropriate results should be our consideration.
Currently, the model proposed in this paper has been initially tried in a sketch colorization task. We believe that the proposed model has good potential for cross-domain image translation tasks. In the future, we plan to extend the framework to the high-resolution domain and integrate style-consistent examples into the keyframes of video data.

Conclusions
In this paper, we present a cross-domain translation framework for exemplar-based sketch colorization tasks. We propose the cross-domain alignment module, effectively establishing correspondence between isolated domains. In order to further promote crossdomain learning, we propose a pixel-wise feature transfer component based on the selfattention mechanism, which is called the cross-domain spatially feature transfer module (CSFT). At the training stage, we design a simple and effective strategy to term the structural and colorific conditions, which can effectively promote image quality. Our method achieves better performance than existing methods in both qualitative and quantitative experiments. In addition, our method learns dense correspondences of sketch images, paving the way for some interesting future applications, which shows the significant potential in the practice of content creation and other fields.

Conflicts of Interest:
The authors declare no conflict of interest.