Syntheses of Dual-Artistic Media Effects Using a Generative Model with Spatial Control

We present a generative model with spatial control to synthesize dual-artistic media effects. It generates different artistic media effects on the foreground and background of an image. In order to apply a distinct artistic media effect to a photograph, deep learning-based models require a training dataset composed of pairs of a photograph and its corresponding artwork images. To build the dataset, we apply some existing techniques that generate an artwork image including colored pencil, watercolor and abstraction from a photograph. In order to produce a dual artistic effect, we apply a semantic segmentation technique to separate the foreground and background of a photograph. Our model applies different artistic media effects on the foreground and background using space control module such as SPADE block.


Introduction
Synthesizing artwork images is one of the most frequently challenged problems in computer graphics. The researchers in computer graphics community developed various computational models to mimic the artistic media effects to synthesize visually pleasing artistic styles expressed using real artistic media such as pencils or brushes [1][2][3]. Recently, the advancement of deep learning techniques has greatly accelerated artistic style synthesis techniques [4]. The style embedded in a sample artwork is transferred to a target photograph using deep convolutional neural networks without developing explicit computational models.
Using a texture-based approach, deep learning-based artistic style synthesis techniques transfer the styles extracted from a sample artwork image into a target photograph. Recently, pix2pix [5] successfully translated images in a domain to the images in a different domain by learning the styles inherent in the domain and applying them to images in the different domain. Pix2pix requires that the training set must consist of matched pairs. The content of the images that belong to different domains should be coincident. Consequently, pix2pix has hardly been applied to artistic style translation, since it is very difficult to collect a pair of image sets for photographs and their matched artwork images.
Zhu et al. [6] presented CycleGAN that resolves the limitation of a paired dataset of pix2pix. Instead of requiring a paired dataset, they present a cyclic generation of images in two distinct domains and devise a cycle consistency constraint for the original and generated images. CycleGAN successfully applies various artistic styles inherent in artwork domain to the images in the photograph domain. However, the style applied to photographs tends to be faint, since CycleGAN averages the styles extracted from many samples. Later, Park et al. [7] introduced GauGAN that includes a Spatially Adaptive Denormalization (SPADE) module, which regulates style transfer in locally segmented regions. They collect similar regions from a set of images and apply styles on the regionbased normalization.
In some visual contents, the selected regions of a scene that the creator wants to emphasize are rendered in a different style. For example, in Schindler's List, the famous black-and-white film, only the girl in a red coat was rendered in color, giving the audience more intense instillation of the girl's emotions. The most widely-used method to implement different artistic styles is to build a mask that specifies the regions where different styles are applied. A stylized image with different artistic styles using a mask that segments an image into separate regions can be a groundtruth to the result of our method. With the recent development of deep learning-based methods that produce artistic styles, we aim to develop a technique that produces dual artistic styles to a scene in an end-to-end framework. Our model that produces both single and dual artistic styles in an identical end-to-end framework can reduce the efforts of synthesizing artistic styles using masks.
We present SDAGAN (Spatially-controlled Dual Artistic effect Generative Adversarial Network) to synthesize dual artistic effects from an input photograph. The first motivation of SDAGAN is to apply the pix2pix approach in style transfer in order to present a salient artistic media effect on an input photograph. To build a training dataset of matched pairs, we employ the existing techniques that mimic artistic media effects. We chose three most representative artistic media effects used in computer graphics and computer vision society: watercolor, color pencil drawing and abstraction with lines. We use Kang et al.'s work [8] for abstraction with lines, Bousseau et al.'s work [2] for watercolor and Yang et al.'s work [3] for color pencil drawing.
The second motivation is to apply different artistic media effects on an input photograph. For this purpose, we apply a semantic segmentation scheme to segment the photograph into foreground and background. We employ SPADE from Park et al.'s work [7] to apply dual artistic effects on a photograph. For example, the foreground of a photograph may be stylized with a watercolor effect, while the background may be stylized with a pencil drawing effect. Some researchers [9][10][11][12] have used this approach. However, these schemes have limitations in synthesizing artistic media effects.
We illustrate our framework in Figure 1. A single artistic effect is generated using our SDAGAN with deactivated SPADE blocks. We attach a control vector to select the artistic effect to generate. To generate dual artistic effects, we employ a segmentation module that segments the input photograph into foreground and background. This segmentation map is fed into the SPADE blocks to specify the region of applying the artistic effects. We assign two condition vectors for dual artistic effects. As illustrated in Figure 1b, we can generate various combination of dual artistic effects. SDAGAN is trained in two different modes: single artistic media effect synthesis and dual artistic media synthesis. The single synthesis mode is trained like pix2pix [5]. For the training, pairs of input photographs and their matched stylized images are used. The SPADE block is deactivated in this mode. The pre-trained parameters from the single synthesis model are used in the dual synthesis mode. The segmentation information is also used for training in this mode. SDAGAN synthesizes dual artistic media effects on an input photograph after training with different effects applied to the foreground and background.
Our contribution can be summarized as follows: • We build a paired dataset for photographs and their matched artwork images by employing the existing techniques that synthesize various artistic media effects. This approach enables us to generate artwork images with very realistic artistic media effects. Using this dataset, we can execute the pix2pix approach for translating a photograph to artwork images of various styles including abstraction, watercolor and pencil drawing. • We develop a framework that synthesizes dual artistic media effects using region information extracted from a semantic segmentation scheme. The SPADE module introduced in GauGAN [7] is employed for this purpose. Our framework can successfully produce an artwork image whose foreground and background are depicted with different artistic media effects.   Abstraction is a drawing technique that removes tiny textures and abstracts complex colors to present an object's coarse shape. Many researchers [1,8,[13][14][15][16] present abstraction techniques for photograph colors and tiny textures. They also presented various line drawing schemes for clearly presenting the distinguishing shape of objects [17][18][19].

Watercoloring
Simulating the physics of watercolor on paper surfaces was pioneered by Curtis et al. [20]. Later, Bousseau et al. [2,21] presented a texture-based approach to rendering video clips in watercolor style. Kang et al. [22] presented a computational model for watercolor and Laerhoven et al. [23] presented a physically-based model for watercolor.

Pencil Drawing
Sousa and Buchanan [24] analyzed the physics of pencils and paper surfaces to mimic pencil drawing. Later, Matsui et al. [25] and Murakami et al. [26] presented a pencil drawing stroke overlapping scheme. Kwon et al. [27] and Yang et al. [3] created a framework based on convolutions for pencil drawing on 3D meshes or photographs.

Deep Learning-Based Work
Gatys et al. [4] have recently presented a texture-based style transfer scheme that innovatively changed the artistic image synthesis framework. They define a Gram ma-trix for extracting the styles from a texture scale sample image and present an iterative optimization process for applying the styles in the Gram matrix to a content image. This scheme prompts a series of follow-up studies on applying styles from samples to a target image in the absence of explicit procedures. These methods are known as image-based optimization stylization.
Model-based style transfer, another category of deep learning-based stylization, originates from Goodfellow et al.'s generative adversarial network (GAN) [28]. GAN comprises a couple of a generator and a discriminator. The generator creates an image to fool a discriminator and the discriminator's goal is to detect the generator's forgery. This antagonistic relationship increases the power of a generator. Following the training process, a generator can create any image recognized as a real photograph.
Radford et al. [29] enhanced the GAN structure by combining a convolutional network and a GAN to improve the image synthesis framework. Isola et al. [5] presented a pix2pix framework that learns styles from many samples and then applies the learned style to an input image. As a result, an abstract style, such as Gogh's or Monet's, can be applied to a photograph. This approach brings a limitation that the styles learned from many samples tend to get averaged. Later, Zhu et al. [6] presented CycleGAN, which converts images of matched pairs in crossway. An artwork image is converted to its paired photograph image, and vice versa.
Recently, some researchers focus on producing individual artistic effects. Chen and Tak [30] and Zhou et al. [31] applied GAN architecture to produce pencil drawing effects. Kim et al. [32] applied GAN with attention module to produce image abstraction effects from portraits. Platkevic et al. [33] and Sochorova and Jamriska [34] presented physicallybased models on simulating artistic media effects such as watercolor and oilpainting.

Region-Based Work
Gatys et al. [35] presented a spatial control scheme for style transfer by applying a mask to select a series of styles from multiple style images. Champandard [9] proposed a semantic map scheme based on doodles for artistic image synthesis of spatial control. Rather than combining styles from various samples, he focuses on controlling the style of the semantic map to the doodles.
Huang and Belongie [10] presented an adaptive instance normalization (adaIN) scheme for artistic image synthesis. The adaIN replaces the batch normalization scheme, which normalizes input images on a batch scale, to improve the efficiency of learning and stylization. They also used a spatial control scheme based on a feed forward approach. Li et al. [11] also presented a mask-guided spatial control scheme that uses user-edited masks to apply a series of styles to an input image. Castillo et al. [12] presented an instance semantic segmentation-based spatial control scheme for artistic image synthesis. They blend different styles on the segmented regions using Markov random field.
Park et al. [7] recently presented a gauGAN that applies a doodle-based normalization (SPADE) for photographic image synthesis. The regions guided by a series of doodles are normalized for efficient image synthesis.

Dual Style Transfer
We briefly review several existing works that combine two different styles in a result image. Some of them focus on preserving both style and content and others consider two artistic styles.
Artists alter their styles over their lifetime. To follow this alternation of styles, Kotovenko et al. [36] presents a style transfer model that separates style and content for stylization. They aim to follow both a common style spread on overall samples and a fine-detailed style on a specific sample. For this purpose, their model involves two novel loss terms: fixpoint triple style loss and fixpoint disentanglement loss. These two loss terms enable a better of style distributions. Kotovenko et al. [37] presents a style transfer model that transfers both content and style simultaneously. They focus on the content transfer, which has not been studied by the existing works. They claim that content details are altered according to the styles. Therefore, they devise a content transformation module located between an encoder and a decoder. This transformation module includes a local feature normalization layer, which is effective in reducing artifacts on stylized images. Recently, Svobada et al. [38] presented a stylization model that recombines style and content in a latent space using a two-stage peer-regularization layer. Since their model does not rely on the pre-trained network, their model allows a zero-shot style transfer. This approach preserves the content from content samples in a solid way, while transfers the style from style samples in a visually pleasing way.
Transferring styles samples from one or two images may produce unwanted biased results. Sanakoyeu et al. [39] resolved this limitation by devising style-aware content loss term that trains a style transfer network composed of an encoder and a decoder. This loss term enables real-time and high-resolution stylization. Chen et al. [40] presents a dual style transfer model that learns holistic artist style and specific artwork style simultaneously. The holistic artist style expresses the tone and specific artwork style depicts the detail of the stylization.

Generator
We build the structure of our generative model for artistic style synthesis using CycleGAN structure [6]. CycleGAN's original generator combines an encoder and a decoder block. However, the input of our approach includes bi-segmentation information for applying various artistic media effects to an image. We design our generator with residual blocks and SPADE [7] to properly use the bi-segmentation information. Our generator has four blocks: an encoder block, a residual block, a SPADE block and a decoder block. Figure 2 Figure 2. The architecture of our generator: An encoding block, a residual block and a decoder block is aligned. In the decoder block, we place three SPADE blocks to properly process the region information.

Encoder Block
In the encoder block, we downsample the input image by applying two convolutional layers. The input image whose dimension is 256 × 256 × 3 is encoded to 128 × 128 × 128 by conv3-64 and conv3-128 layers.

Residual Block
At the end of the encoder block, we devise a residual block that adds two feature maps in an element-by-element way. A conv3-256 layer produces the second feature map in the residual block. This residual block prevents the loss of information that may come from the convolution layers.

SPADE Block
In this SPADE block, the bi-segmentation information is convolved to be multiplied and added to the feature map element-wise. The feature map can be derived from the residual block or the decoder block's transconvolutional layers. We locate the SPADE block for the features maps of the transconvolutional layers because this bi-segmentation information separates the styles for the segmented regions. Figure 3 depicts the structure of our SPADE block.

Decoder Block
Finally, our decoder block reconstructs the resolution of the input image by two transconvolution layers including transconv3-128 and transconv3-64. Each output of the transconvolution layers is post-processed through the SPADE block.

Discriminator
For of our generative model's discriminator, we use the patchGAN architecture for the CycleGAN discriminator [6]. Our discriminator is composed of six convolutional layers. At the end of the discriminator, a 14 × 14 one-dimensional map is produced. Each element of the map corresponds to the discrimination result of each patch. Figure 4 presents our discriminator's architecture.

Loss Function
The loss functions in this work are presented in Equation (1), which is a weighted sum of GAN loss term in Equation (2) and cycle loss term in Equation (3). These equations were developed by Zhu et al. [6].
X and Y denote input image and stylized image domain, respectively. G is a generator that produces stylized images from input images, and F is a generator doing vice versa.
Additionally, we normalize the segmented regions according to the segmentation of the input image using Equation (4), which was developed in Park et al.'s work [7]. where n, c, y, and x denote batch size, number of channels, height, and width, respectively. h is the activation value before normalization, and µ c and σ c denote the mean and standard deviation of activation values at channel c. γ and β are the parameters learnable at the normalization layer.

Generation of the Training Dataset
We train our model by artistic images synthesized from existing non-photorealistic (NPR) studies. Kang et al.'s work [8] generates abstracted images by integrating color along smooth flow embedded in an image. This scheme can generate both abstracted images and abstracted images with lines. Bousseau et al.'s work [2] is used to create the watercolor style by synthesizing watercolor textures using Perlin noise. Yang et al.'s work [3] is designed using a line integral convolution (LIC) scheme to mimic various pencil styles including color pencil and monochrome pencil with various strokes. This scheme produces four variants of pencil drawing styles. Among the various styles that can be produced by these three techniques, we choose three styles: abstraction with lines, watercolor and color pencil drawing with thick stroke. We generate 1.5 K images for each style. Therefore, our dataset has 4.5 K images. The generated images using these schemes are suggested in Figure 5. The 3.3 K images of the dataset are used for training, 0.6 K images for validation and 0.6 K for test. Details of the dataset are suggested in Table 1. Color pencil with thin stroke Figure 5. Training dataset produced using the existing techniques. Among the seven styles we can produce, we select three styles including abstraction with lines, watercolor and color pencil drawing with thick stroke for our dual style synthesis. The selected three styles are marked in red rectangles.

Foreground/Background Segmentation
Many researchers have presented various deep learning-based schemes that segment a scene into foreground and background [41][42][43]. In these schemes, objects in a foreground of a scene are separated from their backgrounds through a series of convolutional neural nets. Bouwmans et al. presented a brief survey on these schemes [44]. In this study, we employ a BiSeNet [45] that effectively segments scenes into several regions. BiSeNet is very effective in segmenting regions with different depths. Since many landscape images are composed of closer regions and farther backgrounds, BiSeNet successfully segments foreground and background of a scene.

Implementation
We have implemented our framework on a PC with a Pentium i7 CPU and 16 GByte main memory. An nVidia TitanX GPU with 6 GByte memory accelerates our implementation. We used Pytorch version 1.2.0 and Cube version cu92 as our software environment. It takes 18 h to train our model for generating artistic media effects. In every training process, we set λ of GAN loss as 10. We employed ADAM for our optimizer, and set the learning rate as 0.0002, epoch as 200, and batch size as 8. After the training, our model takes about 1.5 s to produce a result image.

Results
We apply SDAGAN to twenty images and produced a series of result images that include three images of single artistic media effects and six images of dual artistic media effects.

Synthesis of Single Artistic Media Effects
We generate images of single artistic styles by executing our SDAGAN with disabling SPADE module. Among the seven styles in Figure 5 that we can produce using existing methods, we focus on three styles: abstraction with lines, watercolor and color pencil drawing with thick stroke. After training SDAGAN using the dataset in Table 1, our model can produce images whose styles are similar to those of the training samples (see Figure 6).  Figure 6. Our result images of single artistic style generated from our SDAGAN trained using the images generated from the existing methods. These images are generated using our SDAGAN where SPADE modules are disabled.

Synthesis of Dual Artistic Media Effects
Artistic media effects produced in this study have their own benefits and shortcomings. Abstraction with lines, for example, can express the details of an object in the foreground, while sacrificing smooth tonal change in the background. The watercolor effect may lose the objects' detail because they express a scene with smooth tonal values. The pencil drawing effect can express objects in the foreground with details and tonal changes in the background, reducing a scene's contrast.
Therefore, single artistic media effect's shortcomings can be addressed by mixing two artistic media effects. We present dual-artistic media effects by synthesizing abstraction with watercolor effects in Figures 7-9. The original image of Figure 7 depicts mountains, an ocean and a blue sky. The foreground's land appears dark, while the ocean and sky in the background appear bright. The abstraction effect depicts the details of the mountains, but loses the smooth tonal change of the blue sky and ocean. In contrast, the watercolor effect expresses the smooth tonal change in the background, but loses the details of the mountains. As a result, we separate the foreground and background and apply various artistic media effects. The foreground is expressed with abstraction and the background with a watercolor effect. This dual-artistic media effect can express details in the foreground with abstraction with lines and smooth tonal change in the background with watercolor effect. Figures 8 and 9 show similar results.
We present another dual-artistic media effect by synthesizing pencil drawing effects with watercolor effects in Figures 10-13. This synthesis aims for another effect as well as similar effects in the abstraction-watercolor synthesis. As shown in Figures 7-9, the pencil drawing effects brighten the results when compared to the original photographs. Figure 10 depicts the advantage of our approach. The pencil drawing effect applied to the original photograph boosts its brightness and restores the red color of the board player's cloth. The watercolor effect used on the photograph preserves the image's tone while depicting the mountain and sky with smooth tonal gradations. Combining these effects draws attention to the board player in the center of the image while preserving the overall tone. Figures 11-13 depict similar effects.

Comparison to Existing Studies
Champandard [9] proposed a 2-bit doodle-based approach that segments an image into four regions and applies different styles on the regions (see Figure 14a). Objects in the scene are rendered with the appropriate styles because they apply styles based on the segmented regions. They did not, however, experiment with dual-artistic styles for their results. Huang and Belongie [10] proposed an adaptive instance normalization scheme to address existing style transfer issues like slow optimization and texture smearing. They calculate the mean and variance of style features and apply them to content features. Using this strategy, they segment an image into two regions and apply different style textures to create a dual-styled result. However, this scheme does not consider the effects of artistic media on their styles (see Figure 14b). Castillo et al. [12] proposed an instance-aware style transfer scheme based on MRF transfer. They segment an input image and choose a region to apply their style (see Figure 14c). They do not think about applying different styles to different regions and combining them into single final result. Li et al. [11] proposed a universal style transfer scheme that addresses existing works' limitations, such as poor generation to unseen styles or visual quality. They separate the stylization process into whitening and coloring transform. They apply this strategy to a multi-segmented image and produce a multi-styled result (see Figure 14d). Figure 14. The similar results from existing works: (a) doodle-based style transfer [9], (b) style transfer on bi-segmented image [10], (c) stylization on segmented region [12], (d) universal style transfer on multi-segmented image [11].

Evaluation
In many existing style transfer studies, applying different styles on the different regions of a single image is hardly observed. Therefore, the evaluation of our scheme with similar existing studies is not feasible. Instead of comparing our study with the existing works, we compare our results with the images stylized by a single medium. We present two evaluation tests: the first evaluation is using FID and the second one is user survey.

FID Evaluation
We evaluate FID (Frechet Inception Distance) for the three stylized images: a stylized image for the foreground, a stylized image for the background and our result that combines the foreground and background stylized images. We compare the images in Figure 15 and suggest the FID values for the three stylized images in the left column of Table 2, where our result shows smallest FID values for nine cases from ten cases. Ten images in Figure 15 are grouped into five subsets, each of which contains one input image.

User Survey
For a user survey, we build two questions in order to evaluate our results. We prepare 10 sets of an input image and three stylization images: stylized with the foreground effect, stylized with the background effect, and our result. The 10 sets of the input images are grouped into five subsets, each of which share the input image. Each input image is stylized into two different result images. We aim to estimate the effects of stylization on the identical input image. We present two questions: one for the degree of stylization and the other for the preservation of content.
• q1. Evaluate the stylization for each stylized image. Mark the degree of the stylization in five-point metric: 1 for least stylized, 2 for less stylized, 3 for medium, 4 for more stylized, 5 for maximally stylized. • q2. Evaluate how much information of the input image is preserved for each stylized image. The information includes details of objects in the scene as well as tone and color. Mark the degree of preservation of the information in five-point metric: 1 for least preserved, 2 for less preserved, 3 for medium, 4 for somewhat preserved, 5 for maximally preserved.
We hire twenty persons including nine female and eleven male. Twelve of them are in their twenties and eight of them are in thirties. The results of user survey are summarized in the right column of Table 2. The values in Table 2 are averaged over twenty participants.

Analysis
We execute two analysis on the FID values and user survey scores: t-test and effect size.

t-Test
We execute t-test on two pairs of the stylized images: (i) foreground and ours and (ii) background and ours. For the FID values, the p values record 0.022 and 0.012, respectively. This denotes that the differences are significant for p < 0.05. For the user survey scores, p values score 3.7 × 10 −5 and 9.7 × 10 −5 for the stylization and 0.00033 and 7.7 × 10 −8 for the preservation of content. These denote that the differences are significant for p < 0.01. These values are presented in the upper row of Table 3.

Effect Size
We estimate effect size on two pairs of the stylized images: (i) foreground and ours and (ii) background and ours. For the FID values, the Cohen's d values record 1.64 and 1.76, respectively. This denotes that the effect sizes between two groups are large. For the user survey scores, Cohen's d values are 3.43 and 3.61 for the stylization and 3.14 and 5.48 for the preservation of content. From these values, we conclude that the effect sizes between two groups are large. Figure 15. The images used in FID estimation and user study. Images in green box are rendered in pencil style, images in blue are in abstraction, and images in red are in watercolor.

Discussion
The existing stylization studies concentrated on applying styles while preserving content or synthesizing various styles captured from various samples. On the other hand, our technique differs from the existing research in that it can segment the image into foreground and background and apply different artistic styles to the segmented regions. This approach, which has not been tried by the existing studies, can be applied to various images as a way to express the images more dramatically by applying different artistic styles on them. This approach can be extended to develop a method that applies different styles on various regions segmented from an image.

Conclusions and Future Work
This study proposes a framework for applying two different art effects to a photograph. Most of deep learning-based models require a paired dataset to generate distinct art effects. To resolve this requirement, we applied existing artistic media simulation techniques that produce color pencils, watercolors, and abstraction effects to build the dataset composed of a photograph and its artistic images. By training our model with this dataset, our model successfully produces visual pleasing artistic media effects. Furthermore, in order to apply the two different artistic effects, an input photograph is segmented into foreground and background using a semantic segmentation model. By applying a series of SPADE blocks that apply region information to style transfer, our model successfully produce different artistic media effects in the foreground and background of a photograph.
In future work, we plan to improve the generative model to express more diverse artistic media effects. Furthermore, we will enrich our styles by segmenting an image into more detailed regions and applying proper styles to the regions. Finally, we will extend our dataset of artistic styles by applying more existing artistic style synthesis schemes.