Coarse-to-Fine Structure-Aware Artistic Style Transfer

: Artistic style transfer aims to use a style image and a content image to synthesize a target image that retains the same artistic expression as the style image while preserving the basic content of the content image. Many recently proposed style transfer methods have a common problem; that is, they simply transfer the texture and color of the style image to the global structure of the content image. As a result, the content image has a local structure that is not similar to the local structure of the style image. In this paper, we present an effective method that can be used to transfer style patterns while fusing the local style structure to the local content structure. In our method, different levels of coarse stylized features are ﬁrst reconstructed at low resolution using a coarse network, in which style color distribution is roughly transferred, and the content structure is combined with the style structure. Then, the reconstructed features and the content features are adopted to synthesize high-quality structure-aware stylized images with high resolution using a ﬁne network with three structural selective fusion (SSF) modules. The effectiveness of our method is demonstrated through the generation of appealing high-quality stylization results and a comparison with some state-of-the-art style transfer methods.


Introduction
Artistic style transfer is an attractive image-processing technique that is used to generate a new image that preserves the structure of a content image but carries the pattern of a style image.Recently, the seminal image-optimization method proposed by Gatys et al. [1] was used to achieve style transfer by adopting the correlation of features extracted from a pretrained deep neural network and the iterative optimization process.Like the method presented by Gatys et al. [1], style transfer by relaxed optimal transport and selfsimilarity (STROTSS) [2] is also an image-optimization style transfer method; this method has achieved superior stylization results by adopting the relaxed earth mover's distance (rEMD) loss in a multiscale optimization process.However, the expensive computational cost of these image-optimization methods restricts their use in practice applications in industry.To speed up the optimization procedure, Johnson et al. [3] and Ulyanov et al. [4] proposed model-optimization style transfer methods.They train a feed-forward neural network that can be used to synthesize images with a single given style image in real time.Both adaptive instance normalization (AdaIN) [5] and whitening and coloring transforms (WCTs) [6] are model-optimization methods but are also arbitrary style transfer methods, in which style patterns of arbitrary style images are transferred by adopting some feature transforms.After reviewing these methods, we have found that although local style texture and content structures can generally be combined, some key structures of the style image are not accurately learned.For example, the color blocks and brushstrokes that constitute the main objects in style images are not transferred very well.Meanwhile, in some cases, these methods produce distorted objects and incongruous artistic effects in stylized images.Therefore, our main task is to transfer the local structure of the style image to the content image and adopt a coarse-to-fine strategy to enhance the artistic details of the stylization results.
We propose a novel artistic style transfer network for fusing an essential style structure to a content structure and synthesizing a structure-aware stylized image.In our model, a coarse network is designed to obtain reconstructed coarse stylized features in the first stage.Because the coarse network works only at a low resolution, the coarse stylized features can discard the trivial structure details of the content image and combine the global content structure with the style patterns.Then, the task of a fine network is to adopt these reconstructed coarse stylized features obtained at a low resolution and the original content image with a high resolution to synthesize the final high-resolution stylized image in the second stage.By adopting some SSF modules to fuse the coarse stylized features into the fine network, the final high-resolution stylized images can selectively integrate structural information at different scales.Our main contributions are as follows: 1.
We introduce a novel style transfer model that can be used to synthesize appealing structure-aware stylization results.This model consists of a coarse network and a fine network.The former roughly transfers style patterns that include holistic structural information and color distribution information, and then the latter enhances the details of the style patterns by fusing multiscale features.

2.
We propose a SSF module for fusing the reconstructed features to the content features in a fine network.This module can help the fine network select essential structural information for feature fusion on the basis of the channel attention mechanism.As a result, the color distribution of the style images can be accurately transferred.

3.
It is demonstrated through experiments that our method can be used to synthesize high-quality stylizations, where the main structures of the content image are preserved and the local structures of the style image are transferred.These stylization results can maintain the same artistic expression as style images by discarding trivial content details and injecting key local style structures.
The rest of the paper Is organized as follows.In Section 2, the works related to different style transfer methods are reviewed.In Section 3, the pipeline of our framework and the details of our two networks are described.Moreover, the different loss functions are introduced.Different experimental results are shown and discussed in Section 4. The conclusion is summarized in Section 5.

Related Work 2.1. Style Transfer
The goal of style transfer is to combine the texture of a style image with the structure of a content image.Gatys et al. [1] proposed a seminal iterative method that was based on a pretrained visual geometry group (VGG) network [7].In this method, the content structure and the style texture can be used to synthesize a new image, but it is expensive, and a stylized image is generated only after the training process has been completed.Inspired by Gatys et al. [1], Johnson et al. [3] proposed a feed-forward method, which can be used to synthesize arbitrary images with a fixed style by an encoder-decoder architecture; the time and computation costs are reduced when using this method.Numerous methods have been developed to speed up the style transfer process [4,8] and improve the visual quality [9][10][11].Sanakoyeu et al. [12] also improved the stylization quality by proposing a style-aware loss, but they trained a network with a set of style images instead of a style image.This approach aimed to combine many style images created by one artist to synthesize a stylized image with the overall style of this artist.The dual style generative adversarial network (DualStyleGAN) [13] is proposed to characterize the content and style of a portrait by retaining an intrinsic style path to control the style of the original domain and an extrinsic path to model the style of the target extended domain.Peking Opera face makeup (POFMakeup) [14] also is a portrait style transfer method that can transfer the style of a portrait with a Peking Opera face to a target portrait.Lin et al. [15] combined a universal style transfer method with image fusion and color enhancement methods to solve the problems of the color scheme, the strength of style strokes, and the adjustment of image contrast.
To simultaneously handle multiple styles, [16] proposed a flexible conditional instance normalization approach embedded in style transfer networks to learn multiple styles, and [17] achieved multistyle generation in a generative network architecture with a learnable inspiration layer.Ye et al. [18] adopted a mechanism and instance segmentation to achieve a regional multistyle style transfer model, which can solve the problem of unnatural connections between regions.Alexandru et al. [19] combined various existing style transfer frameworks to propose a novel framework that can generate intriguing artistic stylization results by performing geometric deformation and using different styles from multiple artists.
In AdaIN [5], adaptive instance normalization is implemented to train a network with various styles, providing the ability to transfer arbitrary styles after the training process.In WCT [6], the whitening and coloring transforms are adopted to synthesize arbitrary styles with a pretrained VGG network and a series of pretrained image restructuring decoders.Based on WCT, Wang et al. [20] achieved the diversity of style transfer by adopting a deep feature perturbation (DFP) operation while preserving the quality of stylization results, and Wang et al. [21] synthesized ultraresolution stylized images and reduced the convolutional filters by using a knowledge-distillation method.A style-attentional network (SANet) [22] is also an arbitrary style transfer method that can be used to efficiently generate stylized images by injecting local style patterns into content features on the basis of using the style attention mechanism.

Style Transfer Based on Multiscale Learning
Recently, some style transfer methods have been used to transfer style patterns on the basis of multiscale learning.Multiscale holistic style transfer is achieved in Avatar-Net [23] on the basis of the use of an hourglass with multiple skip connections and a style decorator.STROTSS [2] is an image-optimization method that adopts multiscale learning to update the content image and generate high-quality stylized images.Yang et al. [24] proposed a novel video style transfer framework that can render high-quality artistic portraits on the basis of the multiscale content features and preserve the frame details.A Laplacian pyramid style network (LapStyle) [25] also exhibits high visual quality and is based on a drafting network and a revision network.First, the former transfers the global style patterns, and then, the latter enhances local style details.However, too many content structure details are preserved in these methods.Key local style structures are not fused into stylized images in any of these methods.In contrast, our method transfers global style patterns at low resolution using a coarse network, which needs to be trained only once to reconstruct coarse stylized features.Our fine network enhances local style details with multiscale features from the coarse network and the high-resolution content image.As a result, our method can discard trivial local content structures and synthesize high-quality structure-aware stylized images by using a coarse-to-fine process.The differences between our method and the methods in previous studies are shown in Table 1.

Framework Overview
Inspired by the painting process of artists, in which the coarse structure and color distribution are first constructed and then fine details are added, our framework employs a coarse network and a fine network to simulate the coarse-to-fine process.As shown in Figure 1, given a content image x c ∈ R 3×h c ×w c and a style image x s ∈ R 3×h s ×w s , our model eventually generates a stylized image x cs ∈ R 3×h cs ×w cs .In the first stage, the coarse network takes x c and x s as inputs, where x c and x s are the results of downsampling x c and x s by 2, respectively.Then three restructured coarse stylized features f

Framework Overview
Inspired by the painting process of artists, in which the coarse structure and color distribution are first constructed and then fine details are added, our framework employs a coarse network and a fine network to simulate the coarse-to-fine process.As shown in Figure 1, given a content image ∈  and a style image ∈  , our model eventually generates a stylized image  As shown in Figure 2, different stylized images are generated by our method.In Figure 2b, we adopt the last restructured coarse stylized features only to directly restructure the coarse stylized image by the coarse network in the first stage.The coarse stylized image discards the unnecessary local structures of the content image and transfers the global color distribution of the style image.Then, the fine network is employed to encode the high-resolution content image to obtain the content features, and these content features and three coarse reconstructed features from the coarse network are decoded to generate the high-quality structure-aware stylized image in the second stage.As shown in Figure 2, different stylized images are generated by our method.In Figure 2b, we adopt the last restructured coarse stylized features only to directly restructure the coarse stylized image by the coarse network in the first stage.The coarse stylized image discards the unnecessary local structures of the content image and transfers the global color distribution of the style image.Then, the fine network is employed to encode the high-resolution content image to obtain the content features, and these content features and three coarse reconstructed features from the coarse network are decoded to generate the high-quality structure-aware stylized image in the second stage.As shown in Figure 2c, the final appealing stylized image is synthesized by adopting our full model with a coarse network and a fine network.Moreover, to more clearly show the local style structure of the final stylized image, we use the color control method [26] to keep the color of the final stylized image consistent with the color of the original content image.As illustrated in Figure 2d, although the color distribution of the stylized image remains the same as that of the content image, the local structure of the stylized image is similar to that of the style image.
As shown in Figure 2c, the final appealing stylized image is synthesized by adopting our full model with a coarse network and a fine network.Moreover, to more clearly show the local style structure of the final stylized image, we use the color control method [26] to keep the color of the final stylized image consistent with the color of the original content image.As illustrated in Figure 2d, although the color distribution of the stylized image remains the same as that of the content image, the local structure of the stylized image is similar to that of the style image.

Coarse Network
One problem with recent style transfer methods is that too many structural details of the content image are retained during the transfer of style patterns.In the stylized image, there are some small structures from the content image that do not change; they simply transfer the color and texture of the style image.These local structures that do not exist in the style image appear in the stylized image, resulting in a stylized image that fails to show the spirit of the artistic expression of the style image.The reason is that these methods directly extract features from high-resolution images and cannot decide which details to discard from the content image.Contrary to previous work, our coarse network transfers rough style patterns at low resolution.As a result, there is a larger receptive field to learn low-frequency information to determine the overall structure of the image.Then, some unnecessary high-frequency information is ignored during training.As shown in Figure 3, the coarse network can transfer more details that are unnecessary in the coarse stylized image at high resolution.At low resolution, the coarse network can discard some trivial details of the structure and keep the objects smooth in the stylized image.

Coarse Network
One problem with recent style transfer methods is that too many structural details of the content image are retained during the transfer of style patterns.In the stylized image, there are some small structures from the content image that do not change; they simply transfer the color and texture of the style image.These local structures that do not exist in the style image appear in the stylized image, resulting in a stylized image that fails to show the spirit of the artistic expression of the style image.The reason is that these methods directly extract features from high-resolution images and cannot decide which details to discard from the content image.Contrary to previous work, our coarse network transfers rough style patterns at low resolution.As a result, there is a larger receptive field to learn low-frequency information to determine the overall structure of the image.Then, some unnecessary high-frequency information is ignored during training.As shown in Figure 3, the coarse network can transfer more details that are unnecessary in the coarse stylized image at high resolution.At low resolution, the coarse network can discard some trivial details of the structure and keep the objects smooth in the stylized image.

WCT Module
Inspired by WCT [6], our coarse network adopts whitening and coloring transforms to transfer coarse style patterns at low resolution.The whitening transform can remove inessential information related to style while preserving the global structure of the content.Then, the coloring transform can capture the salient visual style and fuse some style structures to content structures.WCT is a multilevel stylization process that uses different rectified linear unit (ReLU) layers of VGG features ReLU_X_1 (X = 1, 2, …, 5) and

WCT Module
Inspired by WCT [6], our coarse network adopts whitening and coloring transforms to transfer coarse style patterns at low resolution.The whitening transform can remove inessential information related to style while preserving the global structure of the content.Then, the coloring transform can capture the salient visual style and fuse some style structures to content structures.WCT is a multilevel stylization process that uses different rectified linear unit (ReLU) layers of VGG features ReLU_X_1 (X = 1, 2, . . ., 5) and transfers style patterns in a coarse-to-fine pipeline.The higher-layer features are adopted to capture complex local structures, while lower-layer features carry low-level color and texture information.The difference between our coarse network and WCT is that we use only a single-level whitening and coloring transform for stylization.Moreover, we do not directly reconstruct the stylized features to generate an image; however, we utilize the reconstructed features at different layers during reconstruction.As a result, our coarse network, which has the ability to capture the multilevel information by reconstructing the coarse stylized features at different levels, can save computing resources.

Architecture of Coarse Network
The architecture of coarse network, which is shown in Figure 1, includes an encoder, a WCT module, and a decoder.(1) The encoder is a pretrained VGG-19 network, which is fixed during training.Given x c and x s , the VGG encoder extracts the content feature f c and the style feature f s at ReLU_4_1.( 2) Then, we apply a WCT module for whitening and coloring transformation.As shown in Figure 4a, the whitening transform is adopted to linearly transform f c to obtain f c .Next, the coloring transform is carried out to obtain f cs by using f c and f s .(3) Finally, we adopt a reconstruction decoder to reconstruct the coarse stylized feature f cs .The decoder is designed to be symmetrical to the VGG-19 network, where the nearest neighbor upsampling layer is used for enlarging the feature map.We take f cs as input for reconstruction and then generate these restructured stylized features r as outputs.In this reconstruction decoder, these outputs are output before the second upsampling layer, before the third upsampling layer, and after the last convolution layer.These f (i) r will become a part of the input of the fine network.

Fine Network
The fine network aims to synthesize high-resolution stylized images by fusing the reconstructed coarse stylized features to the reconstructed content features.The reconstructed content features are from the high-resolution content image and contains the global semantic information and local detail information.Contrary to the reconstructed content features at high resolution, the reconstructed stylized features generated from the coarse network preserve only the main content structure while blending some local structural style information.By fusing multiscale information, the fine network can pay

Fine Network
The fine network aims to synthesize high-resolution stylized images by fusing the reconstructed coarse stylized features to the reconstructed content features.The recon-structed content features are from the high-resolution content image and contains the global semantic information and local detail information.Contrary to the reconstructed content features at high resolution, the reconstructed stylized features generated from the coarse network preserve only the main content structure while blending some local structural style information.By fusing multiscale information, the fine network can pay more attention to the holistic structure of the content and ignore some trivial details by using our SSF modules.Then, some significant details can be added to the structure, and appealing artistic effects in the stylized image can be enhanced.In addition, fusing the reconstructed coarse stylized information can greatly reduce the time cost of the training process of the fine network, and the desired stylization results can be achieved at an earlier point in time.

SSF Module
The structural selective fusion (SSF) module is designed to fuse the reconstructed coarse stylized features from the coarse network to the reconstructed content features in the decoder of the fine network.Inspired by the attention mechanism [27,28], we employ a weight matrix to select the key structural information of the reconstructed content features, which is learned by adopting the merged features.The merged features are obtained by concatenating reconstructed coarse stylized features and the reconstructed content features.The matrix can help the SSF module obtain the selective features that focus on meaningful structural information, and the selective feature is one part of the output of the SSF module.Another part of the output is the refined merged features, which include different scale information, such as some crucial local textures or global structures.
The architecture of the SSF module is shown in Figure 4b.First, we concatenate the reconstructed coarse stylized features f r and the reconstructed content features f cs as input f csr ∈ R (c cs +c r )×w r ×h r .The reconstructed content features f cs are the output of the convolution layer in the decoder of the fine network (except that the first SSF module uses the content features f c from the encoder of the fine network as f cs ).We adopt an average-pooling operation to aggregate the spatial information of f csr to generate the input of the multilayer perceptron, which is adopted to produce an attention map M cs ∈ R c cs ×1×1 as the weight matrix.In summary, the attention map is calculated as follows: where σ denotes the sigmoid function.Then the selective feature f cs is calculated as follows: where ⊗ denotes element-wise multiplication.Meanwhile, f csr is fed into a convolutional layer to produce a refined merged feature f csr ∈ R c r ×w r ×h r .Eventually, the SSF module generates the final output f ss f ∈ R (c cs +c r )×w r ×h r as the fused feature by directly concatenating f cs and f csr .

Architecture of Fine Network
As shown in Figure 1, fine network is designed as a flexible encoder-decoder architecture, with an encoder, a series of residual blocks, and a decoder.The encoder contains a convolutional layer with a stride of 1 and three convolutional layers with strides of 2, followed by several residual blocks.The decoder contains three upsampling layers, three convolutional layers with strides of 1, and three SSF modules.We use an SSF module before each upsampling layer.Given the content image x c as the input of fine network, the encoder and several residual blocks generate the content feature f c .Then, SSF modules generate the fused features f ss f by taking f (i) r and f cs as inputs, where f cs is the output of these convolution layers in the decoder (except the first SSF module, which takes f c as f cs ).These fused features f ss f are fed into an upsampling layer and a convolution layer.Finally, the decoder generates the final stylized image x cs after the last convolution layer.

Loss Function
Our coarse network needs to train only once, and it is fixed during the training of the fine network.Compared with WCT [6], we train only one reconstruction decoder network to reconstruct the coarse stylized feature.Our coarse network can reconstruct the stylized features at three levels or directly generate a coarse stylized image by taking advantage of the reconstruction decoder.Following WCT, we adopt pixel reconstruction and perceptual loss [3] to train our decoder for image reconstruction: where I i and I o are the input image and output image, respectively, and Φ is the VGG encoder that extracts features at ReLU_X_1 (X = 1, 2, 3, 4).In addition, λ is the weight to balance the two losses.
The fine network is optimized with content and style loss during training.As shown in Figure 5, we keep a single x s and a set of x c from a content dataset, then x cs is a stylized image generated by the fine network.For x s , x c , and x cs , we can use a pretrained VGG-19 encoder to extract their features s ∈ R c t ×h t ×w t , and cs ∈ R c t ×h t ×w t , where t denotes the features extracted at ReLU_t (t = 1_1, 1_2, 2_1, 2_2, 3_1, 3_3, 4_1).For content loss, we adopt the commonly used perceptual loss between F and F proposed in [3].The perceptual loss can measure high-level perceptual and seman- tic differences between images, and it is defined as follows: For style loss, we adopt three style losses.The first and most significant style loss is the relaxed earth mover's distance (rEMD) loss [2], which helps the fine network generate visual effects with minimum distortion to the layout of the content image.This loss plays a key role in migrating the structural forms of the style image to the target image.The rEMD loss between F can be calculated as follows: For content loss, we adopt the commonly used perceptual loss between c and F (t) cs proposed in [3].The perceptual loss can measure high-level perceptual and semantic differences between images, and it is defined as follows: For style loss, we adopt three style losses.The first and most significant style loss is the relaxed earth mover's distance (rEMD) loss [2], which helps the fine network generate visual effects with minimum distortion to the layout of the content image.This loss plays a key role in migrating the structural forms of the style image to the target image.The rEMD loss between F (t) s and F (t) cs can be calculated as follows: where C is the cost matrix, which can be calculated as the cosine distance between s and F (t) cs : The second style loss is the commonly used style reconstruction loss proposed by Gatys et al. [1], which is the difference between the Gram matrices of F (t) s and F (t) cs : where G denotes the calculation of the Gram matrix of the feature vectors.Finally, we use the mean-variance loss as the third style loss, which is similar to the style reconstruction loss.We can use this type of loss to reduce unnecessary visual effects in the stylized image and keep the magnitude of the stylized feature the same as that of the style feature: where µ and σ denote the mean and covariance of the feature vectors, respectively.The overall optimization objective is defined as follows: where α, λ 1 , λ 2 , and λ 3 are weight terms.By adjusting α, we can control the degree of stylization.Specifically, l p and l m both work on ReLU_1_1, ReLU_2_1, ReLU_3_1, and ReLU_4_1; then, l r works on ReLU_2_1, ReLU_3_1, and ReLU_4_1.Following Johnson et al. [3], l g works on ReLU_1_2, ReLU_2_2, and ReLU_3_3.

Experimental Dataset and Implementation Details
During training, we use the MS-COCO [29] dataset as the set of content images and select some famous art paintings as style images.To show the experimental results of our method, we also select some copyright-free images as content images, from Pexels.com.
In our experiment, the coarse network is trained on the MS-COCO dataset only once for image reconstruction, and the weight λ in Equation ( 1) is set as 1.In the experiments, we use the content images and the style image with a resolution of 512 × 512.Then these images are downsampled by 2. Each image that is input into the coarse network has a resolution of 256 × 256.During the training of the fine network, we use the Adam [30] optimizer with a learning rate of 1 × 10 −4 , and the batch size is set as 1 because of the limitation of the graphics processing unit (GPU) memory.To train a style, a training process consists of 15,000 iterations.The loss weight terms α, λ 1 , λ 2 , and λ 3 are set to 1, 20, 1000, and 5, respectively.The experimental environment configuration is shown in Table 2.

Qualitative Comparisons with Methods in Prior Works
Inspired by the recent WCT [6] and STROTSS [2] methods, our method adopts the whiting and coloring transformation proposed in WCT and the rEMD loss proposed in STROTSS.In Figure 6, we compare our method with WCT and STROTSS.WCT can transfer the color distribution and simple texture of arbitrary style images; however, some context local structure is discarded, resulting in messy and disordered stylized images (e.g., rows 1, 2, and 3).STROTSS is an image-optimization style transfer method that transfers the visual attributes from the style image to the content image with minimum semantic distortion.Nevertheless, too many structural details are preserved, and the overall palette of the style image is not accurately transferred (e.g., rows 2 and 3).In contrast to these two methods, our method can transfer the main structure and discard some trivial details of the content image.Moreover, some notable local structures of the style image, such as brushstrokes, can be fused into the global structure of the content image, and the overall palette of the stylized image remains the same as that of the style image.For example, in the second and fourth rows, the color blocks of mountains and the brushstrokes of vegetation in our stylized images are explicitly similar to those in the style images.Our model can learn some key style structures while ignoring some unimportant content details. some trivial details of the content image.Moreover, some notable local structures of the style image, such as brushstrokes, can be fused into the global structure of the content image, and the overall palette of the stylized image remains the same as that of the style image.For example, in the second and fourth rows, the color blocks of mountains and the brushstrokes of vegetation in our stylized images are explicitly similar to those in the style images.Our model can learn some key style structures while ignoring some unimportant content details.As shown in Figure 7, we compare our method with other state-of-the-art style transfer methods.Gatys et al. [1] proposed the original optimization-based style transfer algorithm, which can transfer the overall style texture and the color distribution.However, some incongruous textures appear in the stylized images, leading to the stylizations' looking unnatural (e.g., rows 4, 5, and 6).Similar to our method, the method proposed by Johnson et al. [3] is also a feed-forward method.It can combine the local color and texture of style images with the structure of the content but often maintains too many As shown in Figure 7, we compare our method with other state-of-the-art style transfer methods.Gatys et al. [1] proposed the original optimization-based style transfer algorithm, which can transfer the overall style texture and the color distribution.However, some incongruous textures appear in the stylized images, leading to the stylizations' looking unnatural (e.g., rows 4, 5, and 6).Similar to our method, the method proposed by Johnson et al. [3] is also a feed-forward method.It can combine the local color and texture of style images with the structure of the content but often maintains too many content structures and may play a role in shifting the color histogram only in some cases (e.g., rows 1, 2, and 3).AdaIN [5] and SANet [22] are both arbitrary style transfer models, which mainly transfer simple style patterns.AdaIN often fails to transfer the color distribution of style images, and SANet has the severe problem of messy texture and disordered structure (e.g., rows 4, 5, and 6).All of the methods mentioned above maintain some unnecessary small local structures of the content images, and the essential local structures of style images are not integrated into the target image.In contrast to these methods, our model can simultaneously transfer the style color distribution accurately and combine the local style structure with the global content structure.For example, in the fourth row, the image of the rabbits generated by our method looks more harmonious and natural in the stylized image.
It seems as though the style image consists of ink dots; the same artistic expression can be exhibited by our method.
which mainly transfer simple style patterns.AdaIN often fails to transfer the color distribution of style images, and SANet has the severe problem of messy texture and disordered structure (e.g., rows 4, 5, and 6).All of the methods mentioned above maintain some unnecessary small local structures of the content images, and the essential local structures of style images are not integrated into the target image.In contrast to these methods, our model can simultaneously transfer the style color distribution accurately and combine the local style structure with the global content structure.For example, in the fourth row, the image of the rabbits generated by our method looks more harmonious and natural in the stylized image.It seems as though the style image consists of ink dots; the same artistic expression can be exhibited by our method.

Quantitative Comparisons with Methods in Prior Works
In the experiment of quantitative comparisons, we use the learned perceptual image patch similarity (LPIPS) proposed in [31] and the structural similarity index measurement (SSIM) proposed in [32] to compute the difference in style structure between the stylized image and the style image.In each method, 1500 pairs of stylized and style images that include 10 styles are used to compute the average distance.As shown in Table 3, lower values indicate the higher similarity of human perceptual judgments when we use LPIPS as the metric, and higher values indicate the higher structural similarity when we use SSIM as the metric.For both evaluation metrics, our proposed method achieves the highest similarity in style structure.The experimental results show that our method can synthesize structure-aware stylized images that have a higher structural similarity to the style images.

Comparisons of Time Efficiency with Methods in Prior Works
We further compare the time efficiency of our proposed method with other stateof-art methods.In each method, we synthesize 100 stylized images with a resolution of 512 × 512.All experiments are conducted on the same environment configuration.As shown in Table 4, Johnson et al. [3] achieve the highest time efficiency because they use only a simple encoder-decoder architecture to generate stylized images.Like [3], AdaIN [5] and SANet [22] also use the simple encoder-decoder network to generate stylized images.However, they apply some feature transform modules in their networks to integrate content features and style features.As a result, their time efficiencies are lower than [3] but are still satisfactory.Different from these three methods that work at the same image scale, our model includes two networks and works in two stages.Although our model can capture richer multiscale information and synthesize higher-quality stylized images, the time efficiency of our method is only slightly lower than that of AdaIN and SANet.We traded a small increase in time cost for a promising improvement in the quality of stylized images.WCT [6] has low time efficiency because it uses five encoders and decoders to generate a stylized image.The time efficiencies of STROTSS [2] and Gatys et al. [1] are far lower than other methods because they are image-optimization methods that generate only one stylized image after a training process.

User Study
The user study is conducted on social media, and all participants are anonymous and voluntary.We choose 10 content images and 10 style images to synthesize 10 stylized images in each method and then ask subjects to select their favorite one.By the end of this user study, we had collected 341 votes from these anonymous participants.As shown in Figure 8, we show the percentage of votes for each method.The result shows that the stylization results obtained by our method are more appealing than those of other methods.
shown in Figure 8, we show the percentage of votes for each method.The result shows that the stylization results obtained by our method are more appealing than those of other methods.

Ablation Study on Loss Function
We conduct ablation experiments to verify the effectiveness of each loss term used for training our model, and the results are shown in Figure 9. (1) Without perceptual loss p l , too many structures of the content image are discarded; for example, the basic struc- ture of the dog disappears in the stylized image.(2) Without Gram matrix loss g l , the stylization result is acceptable because mean-variance loss m l has a similar effect to g l , but the color distribution of the stylized image is slightly different from that of the style image.Moreover, the textures of the dog in the stylized image are increasingly denser and smaller.(3) Without rEMD loss r l , the texture distribution is chaotic, and some visual artifacts occur in the stylized image.(4) Without mean-variance loss m l , the global color distribution of the stylized image is not exactly the same as that of the style image; for example, the dark color of the dog in the stylized image is more similar to that in the content image.This dark black color is completely absent in the style image.

Ablation Study on Loss Function
We conduct ablation experiments to verify the effectiveness of each loss term used for training our model, and the results are shown in Figure 9. (1) Without perceptual loss l p , too many structures of the content image are discarded; for example, the basic structure of the dog disappears in the stylized image.(2) Without Gram matrix loss l g , the stylization result is acceptable because mean-variance loss l m has a similar effect to l g , but the color distribution of the stylized image is slightly different from that of the style image.Moreover, the textures of the dog in the stylized image are increasingly denser and smaller.
(3) Without rEMD loss l r , the texture distribution is chaotic, and some visual artifacts occur in the stylized image.(4) Without mean-variance loss l m , the global color distribution of the stylized image is not exactly the same as that of the style image; for example, the dark color of the dog in the stylized image is more similar to that in the content image.This dark black color is completely absent in the style image.

Effectiveness of Coarse Network
During training, we compare our full model with the model without the coarse network.As shown in Figure 10, our full model is trained faster than the model without the coarse network.The preliminary stylization result can be obtained with fewer iterations.Moreover, the stylized images of the comparison during the training phase are shown in Figure 11.At 3000 iterations, our full model can generate a stylized image with a basic structure, while the model without the coarse network generates a completely unstructured image.At 10,000 iterations, the stylization result of our full model is substantially acceptable.However, the stylized result of the model without the coarse network is less than satisfactory because the main structure has not been generated.At 30,000 iterations, the model without the coarse network finally synthesizes the final stylized image, but some messy textures and unnatural structures appear in the stylized image.Compared with this compromised stylized result, our full model can generate an enhanced promising stylized result with more-refined details, such as the brushstrokes of the cat's fur and eyes at 30,000 iterations, which are more delicate and finer than those at 10,000 iterations.

Effectiveness of Coarse Network
During training, we compare our full model with the model without the coarse network.As shown in Figure 10, our full model is trained faster than the model without the coarse network.The preliminary stylization result can be obtained with fewer iterations.Moreover, the stylized images of the comparison during the training phase are shown in Figure 11.At 3000 iterations, our full model can generate a stylized image with a basic structure, while the model without the coarse network generates a completely unstructured image.At 10,000 iterations, the stylization result of our full model is substantially acceptable.However, the stylized result of the model without the coarse network is less than satisfactory because the main structure has not been generated.At 30,000 iterations, the model without the coarse network finally synthesizes the final stylized image, but some messy textures and unnatural structures appear in the stylized image.Compared with this compromised stylized result, our full model can generate an enhanced promising stylized result with more-refined details, such as the brushstrokes of the cat's fur and eyes at 30,000 iterations, which are more delicate and finer than those at 10,000 iterations.

Effectiveness of Fine Network
As shown in Figure 12, we demonstrate the effectiveness of the fine network.Without the fine network, the coarse network can transfer the color and texture of style images, but the local details and global structure are worse than when our full model is

Effectiveness of Fine Network
As shown in Figure 12, we demonstrate the effectiveness of the fine network.Without the fine network, the coarse network can transfer the color and texture of style images, but the local details and global structure are worse than when our full model is

Effectiveness of Fine Network
As shown in Figure 12, we demonstrate the effectiveness of the fine network.Without the fine network, the coarse network can transfer the color and texture of style images, but the local details and global structure are worse than when our full model is utilized.The stylized image generated directly by the coarse network resembles an unfinished work in progress.
utilized.The stylized image generated directly by the coarse network resembles an unfinished work in progress.

Effectiveness of the SSF Modules
We compare two feature fusion methods through some experiments.In the first method, the reconstructed coarse stylized features from the coarse network are fused to the reconstructed content features in the fine network on the basis of our SSF modules.In the second method, we directly concatenate these two features for feature fusion.As

Effectiveness of the SSF Modules
We compare two feature fusion methods through some experiments.In the first method, the reconstructed coarse stylized features from the coarse network are fused to the reconstructed content features in the fine network on the basis of our SSF modules.In the second method, we directly concatenate these two features for feature fusion.As Figure 13 shows, the stylization results that are based on the second method are transferred to the wrong color distribution in some regions.According to the first method, our model can accurately transfer the color distribution, and more-natural textures in the stylized images can be generated by selecting more-critical information.
Figure 13 shows, the stylization results that are based on the second method are transferred to the wrong color distribution in some regions.According to the first method, our model can accurately transfer the color distribution, and more-natural textures in the stylized images can be generated by selecting more-critical information.

Additional Experiments
In Figure 14, we zoom in on some details in style images, content images, and stylized images.The local structures of these style images are transferred to the content image, and the object of the stylized images looks like a reasonable combination that is composed of the style structures rather than a simple mixture of the content structure and the style texture.

Additional Experiments
In Figure 14, we zoom in on some details in style images, content images, and stylized images.The local structures of these style images are transferred to the content image, and the object of the stylized images looks like a reasonable combination that is composed of the style structures rather than a simple mixture of the content structure and the style texture.
Appl.Sci.2023, 13, x FOR PEER REVIEW 18 of 23 Figure 13 shows, the stylization results that are based on the second method are transferred to the wrong color distribution in some regions.According to the first method, our model can accurately transfer the color distribution, and more-natural textures in the stylized images can be generated by selecting more-critical information.

Additional Experiments
In Figure 14, we zoom in on some details in style images, content images, and stylized images.The local structures of these style images are transferred to the content image, and the object of the stylized images looks like a reasonable combination that is composed of the style structures rather than a simple mixture of the content structure and the style texture.As shown in Figure 15, we can control the stylization degree by adjusting the weight term α in the training phase.These experiments demonstrate that the main content structure can be preserved even though the stylization degree is large.Some local style structures, such as lines or color blocks, can be fused to the global content structure.
As shown in Figure 15, we can control the stylization degree by adjusting the weight term α in the training phase.These experiments demonstrate that the main content structure can be preserved even though the stylization degree is large.Some local style structures, such as lines or color blocks, can be fused to global content structure.Following Gatys et al. [26], we incorporate color control and spatial control into our method.In Figure 16b, the color distribution and the local structure of the stylized image are consistent with those of the style image.Then we use color control to make the stylized image preserve the global color of the content image.In Figure 16c, although the color is similar to the content image, the local structure and texture are the same as those of the style image.In Figure 17, we use spatial control to transfer different regions of the content image to different styles.The stylization result is appealing as the local style structures and color distribution are greatly maintained.Both experiments demonstrate that our model can synthesize high-quality structure-aware stylized images by fusing key local structures from the style image to the main content structure while discarding some trivial details from the content image.Following Gatys et al. [26], we incorporate color control and spatial control into our method.In Figure 16b, the color distribution and the local structure of the stylized image are consistent with those of the style image.Then we use color control to make the stylized image preserve the global color of the content image.In Figure 16c, although the color is similar to the content image, the local structure and texture are the same as those of the style image.In Figure 17, we use spatial control to transfer different regions of the content image to different styles.The stylization result is appealing as the local style structures and color distribution are greatly maintained.Both experiments demonstrate that our model can synthesize high-quality structure-aware stylized images by fusing key local structures from the style image to the main content structure while discarding some trivial details from the content image.Following Gatys et al. [26], we incorporate color control and spatial control into our method.In Figure 16b, the color distribution and the local structure of the stylized image are consistent with those of the style image.Then we use color control to make the stylized image preserve the global color of the content image.In Figure 16c, although the color is similar to the content image, the local structure and texture are the same as those of the style image.In Figure 17, we use spatial control to transfer different regions of the content image to different styles.The stylization result is appealing as the local style structures and color distribution are greatly maintained.Both experiments demonstrate that our model can synthesize high-quality structure-aware stylized images by fusing key local structures from the style image to the main content structure while discarding some trivial details from the content image.Following Gatys et al. [26], we incorporate color control and spatial control into our method.In Figure 16b, the color distribution and the local structure of the stylized image are consistent with those of the style image.Then we use color control to make the stylized image preserve the global color of the content image.In Figure 16c, although the color is similar to the content image, the local structure and texture are the same as those of the style image.In Figure 17, we use spatial control to transfer different regions of the content image to different styles.The stylization result is appealing as the local style structures and color distribution are greatly maintained.Both experiments demonstrate that our model can synthesize high-quality structure-aware stylized images by fusing key local structures from the style image to the main content structure while discarding some trivial details from the content image.

Conclusions
The conclusions are summarized as follows: 1.
We proposed a novel feed-forward style transfer algorithm that fuses the local style structure into the global content structure.Different from most style transfer methods that work at the same scale, our model can integrate richer information from features from different scales and then synthesize high-quality structure-aware stylized images.

2.
We first proposed a coarse network to generate reconstructed coarse stylized features at low resolution, which can capture the main structure of the content image and transfer the holistic color distribution of the style image.Then, we proposed a fine network to enhance local style patterns and three SSF modules to selectively fuse the reconstructed stylized features to reconstructed content features at different levels.

3.
Through comparative experiments, was demonstrated that our method was effective in synthesizing appealing high-quality stylized images, and these stylization results outperformed the results generated by current state-of-the-art style transfer methods.
The experimental results also demonstrated the effectiveness of the coarse network, the fine network, and the SSF module.
Although the high-quality stylization results can be synthesized by our method, our model generated the stylized images with a single style only after a training process.In future studies, we will achieve a novel arbitrary style transfer framework that is based on our full model in this paper.Appealing high-quality structure-aware stylized images with an arbitrary style can be generated by this framework after a training process.In addition, we will try to use more feature transform methods to replace the whitening and coloring transforms for achieving higher running time efficiency.

r
3) are generated by the coarse network, where c are the number of channels, height, and width of the i restructured feature, respectively.In the second stage, the fine network takes x c and f (i) r as inputs and then generates the final stylized image x cs by adopting SSF modules for feature fusion.

(
first stage, the coarse network takes c x and s x as inputs, where c x and s x are the results of downsampling c x and s x by 2, respectively.Then three restructured coarse stylized features i = 1, 2, 3) are generated by the coarse network, where ( ) of channels, height, and width of the i restructured feature, respec- tively.In the second stage, the fine network takes c x and ( ) i r f as inputs and then generates the final stylized image cs x by adopting SSF modules for feature fusion.

Figure 1 .
Figure 1.Overview of our framework.

Figure 1 .
Figure 1.Overview of our framework.

Figure 2 .
Figure 2. Different stylization results from the same content image and style image: (a) the content image is a cat, and the style image is Starry Night by Vincent van Gogh; (b) this stylized image is generated directly by our coarse network in the first stage; (c) the final stylized image is generated by our full model in the second stage; (d) this stylized image maintains the same color as the content image using color control.

Figure 2 .
Figure 2. Different stylization results from the same content image and style image: (a) the content image is a cat, and the style image is Starry Night by Vincent van Gogh; (b) this stylized image is generated directly by our coarse network in the first stage; (c) the final stylized image is generated by our full model in the second stage; (d) this stylized image maintains the same color as the content image using color control.

Figure 3 .
Figure 3.Comparison of two stylized images generated by the coarse network at different resolutions: (a) the original content image with resolution of 512 × 512; (b) the stylized image with resolution of 256 × 256; (c) the stylized image with resolution of 512 × 512.

Figure 3 .
Figure 3.Comparison of two stylized images generated by the coarse network at different resolutions: (a) the original content image with resolution of 512 × 512; (b) the stylized image with resolution of 256 × 256; (c) the stylized image with resolution of 512 × 512.

23 Figure 5 .
Figure 5.The schematic for loss network.

Figure 5 .
Figure 5.The schematic for loss network.

Figure 6 .
Figure 6.Qualitative comparisons between our method, WCT [6] and STROTSS [2]: (a) The content images; (b) The style images; (c) The stylized images are generated by our method; (d) The stylized images are generated by WCT; (e) The stylized images are generated by STROTSS.

Figure 6 .
Figure 6.Qualitative comparisons between our method, WCT [6] and STROTSS [2]: (a) The content images; (b) The style images; (c) The stylized images are generated by our method; (d) The stylized images are generated by WCT; (e) The stylized images are generated by STROTSS.

Figure 7 .Figure 7 .
Figure 7. Qualitative comparisons between our method and other state-of-the-art methods: (a) The content images; (b) The style images; (c) The stylized images are generated by our method; (d) The stylized images are generated by Gatys et al. [1]; (e) The stylized images are generated by JohnsonFigure 7. Qualitative comparisons between our method and other state-of-the-art methods: (a) The content images; (b) The style images; (c) The stylized images are generated by our method; (d) The stylized images are generated by Gatys et al. [1]; (e) The stylized images are generated by Johnson et al. [3]; (f) The stylized images are generated by AdaIN [5]; (g) The stylized images are generated by SANet [22].

Figure 9 .
Figure 9. Ablation study of the effects of different loss functions used during training.

Figure 9 .
Figure 9. Ablation study of the effects of different loss functions used during training.

Figure 10 .Figure 11 .
Figure 10.Comparison of the full model and the model without the coarse network in terms of total loss.

Figure 10 . 23 Figure 10 .Figure 11 .
Figure 10.Comparison of the full model and the model without the coarse network in terms of total loss.

Figure 11 .
Figure 11.Comparison of stylized images using the full model and the model without the coarse network during training.In the first row, the stylized results are generated by our full model.In the second row, the stylized results are generated by the model without the coarse network.

Figure 12 .
Figure 12.Comparison of stylized images of the full model and the model without the fine network: (a) the content images; (b) the style images; (c) the stylized images generated by the model without fine network; (d) the stylized images generated by full model.

Figure 12 .
Figure 12.Comparison of stylized images of the full model and the model without the fine network: (a) the content images; (b) the style images; (c) the stylized images generated by the model without fine network; (d) the stylized images generated by full model.

Figure 13 .
Figure 13.Comparison of stylized images of our models with different feature fusion methods: (a) the content images; (b) the style images; (c) the stylized images generated by full model; (d) the stylized images generated by the model without SSF modules.

Figure 14 .
Figure 14.Comparison of local style details.

Figure 13 .
Figure 13.Comparison of stylized images of our models with different feature fusion methods: (a) the content images; (b) the style images; (c) the stylized images generated by full model; (d) the stylized images generated by the model without SSF modules.

Figure 13 .
Figure 13.Comparison of stylized images of our models with different feature fusion methods: (a) the content images; (b) the style images; (c) the stylized images generated by full model; (d) the stylized images generated by the model without SSF modules.

Figure 14 .
Figure 14.Comparison of local style details.Figure 14.Comparison of local style details.

Figure 14 .
Figure 14.Comparison of local style details.Figure 14.Comparison of local style details.

Table 1 .
The differences between our method and those in previous studies.

Table 3 .
Quantitative comparisons of LPIPS and SSIM between our method and six state-of-the-art methods.

Table 4 .
Running time comparison between our method and six state-of-the-art methods (in seconds).