Deep Deformable Artistic Font Style Transfer

: The essence of font style transfer is to move the style features of an image into a font while maintaining the font’s glyph structure. At present, generative adversarial networks based on convolutional neural networks play an important role in font style generation. However, traditional convolutional neural networks that recognize font images suffer from poor adaptability to unknown image changes, weak generalization abilities, and poor texture feature extractions. When the glyph structure is very complex, stylized font images cannot be effectively recognized. In this paper, a deep deformable style transfer network is proposed for artistic font style transfer, which can adjust the degree of font deformation according to the style and realize the multiscale artistic style transfer of text. The new model consists of a sketch module for learning glyph mapping, a glyph module for learning style features, and a transfer module for a fusion of style textures. In the glyph module, the Deform-Resblock encoder is designed to extract glyph features, in which a deformable convolution is introduced and the size of the residual module is changed to achieve a fusion of feature information at different scales, preserve the font structure better, and enhance the controllability of text deformation. Therefore, our network has greater control over text, processes image feature information better, and can produce more exquisite artistic fonts.


Introduction
An artistic font is a beautifully deformed font based on traditional fonts [1] from an artistic and decorative interpretation according to the meaning, character shape, and structural features of the texts.Because of their beautiful and interesting eye-catching characteristics, artistic fonts are widely used in propaganda, advertising, trademarks, and other scenarios and are becoming increasingly popular among the public.A traditional artistic font is designed by professional font designers, so its effect is influenced by the professional level of the designers and other factors.In recent years, with the advent and development of machine learning technology [2], people have applied deep learning methods to artistic font generation to achieve better results.
Currently, the majority of image style transfer methods are based on convolutional neural networks (CNNs).These methods adjust a noisy random image by using an optimization function so that the generated image maintains the content of a normal image while keeping part of the style of the original image.Since artistic fonts can be viewed as beautiful images, image style transfer methods can also be applied to artistic font style transfers.However, the key to artistic font generation is to synthesize text texture and add colorful texture information to the target text.Compared with image style transfer methods, artistic font style transfer methods need to extract the edge features of texts more accurately to maintain the integrity of the font structure in the stylization process.As CNNs adopt a fixed shape of convolution kernels and lack internal mechanisms to adaptively change the shape of convolution kernels, it is difficult to extend them to new tasks with unknown, complex geometric transformations.For example, for visual recognition with fine localization, different locations need to correspond to different scales or perceptual field sizes appropriate to them while the fixed convolutional kernels limit CNNs to satisfying this requirement.For artistic font style transfer methods based on CNNs, both structural disjunction and stroke overlap will occur in the case of a complex glyph structure.In addition, some conventional CNNs will consider some background features as edge features of the text during feature extraction, which leads to the addition of noise points to the image in the feature extraction process of the glyph, resulting in a double shadow and style spillover in the style transfer process.These problems directly lead to stylized font images not being accurately identified.
Shape-matching GAN [3] is an effective model that can realize multiscale deformation of artistic fonts.The model encoder consists of a general CNN and a controllable module.The original controllable module is composed of the same double-branch network of two layers of convolution, the convolution kernel of each convolution layer is 3, and the receptive field of the obtained feature map is 5.The deep network with residuals at different depths performs better than the shallow network, but the higher the number of layers is, the more overfitting will occur.In addition, features will be lost in the convolution process, and a larger receptive field can better ensure the integrity of the information.In summary, the inherent nature of the general convolutions and the small receptive field of the controllable module prevent shape-matching GAN from recognizing complex fonts, resulting in unclear image glyphs and style overflow.
By the above analysis, there are two challenges for artistic font style transfer methods based on CNNs to improve their performances: (1) how to accurately extract the edge features of texts to provide integral glyphs for generating artistic font; (2) how to eliminate double shadow and style overflow caused by noise.In this paper, a novel artistic font generation network is proposed.To address the first challenge, an encoder for glyph generation is designed which introduces a deformable convolution [4,5] that can freely change the receptive field by adjusting the offset of sampling locations, thus improving its ability to the geometric variations of texts and making it learn more complete information of glyphs.Aiming at the second issue, the difference in adjacent pixel values is calculated as a smoothing loss, and the smoothness of image edges is maintained by reducing the loss.
The contributions of this paper are summarized as follows: (1) A deep deformable artistic font style transfer network (DAF) is proposed which consists of a Sketch module for learning glyph mapping, a Glyph module for learning style features, and a Transfer module for a fusion of style textures.In the Glyph module, a Deform-Resblock encoder (DR encoder) is designed to extract glyph features, in which a dilation convolution and a deformable convolution are used to change the perceptual field so that the encoder focuses on information about more critical features.The deformable convolution can also help better integrate feature information at different scales to ensure that the generated glyphs maintain their complete font structure.
(2) Ghosting is eliminated and the image is smoothed by introducing a smoothing loss function that reduces the difference in the value of adjacent pixels in the image.
(3) Comparing the proposed model with four current advanced artistic font style transfer methods, the experimental results show that the proposed model is effective and has better performance.
The remainder of this paper is organized as follows.Section 2 reviews related work involving deformable convolutional networks, image style transfer, and font style transfer.In Section 3, the proposed DAF model is then proposed with a detailed description.To evaluate the performance of our model, a series of experiments are conducted in Section 4. Finally, we summarize this paper in Section 5.

Deformable Convolutional Networks
Research on CNNs [6,7] dates back to the neocognitron model proposed by Japanese scientist Kunihiko Fukushima [8] in 1980.It is the first neural network to use convolution and downsampling, and it is also the prototype of convolutional neural networks.In 1989, Yann LeCun [9,10] constructed a CNN for computer vision problems, which was one of the first CNNs, i.e., the original version of LeNet.It uses convolutional layers and pooling layers for the first time, and achieves remarkable accuracy in handwritten character recognition tasks.As LeNet continued to be studied and its subsequent variants defined the basic structure of modern CNNs, its success drew attention to the application of CNNs [11].Recently, CNNs [12] have achieved significant success in visual recognition tasks, but CNNs are limited to modeling large unknown transformations and lack internal mechanisms to handle geometric transformations.They have difficulty handling a finely localized visual recognition of objects of different scales or deformations.Deformable convolutional networks [4,5] overcome these limitations and shortcomings by introducing a deformable convolution module and a deformable RoI pooling module to improve the transform modeling capabilities of CNNs.In the deformable module, the grid sampling positions of the standard convolution are shifted by 2D offsets learned by an extra convolutional layer.Deformable RoI pooling adds 2D offsets to each bin position in the previous RoI pooling.Thus, the sampling and pooling of a deformable convolutional network can vary with the object's different structure so that it can adjust its feature representation according to the object's configuration.Currently, deformable convolutional networks are widely used in fields such as image processing [13][14][15], complex vision [16,17], pattern recognition [18,19], and other fields [20,21], where they show powerful performance.

Image Style Transfer
Image style transfer is the migration of a style image so that the input image has the style characteristics of the style image.In 2015, Gatys [22] proposed neural style transfer to facilitate image style transfer.However, neural style transfer has some drawbacks.For example, the network must be trained in each migration, which is very slow and cannot achieve real-time migration, and the style migrations on photos may be distorted.To address these problems, Johnson [23] proposed a fast neural style transfer method to train one network for each style image so that only one forward process is needed to obtain the generated image in a test with a given content image.Luan et al. [24] proposed photo style transfer, which solved the photo distortion problem by improving the loss function.Generative adversarial networks (GANs) [25] are neural networks designed to solve the problem of generative modeling in which the generative model learns to capture the statistical distribution of training data and synthesizes samples from the learned distribution.Consequently, GANs have become a prevalent method for image style transfer [26,27].Conditional GANs (CGANs) [28] introduce image-to-image translation and a structured loss function to make networks not only learn the mapping from an input image to an output image, but also learn the loss function of training this mapping.This feature makes CGANs suitable for image generation problems.Because there are no content and style constraints on CGANs, their output results are more similar to artistic creations, and the effect is significantly improved compared to the other image generation models.In recent years, GAN has become a hot research direction, and more variant models have been proposed, such as CycleGAN [29], Wasserstein GAN (WGAN) [30], deep convolutional GAN (DCGAN) [31], and shape-matching GAN [3].Compared with other methods, GANs produce richer artistic effects for image style transfer.

Font Style Transfer
Font style transfer [32], the process of extracting artistic features from images of a given style and integrating artistic characters into text images, is a long-standing research problem.Font synthesis is the process of translating a font from one domain to a font from another domain, and the key to this process is predicting the shapes of the glyphs.Unlike font synthesis, font style transfer is a challenging problem of transferring the color and texture of artistic styles to new glyphs.The BAIR Lab at Berkeley collaborated with Adobe to design a multi-content GAN for font style transfer [33].First, they developed a new decorative network to predict the color and texture of the final glyphs.Then, Yang et al. [34] researched dynamic artistic text style migration with glyph style degree control and proposed a novel bidirectional shape-matching framework for font style migration.They introduced a scale-ware shape-matching GAN to learn glyph style shape mapping, model the style shape features at multiple scales simultaneously, transfer them to target glyphs, and generate high-quality and controllable artistic text.Subsequently, Zhang et al. proposed a font effect generation model [35] based on pyramid-style features based on Yang's work, using morphological operations to improve the transfer effect.Recently, a diverse transformation network for text style transformation [36] has been proposed, which can generate multiple styles of text images in a single model, allowing all styles to be effectively trained on the network.

Artistic Font Generation Network
Image style transfer migrates the style of an image to another image.Unlike image style transfer, font style transfer migrates the style of an image into the text of another image.Thus, if the image style transfer method is copied, the structural characteristics of the text will be destroyed.Yang et al. [3] studied fast and controllable artistic text style transfer in terms of font deformation and proposed a shape-matching GAN for text style transfer.However, by repeatedly testing the shape-matching GAN model, we found that is unable to extract clear font features for fonts with complex strokes, which leads to problems such as stroke adhesion, fuzzy edges, and severe deformation of the font.Currently, these problems are addressed by preprocessing the input image, but this method is time-consuming and difficult to implement.To overcome these limitations, in the section, we propose a deep deformable artistic font style transfer network.The key features of this network are the design of the DR encoder to learn font features and extract more image information, and the introduction of a smoothing loss to preserve key edge details of the font images.Thus, the network is better able to extract features, control font deformation, and maintain the structure of complex glyphs.

Overall Network Architecture
The overall network architecture consists of three main components: (1) a Sketch module for learning glyph mapping, (2) a Glyph module for learning style features to generate deformed font, and (3) a Transfer module for style texture fusion.The network architecture is shown in Figure 1.
As shown in Figure 1a, the network training process is divided into three modules.They are the Sketch module, Glyph module, and Transfer module.In the Sketch module, we first process the style image and turn it into a mask, which can be easily obtained using image editing tools.We call the mask structure mapping X of the style image and then use X and the style image as the input of the sketch module G B .The sketch module G B is composed of a smoothing block, a transformation block, and a smoothing loss function.The smoothing block is used to smooth the input image and the transformation block is used to map the smoothed style image back to the text field.Therefore, the edge of the style image can learn the edge features of the text image and realize structural transformation.We first smooth the input style image mask X by sketch model G B to weaken the edges.We introduce a new loss function called the smooth loss to maintain the smoothness of the image so that the font can better learn the features of the style image that we provide.We transform the style image into different degrees of deformation by adjusting the parameter l(l∈[0, 1]).After deformation, the mask is generated.In the Glyph module, we train the G S network.By clipping mask, the training pairs of sketch shapes with different smoothness can be obtained.The training pair is fed into the glyph network G S, which is trained to map it to the original X so that it can characterize the shape features of X and transfer these features to the target text.Thus, it can increase data diversity and force the model to learn more robust features, hence effectively improving the generalization ability of the model.Through the G S network, the font learns the style structure features and obtains the deformed font mask.In the Transfer module, we train the network G T which is similar to training G S .It is necessary to randomly cut a style image mask X and a style image to form a training pair as the input of G T .The network G T is trained to perform texture rendering instead.Style migration is performed on the input image to allow the deformed font to have the style features of the style image.As shown in Figure 1a, the network training process is divided into three modules.They are the Sketch module, Glyph module, and Transfer module.In the Sketch module, we first process the style image and turn it into a mask, which can be easily obtained using image editing tools.We call the mask structure mapping  of the style image and then use  and the style image as the input of the sketch module  .The sketch module  is composed of a smoothing block, a transformation block, and a smoothing loss function.The smoothing block is used to smooth the input image and the transformation block is used to map the smoothed style image back to the text field.Therefore, the edge of the style image can learn the edge features of the text image and realize structural transformation.We first smooth the input style image mask  by sketch model  to weaken the edges.We introduce a new loss function called the smooth loss to maintain the smoothness of the image so that the font can better learn the features of the style image that we provide.We transform the style image into different degrees of deformation by adjusting the parameter ℓ(ℓ∈[0, 1]).After deformation, the mask is generated.In the Glyph module, we train the  network.By clipping mask, the training pairs of sketch shapes with different smoothness can be obtained.The training pair is fed into the glyph network  , which is trained to map it to the original  so that it can characterize the shape features of  and transfer these features to the target text.Thus, it can increase data diversity and force the model to learn more robust features, hence effectively improving the generalization ability of the model.Through the  network, the font learns the style structure features and obtains the deformed font mask.In the Transfer module, we train the network  which is similar to training  .It is necessary to randomly cut a style image mask  and a style image to form a training pair as the input of  .The network  is trained to perform texture rendering instead.Style migration is performed on the input image to allow the deformed font to have the style features of the style image.Figure 1b shows the network testing process.G S learns the structural features of style images through training.By inputting text mask images and style mask images, text images can learn the corresponding style features and generate deformed text mask images.The deformed text mask is input into G T for style texture migration to obtain the final result.

Glyph Networks (G S )
The generator encoder of generative adversarial networks is generally a convolutional neural network, which consists of a convolutional layer, a pooling layer, and a batch normalization layer.The DR encoder is redesigned as shown in Figure 2. We first fill the input feature map repeatedly, filling the feature map to a specific size, and then use dilation convolution to expand the receptive field of the network on the feature map.Second, we downsample the feature map twice and shift the target features through deformable convolution to obtain more accurate edge features.Finally, the feature map is fed into the controllable deep residual network and linearly superimposed, and the corresponding feature map is output.By continuously learning and constantly adjusting the size of the convolutional layers to obtain the most suitable depth for this network, the texture generation network retains as many complex font structure features as possible.Considering that pooling degrades the performance of the generative model, the encoder uses stepwise convolution for reduced sampling.In addition, we use transposed convolution for feature upsampling to avoid checkerboard artifacts.
controllable deep residual network and linearly superimposed, and the corresponding feature map is output.By continuously learning and constantly adjusting the size of the convolutional layers to obtain the most suitable depth for this network, the texture generation network retains as many complex font structure features as possible.Considering that pooling degrades the performance of the generative model, the encoder uses stepwise convolution for reduced sampling.In addition, we use transposed convolution for feature upsampling to avoid checkerboard artifacts.The glyph generation network generator consists of an encoder and a decoder.The encoder is crucial in the glyph network, which determines whether the feature fusion process can maintain the glyph structure.To allow the network to effectively recognize font details, we design the structure of the DR encoder, as shown in Figure 2. The glyph generation network extracts the desired text and style features with the DR encoder and optimizes the training process by learning a large number of samples.After the training process, the network has learnt the corresponding stylistic features and can directly perform stylizations to generate font masks with stylistic features, which significantly reduces the time and space complexity compared with other networks, making the application of style transformation possible.The glyph generation network generator consists of an encoder and a decoder.The encoder is crucial in the glyph network, which determines whether the feature fusion process can maintain the glyph structure.To allow the network to effectively recognize font details, we design the structure of the DR encoder, as shown in Figure 2. The glyph generation network extracts the desired text and style features with the DR encoder and optimizes the training process by learning a large number of samples.After the training process, the network has learnt the corresponding stylistic features and can directly perform stylizations to generate font masks with stylistic features, which significantly reduces the time and space complexity compared with other networks, making the application of style transformation possible.
The structure of the DR encoder is improved mainly by designing the residual module size and introducing deformable convolution.The convolutional layer in the encoder is responsible for acquiring the local image features.The field of perception is fixed by the size of the convolution kernel during the computation of ordinary convolutions.We can expand the field of perception only by changing the size of the convolutional kernel or increasing the number of convolutional layers, which inevitably increases the number of parameters and computations of the network model and affects model efficiency.Therefore, we use the dilated convolutional layer instead of the normal convolutional layer to expand the corresponding field of perception without changing the size of the convolutional kernel to increase the network attention to include more features and obtain more detailed information.In CNNs, we calculate the size of the perceptual field by Equation (1): where g is the receptive field layer, n is the number of layers, S i is the step size of the i-th layer convolution or pooling, and k is the size of the convolution kernel which is based on Equation (1) to make the receptive field grow exponentially.
The dilated convolution has a hyperparameter dilation rate r, which represents the interval of the convolution kernel, the dilation rate of the standard convolution is 1.We calculate r through Equation (2).
In our calculation formula, rate defaults to 1. Dilated convolution increases the field of perception of the convolution kernel while keeping the number of parameters constant, so that each convolution output contains a larger range of information, allowing us to better detect feature targets and capture contextual information.However, there is a limitation of convolution for complex font cavities, where too large a perceptual field blurs detailed features when there are more strokes in the font.Therefore, to compensate for the dilated convolution insufficiency, we introduce deformable convolution in the encoder and use additional offsets to increase the spatial sampling position in the module so that our convolutional layer can automatically adjust the scale or perceptual field to obtain the best image.
In addition, the adjustment of the direction vector of the convolution kernel is added to the traditional convolution to shift the morphology of the convolution kernel closer to the feature object.The convolution process is shown in Figure 3.
where  is the receptive field layer,  is the number of layers,  is the step size of the th layer convolution or pooling, and  is the size of the convolution kernel which is based on Equation (1) to make the receptive field grow exponentially.
The dilated convolution has a hyperparameter dilation rate , which represents the interval of the convolution kernel, the dilation rate of the standard convolution is 1.We calculate  through Equation (2).

𝑟 2 1
(2) In our calculation formula,  defaults to 1. Dilated convolution increases the field of perception of the convolution kernel while keeping the number of parameters constant, so that each convolution output contains a larger range of information, allowing us to better detect feature targets and capture contextual information.However, there is a limitation of convolution for complex font cavities, where too large a perceptual field blurs detailed features when there are more strokes in the font.Therefore, to compensate for the dilated convolution insufficiency, we introduce deformable convolution in the encoder and use additional offsets to increase the spatial sampling position in the module so that our convolutional layer can automatically adjust the scale or perceptual field to obtain the best image.
In addition, the adjustment of the direction vector of the convolution kernel is added to the traditional convolution to shift the morphology of the convolution kernel closer to the feature object.The convolution process is shown in Figure 3.The CNNs can extract the feature maps, use the feature maps as input and apply another convolutional layer to them.In Figure 3, there is an additional convolutional layer to learn the offset and to share the input feature maps.The purpose of this layer is to obtain the offset of the convolutional deformation; we use Equation ( 3), an interpolation algorithm is used to learn the offset, which is learned by backpropagation.The difference in deformable ConvNets is that they perform dense spatial transformations in a simple, efficient, deep, and end-to-end manner.The deformable convolution introduces an offset ∆P n for each point, which is generated from the input feature map with another convolution, usually a fractional number.P n is each offset of P 0 in the range of the convolution kernel in Equation ( 4) and is represented as follows.
In the subsequent ablation experiments, it is also verified that the introduction of deformable convolution can better extract the features of text images and style images in the encoder, and improve the control of font deformation.This innovation is the key to solving the problem that the feature extraction of complex fonts is not in place, and the font deformation seriously loses the original font structure.
In addition to improved encoder functionality using the introduction of deformable convolution, we find that surface subdivision artifacts appear in the input residual module.This feature can carry the edge features at the edges of the font that are not recognized by the network, which can lead to a style overflow in the network after style migration.To address this issue, we increase the depth of the residual module to further control the font deformation strength.

Transfer Networks (G T )
For the G T module, we use the texture network structure of the shape-matching GAN [1] model.After the glyph generation network, we obtain a text mask style image with the learned style features.Similar to the glyph generation network G S , a large number of data pairs are obtained by clipping the style image and the text mask image, and a large number of data pairs are trained to quickly build an end-to-end fast text style model so that the style network can adapt to the shape of text and quickly generate target images.The network can generate multiple styles of text images and easily control the style of the text.The main idea is that by taking the deformed text images with style characteristics that we have generated as input for the transmission network, we can select the style images that need to be migrated, and all text images can be effectively trained in the network to obtain the corresponding style images.The advantage of this network is that multiple text styles can be generated using a single model, and the generation of text styles can be controlled.

Loss Function
The loss of the network G S contains a reconstruction loss and an adversarial loss.In the reconstruction loss, l(l∈[0, 1]) controls the degree of deformation.Set l to control font deformation and to realize multiscale style migration.x represents the structural sketch obtained after a binary transformation of the style image, and y represents a raw style image.

∼
x l i represents the result of style structure images with different degrees of deformation obtained from the G B network.We use a mask image as an information guide to reconstruct the structure of the different style images.The reconstruction loss restores the structure of the different degrees of images for each style to the structure of the original.In the adversarial loss, we add the mask images to the generator and the discriminator, similar to the conditional GAN procedure.
The overall G S loss is as follows: The main task of the G T network is to assign texture features to the structural images obtained in G S .The loss of the network G T includes reconstruction loss, conditional adversarial loss, style loss, and texture loss.Style loss L sty T is proposed in the neural style transfer.
The overall G T loss is as follows: The loss function described above applies to our basic networks, G S and G T .In the sketch model, we first select a text image t as the base image and randomly select an l value within [0, 1] to reconstruct image t.
After obtaining the reconstructed image t, we generate an adversarial loss function to make the reconstructed image more similar to the original image.
When the target image is smoothed by the sketch model for edge features, the image is not smoothed well due to the influence of the recovery algorithm on the noise amplification, which causes some features to be lost and additional noise features to be added to our image when it is input in G S .Consequently, in the process of migrating the resultant image of G S for stylistic features, a small amount of noise has a great impact on the result, resulting in shadows at the edges of the images, and the total proportion of images contaminated by noise is significantly larger than the proportion of noise-free images.Therefore, we design a new smooth loss by adding regular terms in the sketch model to maintain the smoothness of the image.The difference in adjacent pixel values in the image can be solved to some extent by reducing the loss, and our loss solves the edge shading problem.We also implement the noise constraint by sacrificing image sharpness, which finally solves the problem of noise and poor edge smoothing in the image.The following equation is the regular term that we add.
The overall sketch model loss is as follows:

Experiment 4.1. Dataset
We use the dataset TE141K [37] which contains 152 professionally designed text effects rendered on glyphs, including English letters, Chinese characters, and Arabic numerals.The dataset is divided according to the 8:2 ratio, including 608 pictures in the training set and 152 pictures in the test set.This dataset is one of the largest font style migration datasets to date and can be used in research areas such as font style migration, multidomain transfer, and image-to-image translation.

Training
Our model consists of the sketch module G B , glyph module G S and transfer module G T , so we divide the training strategy into three steps and randomly crop the images to a 256 × 256 image size before the training starts.For the optimizer, we use the Adam optimizer and set the learning rate to 0.0002.We perform 3 training epochs.First, we need only input a style image mask to train the sketch module G B .Then, the model smooths the input image to reduce the sharpening of the image edges, and in this process, the smoothing effect of the network on the image is further improved by the smooth loss we design.We need the model to connect the source style domain and the target text domain using a smoothing block, which maps the style image and the style image mask to the smoothing domain, where the details are eliminated, and the contours show a similar degree of smoothing.According to the adjustment parameter l (l∈[0, 1]), the smoothed style image is transformed into different degrees of a mask.Next, we train the G S .By clipping the mask, the training pairs of sketch shapes with different smoothness can be obtained.The training pair is fed into the G S network, and the glyph network G S is trained to map it to the original text mask so that G S can characterize the shape features of text image mask and transfer these features to the target text.The encoder we design can more flexibly control font deformation at different levels and enhance the model generalization ability.The dilated convolution, deformable convolution and residual block structure we design make the edges of stylized images more convergent to the edges of text images and font deformation more flexible and controllable.Finally, we train the G T module.Here, it is necessary to randomly cut a style image mask and a style image to form a training pair as the input for G T .The network G T is trained to perform texture rendering instead.Style migration on the input image is performed so that the deformed font has the style features of the style image.

Comparisons with State-of-the-Art Methods
We used shape-matching GAN as the baseline and conducted a number of experiments.The effects of our proposed method on artistic text style transfer are shown in Figure 4. On the one hand, our method is superior to the baseline at stylizing complex glyphs.On the other hand, our method represents a significant improvement over the baseline method for complex glyphs, ensuring a clear font structure and improving legibility.
Effect picture comparison.In Figure 5, we qualitatively compare our method with four state-of-the-art style transfer methods, neural style transfer (NST) [21], LapStyle [38], multi-style transfer (MST) [36], and shape-matching GAN [3].These methods are chosen because they are all one-way style transfers, and most style transfer methods are derivative versions of these methods.(a) NST is the most basic style transfer, which uses a CNN for feature extraction and then uses the extracted features for reconstruction.It can transfer the style but cannot learn the style features, and the glyphs are homogenized.(b) LapStyle splits the complex style migration into an initial migration at low resolution and a correction process at high resolution, which effectively improves the quality and the speed of stylization.Thus, LapStyle transfer is more suitable for overall image style migration.However, this method is not applicable to artistic font text generation because it is ineffective in extracting the features of fonts, which represent only one aspect of text images.(c) MST is a recently proposed and diversified transformation network for text style transfer that can generate multiple text images in a single model and control the text style in a simple way.(d) Shape-matching GAN is our baseline method, which cannot maintain the structure of the complex font glyphs.As seen from the results of the comparative experiments in Figure 5, our proposed method has obvious advantages in terms of the effectiveness of artistic font generation.Most other methods involve style transfer of the whole style image and thus have an insufficient feature extraction effect on text and style images, which leads to the inability to generate clear and beautiful artistic text.In contrast, by introducing a deformable convolution and an improved residual module, our proposed network enhances the control of font deformation, enabling a more detailed font feature extraction effect and solving the problems of severe font deformation and unclear character shapes.It differs from other style transfer networks in that the text has texture details while learning the image style, making the generated artistic characters nondual and artistically ornamental.Effect picture comparison.In Figure 5, we qualitatively compare our method wit four state-of-the-art style transfer methods, neural style transfer (NST) [21], LapStyle [38 multi-style transfer (MST) [36], and shape-matching GAN [3].These methods are chose because they are all one-way style transfers, and most style transfer methods are deriv tive versions of these methods.(a) NST is the most basic style transfer, which uses a CN for feature extraction and then uses the extracted features for reconstruction.It can tran fer the style but cannot learn the style features, and the glyphs are homogenized.(b) Lap Style splits the complex style migration into an initial migration at low resolution and correction process at high resolution, which effectively improves the quality and the spee of stylization.Thus, LapStyle transfer is more suitable for overall image style migratio However, this method is not applicable to artistic font text generation because it is ine fective in extracting the features of fonts, which represent only one aspect of text image (c) MST is a recently proposed and diversified transformation network for text style tran fer that can generate multiple text images in a single model and control the text style in simple way.(d) Shape-matching GAN is our baseline method, which cannot maintain th structure of the complex font glyphs.As seen from the results of the comparative exper ments in Figure 5  Execution time comparison.We compare the time needed to generate an image of different models in the testing process with Intel Core i7-11700k 3080 10G, as shown in Table 1.We input 320 × 320 images into the model and average the reasoning time required for 100 pictures.As seen from Table 1, each image generated by our proposed mode requires only 0.039 s on average, and we can nearly interact with users in real time.Our time is slightly longer than that of shape-matching GAN [3] because of the addition of deformable convolution to the model.Deformable convolution adds only a small overhead for the model parameters and computation.However, it is precisely because of deformable convolution that our model can better capture the edge features of fonts and produce better results.NST [22] takes a long time to execute because it requires several iterations during testing to generate the final result.

Ablation Study
To analyze the advantages of our improvements on the baseline model, we design the following experiments with different configurations:

•
Baseline: Our baseline network uses the original shape-matching GAN approach [3] trained to directly map the structure map X back to the style image Y.The results of this ablation experiment are shown in Figure 6.It can be seen that compared with the baseline network, W/o SL enhances the smoothing performance, which can make the text better learn style features and maintain the font.The W/o NCR model improves the legibility of the font and can guarantee the structural features of the font.However, the edge features are recognized, resulting in style overflow.Therefore, due to the flexibility of deformable convolution in feature extraction, the W/o DC model solves the problem of style overflow caused by identifying unnecessary edge features.In sum, when we adopt the full model, the results effectively solve the problem of the missing glyph structure and greatly increase the visibility of the text.

Conclusions
In this paper, we propose the deep deformable artistic font style transfer network that maps the stylistic features of an image to the text of a text image and controls the degree of font deformation by adjusting parameters to achieve diverse style migration.In the network, the DR encoder that we designed can effectively extract font features, control font deformation, greatly improve the recognition accuracy of complex fonts, and enable the network to generate more exquisite art fonts.The DAF network is divided into three modules, and each module can be trained separately.In the sketch module, smooth loss is introduced to enhance the smoothness of the font edges and improve the similarity between the font edges and the edge transformations of the style images.In the  module, the novel DR encoder is used to better preserve the font structure and improve font legibility.The  module is trained to transfer the style image features to the font image so that the font not only retains its own glyph structure but also integrates the style features.We verified the effectiveness and robustness of the method by comparing it with state-ofthe-art migration algorithms.In future work, we hope to integrate the attention mechanism with the DR encoder to improve font adaptivity, which will make the font style transfer more precise for text, resulting in more beautifully migrated text.Additionally,

Conclusions
In this paper, we propose the deep deformable artistic font style transfer network that maps the stylistic features of an image to the text of a text image and controls the degree of font deformation by adjusting parameters to achieve diverse style migration.In the network, the DR encoder that we designed can effectively extract font features, control font deformation, greatly improve the recognition accuracy of complex fonts, and enable the network to generate more exquisite art fonts.The DAF network is divided into three modules, and each module can be trained separately.In the sketch module, smooth loss is introduced to enhance the smoothness of the font edges and improve the similarity between the font edges and the edge transformations of the style images.In the G S module, the novel DR encoder is used to better preserve the font structure and improve font legibility.The G T module is trained to transfer the style image features to the font image so that the font not only retains its own glyph structure but also integrates the style features.We

Electronics 2023 , 16 Figure 1 .
Figure 1.Architecture of the deep deformable artistic font style transfer network.

Figure 1 .
Figure 1.Architecture of the deep deformable artistic font style transfer network.

Figure 3 .
Figure 3. Schematic diagram of the joined deformable convolutional network.

Figure 3 .
Figure 3. Schematic diagram of the joined deformable convolutional network.
, our proposed method has obvious advantages in terms of the effectiv ness of artistic font generation.Most other methods involve style transfer of the who style image and thus have an insufficient feature extraction effect on text and style image which leads to the inability to generate clear and beautiful artistic text.In contrast, b introducing a deformable convolution and an improved residual module, our propose network enhances the control of font deformation, enabling a more detailed font featu extraction effect and solving the problems of severe font deformation and unclear chara ter shapes.It differs from other style transfer networks in that the text has texture detai

Figure 4 .
Figure 4. Our artistic text style transfer effects.

Figure 5 .
Figure 5.Comparison with state-of-the-art methods on various styles.Figure 5. Comparison with state-of-the-art methods on various styles.

Figure 5 .
Figure 5.Comparison with state-of-the-art methods on various styles.Figure 5. Comparison with state-of-the-art methods on various styles.

Figure 6 .
Figure 6.Comparison chart of ablation experiments: (a) represents our original data, and (b) is the style features we need to migrate.The first row on the right is the resulting graph of the texture generation network  , and the second row is the final output graph.From left to right are the output results of the model we proposed above.

Figure 6 .
Figure 6.Comparison chart of ablation experiments: (a) represents our original data, and (b) is the style features we need to migrate.The first row on the right is the resulting graph of the texture generation network G S , and the second row is the final output graph.From left to right are the output results of the model we proposed above.

Table 1 .
Execution time comparison.