You are currently viewing a new version of our website. To view the old version click .
Mathematics
  • Article
  • Open Access

11 January 2024

DALS: Diffusion-Based Artistic Landscape Sketch

,
and
1
Department of Computer Science, Sangmyung University, Seoul 03016, Republic of Korea
2
Department of Software, Sangmyung University, Cheonan 31066, Republic of Korea
*
Authors to whom correspondence should be addressed.
These authors contributed equally to this work.

Abstract

We propose a framework that synthesizes artistic landscape sketches using a diffusion model-based approach. Furthermore, we suggest a three-channel perspective map (3CPM) that mimics the artistic skill used by real artists. We employ Stable Diffusion, which leads us to use ControlNet to process 3CPM in Stable Diffusion. Additionally, we adopt the Low Rank Adaptation (LoRA) method to fine-tune our framework, thereby enhancing the quality of sketch and resolving the color-remaining problem, which is a frequently observed artifact in the sketch images using diffusion models. We implement a bimodal sketch generation interface: text to sketch and image to sketch. In producing a sketch, a guide token is used so that our method synthesizes an artistic sketch in both cases. Finally, we evaluate our framework using quantitative and quantitative schemes. Various sketch images synthesized by our framework demonstrate the excellence of our study.

1. Introduction

Artistic sketch is a genre of fine arts that has long been beloved. Many researchers have proposed various schemes for synthesizing sketches from images [,,,,,,,,,,,,,,,,,,]. However, many of them focus on synthesizing portrait sketches [,,,,,]. Furthermore, many sketch studies produce doodle-like sketches, which can be used for sketch-based retrieval [,,,,]. Therefore, we focus on synthesizing artistic landscape sketch, which is one of the important artistic sketch genres.
Our primary goal is to develop a diffusion model-based sketch synthesizing model that mimics artistic landscape sketch drawing techniques. Important considerations on the techniques include vanishing points and perspective levels of detail. In artistic landscape sketch, the objects in a scene are aligned along vanishing points. Artists can control the number of vanishing points for their artworks. Perspective level of detail is a technique that artists use to draw objects in detail or in abstract according to perspective level.
Another goal is to build a sketch generation framework that relieves the burden of collecting datasets. Generative models that synthesize sketches have long suffered from the lack of data, since a well-drawn sketch requires a lot of time and effort from an artist. Recently, the progress of diffusion models has greatly improved the generative models. Since existing diffusion models are trained with vast datasets, they have rich prior knowledge of objects and they produce impressive image synthesis results with zero-shot learning. Therefore, we employ the pre-trained diffusion model for the backbone of our framework to resolve the lack of sketch datasets.
To facilitate vanishing points and perspective levels of detail, we propose a three-channel perspective map (3CPM). Landscape sketch artists depict a scene by decomposing it into three parts: background, midrange, and foreground. The objects in the foreground are depicted in detail and those in background in abstract. The objects in the midrange are depicted gradually. 3CPM arranges objects in the background, midrange, and foreground in different maps. Then, we train a module that feeds 3CPM and pass representation of it to our framework.
3CPM is technically a semantic segmentation map which segments objects in a scene into three categories: foreground, midrange and background. 3CPM can be produced by manually segment sketches or images or can be synthesized by a depth estimator. 3CPMs with paired sketches are used to train our backbone model.
A serious challenge of deep learning-based sketch generation approach is lack of data. To resolve this challenge, we employ a pre-trained diffusion model as a backbone network of our framework. Since well-known diffusion models are pre-trained using a variety of datasets, most of them can generate convincing images. Among diffusion models, we employ Stable Diffusion [], which is recognized as producing visually pleasing images and easily trainable framework. However, vanilla Stable Diffusion shows limitations for synthesizing artistic sketch. The synthesized sketches show many artifacts such as unwanted tiny strokes or a color-remaining artifact on the result image. Therefore, we improve vanilla Stable Diffusion in order to synthesize artistic landscape sketch images.
For effective sketch synthesis, we consider both text-to-sketch and image-to-sketch frameworks for sketch synthesis. Although vanilla Stable Diffusion supports the image-to-image framework using CLIP that converts images into semantic tokens, the image-to-sketch framework of vanilla Stable Diffusion is heavily affected by the texture or colors of input images. In order to preserve salient visual information of the input image, we employ edge maps for the image-to-sketch framework.
We present landscape sketch image using both text-to-sketch and image-to-sketch approaches. We present some results of our frameworks in Figure 1. We also evaluate our results using a quantitative approach by estimating various metrics including Frechet Inception Distance (FID), Art Frechet Inception Distance (ArtFID) [], Contrastive Language-Image Pre-training (CLIP) score, Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM) and Multi-Scale Structural Similarity Index (MS-SSIM). We also execute a qualitative approach using a user study.
Figure 1. Teaser images: The upper row shows image-to-sketch results, and the lower row shows text-to-sketch results. The token for the text-to-sketch is “duomo”. The image-to-sketch results show level of details according to the perspective levels and the text-to-sketch results show the number of vanishing points from one, two and zero.
Our contributions are listed as follows:
  • We present a three-channel perspective map (3CPM), which controls the perspective levels of detail for the landscape sketch, which has not been resolved in the existing sketch generation schemes.
  • We present a landscape sketch generation framework that controls the number of vanishing points, which has not been controlled in the existing schemes.
  • We present a bi-modal sketch generation framework, which allows both image-to-sketch and text-to-sketch frameworks.

3. Overview

Our model employs Stable Diffusion as a backbone network and ControlNet to control the edge map and 3CPM. We further improve our model using LoRA to enhance the quality of the results. Our approach resembles Ryu et al.’s work [], where LoRA components are added to calculate query, the key value in every attention module.
We implement both text-to-sketch and image-to-sketch approaches. In both case, a guide token, “*”, is used so that the backbone network synthesizes a sketch. During the inference period, we use “ldsktch” as “*”.
Figure 2 illustrates the overview of our framework. In text-to-sketch generation, which is in the orange box, prompt and 3CPM are fed into our model. The prompt, which is given as “*, cafe”, is encoded into tokens and the tokens are fed into both ControlNet and Stable Diffusion. 3CPM is optional and can be constructed by a perspective map estimator (PM) or an artist.
Figure 2. Overview of our framework.
In image-to-sketch generation, prompt and an image are given as an input. A prompt is given as “*”. An image is fed into two preprocessors, a canny edge detector and a perspective map estimator, and then a preprocessed edge map and a 3CPM are fed into two different ControlNet models.
In the denoising step, noise which is fed into UNet of Stable Diffusion is considered as a latent seed, z .
Text tokens, c and z , are fed into cross-attention and LoRA. Two results of each module are added and a feature from ControlNet is added to the outcome of cross-attention and the LoRA module of the decoder of UNet in Stable Diffusion. After several denoising steps, the latent vector is gradually denoised. The results from both text-to-sketch and image-to-sketch frameworks are presented in Figure 2.

4. Method

4.1. Preliminaries on Stable Diffusion

The architecture of Stable Diffusion is a variation of variational auto-encoder (VAE) [], which employs a re-parameterization strategy. On the contrary, Stable Diffusion employs a denoising process.
Stable Diffusion belongs to the category of Latent Diffusion Models (LDM) where a denoising process is executed in latent space to reduce training time and to enhance training stability. Stable Diffusion encodes a 512 × 512 image in a shape of 64 × 64 and then diffuses to finally form 512 × 512 resolution. The diffusion process is performed in the inner UNet []. The UNet consists of an encoder, E V A E , a decoder, D V A E , and a middle block, which consists of several ResNets [] and ViTs [] models. To synthesize a sketch, conditions are fed into cross-attention modules in the ViTs. Every cross-attention module in transformers in each layer is calculated as follows:
A t t e n t i o n ( Q , K , V ) = s o f t m a x ( Q K T d ) · V ,
where Q = W Q ( i ) · ψ i ( z t ) , K = W K ( i ) · τ θ ( c ) , V = W V ( i ) · τ θ ( c ) , ψ i ( z t ) R N × d ϵ i , W Q ( i ) R d × d ϵ i , W K ( i ) R d × d τ , W V ( i ) R d × d τ , and c is a condition vector such as text embedding.
The training phase is composed of two steps. The first step is the training of the encoder that maps input images to the latent space and the decoder that recovers the latent vector to the image.
x ^ = D V A E ( E V A E ( x ) ) ,
where x is an input image and x ^ is an reconstructed image.
The second step is the inner UNet, which is trained minimizing loss term such that
L L D M = E E V A E ( x ) , ϵ ~ N ( 0 , 1 ) , t [ | | ϵ ϵ θ ( z t , t ) | | 2 2 ] ,
where E V A E is an encoder as VAE, ϵ is sampled from normal distribution and ϵ θ is the UNet of Stable Diffusion.
Since Stable Diffusion is basically a text-to-image model, it processes text as a condition. For this purpose, Equation (3) is manipulated as follows:
L S D = E E V A E ( x ) , ϵ ~ N ( 0 , 1 ) , t [ | | ϵ ϵ θ ( z t , t , c t e x t ) | | 2 2 ] ,
where c t e x t is the text condition.

4.2. A Three-Channel Perspective Map (3CPM)

Artists who draw landscape sketches align objects in their sketch along the vanishing points of the scene and draw the objects in perspective levels of detail. For example, they view a scene and determine vanishing points. Afterward, they draw a horizontal line and perspective lines extended from the vanishing points, explicitly or implicitly. Finally, they split the objects into perspective levels of detail by distance and draw them according to their lines and separations.
We propose the 3-channel perspective map (3CPM) in order to the artistic sketch drawing technique. 3CPM is a segmentation map which organizes the scene into three categories: background, midrange, foreground. They are represented in different colors. Yellow is for the foreground, red is for the midrange and blue for the background (see Figure 3). 3CPM can be easily crafted by hand or can be produced by segmenting the depth map of a scene into 3 parts with respect to pixel value of the distance map extracted by a depth map estimator.
Figure 3. Examples of 3CPM: one vanishing point, two vanishing points and a far-sighted view. Yellow is for the foreground, red is for the midrange and blue for the background.
In our study, we train ControlNet by collecting 3CPMs from hundreds of artistic open-source sketches. An example pair of our dataset is illustrated in Figure 4. ControlNet incorporates Stable Diffusion for sketch image generation.
Figure 4. An example pair of 3CPM data. Yellow is for the foreground, red is for the midrange and blue for the background.
The inner UNet ϵ θ consists of an encoder and a decoder. Since this architecture is designed following UNet [] architecture, each layer of the encoder is connected to each corresponding layer of the decoder via skip connection. Therefore, we can say that ( E i , D i ) is the corresponding pair of ϵ θ .
For the training phase of ControlNet, encoder E of ϵ θ is copied as E P M , and we denote each layer of E P M as E P M i . A specific condition is that 3CPM or a Canny edge map, which is notated as c . c , are fed into the zero convolution layer, Z C 0 , and added to latent vector z t . Finally, it is fed into E P M . c f is the notation of every intermediate term of every E P M i for brevity.
c f = E P M ( z t + Z C 0 ( c ) , t , c t e x t ) ,
where c f consists of intermediate terms c f i of every E P M i and c f i is fed into the corresponding Z C i layer in order to feed into the corresponding decoder layer, D i . We can formulate this process as follows:
z d e c i = D i ( z d e c i + 1 + Z C i ( c f i ) ) ,
where z d e c i is the intermediate term of D i and ControlNet is trained with a loss term that is employed in Stable Diffusion. The loss term to train ControlNet for 3CPM is defined as follows:
L P M = E z t , t , c t e x t , c f , ϵ ~ N ( 0 , 1 ) [ | | ϵ ϵ θ ( z t , t , c t e x t , c f ) | | 2 2 ] ,
where t ~ { 1 , 2 , 3 , , T } .

4.3. Enhancing Sketch Quality

We aim to implement a framework that synthesizes artistic landscape sketch. As a backbone network that fits into our framework, we employ a diffusion model able to synthesize high fidelity and diverse images, specifically widely used and released as an open-source model, Stable Diffusion. However, vanilla Stable Diffusion is less feasible to synthesize artistic sketch.
In Figure 5, four sketches in Columns 1 and 2 are synthesized with vanilla Stable Diffusion with prompt “<LOCATION>, sketch, lineart”. <LOCATION> is “duomo”, “landscape”, “city”, “cafe” for each result. These four results have low quality, and two sketches in Column 3 are synthesized by vanilla Stable Diffusion with ControlNet that uses canny edge map. Unfortunately, colors are left on the sketches.
Figure 5. Limitations of vanilla Stable Diffusion in synthesizing sketches.
To address these two problems, (1) low quality and (2) colors on the sketch, we fine-tune our backbone model following an efficient fine tuning method, LoRA, using collected artistic landscape sketches.
Low Rank Adaptation, LoRA, is proposed in the paper of Hu et al. []. They propose a method that trains its tiny LoRA modules with gigantic weights. A tiny LoRA module can be formulated as follows: A B , where A R d i n × r and B R r × d o u t , and a gigantic weight W R d i n × d o u t . Finally, a layer that LoRA modules adapt can be formulated such that
W ( x ) W ( x ) + A B ( x ) .
In the fine-tuning phase, a gigantic weight, W, is locked, and only LoRA modules, A B , are trained.
Ryu et al. [] provide scripts that adapt this method to Stable Diffusion. They implement Q , K and V of cross-attention modules of the inner UNet of Stable Diffusion which are added by LoRA modules. Q , K and v can be formulated as follows:
Q = W Q ( i ) · ψ i ( z t ) + A Q ( i ) B Q ( i ) · ψ i ( z t ) , K = W K ( i ) · τ θ ( c t e x t ) + A K ( i ) B K ( i ) · τ θ ( c t e x t ) , V = W V ( i ) · τ θ ( c t e x t ) + A V ( i ) B V ( i ) · τ θ ( c t e x t ) .
The loss term is the same as Stable Diffusion, since it is a fine-tuning procedure.
L L o R A = E z t , t , c t e x t , ϵ ~ N ( 0 , 1 ) [ | | ϵ ϵ θ ( z t , t , c t e x t ) | | 2 2 ]
As a result, we fine-tune vanilla Stable Diffusion that works efficiently and resolves the existing limitations of the existing diffusion-based sketch generation schemes including low quality and color blurring on a sketch.

4.4. Multimodal Sketch Generation

We implement text-to-sketch and image-to-sketch functions using ControlNet and LoRA trained in the previous stage. Both text-to-sketch and image-to-sketch functions are commonly assigned prompt “∗” in order to guide sketch synthesis. In the text-to-sketch mode, location words such as “cafe”, “station”, and “NewYork” can be attached after a basic prompt. In the case of the image-to-sketch mode, a prompt is fixed as “∗” and information of a scene is given through preprocessed information such as 3CPM or canny edge map.

4.4.1. Text to Sketch

In the text-to-sketch mode, a basic prompt such as “*, lineart” and a location prompt such as “cafe” or “city” are assigned as an input. To synthesize a sketch without an input photograph, only seed noise and a prompt are fed into our framework. Sampling formulation for this process is presented as follows:
z t 1 = 1 α t ( z t 1 α t 1 α ¯ t ϵ θ ( z t , t , c t e x t ) ) + σ t ϵ .
Sampling formulation with 3CPM can be presented as follows:
z t 1 = 1 α t ( z t 1 α t 1 α ¯ t ϵ θ ( z t , t , c t , E P M ( c P M ) ) ) + σ t ϵ ,
where α t is a denoising scheduling parameter with respect to time t and α t ¯ = Π n = 1 t α n . z t is a latent vector, c t is a text condition, ϵ θ is an encoder of the inner UNet, t is time sampled from { 1 , 2 , , T } , E P M is an encoder of ControlNet, and c P M is 3CPM. σ t is variance according to time step t.

4.4.2. Image to Sketch

In the image-to-sketch mode, a basic prompt and an input photograph are given to our framework. The input photograph is fed into the preprocessing modules including a Canny edge estimator and a perspective map estimator. The preprocessed information is fed into ControlNet. Notably, since two ControlNet models trained using different schemes are employed, two outcomes from ControlNets are merged through a weighted average scheme and are fed into the inner UNet of our framework. The weights for different ControlNets are α P M and α c a n , respectively.
Finally, a sketch is synthesized using the following sampling formulation:
z t 1 = 1 α t ( z t 1 α t 1 α ¯ t ϵ θ ( z t , t , c t , c f ¯ ) ) + σ t ϵ ,
where c f ¯ = α P M E P M ( c P M ) + α c a n E c a n ( c c a n ) ,   α P M + α c a n n y = 1 and c P M is 3CPM, c c a n is Canny edge map, E c a n is encoder of ControlNet trained with canny edge map. Other variables are identical to the case of text-to-image generation.
Finally, latent vector, which is recovered by decoder of VAE D V A E , is defined as follow:
x ^ = D V A E ( z 0 ) ,
where z 0 is sampled from z T , which is noise, by T times.

5. Implementation and Results

We train and test our framework in Naver Clova NSML cloud environment. CPU is Intel(R) Xeon(R) Gold 5220 CPU @ 2.20GHz with 16 cores, GPUs are two Tesla V100-SXM2-32GB.
To fine-tune and test our model, we employ the Automatic1111 [] open-source framework, which supports not only web UI but also API so that we can use Stable Diffusion with some snippets. Furthermore, it provides various extensions by combining ControlNet and LoRA with Stable Diffusion.
We implement a user interface to our framework and present a GitHub webpage of this study in “https://github.com/comeeasy/DALS” (accessed on 1 January 2023).

5.1. Results

We present two categories of sketch images synthesized using our framework. In Figure 6, sketch images synthesized from input images are presented. We apply five landscape photographs and produce sketch images. The details of the photograph such as trees and brick tiles are successfully generated. The objects in the foreground are illustrated in detail and those in the background are abstracted. In Figure 7, sketch images generated from keywords are suggested. As input, we employ the following keywords in different 3CPMs and numbers of viewpoints: “duomo”, “station”, “New York”, “India”, “Tokyo”, “Shanghai”, “London”, “cafe”, “amusement park”, “space station”, “auditorium”, “city hall”, and “university”. In addition to these keywords, we present three different 3CPMs, which lead different sketch images.
Figure 6. Results of our framework that produces a sketch from an image.
Figure 7. Results of our framework that produces a sketch from text.

5.2. Dataset

We collect 147 artistic landscape sketches for training and 12 images for testing from [,]. Additionally, we produce hand-crafted 3CPMs for each collected sketch to train ControlNet. Therefore, we use 147 sketches to train LoRA and 147 pairs of sketches and their corresponding 3CPMs to train ControlNet, and we use 12 landscape images to test our framework.
To evaluate our framework, we estimate Sketch-FID. We download an imagenet-sketch dataset from kaggle [] and train ResNet50 [].

5.3. Hyperparameters

In order to train LoRA, we set the rank of LoRA, r, to eight and attach LoRA modules only on Q , K , V of UNet of our framework. We set the batch size as four, epochs as 500, learning rate as 0.0001, and network α as one. We employ a cosine LR Scheduler whose input image size is 512 × 512 .
A prompt used for training LoRA is “ldsktch”, which is a guide token. When it comes to synthesizing, text-to-sketch and image-to-sketch modes use a default prompt, “ldsktch”, that guides our framework to synthesize the result in an artistic landscape sketch style. The text-to-sketch mode uses a location prompt in addition to the default prompt. A location prompt specifies the keyword of a place including “cafe”, “city” or “beach”, etc.
To train ControlNet, we employ Stable Diffusion v1.5. We set the batch size as four, epochs as 1000, learning rate as 0.00085, which is gradually increased to 0.012 using linear scheduler, image size as 512 by 512, and latent size as 64.
A prompt used for training ControlNet is “landscape sketch line-art, drawing detail is separated by distance”.

5.4. Comparison

We compare our results produced by the image-to-sketch framework and the text-to-sketch framework. For the comparison of image-to-sketch framework, we sample seven important studies that generate sketch images [,,,,,,]. Among various categories of sketch synthesis studies, a Coherent Line [] is selected from pre-deep learning schemes, and CycleGAN [] is from general deep learning schemes. U-GAT-IT [] and Kim et al.’s [] are from sketch-specific deep learning schemes. Chan et al.’s [], Vinker et al.’s [] and Vinker et al.’s [] are from diffusion model-based schemes. Figure 8 illustrates the sketch images from these schemes and ours. For the comparison of text-to-sketch framework, we sample two important studies [,]. Figure 9 illustrates the sketch images produced by these schemes and ours.
Figure 8. Comparison of image-to-sketch results: The results of our study are compared with those from important studies including Coherent Line [], CycleGAN [], U-GAT-IT [], Kim et al. [], info-draw [], CLIPascence [], and CLIPasso [].
Figure 9. Comparison of text-to-sketch results: The results of our study are compared with those from important studies including DiffSketcher [] and vanilla Stable Diffusion (SD) [].

6. Evaluation

We evaluate sketch images synthesized using our framework using both a quantitative approach and a qualitative approach. In quantitative evaluation, we compare the results of the sampled studies [,,,,,,] with ours. Since these studies produce sketch images from an input photograph, we evaluate sketch images produced using our image-to-sketch framework. In qualitative evaluation, we evaluate both image-to-sketch results and text-to-sketch results. We hire twenty human participants and execute surveys on the results.

6.1. Quantitative Evaluation

For the seven important studies presented in Figure 8, we estimate the following metrics, which have been frequently used for the evaluation of the synthesized images in many studies:
  • FID (Frechet Inception Distance): a metric that evaluates the quality and diversity of generated images, particularly in the context of generative models including GANs and diffusion models. FID (Frechet Inception Distance) is estimated for trained ResNet50 [] with an ImageNet-Sketch [] dataset.
  • ArtFID: an enhanced FID metric that measures the similarity of the contents of a stylized image and its original image. It not only calculates FID but also considers identity consistency between an input image and a synthesized sketch.
  • CLIP (Contrastive Language-Image Pre-training) score: a metric that measures text-to-image framework similarity. We extract features from both input photograph and sketch image using the CLIP encoder and compute a cosine similarity from the features.
  • PSNR (Peak Signal-to-Noise Ratio): a metric that measures the quality of a reconstructed image by comparing it to the original image.
  • SSIM (Structural Similarity Index): a metric that evaluates the similarity between two images considering luminance, contrast, and structure.
  • MS-SSIM (Multi-Scale Structural Similarity Index): an extension of the traditional SSIM metric. MS-SSIM takes into account variations in structure and texture across multiple scales of two images.
We estimate these metrics on the images in Figure 8 and present the values in Table 1. According to Table 1, our result shows best scores for three metrics including FID, ArtFID and CLIP scores and a second-best score for PSNR. Our scheme is ranked low for SSIM and MS-SSIM. Among the six metrics, our result records three best ranks and one second best rank. Therefore, we conclude that our scheme produces higher-quality sketch images than the compared existing studies.
Table 1. The values from quantitative evaluation. The bold figures denote the best. Red figure means the smallest value is best and blue figure means the biggest value is best.
According to our comparison in Table 1, our results show lower FID and ArtFID scores than other compared models. In our analysis, both FID and ArtFID measure the distances of two images. Therefore, the coincidence of the structures of the images affects both FID and ArtFID. Furthermore, the absence of unwanted artifacts in the sketch image also affects them. Since the structures of our results show best coincidence with those of the input photographs and the unwanted artifacts of our results are very rare, our results show lowest FID and ArtFID scores.

6.2. Qualitative Evaluation

As a qualitative evaluation, we execute a user study by hiring twenty participants. Seventeen of them are in their twenties and three are in their thirties. Eleven of them are female and nine are male. We present three questions as follows:
Q1:
Evaluate the visual quality of the sketch image in a ten-point metric. Mark 10 for the best and mark 1 for the worst.
Q2:
Evaluate the artifact in the sketch image in a ten-point metric. Mark 10 for an image free from artifacts and mark 1 for the image full of artifacts.
Q3:
Evaluate the preservation of the content in the sketch image in a ten-point metric. Mark 10 for the image whose content coincides with that of the input and mark 1 for the image whose content is not preserved at all.
We execute our user study on both image-to-sketch and text-to-sketch frameworks.

6.2.1. Image to Sketch

For the user study on image-to-sketch frameworks, we sample the images in Figure 8 and pose three questions to the participants. The scores averaged over the images for a sketch generation scheme are presented in Table 2. In Table 2, we can conclude that our scheme shows best results for the sampled image-to-sketch studies [,,,,,,].
Table 2. The scores of user study on image-to-sketch results. The blue bold figures represent the highest score, which means the best scheme.

6.2.2. Text to Sketch

For the user study on text-to-sketch frameworks, we sample the images in Figure 9 and pose questions Q1 and Q2 to the participants. Since the text-to-sketch framework does not require an input photograph, question Q3 is not valid. The scores averaged over the images for a sketch generation scheme are presented in Table 3. In Table 3, we can conclude that our scheme show best results for the sampled text-to-sketch studies [,].
Table 3. The scores of user study on text-to-sketch results. The blue bold figures represent the highest score, which means the best scheme.

6.3. Ablation Study

In the ablation study, we combine various configurations of our framework and synthesize sketch images. We identify the configurations as Canny edge, LoRA, and 3CPM. We combine these configurations as Canny only, Canny + LoRA, Canny + 3CPM and Canny + LoRA + 3CPM. The results of this ablation study are illustrated in Figure 10 and presented in Table 4. As illustrated in Figure 10, our framework with Canny only and Canny + 3CPM cannot successfully manage the color remaining problem, which results in red sky. The Configuration of Canny+LoRA resolves the color remaining problem. However, they cannot successfully produce details of sketch images. Our approach that combines Canny and LoRA and 3CPM simultaneously can produce sketch images that describe the details of the objects and resolves the color remaining problem.
Figure 10. Ablation study: the results with combining various components such as Canny, LoRA and 3CDM. The rightmost column shows our results.
Table 4. Ablation study. The bold figures denote the best. Red figure means the smallest value is best and blue figure means the biggest value is best.

6.4. Limitation

Our framework shows several limitations. The major limitation is that the results of our framework depend on the Canny edge. The scenes whose Canny edges are clear show better sketch results than those whose Canny edges are obscure and noisy. Another limitation comes from the pre-trained Stable Diffusion model, which is the backbone network of our framework. Since the backbone network is pre-trained, those scenes in the dataset used for the pre-training affect our sketch images. Therefore, our framework sometimes produces unexpected objects in the sketch. Finally, our framework is short for controlling the level of abstraction or the sketch style in sketch synthesis.

7. Conclusions and Future Work

In this paper, we present DALS: a Diffusion-based Artistic Landscape Sketch generation framework. Even though landscape sketch is an important genre of sketch artwork, few studies have tried to learn and apply artistic landscape sketch drawing techniques such as vanishing point and level of detail. The pre-deep learning schemes have limitations in conducting artistic techniques and in controlling unwanted artifacts. The deep learning schemes have limitations in collecting proper training datasets for sketch generation models. Our framework analyzes artistic landscape drawing techniques and applies techniques such as vanishing points and a three-channel perspective map (3CPM) for generating sketch images. Furthermore, since our framework requires a relatively small-sized dataset, we employ a pre-trained Stable Diffusion model as a backbone network in our framework. Our framework can produce sketch images in bi-modal frameworks: image to sketch and text to sketch.
Our first plan is to extend the capability of our framework to cover more diverse landscape images by collecting diverse sketch datasets. We also plan to devise and implement an optimal structure that can produce and control the sketch images in various formats including vectorized sketch. We also plan to enhance our framework to control the levels of abstraction and style of sketch.

Author Contributions

Methodology, J.K.; Writing—original draft, H.Y.; Writing—review & editing, K.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by Sangmyung University at 2021.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Kang, H.; Lee, S.; Chui, C.K. Coherent line drawing. In Proceedings of the 5th International Symposium on Non-Photorealistic Animation and Rendering, San Diego, CA, USA, 4–5 August 2007; pp. 43–50. [Google Scholar]
  2. Kang, H.; Lee, S.; Chui, C.K. Flow-based image abstraction. IEEE Trans. Vis. Comput. Graph. 2008, 15, 62–76. [Google Scholar] [CrossRef]
  3. Winnemoller, H. Xdog: Advanced image stylization with extended difference-of-gaussians. In Proceedings of the ACM SIGGRAPH/Eurographics Symposium on Non-Photorealistic Animation and Rendering, New York, NY, USA, 5–7 August 2011; pp. 147–156. [Google Scholar]
  4. Gatys, A.; Ecker, A.S.; Bethge, M. Image style transfer using convolutional neural networks. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2414–2423. [Google Scholar]
  5. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farlely, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets; NeurIPS; Curran Associates, Inc.: Red Hook, NY, USA, 2014; Volume 27. [Google Scholar]
  6. Isola, P.; Zhu, J.; Zhou, T.; Efros, A. Image-to-image translation with conditional adversarial networks. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1125–1134. [Google Scholar]
  7. Zhu, Y.; Park, T.; Isola, P.; Efros, A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2223–2232. [Google Scholar]
  8. Liu, M.; Breuel, T.; Kautz, J. Unsupervised Image-to-Image Translation Networks; NeurIPS; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
  9. Huang, X.; Liu, M.; Belongie, S.; Kautz, J. Multimodal unsupervised image-to-image translation. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 172–189. [Google Scholar]
  10. Yeom, J.; Yang, H.; Min, K. An Attention-Based Generative Adversarial Network for Producing Illustrative Sketches. Mathematics 2021, 9, 2791. [Google Scholar] [CrossRef]
  11. Kim, J.; Kim, M.; Kim, H.; Lee, K. U-gat-it: Unsupervised generative attentional networks with adaptive layer-instance normalization for image-to-image translation. In Proceedings of the Eighth International Conference on Learning Representations, Virtual, 26 April–1 May 2020. [Google Scholar]
  12. Li, M.; Lin, Z.; Mech, R.; Yumer, E.; Ramanan, D. Photo-sketching: Inferring contour drawings from images. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 7–11 January 2019; pp. 1403–1412. [Google Scholar]
  13. Yi, R.; Lai, Y.; Rosin, P. APDrawingGAN: Generating artistic portrait drawings from face photos with hierarchical gans. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 10743–10752. [Google Scholar]
  14. Su, H.; Niu, J.; Liu, X.; Li, Q.; Cui, J.; Wan, J. MangaGAN: Unpaired photo-to-manga translation based on the methodology of manga drawing. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35. [Google Scholar]
  15. Kim, H.; Oh, H.; Yang, H. A Transfer Learning for Line-Based Portrait Sketch. Mathematics 2022, 10, 3869. [Google Scholar] [CrossRef]
  16. Peng, C.; Zhang, C.; Liu, D.; Wang, N.; Gao, X. Face photo–sketch synthesis via intra-domain enhancement. Knowl. Based Syst. 2023, 259, 110026. [Google Scholar] [CrossRef]
  17. Zhu, M.; Wu, Z.; Wang, N.; Yang, H.; Gao, X. Dual Conditional Normalization Pyramid Network for Face Photo-Sketch Synthesis. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 5200–5211. [Google Scholar] [CrossRef]
  18. Koh, I. AI-Urban-Sketching: Deep Learning and Automating Design Perception for Creativity. Transformations 2022, 36, 14443775. [Google Scholar]
  19. Qian, W.; Yang, F.; Mei, H.; Li, H. Artificial intelligence-designer for high-rise building sketches with user preferences. Eng. Struct. 2023, 275, 115171. [Google Scholar] [CrossRef]
  20. Selim, A.; Mohamed, E.; Linda, D. Painting style transfer for head portraits using convolutional neural networks. ACM ToG 2016, 35, 129. [Google Scholar] [CrossRef]
  21. Chan, C.; Durand, F.; Isola, P. Learning to generate line drawings that convey geometry and semantics. In Proceedings of the Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 7905–7915. [Google Scholar]
  22. Vinker, Y.; Pajouheshgar, E.; Bo, J.; Bachmann, R.C.; Bermano, A.H.; Bermano, A.H.; Cohen-Or, D.; Zamir, A.; Shamir, A. CLIPasso: Semantically-Aware Object Sketching. ACM Trans. on Graph. 2022, 41, 86. [Google Scholar] [CrossRef]
  23. Vinker, Y.; Alalus, Y.; Cohen-Or, D.; Shamir, A. CLIPascene: Scene Sketching with Different Types and Levels of Abstraction. In Proceedings of the ICCV, Paris, France, 2–6 October 2023; pp. 4146–4156. [Google Scholar]
  24. Wang, Q.; Deng, H.; Qi, Y.; Da, L.; Song, Y. SketchKnitter: Vectorized Sketch Generation with Diffusion Models. In Proceedings of the International Conference on Learning Representations, Vritual, 25–29 April 2022. [Google Scholar]
  25. Xing, X.; Wang, C.; Zhou, H.; Zhang, J.; Yu, Q.; Xu, D. DiffSketcher: Text Guided Vector Sketch Synthesis through Latent Diffusion Models. arXiv 2023, arXiv:2306.14685. [Google Scholar]
  26. Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
  27. Wright, A.; Ommer, B. ArtFID: Quantitative Evaluation of Neural StyleTransfer. In Proceedings of the DAGM GCPR, Konstanz, Germany, 27–30 September 2022; pp. 560–576. [Google Scholar]
  28. Canny, J. A computational approach to edge detection. IEEE Trans. Patt. Anal. Mach. Intell. 1986, 8, 679–698. [Google Scholar] [CrossRef]
  29. Dhariwal, P.; Alexander, N. Diffusion models beat GANs on image synthesis. NeurIPS 2021, 34, 8780–8794. [Google Scholar]
  30. Jonathan, H.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. NeurIPS 2020, 33, 6840–6851. [Google Scholar]
  31. Song, J.; Chenlin, M.; Stefano, E. Denoising diffusion implicit models. In Proceedings of the International Conference on Learning Representations, Vritual, 26 April–1 May 2020. [Google Scholar]
  32. Song, Y.; Sohl-Dickstein, J.; Kingma, D.; Kumar, A.; Ermon, S.; Poole, B. Score-based generative modeling through stochastic differential equations. In Proceedings of the International Conference on Learning Representations, Virtual, 3–7 May 2021. [Google Scholar]
  33. Jain, A.; Xie, A.; Abbeel, P. VectorFusion: Text-to-SVG by Abstracting Pixel-Based Diffusion Models. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 1911–1920. [Google Scholar]
  34. Zhang, L.; Rao, A.; Agrawala, M. Adding conditional control to text-to-image diffusion models. arXiv 2023, arXiv:2302.05543. [Google Scholar]
  35. Mirza, M.; Simon, O. Conditional generative adversarial nets. arXiv 2014, arXiv:1411.1784. [Google Scholar]
  36. Liu, M.; Yuzel, O. Coupled Generative Adversarial Networks. NeurIPS 2016, 29, 469–477. [Google Scholar]
  37. Yi, R.; Liu, Y.; Lai, Y.; Rosin, P. Unpaired Portrait Drawing Generation via Asymmetric Cycle Mapping. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8217–8225. [Google Scholar]
  38. Available online: https://github.com/AUTOMATIC1111/stable-diffusion-webui (accessed on 31 December 2023).
  39. Available online: https://civitai.com/ (accessed on 31 December 2023).
  40. Hu, E.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. In Proceedings of the International Conference on Learning Representations, Virtual, 3–7 May 2021. [Google Scholar]
  41. Ryu, S. Low-Rank Adaptation for Fast Text-to-Image Diffusion Fine-Tuning. 2022. Available online: https://github.com/cloneofsimo/lora (accessed on 31 December 2023).
  42. Kingma, D.; Welling, M. Auto-Encoding Variational Bayes. Stat 2014, 1050, 1. [Google Scholar]
  43. Ronneberger, O.; Philippm, F.; Thomas, B. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; Volume 9351, pp. 234–241. [Google Scholar]
  44. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  45. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16X16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations, Virtual, 26 April–1 May 2020. [Google Scholar]
  46. Available online: https://www.pexels.com/ (accessed on 31 December 2023).
  47. Available online: https://pixabay.com/ (accessed on 31 December 2023).
  48. Available online: https://www.kaggle.com/datasets/wanghaohan/imagenetsketch (accessed on 31 December 2023).
  49. Wang, H.; Ge, S.; Lipton, Z.; Xing, E. Learning Robust Global Representations by Penalizing Local Predictive Power; System 32, NeurIPS; Curran Associates, Inc.: Red Hook, NY, USA, 2019. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.